Web Scraping Journey
Lessons from the Dataraflow Internship
Introduction
This week at Dataraflow, I was tasked with scraping legislative reports from the University of California's Operating Budget website. What seemed like a straightforward assignment turned into an incredible learning experience filled with challenges, breakthroughs, and valuable lessons in data extraction and manipulation.
The Challenge
My objective was to extract a structured dataset of legislative reports from UC's 2013-14 Legislative Session page, including report dates and titles, and organize them into a clean pandas DataFrame.
Initial Struggles
1. Understanding HTML Structure
My first major hurdle was navigating the complex HTML structure. The webpage had nested divs, tables with specific classes, and inconsistent formatting. I spent hours inspecting the page source, trying to identify the exact elements containing the data I needed.
from bs4 import BeautifulSoup
import requests
import pandas as pd
from pandas import Series, DataFrame
url = 'https://www.ucop.edu/operating-budget/budgets-and-reports/2013-14-legislative-session.html'
result = requests.get(url)
web_content = result.content
# Creating Beautiful Soup object
soup = BeautifulSoup(web_content, 'html.parser')
The Struggle: Initially, I didn't pass the parser argument to BeautifulSoup, which caused deprecation warnings. I also struggled to locate the right div section containing the reports.
2. Targeting the Right Elements
Finding the specific section with reports was tricky. The page had multiple divs, and I needed to target the exact one.
# Finding the right section
summary = soup.find("div", {'class':'list-land', 'id':'content'})
# Extracting all tables from that section
tables = summary.find_all('table')
The Breakthrough: Learning to use both class and id attributes together helped me pinpoint the exact location of my data. This was a game-changer!
3. Extracting Data from Table Rows
The most frustrating part was extracting text from each table cell while maintaining the relationship between dates and report titles.
# Setting up empty data list
data = []
# Getting all table rows
rows = tables[0].findAll('tr')
# Extracting text from each cell
for tr in rows:
cols = tr.findAll('td')
for td in cols:
text = td.find(text=True)
data.append(text)
The Problem: This approach created a flat list with all data mixed, including None values and empty strings. I needed a way to pair dates with their corresponding reports in an intelligent manner.
4. Dealing with Special Characters
Unicode characters like \xa0 (non-breaking spaces) appeared in my data, making it messy and hard to read.
# The nightmare of special characters
'2014-15\xa0 (EDU 92495) Proposed Capital Outlay Projects (pdf)'
The Solution: Using Python's replace method to clean these characters:
reports.append(item.replace(u'\xa0', u' '))
The Final Solution
After multiple iterations and debugging sessions, here's the working solution I developed:
# Set up empty lists for organized data
reports = []
date = []
# Set index counter
index = 0
for item in data:
# Only check for 'pdf' when item is a string
if isinstance(item, str) and 'pdf' in item:
# Add the corresponding date
if index > 0:
date.append(data[index-1])
else:
date.append('')
reports.append(item.replace(u'\xa0', u' '))
index += 1
# Convert to pandas Series
date = Series(date)
reports = Series(reports)
# Create the final DataFrame
legislative_df = pd.concat([date, reports], axis=1)
legislative_df.columns = ['Date', 'Reports']
The Beautiful Result
Date Reports
0 08/01/13 2013-14 (EDU 92495) Proposed Capital Outlay Pr...
1 09/01/13 2014-15 (EDU 92495) Proposed Capital Outlay P...
2 11/01/13 Utilization of Classroom and Teaching Laborato...
3 11/01/13 Instruction and Research Space Summary & Analy...
4 11/15/13 Statewide Energy Partnership Program (pdf)
...
Key Lessons Learned
Always specify the parser: Using
BeautifulSoup(content, 'html.parser')prevents warnings and ensures consistent behavior.Inspect before you extract: Spend time understanding the HTML structure before writing extraction code.
Type checking matters: Using
isinstance(item, str)prevents errors when encountering None values.Data cleaning is crucial: Real-world data is messy. Always plan for special characters and inconsistencies.
Iterative development works: My final solution was version 5 or 6. Each iteration taught me something new.
Overcoming the Struggles
The turning point came when I realized I needed to:
Think about data relationships (dates paired with reports)
Filter data intelligently (only items containing 'pdf')
Handle edge cases (None values, special characters)
Use pandas effectively for data organization
My mentor at DataGraflow suggested using a cleaner approach with list comprehensions and better error handling, which I'm now implementing in future projects.
What I Look Forward To
This project opened my eyes to the world of web scraping and data engineering. Moving forward, I'm excited to:
1. Advanced Scraping Techniques
Learning Selenium for dynamic JavaScript-heavy websites
Implementing robust error handling and retry logic
Working with APIs as an alternative to scraping
2. Scaling Up
Scraping multiple pages automatically
Building data pipelines that run on schedules
Implementing data validation and quality checks
3. Database Integration
Storing scraped data in SQL databases
Creating automated ETL processes
Building dashboards to visualize scraped data
4. Legal and Ethical Considerations
Understanding robots.txt and web scraping ethics
Implementing rate limiting to respect server resources
Learning about data privacy and usage rights
5. Performance Optimization
Using multiprocessing for concurrent scraping
Implementing caching mechanisms
Optimizing memory usage for large datasets
Advice for Fellow Interns
If you're starting with web scraping, here's what I wish I knew:
Start small: Don't try to scrape entire websites on day one
Read the documentation: BeautifulSoup and pandas docs are your friends
Debug systematically: Print intermediate results to understand what's happening
Ask for help: Your mentors have been through this before
Celebrate small wins: Getting that first table extracted feels amazing!
Conclusion
This web scraping project at Dataraflow transformed from a daunting task into one of my favorite learning experiences. Every error message taught me something, every bug fixed boosted my confidence, and every successful extraction felt like solving a puzzle.
The journey from seeing a wall of HTML to producing a clean DataFrame was challenging but incredibly rewarding. I'm grateful for this opportunity and excited to apply these skills to future data science and engineering challenges.