Skip to main content

Command Palette

Search for a command to run...

Web Scraping Journey

Published
5 min read

Lessons from the Dataraflow Internship

Introduction

This week at Dataraflow, I was tasked with scraping legislative reports from the University of California's Operating Budget website. What seemed like a straightforward assignment turned into an incredible learning experience filled with challenges, breakthroughs, and valuable lessons in data extraction and manipulation.

The Challenge

My objective was to extract a structured dataset of legislative reports from UC's 2013-14 Legislative Session page, including report dates and titles, and organize them into a clean pandas DataFrame.

Initial Struggles

1. Understanding HTML Structure

My first major hurdle was navigating the complex HTML structure. The webpage had nested divs, tables with specific classes, and inconsistent formatting. I spent hours inspecting the page source, trying to identify the exact elements containing the data I needed.

from bs4 import BeautifulSoup
import requests
import pandas as pd
from pandas import Series, DataFrame

url = 'https://www.ucop.edu/operating-budget/budgets-and-reports/2013-14-legislative-session.html'
result = requests.get(url)
web_content = result.content

# Creating Beautiful Soup object
soup = BeautifulSoup(web_content, 'html.parser')

The Struggle: Initially, I didn't pass the parser argument to BeautifulSoup, which caused deprecation warnings. I also struggled to locate the right div section containing the reports.

2. Targeting the Right Elements

Finding the specific section with reports was tricky. The page had multiple divs, and I needed to target the exact one.

# Finding the right section
summary = soup.find("div", {'class':'list-land', 'id':'content'})

# Extracting all tables from that section
tables = summary.find_all('table')

The Breakthrough: Learning to use both class and id attributes together helped me pinpoint the exact location of my data. This was a game-changer!

3. Extracting Data from Table Rows

The most frustrating part was extracting text from each table cell while maintaining the relationship between dates and report titles.

# Setting up empty data list
data = []

# Getting all table rows
rows = tables[0].findAll('tr')

# Extracting text from each cell
for tr in rows:
    cols = tr.findAll('td')
    for td in cols:
        text = td.find(text=True) 
        data.append(text)

The Problem: This approach created a flat list with all data mixed, including None values and empty strings. I needed a way to pair dates with their corresponding reports in an intelligent manner.

4. Dealing with Special Characters

Unicode characters like \xa0 (non-breaking spaces) appeared in my data, making it messy and hard to read.

# The nightmare of special characters
'2014-15\xa0 (EDU 92495) Proposed Capital Outlay Projects (pdf)'

The Solution: Using Python's replace method to clean these characters:

reports.append(item.replace(u'\xa0', u' '))

The Final Solution

After multiple iterations and debugging sessions, here's the working solution I developed:

# Set up empty lists for organized data
reports = []
date = []

# Set index counter
index = 0

for item in data:
    # Only check for 'pdf' when item is a string
    if isinstance(item, str) and 'pdf' in item:
        # Add the corresponding date
        if index > 0:
            date.append(data[index-1])
        else:
            date.append('')

        reports.append(item.replace(u'\xa0', u' '))

    index += 1

# Convert to pandas Series
date = Series(date)
reports = Series(reports)

# Create the final DataFrame
legislative_df = pd.concat([date, reports], axis=1)
legislative_df.columns = ['Date', 'Reports']

The Beautiful Result

        Date                                            Reports
0   08/01/13  2013-14 (EDU 92495) Proposed Capital Outlay Pr...
1   09/01/13  2014-15  (EDU 92495) Proposed Capital Outlay P...
2   11/01/13  Utilization of Classroom and Teaching Laborato...
3   11/01/13  Instruction and Research Space Summary & Analy...
4   11/15/13         Statewide Energy Partnership Program (pdf)
...

Key Lessons Learned

  1. Always specify the parser: Using BeautifulSoup(content, 'html.parser') prevents warnings and ensures consistent behavior.

  2. Inspect before you extract: Spend time understanding the HTML structure before writing extraction code.

  3. Type checking matters: Using isinstance(item, str) prevents errors when encountering None values.

  4. Data cleaning is crucial: Real-world data is messy. Always plan for special characters and inconsistencies.

  5. Iterative development works: My final solution was version 5 or 6. Each iteration taught me something new.

Overcoming the Struggles

The turning point came when I realized I needed to:

  • Think about data relationships (dates paired with reports)

  • Filter data intelligently (only items containing 'pdf')

  • Handle edge cases (None values, special characters)

  • Use pandas effectively for data organization

My mentor at DataGraflow suggested using a cleaner approach with list comprehensions and better error handling, which I'm now implementing in future projects.

What I Look Forward To

This project opened my eyes to the world of web scraping and data engineering. Moving forward, I'm excited to:

1. Advanced Scraping Techniques

  • Learning Selenium for dynamic JavaScript-heavy websites

  • Implementing robust error handling and retry logic

  • Working with APIs as an alternative to scraping

2. Scaling Up

  • Scraping multiple pages automatically

  • Building data pipelines that run on schedules

  • Implementing data validation and quality checks

3. Database Integration

  • Storing scraped data in SQL databases

  • Creating automated ETL processes

  • Building dashboards to visualize scraped data

  • Understanding robots.txt and web scraping ethics

  • Implementing rate limiting to respect server resources

  • Learning about data privacy and usage rights

5. Performance Optimization

  • Using multiprocessing for concurrent scraping

  • Implementing caching mechanisms

  • Optimizing memory usage for large datasets

Advice for Fellow Interns

If you're starting with web scraping, here's what I wish I knew:

  • Start small: Don't try to scrape entire websites on day one

  • Read the documentation: BeautifulSoup and pandas docs are your friends

  • Debug systematically: Print intermediate results to understand what's happening

  • Ask for help: Your mentors have been through this before

  • Celebrate small wins: Getting that first table extracted feels amazing!

Conclusion

This web scraping project at Dataraflow transformed from a daunting task into one of my favorite learning experiences. Every error message taught me something, every bug fixed boosted my confidence, and every successful extraction felt like solving a puzzle.

The journey from seeing a wall of HTML to producing a clean DataFrame was challenging but incredibly rewarding. I'm grateful for this opportunity and excited to apply these skills to future data science and engineering challenges.