Lessons from the Dataraflow Internship

Introduction

This week at Dataraflow, I was tasked with scraping legislative reports from the University of California's Operating Budget website. What seemed like a straightforward assignment turned into an incredible learning experience filled with challenges, breakthroughs, and valuable lessons in data extraction and manipulation.

The Challenge

My objective was to extract a structured dataset of legislative reports from UC's 2013-14 Legislative Session page, including report dates and titles, and organize them into a clean pandas DataFrame.

Initial Struggles

1. Understanding HTML Structure

My first major hurdle was navigating the complex HTML structure. The webpage had nested divs, tables with specific classes, and inconsistent formatting. I spent hours inspecting the page source, trying to identify the exact elements containing the data I needed.

from bs4 import BeautifulSoup
import requests
import pandas as pd
from pandas import Series, DataFrame

url = 'https://www.ucop.edu/operating-budget/budgets-and-reports/2013-14-legislative-session.html'
result = requests.get(url)
web_content = result.content

# Creating Beautiful Soup object
soup = BeautifulSoup(web_content, 'html.parser')

The Struggle: Initially, I didn't pass the parser argument to BeautifulSoup, which caused deprecation warnings. I also struggled to locate the right div section containing the reports.

2. Targeting the Right Elements

Finding the specific section with reports was tricky. The page had multiple divs, and I needed to target the exact one.

# Finding the right section
summary = soup.find("div", {'class':'list-land', 'id':'content'})

# Extracting all tables from that section
tables = summary.find_all('table')

The Breakthrough: Learning to use both class and id attributes together helped me pinpoint the exact location of my data. This was a game-changer!

3. Extracting Data from Table Rows

The most frustrating part was extracting text from each table cell while maintaining the relationship between dates and report titles.

# Setting up empty data list
data = []

# Getting all table rows
rows = tables[0].findAll('tr')

# Extracting text from each cell
for tr in rows:
    cols = tr.findAll('td')
    for td in cols:
        text = td.find(text=True) 
        data.append(text)

The Problem: This approach created a flat list with all data mixed, including None values and empty strings. I needed a way to pair dates with their corresponding reports in an intelligent manner.

4. Dealing with Special Characters

Unicode characters like \xa0 (non-breaking spaces) appeared in my data, making it messy and hard to read.

# The nightmare of special characters
'2014-15\xa0 (EDU 92495) Proposed Capital Outlay Projects (pdf)'

The Solution: Using Python's replace method to clean these characters:

reports.append(item.replace(u'\xa0', u' '))

The Final Solution

After multiple iterations and debugging sessions, here's the working solution I developed:

# Set up empty lists for organized data
reports = []
date = []

# Set index counter
index = 0

for item in data:
    # Only check for 'pdf' when item is a string
    if isinstance(item, str) and 'pdf' in item:
        # Add the corresponding date
        if index > 0:
            date.append(data[index-1])
        else:
            date.append('')

        reports.append(item.replace(u'\xa0', u' '))

    index += 1

# Convert to pandas Series
date = Series(date)
reports = Series(reports)

# Create the final DataFrame
legislative_df = pd.concat([date, reports], axis=1)
legislative_df.columns = ['Date', 'Reports']

The Beautiful Result

        Date                                            Reports
0   08/01/13  2013-14 (EDU 92495) Proposed Capital Outlay Pr...
1   09/01/13  2014-15  (EDU 92495) Proposed Capital Outlay P...
2   11/01/13  Utilization of Classroom and Teaching Laborato...
3   11/01/13  Instruction and Research Space Summary & Analy...
4   11/15/13         Statewide Energy Partnership Program (pdf)
...

Key Lessons Learned

Always specify the parser: Using BeautifulSoup(content, 'html.parser') prevents warnings and ensures consistent behavior.
Inspect before you extract: Spend time understanding the HTML structure before writing extraction code.
Type checking matters: Using isinstance(item, str) prevents errors when encountering None values.
Data cleaning is crucial: Real-world data is messy. Always plan for special characters and inconsistencies.
Iterative development works: My final solution was version 5 or 6. Each iteration taught me something new.

Overcoming the Struggles

The turning point came when I realized I needed to:

Think about data relationships (dates paired with reports)
Filter data intelligently (only items containing 'pdf')
Handle edge cases (None values, special characters)
Use pandas effectively for data organization

My mentor at DataGraflow suggested using a cleaner approach with list comprehensions and better error handling, which I'm now implementing in future projects.

What I Look Forward To

This project opened my eyes to the world of web scraping and data engineering. Moving forward, I'm excited to:

1. Advanced Scraping Techniques

Learning Selenium for dynamic JavaScript-heavy websites
Implementing robust error handling and retry logic
Working with APIs as an alternative to scraping

2. Scaling Up

Scraping multiple pages automatically
Building data pipelines that run on schedules
Implementing data validation and quality checks

3. Database Integration

Storing scraped data in SQL databases
Creating automated ETL processes
Building dashboards to visualize scraped data

4. Legal and Ethical Considerations

Understanding robots.txt and web scraping ethics
Implementing rate limiting to respect server resources
Learning about data privacy and usage rights

5. Performance Optimization

Using multiprocessing for concurrent scraping
Implementing caching mechanisms
Optimizing memory usage for large datasets

Advice for Fellow Interns

If you're starting with web scraping, here's what I wish I knew:

Start small: Don't try to scrape entire websites on day one
Read the documentation: BeautifulSoup and pandas docs are your friends
Debug systematically: Print intermediate results to understand what's happening
Ask for help: Your mentors have been through this before
Celebrate small wins: Getting that first table extracted feels amazing!

Conclusion

This web scraping project at Dataraflow transformed from a daunting task into one of my favorite learning experiences. Every error message taught me something, every bug fixed boosted my confidence, and every successful extraction felt like solving a puzzle.

The journey from seeing a wall of HTML to producing a clean DataFrame was challenging but incredibly rewarding. I'm grateful for this opportunity and excited to apply these skills to future data science and engineering challenges.

Web Scraping Journey

Lessons from the Dataraflow Internship

Introduction

The Challenge

Initial Struggles

1. Understanding HTML Structure

2. Targeting the Right Elements

3. Extracting Data from Table Rows

4. Dealing with Special Characters

The Final Solution

The Beautiful Result

Key Lessons Learned

Overcoming the Struggles

What I Look Forward To

1. Advanced Scraping Techniques

2. Scaling Up

3. Database Integration

4. Legal and Ethical Considerations

5. Performance Optimization

Advice for Fellow Interns

Conclusion

Comments

More from this blog

My Journey Through a Weather Data Take-Home Challenge

Week 6 at DataraFlow: Learning Resilience Through NumPy, Pandas, and Weather Data Insights

Machine Learning and Simplex Optimization: Smarter Decision-Making for Educational Institutions

# Week 3 --- Python Practice with Modules, Dates, JSON, Pip, Error Handling, and More

Command Palette

Lessons from the Dataraflow Internship

Introduction

The Challenge

Initial Struggles

1. Understanding HTML Structure

2. Targeting the Right Elements

3. Extracting Data from Table Rows

4. Dealing with Special Characters

The Final Solution

The Beautiful Result

Key Lessons Learned

Overcoming the Struggles

What I Look Forward To

1. Advanced Scraping Techniques

2. Scaling Up

3. Database Integration

4. Legal and Ethical Considerations

5. Performance Optimization

Advice for Fellow Interns

Conclusion

Comments

More from this blog