import requests
import bs4
import pandas as pd
Lecture 10 - Advanced Web Scraping
Overview
In this lecture, we cover more advanced web scraping with:
scrapy
playwright
References
This lecture contains material from:
The Document Object Model
From MDN Web Docs:
The Document Object Model (DOM) is the data representation of the objects that comprise the structure and content of a document on the web.
By Birger Eriksson - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=18034500
Review: Quotes
We are going to scrape https://quotes.toscrape.com, a website that lists quotes from famous authors.
We first scrape a single page of the quotes website using tools from last lecture. Then, we show how to automate scraping many pages with scrapy
.
= requests.get('https://quotes.toscrape.com') req
req
<Response [200]>
= bs4.BeautifulSoup(req.text, 'html.parser') soup
= soup.select("div.quote") quotes
print(quotes[0].prettify())
<div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
<span class="text" itemprop="text">
“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
</span>
<span>
by
<small class="author" itemprop="author">
Albert Einstein
</small>
<a href="/author/Albert-Einstein">
(about)
</a>
</span>
<div class="tags">
Tags:
<meta class="keywords" content="change,deep-thoughts,thinking,world" itemprop="keywords"/>
<a class="tag" href="/tag/change/page/1/">
change
</a>
<a class="tag" href="/tag/deep-thoughts/page/1/">
deep-thoughts
</a>
<a class="tag" href="/tag/thinking/page/1/">
thinking
</a>
<a class="tag" href="/tag/world/page/1/">
world
</a>
</div>
</div>
# tag: span, class: text
0].select("span.text") quotes[
[<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>]
.select
returns a list, so we need to select the first element and then get text via .text
:
0].select("span.text")[0].text quotes[
'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'
# tag: small, class: author
0].select("small.author")[0].text quotes[
'Albert Einstein'
# tag: div, class: tags. > get children with tag: a, class: tag
for i in quotes[0].select("div.tags > a.tag")] [i.text
['change', 'deep-thoughts', 'thinking', 'world']
= []
quote_list for quote in quotes:
'quote': quote.select("span.text")[0].text,
quote_list.append({'author': quote.select("small.author")[0].text,
'tags': [i.text for i in quote.select("div.tags > a.tag")]})
pd.DataFrame(quote_list)
quote | author | tags | |
---|---|---|---|
0 | “The world as we have created it is a process ... | Albert Einstein | [change, deep-thoughts, thinking, world] |
1 | “It is our choices, Harry, that show what we t... | J.K. Rowling | [abilities, choices] |
2 | “There are only two ways to live your life. On... | Albert Einstein | [inspirational, life, live, miracle, miracles] |
3 | “The person, be it gentleman or lady, who has ... | Jane Austen | [aliteracy, books, classic, humor] |
4 | “Imperfection is beauty, madness is genius and... | Marilyn Monroe | [be-yourself, inspirational] |
5 | “Try not to become a man of success. Rather be... | Albert Einstein | [adulthood, success, value] |
6 | “It is better to be hated for what you are tha... | André Gide | [life, love] |
7 | “I have not failed. I've just found 10,000 way... | Thomas A. Edison | [edison, failure, inspirational, paraphrased] |
8 | “A woman is like a tea bag; you never know how... | Eleanor Roosevelt | [misattributed-eleanor-roosevelt] |
9 | “A day without sunshine is like, you know, nig... | Steve Martin | [humor, obvious, simile] |
How do we scrape multiple pages easily?? This is where scrapy
is useful.
Scrapy Introduction
About
Scrapy is an framework for crawling web sites and extracting structured data.
Even though Scrapy was originally designed for web scraping, it can also be used to extract data using APIs.
Installation
First, install scrapy
in your conda environment:
Terminal
pip install scrapy
Example
This example follows the tutorial from the scrapy documentation.
- Create a new directory (i.e. folder) called
scrapy-project
- In
scrapy-project
, create a Python filequotes_spider.py
- Copy the following code and paste in
quotes_spider.py
:
python
import scrapy
class QuoteSpider(scrapy.Spider):
= "quotes"
name = [
start_urls 'http://quotes.toscrape.com/',
]
def parse(self, response):
for quote in response.css("div.quote"):
yield {
'text': quote.css("span.text::text").get(),
'author': quote.css("small.author::text").get(),
'tags': quote.css("div.tags > a.tag::text").getall()
}
= response.css("li.next a::attr(href)").get()
next_page if next_page is not None:
= response.urljoin(next_page)
next_page yield scrapy.Request(next_page)
- Open your command line / Terminal, navigate to
scrapy-project
, and run:
Terminal
scrapy runspider quotes_spider.py -o quotes.jsonl
- When this finishes you will have a new file called
quotes.jsonl
with a list of the quotes in JSON Lines format, containing the text and author, which will look like this:
{"text": "\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d", "author": "Albert Einstein", "tags": ["change", "deep-thoughts", "thinking", "world"]}
{"text": "\u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.\u201d", "author": "J.K. Rowling", "tags": ["abilities", "choices"]}
{"text": "\u201cThere are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.\u201d", "author": "Albert Einstein", "tags": ["inspirational", "life", "live", "miracle", "miracles"]}
...
We can read in JSONL files with:
= pd.read_json(path_or_buf='scrapy/quotes.jsonl', lines=True) quotes
JSON: the entire file is a valid JSON document
JSON example
{
"users": [
{"id": 1, "name": "Alice", "email": "alice@example.com"},
{"id": 2, "name": "Bob", "email": "bob@example.com"},
{"id": 3, "name": "Charlie", "email": "charlie@example.com"}
]
}
JSONL: each line is a separate JSON object
JSONL example
{"id": 1, "name": "Alice", "email": "alice@example.com"}
{"id": 2, "name": "Bob", "email": "bob@example.com"}
{"id": 3, "name": "Charlie", "email": "charlie@example.com"}
What just happened?
When you ran the command scrapy runspider quotes_spider.py
, Scrapy looked for a Spider definition inside it and ran it through its crawler engine.
The crawl:
- Makes requests to the URLs defined in the
start_urls
attribute - Takes the HTML output and passes it into the
parse
method. Theparse
method:- Loops through the quote elements using a CSS selector
- Yields a Python dict with extracted quote text and author
- Looks for a link to the next page and schedules another request
We will go through the code in more detail in the section Scrapy Spiders.
Before we do that, however, we need a refresher of object oriented programming in Python.
Interlude: Object-Oriented Programming
Python is an object-oriented programming language.
Every item you interact with is an object. An object has a type, as well as:
- attributes (values)
- methods (functions that can be applied to the object)
As an example, 'hello'
in Python is an object of type str
, with methods such as split
, join
etc.
Classes
A Python class is a “blueprint” for an object.
We can define new classes. For example:
class Point:
""" Point class represents and manipulates x,y coords. """
def __init__(self, x=0, y=0):
""" Create a new point at the origin """
self.x = x
self.y = y
All classes have a function called __init__()
, which is executed when the class is being initiated.
The self
parameter is a reference to the current instance of the class, and is used to access variables that belong to the class.
Above, self.x=x
means that each instance of the class has its own x
value.
Let’s create an object p1
which is a new instance of the Point
class:
= Point(x=1.2, y=4) p1
We can extract the x
and y
attributes of p1
:
print(p1.x, p1.y)
1.2 4
We can also change attributes after the object has been created:
= 5
p1.x print(p1.x, p1.y)
5 4
Let’s re-write the Point
class and add a method:
class Point:
""" Create a new Point, at coordinates x, y """
def __init__(self, x=0, y=0):
""" Create a new point at x, y """
self.x = x
self.y = y
def distance_from_origin(self):
""" Compute my distance from the origin """
return ((self.x ** 2) + (self.y ** 2)) ** 0.5
= Point(x=3, y=4)
p2 p2.distance_from_origin()
5.0
Inheritance
Inheritance allows us to define a class that inherits all the methods and properties from another class.
# Parent class
class Animal:
def __init__(self, name):
self.name = name # Initialize the name attribute
def speak(self):
pass # Placeholder method to be overridden by child classes
# Child class inheriting from Animal
class Dog(Animal):
def speak(self):
return f"{self.name} barks!" # Override the speak method
# Child class inheriting from Animal
class Cat(Animal):
def speak(self):
return f"{self.name} meows!" # Override the speak method
# Creating an instance of Dog
= Dog("Buddy")
dog print(dog.speak())
# Creating an instance of Dog
= Cat("Puss")
cat print(cat.speak())
Buddy barks!
Puss meows!
Scrapy Spiders
Spiders are classes which define how a certain site (or a group of sites) will be scraped.
Specifically:
- how to perform the crawl (i.e. follow links)
- how to extract structured data from their pages (i.e. scraping items)
Looking back at the beginning of the QuoteSpider
class:
the first line indicates that
QuoteSpider
is a new class inheriting from thescrapy.Spider
class. Spiders require aname
.class QuoteSpider(scrapy.Spider): = "quotes" name = [ start_urls 'http://quotes.toscrape.com/', ]
The parent
scrapy.Spider
class provides a defaultstart_requests()
implementation which sends requests from thestart_urls
spider attribute and calls the spider’s methodparse
for each of the resulting responses.Now let’s consider the
parse
callback function. Theparse
method is in charge of processing the response and returning scraped data and/or more URLs to follow.The first part of the function processes the response:
for quote in response.css("div.quote"): yield { 'text': quote.css("span.text::text").get(), 'author': quote.css("small.author::text").get(), 'tags': quote.css("div.tags > a.tag::text").getall() }
- We can see this is similar to our BeautifulSoup processing:
= [] quote_list for quote in quotes: 'quote': quote.select("span.text")[0].text, quote_list.append({'author': quote.select("small.author")[0].text, 'tags': [i.text for i in quote.select("div.tags > a.tag")]})
A difference is that in BeautifulSoup, to extract the text, we needed
.text
. In Scrapy.css
, we can use::text
after the CSS selector e.g.span.text::text
.Notice also that we use
yield
instead ofreturn
for the output. Unlike areturn
statement which terminates the function,yield
allows the function to produce values incrementally while maintaining its state.
The second part of the
parse
function looks for a link to the next page and schedules another request.- In the HTML, we can see the next page link (
/page/2/
) is in tagli
, withclass='next'
:
<li class="next"> <a href="/page/2/">Next <span aria-hidden="true">→</span></a> </li>
- In the
QuoteSpider
, the following command extracts/page/2/
, using::attr(href)
to access thehref
attribute of<a>
:
= response.css("li.next a::attr(href)").get() next_page
- Then, this command creates the new URL to scrape, that is
https://quotes.toscrape.com/page/2
:
= response.urljoin(next_page) next_page
- Finally, we
yield
ascrapy.Request
:
yield scrapy.Request(next_page)
- When you
yield
aRequest
in theparse
method, Scrapy will schedule that request to be sent and the resulting response will be fed back to theparse
method.
- In the HTML, we can see the next page link (
Scrapy Shell
We can open an interactive scrapy shell to test our functions:
Terminal
scrapy shell 'URL'
To exit the scrapy interactive shell, run exit
.
Scrapy Example: Jazz Transcriptions
Single Page
We now use scrapy to get book titles and URLs from this website: https://blueblackjazz.com/en/books.
Workflow:
- Inspect the website HTML to figure out HTML tags and classes we want (potentially using SelectorGadget)
- Use scrapy shell to test our tags/classes and see what processing we need
Terminal
scrapy shell "https://blueblackjazz.com/en/books"
response.css('#body li a')
link = response.css('#body li a')[0]
link.css('::text').get()
link.attrib['href']
jazz_single_page.py
import scrapy
class BookSpider(scrapy.Spider):
= 'jazzspider'
name = ['https://blueblackjazz.com/en/books']
start_urls
def parse(self, response):
for link in response.css('#body li a'):
yield {'title': link.css('::text').get(),
'url': link.attrib['href'] }
Terminal
scrapy runspider jazz_single_page.py -o jazz_data.csv
Multi Page
The next spider clicks on the link for each transcription and extracts information from that link.
Before we run the spider, let’s see what the first link looks like:
= requests.get('https://blueblackjazz.com/en/transcription/142/142-albert-ammons-12th-street-boogie-c-maj-transcription-pdf')
response = bs4.BeautifulSoup(response.text, 'html.parser') soup
= soup.select('h2.transcriptionTitle') elements
0].text elements[
'\nAlbert Ammons\n 12th Street Boogie\n \n C Maj\n\n - 5\n\n pages\n \n \n'
We can clean up this string:
= [i.strip() for i in elements[0].text.split('\n') if len(i.strip()) > 0] info
info
['Albert Ammons', '12th Street Boogie', 'C Maj', '- 5', 'pages']
We can also do the code checking in the scrapy shell.
jazz_multi_page.py
import scrapy
class BlogSpider(scrapy.Spider):
= 'jazzspider'
name = ['https://blueblackjazz.com/en/books']
start_urls
def parse(self, response):
for link in response.css('div.row > div > ul > li > a'):
= response.urljoin(link.attrib['href'])
another_url
yield scrapy.Request(another_url, callback=self.parse_transcription_page)
# callback specifies the function that will be called with first parameter being
# the response of this request (once it’s downloaded)
def parse_transcription_page(self, response):
for title in response.css('h2.transcriptionTitle'):
# this outputs a list
= title.css('::text').extract()
info # join the list
= ' '.join(info)
info = [i.strip() for i in info.split('\n') if len(i.strip()) > 0]
info
yield {
'Musician': title.css('small::text').extract_first(),
'Title': info[1],
'Key': info[2],
'Pages': info[3].strip('- ')
}
Terminal
scrapy runspider jazz_multi_page.py -o jazz_multi_data.csv
Scrapy Example: Hockey Teams
Now, let’s scrape the Hockey Team data from this site.
hockey.py
import scrapy
class HockeyTeamsSpider(scrapy.Spider):
= "hockey_teams"
name = ["scrapethissite.com"]
allowed_domains = ["https://www.scrapethissite.com/pages/forms/"]
start_urls
def parse(self, response):
# Extract team data from the current page
for team in response.css('tr.team'):
yield {
'Team Name': team.css('td.name::text').get().strip(),
'Year': team.css('td.year::text').get().strip(),
'Wins': team.css('td.wins::text').get().strip(),
'Losses': team.css('td.losses::text').get().strip(),
'OT Losses': team.css('td.ot-losses::text').get().strip(),
'Win Percentage': team.css('td.pct::text').get().strip(),
'Goals For': team.css('td.gf::text').get().strip(),
'Goals Against': team.css('td.ga::text').get().strip(),
}
# Follow pagination links
= response.css('a[aria-label="Next"]::attr(href)').get()
next_page
if next_page is not None:
= response.urljoin(next_page)
next_page yield response.follow(next_page, self.parse)
Terminal
scrapy runspider hockey.py -o hockey_teams_data.csv
Scrapy Example: Books
A bookstore website http://books.toscrape.com/ has been built to learn web scraping.
We build a spider to click on each book URL and extract the info. There are 50 webpages of books, so start_urls
is a list of 50 webpages.
We also write a new parse function for each page.
book_spider.py
import scrapy
class BookSpider(scrapy.Spider):
= "toscrape-css"
name = [
start_urls 'http://books.toscrape.com/catalogue/page-%s.html' % page
for page in range(1, 51)
]
def parse(self, response):
for link in response.css("ol.row article.product_pod h3 a"):
# we can extract a shortened title with the following line, although this is shortened with (...) for long ones
"::text").extract_first()
link.css(# url to an individual book page where complete information can be scraped
= response.urljoin(link.attrib['href'])
book_url yield scrapy.Request(book_url,
=self.parse_book)
callback
def parse_book(self, response):
yield {
'title': response.css('div.product_main h1::text').extract_first(),
'price': response.css('div.product_main p.price_color::text').extract_first(),
'genre': response.css('li:nth-child(3) a::text').extract_first(),
'image_url': response.css('#product_gallery img::attr(src)').extract_first(),
'stars': response.css('div.product_main p.star-rating::attr(class)').extract_first().replace('star-rating ', '')
}
Terminal
scrapy runspider book_spider.py -o bookstore.csv
Scrapy Advantages
- Requests are scheduled and processed asynchronously (scrapy doesn’t need to wait for a request to be finished, it can send another in the meantime)
- If a request fails or an error happens, other requests can keep going
Scrapy also gives you control over the “politeness” of your crawl, e.g. you can
- set a download delay between each request
- limit the amount of concurrent requests per domain
- use an auto-throttling extension that tries to figure out these settings automatically
Static vs Dynamic Websites
So far, we have been mostly looking at static websites.
Static websites serve pre-built HTML files directly to the browser:
- all content is present in the initial HTML response
- typically built with HTML, CSS and minimal JavaScript
We can use simple HTTP requests to obtain this content.
Dynamic websites generate content on-the-fly, often using JavaScript:
- initial HTML is minimal, with JavaScript loading content after page load
- content appears after JavaScript executes API calls
We need browser automation tools to extract this content because we must wait for JavaScript execution to finish. A popular automation framework is Playwright.
Before we discuss Playwright, however, we need to discuss asynchronous operations.
Interlude: Asyncio
Asyncio is a built-in Python library that us handle input-output processes more efficiently
The Problem: Many programs spend a lot of time just waiting - waiting for websites to respond, files to download, or databases to return results. During this waiting time, your CPU sits idle, doing nothing useful.
Traditional Approach: In regular Python code, operations happen one after another:
- Start operation A
- Wait for A to finish completely
- Only then start operation B
Asyncio Solution: Asyncio lets your program work on multiple tasks by switching between them when one is waiting:
- Start operation A
- When A needs to wait (like for a website to respond), temporarily pause it
- Start operation B during this waiting time
- Switch back to A when its data is ready
- Both operations finish faster than if done one after another
How It Works: Asyncio uses special functions called “coroutines” (defined with async def
). Inside these functions, the await
keyword marks spots where the function might need to wait. When your program hits an await
, Python can temporarily switch to another task instead of sitting idle.
python
async def example_coroutine():
# pause here and come back to example_coroutine when f() is ready
= await f()
r return r
def example_function():
# this function only exits when f() has finished
= f()
r return f
Note: You can only use await
with awaitable functions.
Example: Traditional Approach
import time
= time.time()
start_time
def fetch(url):
print(f'Starting {url}')
= requests.get(url).text
data print(f'Finished {url}')
return data
= fetch('http://example.com')
page1 = fetch('http://example.org')
page2
print(f"Done in {time.time() - start_time} seconds")
Starting http://example.com
Finished http://example.com
Starting http://example.org
Finished http://example.org
Done in 0.2890596389770508 seconds
Example: Asyncio Approach
Now let’s do the same thing, but asynchronously.
To do so, we need the package aiohttp
(website)
Terminal
pip install aiohttp
This is because the requests.get(url)
function is a synchronous (blocking) HTTP request function - it’s not designed to work with asyncio.
import aiohttp
import asyncio
import time
async def fetch_async(url, session):
print(f'Starting {url}')
async with session.get(url) as response:
= await response.text()
data
print(f'Finished {url}')
return data
async def main():
async with aiohttp.ClientSession() as session:
# task1 = asyncio.create_task(fetch_async('http://example.com', session))
# task2 = asyncio.create_task(fetch_async('http://example.org', session))
# page1 = await task1
# page2 = await task2
= await asyncio.gather(
page1, page2 'http://example.com', session),
fetch_async('http://example.org', session)
fetch_async(
)
return page1, page2
= time.time()
start_time = await main()
page1, page2 print(f"Done in {time.time() - start_time} seconds")
Starting http://example.com
Starting http://example.org
Finished http://example.com
Finished http://example.org
Done in 0.20457220077514648 seconds
What is happening?
We define an
async
function calledfetch_async
that:- Takes a URL and an
aiohttp
session - Makes an asynchronous HTTP GET request
- Returns the text content of the response
- Takes a URL and an
In the
main()
function:We create a single aiohttp
ClientSession
, which is specifically designed to work with asyncioWe start
task1
andtask2
usingasyncio.create_task()
(this immediately begins the work)When
fetch_async()
reachesawait response.text()
, it pauses and lets the next runnable task continueAs soon as data arrives from one server, the corresponding task is resumed
We then await the tasks to collect their results:
= await task1 page1 = await task2 page2
Alternative: the above steps can be simplified using
await asyncio.gather()
Note:
- Both requests run concurrently - we are not blocking the entire program while waiting for the response
- The
ClientSession
is properly managed using a context manager (i.e. awith
statement)
In Jupyter Notebook, to call an async function we can use await
e.g.
= await main() out1, out2
If we were running Python in a Terminal, we would need to use:
= asyncio.run(main()) out1, out2
Playwright
Playwright is an open-source testing and automation framework that can automate web browser interactions. It is developed by Microsoft.
With Playwright, you can write code that can open a browser. Then, your code can extract text, click buttons, navigate to URLs etc.
To install Playwright, activate your conda environment and run:
Terminal
pip install playwright
playwright install
The second command installs the browser binaries (Chromium, Firefox and Webkit).
Asynchronous vs Synchronous API
Playwright provides two distinct APIs in Python:
Synchronous API:
from playwright.sync_api import sync_playwright
- Uses regular function calls
- Blocks execution until operations complete
Asynchronous API:
from playwright.async_api import async_playwright
- Uses Python’s
async/await
syntax - Non-blocking execution
- Better performance
- Enables parallel operation of multiple browser tabs
- Uses Python’s
For Jupyter notebooks, you have to use the asynchronous API (otherwise you will get an error).
Example: Faculty Webpage
We first provide an example of Playwright and then we explain the code.
import asyncio
from playwright.async_api import async_playwright
async def scrape_faculty_data():
async with async_playwright() as p:
# Launch the browser and create a new context
= await p.chromium.launch(headless=False)
browser = await browser.new_context()
context # Create a new page within the context
= await context.new_page()
page # Navigate to the webpage containing the faculty list
await page.goto("https://statistics.rutgers.edu/people-pages/faculty")
# Wait for the content to load
await page.wait_for_load_state("networkidle")
# Initialize empty lists to store data
= []
data # Loop through all .newsinfo elements
= await page.query_selector_all('.newsinfo')
faculty_items for item in faculty_items:
# Extract data
= await item.query_selector('span')
name = await item.query_selector('span.detail_data')
title = await item.query_selector('a[href^="mailto"]')
email
data.append({'name': await name.text_content(),
'title': await title.text_content(),
'email': await email.text_content()
})
# Close the browser
await browser.close()
return pd.DataFrame(data)
= await scrape_faculty_data() faculty_df
faculty_df
name | title | ||
---|---|---|---|
0 | Pierre Bellec | Associate Professor | pcb71@stat.rutgers.edu |
1 | Matteo Bonvini | Assistant Professor | mb1662@stat.rutgers.edu |
2 | Steve Buyske | Associate Professor; Co-Director of Undergradu... | buyske@stat.rutgers.edu |
3 | Javier Cabrera | Professor | cabrera@stat.rutgers.edu |
4 | Rong Chen | Distinguished Professor and Chair | rongchen@stat.rutgers.edu |
5 | Yaqing Chen | Assistant Professor | yqchen@stat.rutgers.edu |
6 | Harry Crane | Professor | hcrane@stat.rutgers.edu |
7 | Tirthankar DasGupta | Professor and Co-Graduate Director | tirthankar.dasgupta@rutgers.edu |
8 | Ruobin Gong | Associate Professor | ruobin.gong@rutgers.edu |
9 | Zijian Guo | Associate Professor | zijguo@stat.rutgers.edu |
10 | Qiyang Han | Associate Professor | qh85@stat.rutgers.edu |
11 | Donald R. Hoover | Professor | drhoover@stat.rutgers.edu |
12 | Ying Hung | Professor | yhung@stat.rutgers.edu |
13 | Koulik Khamaru | Assistant Professor | kk1241@stat.rutgers.edu |
14 | John Kolassa | Distinguished Professor | kolassa@stat.rutgers.edu |
15 | Regina Y. Liu | Distinguished Professor | rliu@stat.rutgers.edu |
16 | Gemma Moran | Assistant Professor | gm845@stat.rutgers.edu |
17 | Nicole Pashley | Assistant Professor | np755@stat.rutgers.edu |
18 | Harold B. Sackrowitz | Distinguished Professor and Undergraduate Dire... | sackrowi@stat.rutgers.edu |
19 | Michael L. Stein | Distinguished Professor | ms2870@stat.rutgers.edu |
20 | Zhiqiang Tan | Distinguished Professor | ztan@stat.rutgers.edu |
21 | David E. Tyler | Distinguished Professor | dtyler@stat.rutgers.edu |
22 | Guanyang Wang | Assistant Professor | guanyang.wang@rutgers.edu |
23 | Sijian Wang | Professor and Co-Director of FSRM and MSDS pro... | sijian.wang@stat.rutgers.edu |
24 | Han Xiao | Professor and Co-Graduate Director | hxiao@stat.rutgers.edu |
25 | Minge Xie | Distinguished Professor and Director, Office o... | mxie@stat.rutgers.edu |
26 | Min Xu | Assistant Professor | mx76@stat.rutgers.edu |
27 | Cun-Hui Zhang | Distinguished Professor and Co-Director of FSR... | czhang@stat.rutgers.edu |
28 | Linjun Zhang | Associate Professor | linjun.zhang@rutgers.edu |
What just happened?
We use the asynchronous API
from playwright.async_api import async_playwright
We define a function with
async
:async def scrape_faculty_data(): async with async_playwright() as p: ...
The syntax
async def
defines the functionscrape_faculty_data
as a “coroutine”The command
async with async_playwright() as p:
returns an asynchronous context manager that initializes the Playwright library. On exit ofasync with
, all remaining browsers are closed and resources are properly cleaned up.
We set up the browser, context and page:
= await p.chromium.launch(headless=False) browser = await browser.new_context() context = await context.new_page() page
This initializes Playwright and creates a browser window (visible because
headless=False
).Then,
browser.new_context()
creates an isolated browsing environment (similar to an incognito window).Finally,
context.new_page()
creates a new tab within a context.We navigate to the website and wait until all network activity stops i.e. the page is fully loaded
await page.goto("https://statistics.rutgers.edu/people-pages/faculty") await page.wait_for_load_state("networkidle")
We extract data:
= await page.query_selector_all('.newsinfo') faculty_items for item in faculty_items: # ... extraction code ...
query_selector_all
returns all elements that match selectorquery_selector
returns the first element that matches the selector
Note: we can use
^
to specify “starts with” e.g.a[href^="mailto"]
selects thehref
attribute of taga
that starts with “mailto”.We store data in a list of dictionaries:
data.append({'name': await name.text_content(), 'title': await title.text_content(), 'email': await email.text_content() })
We use
.text_content()
to extract the text (similar to.text
in BeautifulSoup).
Example: Jazz Transcriptions
Let’s look at a website where we need to use Playwright because the website is dynamically loaded.
https://www.blueblackjazz.com/en/jazz-piano-transcriptions/
First, we show that requests
and BeautifulSoup
can’t extract the table.
import pandas as pd
from bs4 import BeautifulSoup
import requests
= "https://www.blueblackjazz.com/en/jazz-piano-transcriptions/"
url = requests.get(url)
response = BeautifulSoup(response.text, "html.parser")
soup print(soup.find('tr').prettify())
<tr class="small">
<th>
</th>
<th>
<a href="#searchFieldId" ng-click="sortBy='name';">
Sort by transcription name
</a>
</th>
<th>
<a href="#searchFieldId" ng-click="sortBy='artist';">
By artist
</a>
</th>
<th>
<a href="#searchFieldId" ng-click="sortBy='date';">
By date
</a>
</th>
<th>
<a href="#searchFieldId" ng-click="sortBy='note';">
By note
</a>
</th>
<th>
</th>
</tr>
Now, let’s try with Playwright.
First, let’s go to the website, right click on the table and select “Inspect Element”:
We can see the table is:
<table class="table table-striped table-condensed">
This means there are three classes associated with <table>
(table
, table-striped
, table-condensed
).
Then is another tag <tbody>
, a child of <table>
.
Finally, the info we are interested in is in <tr>
, a child tag of <tbody>
(see next image).
In <tr>
with class="ng-scope"
:
- the second child has the title in the text (“12th Street Boogie”)
- the third child has the name in the text (“Albert Ammons”)
- the fourth child has the year in the text (“1946”)
- the fifth child has the key in the text (“C Maj”)
import pandas as pd
from playwright.async_api import async_playwright
async def scrape_jazz_data():
async with async_playwright() as p:
# Launch the browser and create a new context
= await p.chromium.launch()
browser = await browser.new_context()
context # Create a new page within the context
= await context.new_page()
page # Navigate to the webpage containing the list of transcriptions
await page.goto("https://www.blueblackjazz.com/en/jazz-piano-transcriptions/")
# Wait for the content to load
await page.wait_for_load_state("networkidle")
await page.wait_for_selector('.table.table-striped.table-condensed')
# Find all rows in the table
= await page.query_selector_all('.table.table-striped.table-condensed tbody tr')
rows = []
transcriptions for row in rows:
# Extract data from each row
= await row.query_selector_all('td')
cells if len(cells) > 4:
= await cells[1].inner_text()
title = await cells[2].inner_text()
artist = await cells[3].inner_text()
date = await cells[4].inner_text()
note
transcriptions.append({'title': title,
'artist': artist,
'date': date,
'note': note
})await browser.close()
return pd.DataFrame(transcriptions)
= await scrape_jazz_data() jazz_df
jazz_df
title | artist | date | note | |
---|---|---|---|---|
0 | 20 Solos For Piano Volume 1 | Fats Waller | ||
1 | 20 Solos For Piano Volume 2 | Fats Waller | ||
2 | 20 Solos For Piano Volume 3 | Fats Waller | ||
3 | 17 Solos For Piano Volume 1 | James P. Johnson | ||
4 | 17 Solos For Piano Volume 2 | James P. Johnson | ||
... | ... | ... | ... | ... |
455 | You're Gonna Be Sorry | Fats Waller | 1941 | C Maj |
456 | You're My Favorite Memory | Teddy Wilson | 1946 | Eb Maj |
457 | You're the top | Fats Waller | 1935 | Eb Maj |
458 | Zig Zag | Willie "the Lion" Smith | 1949 | E min |
459 | Zonky | Fats Waller | 1935 | F min |
460 rows × 4 columns
.text_content()
vs. .inner_text()
We have seen both .text_content()
and .inner_text()
.
.text_content()
: gets all text, including text hidden from display.inner_text()
: gets just text diplayed as you would see in browser
Example: Quotes with Multiple Pages
Playwright can emulate browser actions, so we can “click” next to get to the next page:
import asyncio
from playwright.async_api import async_playwright
async def scrape_quotes():
async with async_playwright() as p:
= await p.chromium.launch(headless=True)
browser = await browser.new_page()
page await page.goto("https://quotes.toscrape.com")
= []
all_quotes
while True:
= await page.query_selector_all(".quote")
quotes for quote in quotes:
= await (await quote.query_selector(".text")).inner_text()
text = await (await quote.query_selector(".author")).inner_text()
author
all_quotes.append({'text': text,
'author': author
})
= await page.query_selector("li.next > a")
next_button if next_button:
await next_button.click()
await page.wait_for_load_state("load")
else:
break
await browser.close()
return all_quotes
= await scrape_quotes() quotes
Example: Quotes with Infinite Scroll
Using Playwright, we can also scrape websites that have dynamically loaded content accessed through scrolling down the webpage.
The website we consider is https://quotes.toscrape.com/scroll.
The logic is:
- scroll to bottom of the page
- count how many quote selectors there are
- scroll again
- check if there is a larger number of quotes, go to previous step
- if the number of quotes is constant for 3 scrolls, take that as total number of quotes
- process quotes
import asyncio
from playwright.async_api import async_playwright
async def scrape_quotes_scroll():
async with async_playwright() as p:
= await p.chromium.launch(headless=True)
browser = await browser.new_page()
page await page.goto("https://quotes.toscrape.com/scroll")
= 0
last_count = 0
same_count_repeats = 3
max_repeats
while same_count_repeats < max_repeats:
# scroll from 0 (top of page) to bottom of page (document.body.scrollHeight)
await page.evaluate("window.scrollTo(0, document.body.scrollHeight)")
await page.wait_for_timeout(3000) # wait for 3 seconds
= await page.query_selector_all(".quote")
quotes = len(quotes)
current_count
if current_count == last_count:
+= 1
same_count_repeats else:
= 0
same_count_repeats = current_count
last_count
print(f"\nTotal quotes found: {len(quotes)}\n")
= []
quote_list
for quote in quotes:
= await (await quote.query_selector(".text")).inner_text()
text = await (await quote.query_selector(".author")).inner_text()
author
quote_list.append({'text': text,
'author': author
})
await browser.close()
return quote_list
= await scrape_quotes_scroll() quotes
Total quotes found: 100
pd.DataFrame(quotes)
text | author | |
---|---|---|
0 | “The world as we have created it is a process ... | Albert Einstein |
1 | “It is our choices, Harry, that show what we t... | J.K. Rowling |
2 | “There are only two ways to live your life. On... | Albert Einstein |
3 | “The person, be it gentleman or lady, who has ... | Jane Austen |
4 | “Imperfection is beauty, madness is genius and... | Marilyn Monroe |
... | ... | ... |
95 | “You never really understand a person until yo... | Harper Lee |
96 | “You have to write the book that wants to be w... | Madeleine L'Engle |
97 | “Never tell the truth to people who are not wo... | Mark Twain |
98 | “A person's a person, no matter how small.” | Dr. Seuss |
99 | “... a mind needs books as a sword needs a whe... | George R.R. Martin |
100 rows × 2 columns
Review: Wikipedia
We review scraping with requests
and BeautifulSoup
for a difficult scraping task. The goal is to scrape current events from Wikipedia.
import requests
import bs4
import pandas as pd
= "https://en.wikipedia.org/wiki/Portal:Current_events/March_2025"
url = requests.get(url) response
= bs4.BeautifulSoup(response.text, "html.parser") soup
We see the days have class="vevent"
.
= soup.select('.vevent') days
Where is the date of each day?
print(days[0].prettify()[:390])
<div aria-label="March 1" class="current-events-main vevent" id="2025_March_1" role="region">
<div class="current-events-heading plainlinks">
<div class="current-events-title" role="heading">
<span class="summary">
March 1, 2025
<span style="display: none;">
(
<span class="bday dtstart published updated itvstart">
2025-03-01
</span>
)
</span>
0].select('.bday.dtstart.published.updated.itvstart')[0].text days[
'2025-03-01'
The events themselves are in class="description"
.
= days[0].select('.description')[0] # returns a list, need [0] events
print(events.prettify()[:1000])
<div class="current-events-content description">
<p>
<b>
Armed conflicts and attacks
</b>
</p>
<ul>
<li>
<a href="/wiki/War_on_terror" title="War on terror">
War on terror
</a>
<ul>
<li>
<a href="/wiki/War_against_the_Islamic_State" title="War against the Islamic State">
War against the Islamic State
</a>
<ul>
<li>
<a href="/wiki/Islamic_State_insurgency_in_Puntland" title="Islamic State insurgency in Puntland">
Islamic State insurgency in Puntland
</a>
<ul>
<li>
<a href="/wiki/Puntland_counter-terrorism_operations" title="Puntland counter-terrorism operations">
Puntland counter-terrorism operations
</a>
<ul>
<li>
The
<a href="/wiki/Puntland_Dervish_Force" title="Puntland Dervish Force">
Puntland Dervish Force
</a>
captures an
<a href="/wiki/Islamic_State_%E2%80%93_Somalia_Province"
Let’s see the categories:
= events.select('p')
categories categories
[<p><b>Armed conflicts and attacks</b>
</p>,
<p><b>Arts and culture</b>
</p>,
<p><b>Disasters and accidents</b>
</p>,
<p><b>International relations</b>
</p>,
<p><b>Politics and elections</b>
</p>,
<p><b>Sports</b>
</p>]
We can get direct children of events using .contents
.
= events.contents sub_events
We now can grab the events between the categories.
Let’s keep only the items that have no children items (these are the actual events, instead of their event sub-categories).
= []
event_list
for sub in sub_events:
if isinstance(sub, bs4.element.Tag):
= sub.select('b')
cat if cat:
= cat[0].text
cat print(cat)
# if there are <li> tags in sub:
= sub.select('li')
dat if dat:
# Find all <li> tags that do NOT contain <li> children
= [
leaf_lis for li in sub.find_all('li')
li if not li.find('li')
]
# Print each leaf <li> and its cleaned-up text
for li in leaf_lis:
print('---', li.text)
Armed conflicts and attacks
--- The Puntland Dervish Force captures an IS–Somalia base in Buqa Caleed, in the Cal Miskaad mountain range of Bari Region, Puntland, Somalia. (The Somali Digest) (Horseed Media)
--- United States Central Command says that it has carried out a precision airstrike in Syria, targeting and killing Muhammed Yusuf Ziya Talay, a senior military leader in Hurras al-Din. (Al Arabiya)
--- Three men are found dead in a vehicle near the village of Orú in the Tibú municipality, Norte de Santander, Colombia, with one body dressed in a National Liberation Army (ELN) uniform. (El Heraldo de Colombia)
--- After placing an ELN flag at the entrance of the municipality of Saravena, Arauca, the ELN detonate an improvised explosive device, targeting Colombian soldiers attempting to remove the flag. No casualties are reported. (El Heraldo de Colombia)
--- The first phase of the ceasefire between Israel and Hamas is scheduled to expire today while talks on the second phase, which aims to end the war, remain inconclusive. (DW)
--- Israel blocks the entry of all humanitarian aid from entering Gaza as the first phase of the ceasefire ends. (AP)
--- Civil society groups in the Ituri Province, Democratic Republic of the Congo, report that 23 people were killed and another 20 were taken hostage in raids by an Islamic State-affiliated faction of the Allied Democratic Forces militia over the past week. (Arab News)
--- The Kurdistan Workers' Party announces a ceasefire with Turkey after forty years of conflict. (Al Jazeera)
--- At least one person is killed and approximately nine others are wounded in the Druze-majority city of Jaramana, following armed confrontations between local residents and security forces affiliated with the transitional government. In response, the Suwayda Military Council declares a state of alert, while Israeli prime minister Benjamin Netanyahu and Defense Minister Israel Katz instruct the Israel Defense Forces to "prepare to defend" the city. (ANHA) (Times of Israel)
Arts and culture
--- At the 2025 Brit Awards, Charli XCX wins British Artist of the Year, while her album Brat wins British Album of the Year and her song "Guess" wins Song of the Year in collaboration with Billie Eilish. Ezra Collective wins Best British Group. (BBC News)
--- A group of winter swimmers in Most, Czechia, set a new world record for the largest polar bear plunge with 2,461 participants. The previous record was 1,799 participants set in Mielno, Poland, in 2015. (AP)
Disasters and accidents
--- At least 37 people are killed and 30 others are injured when two passenger buses collide near Uyuni, Potosí department, Bolivia. (BBC News)
International relations
--- Ukrainian president Volodymyr Zelenskyy meets with UK prime minister Keir Starmer in London, where they sign off on a British loan of GB£2.26 billion to buy military supplies for Ukraine. (BBC)
Politics and elections
--- Tens of thousands of demonstrators hold a rally in Bucharest, Romania, in support of presidential candidate Călin Georgescu and demand that the second round of the annulled 2024 election is held instead of a new election. (AP)
--- United States President Donald Trump signs an executive order designating English as the country's official language. (The Guardian)
--- Yamandú Orsi and Carolina Cosse are inaugurated as the president and vice president of Uruguay in Montevideo. (Reuters)
--- Italian nun Raffaella Petrini is sworn is as President of the Pontifical Commission for Vatican City State and President of the Governorate of Vatican City State, becoming the first woman to assume one of the highest political offices in the Vatican. She succeeds Spanish cardinal Fernando Vérgez Alzaga. (RTVE)
Sports
--- At their annual general meeting in Northern Ireland, the International Football Association Board approves a new rule stating that beginning the following season, if a goalkeeper holds the ball for more than eight seconds, the opposing team is awarded a corner kick. (BBC)
Now, let’s put all this together.
= soup.select('.vevent')
days
all = []
for day in days:
= day.select_one('.bday.dtstart.published.updated.itvstart').text
date = day.select_one('.description')
events = events.contents
sub_events
for sub in sub_events:
if isinstance(sub, bs4.element.Tag):
if sub.select('b'):
= sub.select_one('b').text
cat
# if there are <li> tags in sub:
if sub.select('li'):
# Find all <li> tags that do NOT contain <li> children
for li in sub.find_all('li'):
if not li.find('li'):
all.append({'date': date,
'category': cat,
'event': li.text})
= pd.DataFrame(all)
march march
date | category | event | |
---|---|---|---|
0 | 2025-03-01 | Armed conflicts and attacks | The Puntland Dervish Force captures an IS–Soma... |
1 | 2025-03-01 | Armed conflicts and attacks | United States Central Command says that it has... |
2 | 2025-03-01 | Armed conflicts and attacks | Three men are found dead in a vehicle near the... |
3 | 2025-03-01 | Armed conflicts and attacks | After placing an ELN flag at the entrance of t... |
4 | 2025-03-01 | Armed conflicts and attacks | The first phase of the ceasefire between Israe... |
... | ... | ... | ... |
434 | 2025-03-31 | International relations | The United States announces sanctions on six C... |
435 | 2025-03-31 | International relations | The Moldovan foreign affairs ministry expels t... |
436 | 2025-03-31 | International relations | Newly elected Prime Minister of Greenland Jens... |
437 | 2025-03-31 | Law and crime | A court in Abu Dhabi, United Arab Emirates, se... |
438 | 2025-03-31 | Law and crime | National Rally politician Marine Le Pen is con... |
439 rows × 3 columns
Conclusion
We have considered a number of different approaches to web scraping with Python:
requests
for HTTP requests andBeautifulSoup
for parsing HTMLscrapy
Playwright
When to use what?
In general:
Static HTML, simple tasks
✅ requests
and BeautifulSoup
are good for quick scrapes
✅ scrapy
is very efficient
❌ Playwright
can be overkill (and slow)
Static HTML, multiple pages, large scale
❌ requests
and BeautifulSoup
are slow
✅ scrapy
is very efficient
❌ Playwright
can be slow (runs a browser)
Javascript-rendered content, interactivity
❌ requests
and BeautifulSoup
❌ scrapy
✅ Playwright