In this tutorial, we’ll show you how to perform web scraping using Python 3 and the BeautifulSoup library.
We’ll be scraping weather forecasts from the National Weather Service, and then analyzing them using the Pandas library.
Before we get started, if you’re looking for more background on APIs or the csv format, you might want to check out our Dataquest courses on:
The requests library
Let’s try downloading a simple sample website,
http://dataquestio.github.io/web-scraping-pages/simple.html
. We’ll need to
first download it using the
requests.get
method.
Hello World with requests
import requests
page = requests.get("http://dataquestio.github.io/web-scraping-pages/simple.html")
page # <Response [200]>
page.status_code # 200
page.content # display html
Parsing a page with BeautifulSoup
use BeautifulSoup library to parse this document, and extract text
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
soup.prettify() # print html
list(soup.children) # collect
type(item) for item in list(soup.children)
# [bs4.element.Doctype, bs4.element.NavigableString, bs4.element.Tag]
# [0] `Doctype` object, which contains information about the type of the document.
# [1] `NavigableString`, which represents text found in the HTML document.
# [2] `Tag` object, which contains other nested tags.
html = list(soup.children)[2]
parse children inside the html
tag:
list(html.children)
# ['\n',
# <head> <title>A simple example page</title> </head>,
# '\n',
# <body> <p>Here is some simple content for this page.</p> </body>,
# '\n']
body = list(html.children)[3] # get <body>
list(body.children) # <body> content
# ['\n',
# <p>Here is some simple content for this page.</p>,
# '\n']
p = list(body.children)[1] # <p> content
p.get_text() # extract text from <p>
# 'Here is some simple content for this page.'
Finding all instances of a tag at once
Use find_all
and find
instead of manually traversing
soup.find_all('p') # returns [] of all tags found
# [<p>Here is some simple content for this page.</p>]
soup.find_all('p')[0].get_text() # go directly to <p> we want
# 'Here is some simple content for this page.'
soup.find('p') # only get 1st instance found
# <p>Here is some simple content for this page.</p>
Finding instances by class or id
page = requests.get("http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")
soup = BeautifulSoup(page.content, 'html.parser')
outer_text = soup.find_all(class_="outer-text") # find all elements with class
outer_text = soup.find_all('p', class_='outer-text') # find p with class
first_id = soup.find_all(id="first") # find by id
Using CSS Selectors
You can also search for items using CSS selectors.
-
p a
— finds alla
tags inside of ap
tag. -
body p a
— finds alla
tags inside of ap
tag inside of abody
tag. -
html body
— finds allbody
tags inside of anhtml
tag. -
p.outer-text
— finds allp
tags with a class ofouter-text
. -
p#first
— finds allp
tags with an id offirst
. -
body p.outer-text
— finds anyp
tags with a class ofouter-text
inside of abody
tag.
Use CSS Selector select
soup.select("div p")
# [<p class="inner-text first-item" id="first">
# First paragraph.
# </p>, <p class="inner-text">
# Second paragraph.
# </p>]
Downloading weather data
Extract weather information about downtown San Francisco here.
Exploring page structure with Chrome DevTools
The first thing we’ll need to do is inspect the page using Chrome Devtools.
We want the div
tag with the id seven-day-forecast
If you click around on the console, and explore the div, you’ll discover that
each forecast item (like “Tonight”, “Thursday”, and “Thursday Night”) is
contained in a div
with the class tombstone-container
.
We now know enough to download the page and start parsing it. In the below code, we:
- Download the web page containing the forecast.
- Create a
BeautifulSoup
class to parse the page. - Find the
div
with idseven-day-forecast
, and assign toseven_day
- Inside
seven_day
, find each individual forecast item. - Extract and print the first forecast item.
Parse weather and print first forecast
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
page = requests.get("http://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168")
soup = BeautifulSoup(page.content, 'html.parser')
seven_day = soup.find(id="seven-day-forecast")
forecast_items = seven_day.find_all(class_="tombstone-container")
tonight = forecast_items[0]
print(tonight.prettify())
Note the use of class_=
vs id=
in find_all
extract the name, short description, and temperature
period = tonight.find(class_="period-name").get_text()
short_desc = tonight.find(class_="short-desc").get_text()
temp = tonight.find(class_="temp").get_text()
print(period)
print(short_desc)
print(temp)
extract the title
attribute from the img
tag.
desc = img['title']
img = tonight.find("img")
print(desc)
Select all items within tombstone-container
period_tags = seven_day.select(".tombstone-container .period-name")
periods = [pt.get_text() for pt in period_tags]
print(periods)
short_descs = [sd.get_text() for sd in seven_day.select(".tombstone-container .short-desc")]
print(short_descs)
temps = [t.get_text() for t in seven_day.select(".tombstone-container .temp")]
print(temps)
descs = [d["title"] for d in seven_day.select(".tombstone-container img")]
print(descs)
Combining our data into a Pandas Dataframe
We can now combine the data into a Pandas DataFrame and analyze it. A DataFrame is an object that can store tabular data, making data analysis easy. If you want to learn more about Pandas, check out our free to start course here.
Set up pandas DataFrame
import pandas as pd
weather = pd.DataFrame({
"period": periods,
"short_desc": short_descs,
"temp": temps,
"desc":descs
})
print(weather)
We can now do some analysis on the data using Series.str.extract
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.extract.html)
temp_nums = weather["temp"].str.extract("(?P<temp_num>\d+)", expand=False)
weather["temp_num"] = temp_nums.astype('int')
print(temp_nums)
print(weather["temp_num"].mean()) # find the mean of all the high and low temperatures
# We could also only select the rows that happen at night:
is_night = weather["temp"].str.contains("Low")
weather["is_night"] = is_night
print(is_night)
Final Code
scrape.py
scrape.py
import requests
from bs4 import BeautifulSoup
page = requests.get("http://dataquestio.github.io/web-scraping-pages/simple.html")
page # <Response [200]>
page.status_code # 200
page.content # display html
soup = BeautifulSoup(page.content, 'html.parser')
#print(soup.prettify()) # print html
# print(list(soup.children)) # collect
# print([type(item) for item in list(soup.children)])
#bs4.element.Doctype, bs4.element.NavigableString, bs4.element.Tag
html = list(soup.children)[2] # [2] contains the <html> content
#print(html) # print html content
list(html.children)
# ['\n',
# <head> <title>A simple example page</title> </head>,
# '\n',
# <body> <p>Here is some simple content for this page.</p> </body>,
# '\n']
body = list(html.children)[3] # get <body>
list(body.children) # <body> content
# ['\n',
# <p>Here is some simple content for this page.</p>,
# '\n']
p = list(body.children)[1] # <p> content
p = p.get_text() # extract text from <p>
# 'Here is some simple content for this page.'
print("---\n<p>: " + p)
# ---
# Find with css selectors
# ---
page = requests.get("http://dataquestio.github.io/web-scraping-pages/ids_and_classes.html")
soup = BeautifulSoup(page.content, 'html.parser')
outer_text = soup.find_all(class_="outer-text") # find all elements with class
outer_text = soup.find_all('p', class_='outer-text') # find p with class
print("\n--- outer_text: ")
print(outer_text)
first_id = soup.find_all(id="first") # find by id
print("\n--- first_id: ")
print(first_id)
div_p = soup.select("div p") # Use CSS Selector
print("\n--- div_p: ")
print(div_p)
weather.py
weather.py
import requests
from bs4 import BeautifulSoup
page = requests.get("http://forecast.weather.gov/MapClick.php?lat=37.7772&lon=-122.4168")
soup = BeautifulSoup(page.content, 'html.parser')
seven_day = soup.find(id="seven-day-forecast")
forecast_items = seven_day.find_all(class_="tombstone-container")
tonight = forecast_items[0]
period = tonight.find(class_="period-name").get_text()
short_desc = tonight.find(class_="short-desc").get_text()
temp = tonight.find(class_="temp").get_text()
print(period)
print(short_desc)
print(temp)
img = tonight.find("img")
desc = img['title']
print(desc)
print("\n---\n")
# :ship: Select all items within `tombstone-container`
period_tags = seven_day.select(".tombstone-container .period-name")
periods = [pt.get_text() for pt in period_tags]
print(periods)
short_descs = [sd.get_text() for sd in seven_day.select(".tombstone-container .short-desc")]
print(short_descs)
temps = [t.get_text() for t in seven_day.select(".tombstone-container .temp")]
print(temps)
descs = [d["title"] for d in seven_day.select(".tombstone-container img")]
print(descs)
print("\n---\n")
import pandas as pd
# :ship: Set up pandas DataFrame
# http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html
weather = pd.DataFrame({
"period": periods,
"short_desc": short_descs,
"temp": temps,
"desc":descs
})
print(weather)
print("\n---\n")
# :ship: We can now do some analysis on the data using `Series.str.extract`
# http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.str.extract.html)
temp_nums = weather["temp"].str.extract("(?P<temp_num>\d+)", expand=False)
weather["temp_num"] = temp_nums.astype('int')
print(temp_nums)
print("\n---\n")
print(weather["temp_num"].mean()) # find the mean of all the high and low temperatures
print("\n---\n")
# We could also only select the rows that happen at night:
is_night = weather["temp"].str.contains("Low")
weather["is_night"] = is_night
print(is_night)
Next Steps
If you want to learn more about any of the topics covered here, check out our interactive courses which you can start for free: Web Scraping in Python