Python for Network Engineers

How to Scrape the Web With Python and Lxml or Beautiful soup

by: George El., January 2019, Reading time: 3 minutes

Most of the time, you will use the API provided from the site to access information. However, sometimes you may need to read the web page and extract the data. Please read the license terms on every site, since it may not be permitted.

Here I will show you how to get the box office for the current week from imdb.com. I will show you how to do it with lxml, which uses xpath, and with beautiful soup which uses css selectors.

First we use the requests library to get the page content. Then we pass this content to an html method and we get a tree object of the html document. Now we can use xpath to navigate through the document. inspect

In order to identify our content, we right click on one of the titles and click inspect on chrome. A new window will open which shows us the html and css structure. The content we want is in a table, in a td with class titleColumn. inspect

from lxml import html
import requests

page=requests.get("https://www.imdb.com/chart/boxoffice")
tree = html.fromstring(page.content)

for content in tree.xpath('//td[@class="titleColumn"]'):
    title_link = content.xpath('.//a/@href')
    title = content.xpath('.//a/text()')
    print (title, title_link)

this will output

Glass /title/tt6823368?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=f9f31d04-fc22-4d12-86b4-f46e25aa2f6f&pf_rd_r=E73JZF4AF4AHJ8W4D2TG&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=boxoffice&ref_=cht_bo_1
The Upside /title/tt1987680?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=f9f31d04-fc22-4d12-86b4-f46e25aa2f6f&pf_rd_r=E73JZF4AF4AHJ8W4D2TG&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=boxoffice&ref_=cht_bo_2
Doragon bôru chô: Burorî - Dragon Ball Super: Broly /title/tt7961060?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=f9f31d04-fc22-4d12-86b4-f46e25aa2f6f&pf_rd_r=E73JZF4AF4AHJ8W4D2TG&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=boxoffice&ref_=cht_bo_3
Aquaman /title/tt1477834?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=f9f31d04-fc22-4d12-86b4-f46e25aa2f6f&pf_rd_r=E73JZF4AF4AHJ8W4D2TG&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=boxoffice&ref_=cht_bo_4
Spider-Man: Into the Spider-Verse /title/tt4633694?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=f9f31d04-fc22-4d12-86b4-f46e25aa2f6f&pf_rd_r=E73JZF4AF4AHJ8W4D2TG&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=boxoffice&ref_=cht_bo_5
A Dog's Way Home /title/tt7616798?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=f9f31d04-fc22-4d12-86b4-f46e25aa2f6f&pf_rd_r=E73JZF4AF4AHJ8W4D2TG&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=boxoffice&ref_=cht_bo_6
Escape Room /title/tt5886046?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=f9f31d04-fc22-4d12-86b4-f46e25aa2f6f&pf_rd_r=E73JZF4AF4AHJ8W4D2TG&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=boxoffice&ref_=cht_bo_7
Mary Poppins Returns /title/tt5028340?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=f9f31d04-fc22-4d12-86b4-f46e25aa2f6f&pf_rd_r=E73JZF4AF4AHJ8W4D2TG&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=boxoffice&ref_=cht_bo_8
Bumblebee /title/tt4701182?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=f9f31d04-fc22-4d12-86b4-f46e25aa2f6f&pf_rd_r=E73JZF4AF4AHJ8W4D2TG&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=boxoffice&ref_=cht_bo_9
On the Basis of Sex /title/tt4669788?pf_rd_m=A2FGELUUNOQJNL&pf_rd_p=f9f31d04-fc22-4d12-86b4-f46e25aa2f6f&pf_rd_r=E73JZF4AF4AHJ8W4D2TG&pf_rd_s=center-1&pf_rd_t=15506&pf_rd_i=boxoffice&ref_=cht_bo_10

actually I don’t need anything in the link after ? and I also need to add the www.imdb.com in the front

so I am going to take the string only until the “?"
and then join with “http://www.imdb.com

    link = link[:link.index("?")]
    link="".join(("http://www.imdb.com",link))

Now I get the following output

Glass http://www.imdb.com/title/tt6823368
The Upside http://www.imdb.com/title/tt1987680
Doragon bôru chô: Burorî - Dragon Ball Super: Broly http://www.imdb.com/title/tt7961060
Aquaman http://www.imdb.com/title/tt1477834
Spider-Man: Into the Spider-Verse http://www.imdb.com/title/tt4633694
A Dog's Way Home http://www.imdb.com/title/tt7616798
Escape Room http://www.imdb.com/title/tt5886046
Mary Poppins Returns http://www.imdb.com/title/tt5028340
Bumblebee http://www.imdb.com/title/tt4701182
On the Basis of Sex http://www.imdb.com/title/tt4669788

the whole program is

from lxml import html
import requests

page=requests.get("https://www.imdb.com/chart/boxoffice")
tree = html.fromstring(page.content)

for content in tree.xpath('//td[@class="titleColumn"]'):
    link = content.xpath('.//a/@href')[0]
    title = content.xpath('.//a/text()')[0]
    link = link[:link.index("?")]
    link="".join(("http://www.imdb.com",link))
    print (title, link)

with beautiful soup the code is as follows:

from bs4 import BeautifulSoup
import requests
page_link = 'https://www.imdb.com/chart/boxoffice'
page_response = requests.get(page_link, timeout=5)
soup  = BeautifulSoup(page_response.content, "html.parser")

for link in soup.select('td[class=titleColumn] a'):
    title = link.text
    link=link.get('href')
    link = link[:link.index("?")]
    link="".join(("http://www.imdb.com",link))
    print(title,link)

More or less is the same. After we get the page, we pass the content to BeautifulSoup to create a soup object. Then we can use css select methods to extract the data

Please note the scraping can be illegal. Read the license terms, or ask permission before you do any heavy scraping. If the site uses an API, you should use the API. And never overload the server with requests.

comments powered by Disqus