HER DATA LEARNS: WEB SCRAPING

Updated: May 11

Priyanka Dobhal walks us through Web Scraping

For our first Her Data Learns we have Priyanka Dobhal teaching us how to web scrape. Watch our zoom call with her as she talks us through an example and check out several resources and references she prepared. All content from the call is listed below for reference.


Getting started, Priyanka walked us through HTML basics which is much needed to get started with web scraping.


What is HTML?

HTML is the language in which most websites are written. HTML is used to create pages and make them functional.


There are two important aspects to cover - Tags and Attributes

  • HTML tags are the hidden keywords within a web page that define how your web browser must format and display the content.

  • Most tags must have two parts, an opening and a closing part. For example, <html> is the opening tag and </html> is the closing tag. Note that the closing tag has the same text as the opening tag, but has an additional forward-slash ( / ) character.

  • There are some tags that are an exception to this rule, and where a closing tag is not required. The <img> tag for showing images is one example of this.

  • Example - <b> content </b>

  • An attribute is used to define the characteristics of an HTML element and is placed inside the element's opening tag. All attributes are made up of two parts − a name and a value

  • Example - <p align = "left">This is left aligned</p>

Next she walks through the steps of web scraping. In this process she used Google Colab.


What are the steps in Web Scraping?

  1. Find the URL that you want to scrape

  2. Inspecting the Page

  3. Write the code

  4. Run the code and extract the data

  5. Store the data in the required format

Before writing the code Priyanka explained we need to install required libraries for this process.


Python Libraries -

  • urllib.request

The urllib.request module defines functions and classes which help in opening URLs (mostly HTTP) in a complex world — basic and digest authentication, redirections, cookies and more.

In particular, the urllib.request module contains a function called urlopen() that can be used to open a URL within a program.

  • BeautifulSoup

Beautiful Soup is a library that makes it easy to scrape information from web pages. It sits atop an HTML or XML parser, providing Pythonic idioms for iterating, searching, and modifying the parse tree.

  • Pandas

Pandas is the most popular python library that is used for data analysis. Using this library, you can export the data into a file that can be used further.


Now we write the code in Google Colab.


Most Common commands used:

  • findAll()

Pass the tag to find all the mentions.

Example, soup.findAll("a")

This would retrurn all the <a> tags

To find specific attributes within the tag - soup.findAll("a").get("href")

This would return the links

  • find()

Pass the tag to find the first mention only.

Example, soup.find("a")

This would return the first <a> tags

To find specific attributes within the tag - soup.find("a").get("href")

This would return the link of the first tag


Google Colab Code:


For Web-scraping, we'll need these libraries -

  • urllib.request - To open URLs

  • BeautifulSoup - To extract data from html files

  • Pandas - To perform any manupulation

  • xlsxwriter - To save the result in excel

Let us check if the above mentioned libraries are pre-installed

[ ] !pip list To install any library use the syntax below -


[ ] !pip install beautifulsoup4 !pip install pandas !pip install urllib3 Import Libraries [ ] from bs4 import BeautifulSoup as soup from urllib.request import urlopen import pandas as pd from google.colab import files Get the html of the page and parse it


[ ] url = "https://en.wikipedia.org/wiki/The_Good_Place" page_html = urlopen(url) print(page_html)


create a Beautiful Soup object from the html. This is done by passing the html to the BeautifulSoup() function. The Beautiful Soup package is used to parse the html, that is, take the raw html text and break it into Python objects. [ ] page_soup=soup(page_html,"html.parser" print(page_soup)  Find the outer section (tag) to capture Note: I'm using find since I need to extract only the first table with class = infobox vevent [ ] table_outer = page_soup.find("table", {"class":"infobox vevent"}) #If there are multiple tables with the same name and you need to capture  all of them, then use findAll() print(table_outer) To capture the individual information, identify the one common tag among them In this case, it is "tr" tag Extract all the tr tags within the table_outer [ ] tr_tags = table_outer.findAll('tr') print(tr_tags) Iterate over the tr tags Trial Section - 1 Each of the tr tags are saved in tr_tags and can be now access individually For example - Let us try to access the Genre [ ] tr_tags[2].fi