The most precious thing in today’s era is data. By data, we mean anything it can be the information about airlines, weather, forex exchange, etc. Moreover, it is not necessary that for every data we have access to the API’s. Sometimes API’s are not presented and are too much expensive to buy in such situation web scraping is here for the rescue.
Web scraping also has known as screen scraping is a technique to get the data that resides within HTML tags. It involves the process of extracting the data to save the data locally or to use in your websites.
Data provided by many of the websites can only be viewed so to get the access we do web scraping in python. Now the question is there any web scraping available in other languages? So, the answer yes but mostly web scrapping scripts are written and prefer to be written in python because python provides many such modules which make it very easy to scrape the data. In python there 3 main modules which are used extensively for web scraping. Following is the name:
- Beautiful soup
Beautiful soup is a python library that allows a user to scrape or parse the HTML data. The data is scraped based on tags, attributes, and classes. However, before starting, we have to install two modules one is beautifulsoup4, and another one is lxml. Lxml is a parser in which HTML is converted to parse the data quickly.
However, before diving into web scraping, we must know that we should not make many requests at one time it may crash their web application, or you may get blocked to visit their website. One request on a web page per 10 seconds is a right approach.
Before web scraping, we must know about the HTML tags their attribute and their classes because these are the critical value which allows us to scrape the data.
The example code is as follows:
from bs4 import BeautifulSoup as bs
In the above code we make a string of html and pass it to the beautiful instance. Than we find the h1 tag. The output is as follows:
Text function extracts the text within a tag. Find function find that tag. So that’s why it is compulsory to understand the HTML before scraping the data if you know what data resides where than it will become easy to scrape.
Now the question arises where to get the attributes and how to access the class names before requesting the HTML. The easy answer to this question is to inspect element in your browser.
Inspecting the elements in browser:
To understand the HTML, we first open the inspect element of that URL in the browser and locate the element that we have to scrape this will reduce the time significantly and working become more efficient. Following is the way to inspect element.
This is the way of inspecting the element just hover the click on the element you want to scrape it will show the tag name with the class by which scraping can be easily done.
Title is of the web page is given in the title tag. To extract the tile following code is executed:
This will return the following output:
To extract only text we have to print soup.find('title').text which print out the text and remove the html tags.
Scraping Multiple Items:
The find function only returns one element that it finds first. However, what if we have to extract all the links from the web page. For that purpose, we use find_all() function which iterates over all the tags that are present on a web page. See example below:
The following example outputs all the links that are given in the href attribute. It is also an example of fetching the attribute values by beautiful soup. Following are some of the link that is given as an output: