Python Tutorials: Web Scraping In Python

Python Tutorials: : Learn Python Web scraping In Detail with Live Examples
Written by Paayi Tech |17-Oct-2020 | 0 Comments | 503 Views

The most precious thing in today's era is data. By data, we mean anything. It can be information about airlines, weather, forex exchange, etc. Moreover, it is not necessary for every data; we have access to the APIs. Sometimes APIs are not presented and are too much expensive to buy in such a situation; web scraping is here for the rescue.

Web scraping also has known as screen scraping, is a technique to get the data that resides within HTML tags. It involves the process of extracting the data to save the data locally or to use in your websites.

Data provided by many of the websites can only be viewed, so to get the access, we do web scraping in python. Now the question is there any web scraping available in other languages? So, the answer yes, but mostly web scrapping scripts are written and prefer to be written in python because python provides many such modules, which make it very easy to scrape the data. In python, there three main modules that are used extensively for web scraping. Following is the name:

  1. Beautiful soup
  2. Selenium
  3. Scrapy

Beautiful soup is the basic scraper and done the job very well in most of the scenarios, but as the site builds by using the dynamic javascript code or ajax call, then beautiful soup fails to execute the javascript.

In such a scenario, we use selenium to execute the javascript and then get the page source, which is then projected in the beautiful soup to parse the data. Selenium is also used for.

 

Beautiful Soup:

Beautiful soup is a python library that allows a user to scrape or parse the HTML data. The data is scraped based on tags, attributes, and classes. However, before starting, we have to install two modules. One is beautifulsoup4, and another one is lxml. Lxml is a parser in which HTML is converted to parse the data quickly.

Windows

pip install BeautifulSoup4

pip install lxml

 

Linux

pip3 install BeautifulSoup4

pip3 install lxml

 

However, before diving into web scraping, we must know that we should not make many requests at one time. It may crash their web application, or you may get blocked to visit their website. One request on a web page per 10 seconds is the right approach.

Before web scraping, we must know about the HTML tags their attribute and their classes because these are the critical value which allows us to scrape the data.

The example code is as follows:

from bs4 import BeautifulSoup as bs

html = '''       

My First Scrapper        

Hello World From Python

'"

 

soup = bs(html,'html.parser')

print(soup.find('h1').text)

 

 

 

In the above code, we make a string of HTML and pass it to a beautiful instance. Then we find the h1 tag. The output is as follows:

 My First Scrapper

 

Text function extracts the text within a tag. Find function find that tag. So that's why it is compulsory to understand the HTML before scraping the data. If you know what data resides where than, it will become easy to scrape.

Now the question arises where to get the attributes and how to access the class names before requesting the HTML. The easy answer to this question is to inspect elements in your browser.

 

Inspecting the elements in the browser:

To understand the HTML, we first open the inspect element of that URL in the browser and locate the element that we have to scrape this will reduce the time significantly, and work becomes more efficient. Following is the way to inspect elements.

Inspecting the elements in browser

Figure 1

 

This is the way of inspecting the element. Just hover the click on the element you want to scrape. It will show the tag name with the class by which scraping can be easily done.

 

Scraping Title:

Title is of the web page is given in the title tag. To extract the tile following code is executed:

import requests

from bs4 import BeautifulSoup as bs

r = requests.get('https://www.python.org/')

soup = bs(r.text,'html.parser')

print(soup.find('title'))

 

This will return the following output:

Welcome to Python.org

To extract only text, we have to print soup.find('title').text, which prints out the text and remove the HTML tags.

 

 

Scraping Multiple Items:

The find function only returns one element that it finds first. However, what if we have to extract all the links from the web page. For that purpose, we use the find_all() function, which iterates over all the tags that are present on a web page. See the example below:

import requests

from bs4 import BeautifulSoup as bs

r = requests.get('https://www.python.org/')

soup = bs(r.text,'HTML.parser')

links = soup.findAll('a')

for e in links:

    print(e.get('href'))

 

The following example outputs all the links that are given in the href attribute. It is also an example of fetching the attribute values by beautiful soup. Following are some of the link that is given as an output:

http://brochure.getpython.info/

/downloads/

/downloads/

/downloads/source/

/downloads/windows/

/downloads/mac-osx/

/download/other/

https://docs.python.org/3/license.html

/download/alternatives

/doc/

/doc/

/doc/av

https://wiki.python.org/moin/BeginnersGuide

https://devguide.python.org/

https://docs.python.org/faq/

http://wiki.python.org/moin/Languages

http://python.org/dev/peps/

https://wiki.python.org/moin/PythonBooks

/doc/essays/

/community/

/community/survey

/community/diversity/

/community/lists/

/community/irc/

/community/forums/

/community/workshops/

http://www.djangoproject.com/

http://www.pylonsproject.org/

http://bottlepy.org

http://tornadoweb.org

http://flask.pocoo.org/

http://www.web2py.com/

http://wiki.python.org/moin/TkInter

https://wiki.gnome.org/Projects/PyGObject

http://www.riverbankcomputing.co.uk/software/pyqt/intro

https://wiki.qt.io/PySide

https://kivy.org/

http://www.wxpython.org/

http://www.scipy.org

http://pandas.pydata.org/

http://ipython.org

http://buildbot.net/

http://trac.edgewall.org/

http://roundup.sourceforge.net/

http://www.ansible.com

http://www.saltstack.com

https://www.openstack.org





Login/Sign Up

Comments




Related Posts



© Copyright 2020, All Rights Reserved. paayi.com

This site uses cookies. By continuing to use this site or clicking "I Agree", you agree to the use of cookies. Read our cookies policy and privacy statement for more information.