Web Scraping In Python

Python Web scraping: Learn Python Web scraping In Detail with Live Examples
Written by Paayi Tech |24-Apr-2019 | 0 Comments | 278 Views

The most precious thing in today’s era is data. By data, we mean anything it can be the information about airlines, weather, forex exchange, etc. Moreover, it is not necessary that for every data we have access to the API’s. Sometimes API’s are not presented and are too much expensive to buy in such situation web scraping is here for the rescue.

Web scraping also has known as screen scraping is a technique to get the data that resides within HTML tags. It involves the process of extracting the data to save the data locally or to use in your websites.

Data provided by many of the websites can only be viewed so to get the access we do web scraping in python. Now the question is there any web scraping available in other languages? So, the answer yes but mostly web scrapping scripts are written and prefer to be written in python because python provides many such modules which make it very easy to scrape the data. In python there 3 main modules which are used extensively for web scraping. Following is the name:

  1. Beautiful soup
  2. Selenium
  3. Scrapy

Beautiful soup is the basic scraper and done the job very well in most of the scenarios but as the site build by using the dynamic javascript code or ajax call than beautiful soup fails to execute the javascript.

In such a scenario we use selenium to execute the javascript and then get the page source which is then projected in the beautiful soup to parse the data. Selenium is also used for 

 

Beautiful Soup:

Beautiful soup is a python library that allows a user to scrape or parse the HTML data. The data is scraped based on tags, attributes, and classes. However, before starting, we have to install two modules one is beautifulsoup4, and another one is lxml. Lxml is a parser in which HTML is converted to parse the data quickly.

 

Windows

pip install BeautifulSoup4
pip install lxml

 

Linux

pip3 install BeautifulSoup4
pip3 install lxml

 

However, before diving into web scraping, we must know that we should not make many requests at one time it may crash their web application, or you may get blocked to visit their website. One request on a web page per 10 seconds is a right approach.

Before web scraping, we must know about the HTML tags their attribute and their classes because these are the critical value which allows us to scrape the data.

The example code is as follows:

from bs4 import BeautifulSoup as bs

 

html = '''
 

 

   
   
   
       

My First Scrapper

       

Hello World From Python

   
 
'''
 
soup = bs(html,'html.parser')
print(soup.find('h1').text)

 

In the above code we make a string of html and pass it to the beautiful instance. Than we find the h1 tag. The output is as follows:

 My First Scrapper

 

Text function extracts the text within a tag. Find function find that tag. So that’s why it is compulsory to understand the HTML before scraping the data if you know what data resides where than it will become easy to scrape.

 

Now the question arises where to get the attributes and how to access the class names before requesting the HTML. The easy answer to this question is to inspect element in your browser.

 

Inspecting the elements in browser:

To understand the HTML, we first open the inspect element of that URL in the browser and locate the element that we have to scrape this will reduce the time significantly and working become more efficient. Following is the way to inspect element.

Inspecting the elements in browser

 

Figure 1

 

This is the way of inspecting the element just hover the click on the element you want to scrape it will show the tag name with the class by which scraping can be easily done.

 

Scraping Title:

Title is of the web page is given in the title tag. To extract the tile following code is executed:

import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.python.org/')
soup = bs(r.text,'html.parser')
print(soup.find('title'))

 

This will return the following output:

Welcome to Python.org

 

To extract only text we have to print soup.find('title').text which print out the text and remove the html tags.

 

 

Scraping Multiple Items:

The find function only returns one element that it finds first. However, what if we have to extract all the links from the web page. For that purpose, we use find_all() function which iterates over all the tags that are present on a web page. See example below:

import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.python.org/')
soup = bs(r.text,'html.parser')
links = soup.findAll('a')
 
for e in links:
    print(e.get('href'))

 

The following example outputs all the links that are given in the href attribute. It is also an example of fetching the attribute values by beautiful soup. Following are some of the link that is given as an output:

http://brochure.getpython.info/

/downloads/

/downloads/

/downloads/source/

/downloads/windows/

/downloads/mac-osx/

/download/other/

https://docs.python.org/3/license.html

/download/alternatives

/doc/

/doc/

/doc/av

https://wiki.python.org/moin/BeginnersGuide

https://devguide.python.org/

https://docs.python.org/faq/

http://wiki.python.org/moin/Languages

http://python.org/dev/peps/

https://wiki.python.org/moin/PythonBooks

/doc/essays/

/community/

/community/survey

/community/diversity/

/community/lists/

/community/irc/

/community/forums/

/community/workshops/

http://www.djangoproject.com/

http://www.pylonsproject.org/

http://bottlepy.org

http://tornadoweb.org

http://flask.pocoo.org/

http://www.web2py.com/

http://wiki.python.org/moin/TkInter

https://wiki.gnome.org/Projects/PyGObject

http://www.riverbankcomputing.co.uk/software/pyqt/intro

https://wiki.qt.io/PySide

https://kivy.org/

http://www.wxpython.org/

http://www.scipy.org

http://pandas.pydata.org/

http://ipython.org

http://buildbot.net/

http://trac.edgewall.org/

http://roundup.sourceforge.net/

http://www.ansible.com

http://www.saltstack.com

https://www.openstack.org





Login/Sign Up

Comments




Related Posts



© Copyright 2019, All Rights Reserved. paayi.com

This site uses cookies. By continuing to use this site or clicking "I Agree", you agree to the use of cookies. Read our cookies policy and privacy statement for more information.