Python Tutorials: Moderate Start to Natural Language Processing (NPL) using Python

Python Tutorials: In this part of Learning Python we Cover Natural Language Processing In Python
Written by Paayi Tech |17-Oct-2020 | 0 Comments | 566 Views

Natural language processing is a branch of computer science that manages the human voice and text by interpreting and manipulating it to be understood by the machine. It filled the gap between the human and the computers.

Royal Bank of Scotland used natural language processing NLP to extract the sentiments of the people from emails and forum discussions. The company analyzes the data gathered from email, surveys, and call center data to know the actual root cause of customer dissatisfaction.

In this section, we will be using natural language processing by using NLTK and spacy; both are the natural language processing libraries of python to install these libraries following command should be made.

pip3 install nltk

pip3 install spacy

python3 -m spacy download en_core_web_sm


en_core_web_sm provides all the data and models that we will be going to use.

So first, we will talk about the basics of NLP, and that is Regular Expression.


Regular Expression:

A regular expression is defined as a pattern that describes a certain amount of textual data. The regular expression is a sequence of data that helps us to find the particular data we want to extract from the document. The regular expression is a key method to UNIX world to search the strings or to search the files.

The code to implement the regex is as follows:

import re

string = "my email address is and"

m = re.findall(r'[w.-]+@[w.-]+',string)



In the above code, we have to do is to remove all other information and extract the email from the string. So we have made a regex as follows:



This regex takes a word that starts with letters that have @ sign in between and then end with a letter. The output is as follows:

['', '']


The next code is we have to remove just yahoo email. Implementation is as follows:

Import re

string = "my email address is and"

m = re.findall(r'[w.-]',string)



In this code, we have to change the regex, and after @ sign, we want that string that has at the end. The output is as follows:



Extracting Number:

Now we will see how to extract the number from the string. The method as follows:

Import re

string = "my phone number is +92-321-000000"

m = re.findall(r'[+d.-]+',string)



In this code, we have made a regex that took that word, which has numbers, hyphen plus sign. Suppose any of these is present once the regex will consider it. The output is as follows:



Now we will extract only the number and remove all other characters.

num = re.sub(r'D',"",string)

print(f'The number is: {num}')


From the above string, we extract the numbers only and discard all other characters.

The output is as follows:

The number is: 92321000000


The screenshot is as follows:

Figure 1

Word Tokenization:

A character sequence and a defined document unit, tokenization, is defined as the chopping it up into pieces, is known as tokens, perhaps at the same time throwing away certain characters, such as punctuation.

These tokens are loosely referred to as terms or words, but it is sometimes essential to make a type-token distinction. A symbol is an instance of a sequence of characters in some particular document that is grouped as a useful semantic unit for processing.

import nltk'punkt')


We first imported the nltk and than run that download command. It is the one time process as it downloads the corpus needed for the NLTK library to perform the actions that are desired.

sent = "This is the practical example of how to tokenize the sentence."

tokens = nltk.word_tokenize(sent)



The output of this is as follows:

['This', 'is', 'the', 'practical', 'example', 'of', 'how', 'to', 'tokenize', 'the', 'sentence', '.']



For logical reasons, documents are going to use a different form of a word, such as organize, organized, and organizing. Plus, there are families of derivationally related words with similar meanings, such as democracy, democratic, and democratization.

In any situation, it might be beneficial for a search and also for one of these words to acknowledged documents that have another word in the set. The objective of both stemming and lemmatization is to reduce inflectional forms and sometimes forms related to derivationally of a word to a common base. For instance:

car,cars,car’s,cars’ => car


Figure 2

As we can see from figure 2, both functional and functionality have the same stem function. This process allows for normalizing the data. For other normalization and preprocessing of data, we eliminate the words like is, are, am, and punctuation marks.

Login/Sign Up


Related Posts

© Copyright 2020, All Rights Reserved.

This site uses cookies. By continuing to use this site or clicking "I Agree", you agree to the use of cookies. Read our cookies policy and privacy statement for more information.