Moderate Start to Natural Language Processing (NPL) using Python

In this part of Learning Python we Cover Natural Language Processing In Python
Written by Paayi Tech |11-May-2019 | 0 Comments | 342 Views

Natural language processing is a branch of computer science that manages the human voice and text by interpreting and manipulating it to be understood by the machine. It filled the gap between the human and the computers.

Royal Bank of Scotland used natural language processing NLP to extract calculate the sentiments of the people from emails and forum discussions. The company analyzes the data gathered from email, surveys and call center data to know the actual root cause of the customer dissatisfaction.

In this section, we will be using natural language processing by using NLTK and spacy both are the natural language processing libraries of python to install these libraries following command should be made.

pip3 install nltk
pip3 install spacy
python3 -m spacy download en_core_web_sm

 

en_core_web_sm provides all the data and models that we will be going to use.

So first we will talk about the basics of NLP, and that is Regular Expression.

 

Regular Expression:

A regular expression is defined as a pattern that describes a certain amount of textual data. The regular expression is a sequence of data that helps us to find the particular data we want to extract from the document. The regular expression is a key method to UNIX world to search the strings or to search the files.

The code to implement the regex is as follows:

import re
string = "my email address is abc@gmail.com and xyz@yahoo.com"
m = re.findall(r'[w.-]+@[w.-]+',string)
print(m)

 

In the above code, we have to do is to remove all other information and extract the email from the string. So we have made a regex as follows:

[w.-]+@[w.-]+

 

This regex takes a word that starts with letters has @ sign in between and then ends with a letter. The output is as follows:

['abc@gmail.com', 'xyz@yahoo.com']

 

The next code is we have to remove just yahoo email. Implementation is as follows:

Import re
string = "my email address is abc@gmail.com and xyz@yahoo.com"
m = re.findall(r'[w.-]+@yahoo.com+',string)
print(m)

 

In this code, we have to change the regex, and after @ sign, we want that string that has yahoo.com at the end. The output is as follows:

['xyz@yahoo.com']

 

Extracting Number:

Now we will see how to extract the number from the string. The method as follows:

Import re
string = "my phone number is +92-321-000000"
m = re.findall(r'[+d.-]+',string)
print(m)

 

In this code, we have made a regex that took that word which has numbers, hyphen plus sign. If any of these is present once the regex will consider it. The output is as follows:

['+92-321-000000']

 

Now we will extract only the number and remove all other characters.

num = re.sub(r'D',"",string)
print(f'The number is: {num}')

 

From the above string, we extract the numbers only and discard all other characters.

The output is as follows:

The number is: 92321000000

 

The screenshot is as follows:

Figure 1

 

Word Tokenization:

A character sequence and a defined document unit, tokenization is defined as the chopping it up into pieces, is known as tokens, perhaps at the same time throwing away certain characters, such as punctuation. These tokens are loosely referred to as term or words, but it is sometimes essential to make a type-token distinction. A symbol is an instance of a sequence of characters in some particular document that is grouped as a useful semantic unit for processing.

import nltk
nltk.download('punkt')

 

We first imported the nltk and than run that download command. It is the one time process as it downloads the corpus needed for the NLTK library to perform the actions that are desired.

sent = "This is the practical example of how to tokenize the sentence."
tokens = nltk.word_tokenize(sent)
print(tokens)

 

The output of this is as follows:

['This', 'is', 'the', 'practical', 'example', 'of', 'how', 'to', 'tokenize', 'the', 'sentence', '.']

 

Stemming:

For logical reasons, documents are going to use a different form of a word, such as organize, organized and organizing. Plus, there are families of derivationally related words with similar meanings, such as democracy, democratic and democratization. In any situation it might be beneficial for a search and also for one of these words to acknowledged documents that have another word in the set. The objective of both stemming and lemmatization is to reduce inflectional forms and sometimes forms related to derivationally of a word to a common base from. For instance:

car,cars,car’s,cars’ => car

Figure 2

As we can see from figure 2, both functional and functionality have the same stem function. This process allows for normalizing the data. For other normalization and preprocessing of data we eliminate the words like is, are, am, and punctuation marks.





Login/Sign Up

Comments




Related Posts



© Copyright 2019, All Rights Reserved. paayi.com

This site uses cookies. By continuing to use this site or clicking "I Agree", you agree to the use of cookies. Read our cookies policy and privacy statement for more information.