Natural language processing is a branch of computer science that manages the human voice and text by interpreting and manipulating it to be understood by the machine. It filled the gap between the human and the computers.
Royal Bank of Scotland used natural language processing NLP to extract calculate the sentiments of the people from emails and forum discussions. The company analyzes the data gathered from email, surveys and call center data to know the actual root cause of the customer dissatisfaction.
In this section, we will be using natural language processing by using NLTK and spacy both are the natural language processing libraries of python to install these libraries following command should be made.
en_core_web_sm provides all the data and models that we will be going to use.
So first we will talk about the basics of NLP, and that is Regular Expression.
A regular expression is defined as a pattern that describes a certain amount of textual data. The regular expression is a sequence of data that helps us to find the particular data we want to extract from the document. The regular expression is a key method to UNIX world to search the strings or to search the files.
The code to implement the regex is as follows:
In the above code, we have to do is to remove all other information and extract the email from the string. So we have made a regex as follows:
This regex takes a word that starts with letters has @ sign in between and then ends with a letter. The output is as follows:
The next code is we have to remove just yahoo email. Implementation is as follows:
In this code, we have to change the regex, and after @ sign, we want that string that has yahoo.com at the end. The output is as follows:
Now we will see how to extract the number from the string. The method as follows:
In this code, we have made a regex that took that word which has numbers, hyphen plus sign. If any of these is present once the regex will consider it. The output is as follows:
Now we will extract only the number and remove all other characters.
From the above string, we extract the numbers only and discard all other characters.
The output is as follows:
The screenshot is as follows:
A character sequence and a defined document unit, tokenization is defined as the chopping it up into pieces, is known as tokens, perhaps at the same time throwing away certain characters, such as punctuation. These tokens are loosely referred to as term or words, but it is sometimes essential to make a type-token distinction. A symbol is an instance of a sequence of characters in some particular document that is grouped as a useful semantic unit for processing.
We first imported the nltk and than run that download command. It is the one time process as it downloads the corpus needed for the NLTK library to perform the actions that are desired.
The output of this is as follows:
For logical reasons, documents are going to use a different form of a word, such as organize, organized and organizing. Plus, there are families of derivationally related words with similar meanings, such as democracy, democratic and democratization. In any situation it might be beneficial for a search and also for one of these words to acknowledged documents that have another word in the set. The objective of both stemming and lemmatization is to reduce inflectional forms and sometimes forms related to derivationally of a word to a common base from. For instance:
car,cars,car’s,cars’ => car
As we can see from figure 2, both functional and functionality have the same stem function. This process allows for normalizing the data. For other normalization and preprocessing of data we eliminate the words like is, are, am, and punctuation marks.