Natural Language Processing in Python

In this part of Learning Python we Cover Natural Language Processing In Python
Written by Paayi Tech |20-Jun-2019 | 0 Comments | 1579 Views

Now we will study how to make n-grams in python. NLTK provides the functionality to convert the string into bigram or trigram.

import nltk
from nltk.util import ngrams
 
sent = "Winston Churchill had many accomplishments during his life. He was a remarkable politician but also a great solider, speech writer, and artist. He was considered one of the best politicians and speech writers of both his time and ours"
 
tokens = nltk.word_tokenize(sent)

output = list(ngrams(tokens, 5))

print(output)

 

This is the method to create n grams. The ngrams took two parameters the first one is tokens and then what grams you want. In this, we have taken five ngrams. The output is as follows:

[('Winston', 'Churchill', 'had', 'many', 'accomplishments'), ('Churchill', 'had', 'many', 'accomplishments', 'during'), ('had', 'many', 'accomplishments', 'during', 'his'), ('many', 'accomplishments', 'during', 'his', 'life'), ('accomplishments', 'during', 'his', 'life', '.'), ('during', 'his', 'life', '.', 'He'), ('his', 'life', '.', 'He', 'was'), ('life', '.', 'He', 'was', 'a'), ('.', 'He', 'was', 'a', 'remarkable'), ('He', 'was', 'a', 'remarkable', 'politician'), ('was', 'a', 'remarkable', 'politician', 'but'), ('a', 'remarkable', 'politician', 'but', 'also'), ('remarkable', 'politician', 'but', 'also', 'a'), ('politician', 'but', 'also', 'a', 'great'), ('but', 'also', 'a', 'great', 'soldier'), ('also', 'a', 'great', 'soldier', ','), ('a', 'great', 'soldier', ',', 'speech'), ('great', 'soldier', ',', 'speech', 'writer'), ('soldier', ',', 'speech', 'writer', ','), (',', 'speech', 'writer', ',', 'and'), ('speech', 'writer', ',', 'and', 'artist'), ('writer', ',', 'and', 'artist', '.'), (',', 'and', 'artist', '.', 'He'), (',', 'and', 'artist', '.', 'He'), ('and', 'artist', '.', 'He', 'was'), ('artist', '.', 'He', 'was', 'considered'), ('.', 'He', 'was', 'considered', 'one'), ('He, 'was', 'considered', 'one', 'of'), ('was', 'considered', 'one', 'of', 'the'), ('considered', 'one', 'of', 'the', 'best'), ('one', 'of', 'the', 'best', 'politicians'), ('of', 'the', 'best', 'politicians', 'and'), ('the', 'best', 'politicians', 'and', 'speech'), ('best', 'politicians', 'and', 'speech', 'writers'), ('politicians', 'and', 'speech', 'writers', 'of'), ('and', 'speech', 'writers', 'of', 'both'), ('speech', 'writers', 'of', 'both', 'his'), ('writers', 'of', 'both', 'his', 'time'), ('of', 'both', 'his', 'time', 'and'), ('both', 'his', 'time', 'and', 'ours')]

 

Each tuple consist of 5 words, and on each successive tuple the previous words are going, and it is adding the new word. The maximum number of ngrams that is good for the model without increasing much computation cost is 4 grams. Google also uses a maximum of up to 4 grams in their models.

We can also do this for bi-grams we have to change the parameters.

import nltk
from nltk.util import ngrams
 
sent = "Winston Churchill had many accomplishments during his life. He was a remarkable politician but also a great solider, speech writer, and artist. He was considered one of the best politicians and speech writers of both his time and ours"
 
tokens = nltk.word_tokenize(sent)

output = list(ngrams(tokens, 2))

print(output)

 

The output is as follows:

[('Winston', 'Churchill'), ('Churchill', 'had'), ('had', 'many'), ('many', 'accomplishments'), ('accomplishments', 'during'), ('during', 'his'), ('his', 'life), ('life', '.'), ('.', 'He'), ('He', 'was'), ('was', 'a'), ('a', 'remarkable'), ('remarkable', 'politician'), ('politician', 'but'), ('but', 'also'), ('also', 'a'), ('a', 'great'), ('great', 'solider'), ('solider', ','), (',', 'speech'), ('speech', 'writer'), ('writer', ','), (',', 'and'), ('and', 'artist'), ('artist', '.'), ('.', 'He'), ('He', 'was'), ('was', 'considered'), ('considered', 'one'), ('one', 'of'), ('of', 'the'), ('the', 'best'), ('best', 'politicians'), ('politicians', 'and'), ('and', 'speech'), ('speech', 'writers'), ('writers', 'of'), ('of', 'both'), ('both', 'his'), ('his', 'time'), ('time', 'and'), ('and', 'ours')]

 

How to Evaluate Language Model:

By evaluating the language model we mean what is the best way to know the value of N for ngram that best fits the model. There are two methods to evaluate the performance of the language model.

  • Extrinsic evaluation
  • Intrinsic Evaluation

 

Extrinsic Evaluation:

The best way to evaluate the performance of a language model is to embed it in an application and measure how much the application improves. Such end-to-end evaluation is called extrinsic evaluation. It is the only way to know if a particular improvement in a component is going to make some difference or not.

It can be done on small corpus but on a large corpus it is impractical and expensive often take days to evaluate the model. In such a scenario, we use a metric instead that can calculate the potential improvement in a system quickly.

 

Intrinsic Evaluation:

An intrinsic evaluation metric is one that measures the quality of a model independent of any application. It captures how well the model captures what it is supposed to achieve. In intrinsic evaluation, perplexity is applied.

 

Perplexity:

Perplexity is the inverse of the probability of the test set as assigned by the language model, normalized by the number of word token in the test set.

When the perplexity is minimum, the probability will be the maximum. Suppose we have a two language model LM1 and LM2. If LM1 assigns lower perplexity to the test corpus than LM1 is a good model than LM2.

In practice, we don't use raw probability as our metric for evaluating language models but a variant which is a perplexity. The perplexity of a language model on a test set is the inverse probability of the test set, normalized by the number of words. Thus if we are calculating the perplexity of bigram the equation is:

Perplexity

When unigram, bigram, and trigram was trained on 38 million words from the wall street journal using a 19,979-word vocabulary. The perplexity was:

 

Unigram

Bigram

Trigram

Perplexity

962

170

109

 

Generalization of Zeros:

If we take Shakespeare as a corpus, there would be 884,647 tokens having 29,066 vocabulary. If we produce 300,000 bigrams out of 844 million possible bigrams. It means 99.96% of the possible bigrams were never seen. That means they have zero value in the table. 4 Grams will be even worse.

 

Overfitting:

Ngrams only work well for word prediction if the test corpus looks like the training corpus. However, unfortunately, in reality, it does not happen. We need to train a robust model that generalizes the best. Moreover, one kind of generalization is Zeros means things that don't ever occur in a training set but occur in a test set.

 

Add One Smoothing:

Add-one smoothing also called Laplace smoothing. It adds one where the probability is zero means the word that is in training data not present in testing data. It just adds 1 to all the counts. The formula is given by:

add on smoothing

 

Suppose we have a tiny test corpus having vocabulary =4 and the total number of tokens are 20

Word

Frequency

Unsmoothed

Frequency(new)

Add-one

eat

10

0.5

11

0.46

British

4

0.2

5

0.21

food

6

0.3

7

0.29

happily

0

0.0

1

0.04

 

20

1.0

~20

1.0

 Problem with Add-One Smoothing:

  • Overestimates non zero events.
  • Moves too much mass over to unseen n-grams.

Add-K Smoothing:

One alternative to adding one smoothing is to move a bit less of the probability mass from the seen to unseen events. Instead of adding 1 to each count, we add fractional count. This algorithm is therefore called add-k smoothing.

add-k smoothing

t doesn't work well for language modeling.

 





Login/Sign Up

Comments




Related Posts



© Copyright 2019, All Rights Reserved. paayi.com

This site uses cookies. By continuing to use this site or clicking "I Agree", you agree to the use of cookies. Read our cookies policy and privacy statement for more information.