Now we will study how to make ngrams in python. NLTK provides the functionality to convert the string into bigram or trigram.
This is the method to create n grams. The ngrams took two parameters the first one is tokens and then what grams you want. In this, we have taken five ngrams. The output is as follows:
Each tuple consist of 5 words, and on each successive tuple the previous words are going, and it is adding the new word. The maximum number of ngrams that is good for the model without increasing much computation cost is 4 grams. Google also uses a maximum of up to 4 grams in their models.
We can also do this for bigrams we have to change the parameters.
The output is as follows:
How to Evaluate Language Model:
By evaluating the language model we mean what is the best way to know the value of N for ngram that best fits the model. There are two methods to evaluate the performance of the language model.
 Extrinsic evaluation
 Intrinsic Evaluation
Extrinsic Evaluation:
The best way to evaluate the performance of a language model is to embed it in an application and measure how much the application improves. Such endtoend evaluation is called extrinsic evaluation. It is the only way to know if a particular improvement in a component is going to make some difference or not.
It can be done on small corpus but on a large corpus it is impractical and expensive often take days to evaluate the model. In such a scenario, we use a metric instead that can calculate the potential improvement in a system quickly.
Intrinsic Evaluation:
An intrinsic evaluation metric is one that measures the quality of a model independent of any application. It captures how well the model captures what it is supposed to achieve. In intrinsic evaluation, perplexity is applied.
Perplexity:
Perplexity is the inverse of the probability of the test set as assigned by the language model, normalized by the number of word token in the test set.
When the perplexity is minimum, the probability will be the maximum. Suppose we have a two language model LM1 and LM2. If LM1 assigns lower perplexity to the test corpus than LM1 is a good model than LM2.
In practice, we don't use raw probability as our metric for evaluating language models but a variant which is a perplexity. The perplexity of a language model on a test set is the inverse probability of the test set, normalized by the number of words. Thus if we are calculating the perplexity of bigram the equation is:
When unigram, bigram, and trigram was trained on 38 million words from the wall street journal using a 19,979word vocabulary. The perplexity was:

Unigram 
Bigram 
Trigram 
Perplexity 
962 
170 
109 
Generalization of Zeros:
If we take Shakespeare as a corpus, there would be 884,647 tokens having 29,066 vocabulary. If we produce 300,000 bigrams out of 844 million possible bigrams. It means 99.96% of the possible bigrams were never seen. That means they have zero value in the table. 4 Grams will be even worse.
Overfitting:
Ngrams only work well for word prediction if the test corpus looks like the training corpus. However, unfortunately, in reality, it does not happen. We need to train a robust model that generalizes the best. Moreover, one kind of generalization is Zeros means things that don't ever occur in a training set but occur in a test set.
Add One Smoothing:
Addone smoothing also called Laplace smoothing. It adds one where the probability is zero means the word that is in training data not present in testing data. It just adds 1 to all the counts. The formula is given by:
Suppose we have a tiny test corpus having vocabulary =4 and the total number of tokens are 20
Word 
Frequency 
Unsmoothed 
Frequency(new) 
Addone 
eat 
10 
0.5 
11 
0.46 
British 
4 
0.2 
5 
0.21 
food 
6 
0.3 
7 
0.29 
happily 
0 
0.0 
1 
0.04 

20 
1.0 
~20 
1.0 
Problem with AddOne Smoothing:
 Overestimates non zero events.
 Moves too much mass over to unseen ngrams.
AddK Smoothing:
One alternative to adding one smoothing is to move a bit less of the probability mass from the seen to unseen events. Instead of adding 1 to each count, we add fractional count. This algorithm is therefore called addk smoothing.
t doesn't work well for language modeling.