Now we will see how we can make a spelling corrector in python. We have to use the method of distance formula to make this program. In this, we will only be using the regex library. So let's dive into the code.
We first import the libraries that we are going to use. Then we will read the file which contains the 20,000 words that we will be using for the dictionary.
The data looks like this:
These are the few words that are the part of our file. We now will make a function read this file:
This function will read the file. However, we will use the counter to assign each word its occurrence in the file with the help of the following method:
If we output the data will look like this:
Now we will make another function that will compute the probability of the occurrence of words in that corpus:
This function will compute the probability
This function is to see if the word is present or not.
These two functions will calculate the edit distance. It is based on the rules that we have learned earlier.
The candidate function will first check if the word is in the corpus if no then it will check the edit distance. The correction method will correct the word which will have the highest probability.
All the code is as follows:
The output is follows:
Improvement in Channel Model:
By using the confusion matrix, we can make a richer edit. This improvement was added by Brill and Moore in 2000. There are some letters which are mistakenly written by some other letter. For example, ‘a’ is the majority time misspelled by ‘e’ and ‘m’ is often misspelled by ‘n’. By making the matrix of such occurrence, it is easy to replace the letter if the word is not in the dictionary
- ent → ant
- ph → f
These above are the example of misspelled words.
The second improvement was made by Toutanova and Moore in 2002 by incorporating the pronunciation in the channel. This made voice recognition more robust, efficient and accurate.
What were the factors on which improvement was made:
The improvement was made on the following basis:
- Surrounding Letters: What are the surrounding letters and what will be the probability of a letter that is misspelled.
- Position of Word: What is the position of the word and what are the surrounding word. We then compute the bigram probability to know what will be the next word.
- Nearby keys on the keyboard: What are the nearby keys of a keyboard. This makes many mistakes in spelling as a user press nearby button instead of pressing the actual button.
Classifier-based methods for real-life spell checker:
For real-life spell checker, the Noisy Channel Model is not good enough. For that purpose classifier based methods are used suppose we have a letter cloudy so the system should know that the word will be ‘weather’ and if we have ‘verb’ or ‘or not’ than the word should be ‘whether’.
We have to classify such pair for modeling.
A classifier that can be used for such problems are:
- Naive Bayes
- KNN etc.
Text classification assigns the classes to the text according to their content. These classes have then established a relationship with the text and form a hierarchy of categories. The text classification includes many types of pre-processing. These pre-processing includes text extraction, tokenization, removal of stop words and conversion to unigram or bigram.