Build a Spelling Checker using NLP in Python

In this part of Learning Python we will build Spell Checker using NLP In Python
Written by Paayi Tech |10-May-2019 | 0 Comments | 185 Views

Now we will see how we can make a spelling corrector in python. We have to use the method of distance formula to make this program. In this, we will only be using the regex library. So let's dive into the code.

import re
from collections import Counter

 

We first import the libraries that we are going to use. Then we will read the file which contains the 20,000 words that we will be using for the dictionary.

The data looks like this:

the

of

and

to

a

in

for

is

on

that

by

this

with

i

you

it

not

or

be

are

from

at

as

your

all

have

new

more

an

was

we

will

home

can

us

about

if

page

my

has

search

free

but

 

These are the few words that are the part of our file. We now will make a function read this file:

def file_read(path):
     return re.findall(r'w+', path.lower())

 

This function will read the file. However, we will use the counter to assign each word its occurrence in the file with the help of the following method:

data = Counter(file_read(open('data/20K.txt').read()))

 

If we output the data will look like this:

Counter({'the': 1,

         'of': 1,

         'and': 1,

         'to': 1,

         'a': 1,

         'in': 1,

         'for': 1,

         'is': 1,

         'on': 1,

         'that': 1,

         'by': 1,

         'this': 1,

         'with': 1,

         'i': 1,

         'you': 1,

 

 

Now we will make another function that will compute the probability of the occurrence of words in that corpus:

def Probability(word, N=sum(data.values())):
    return data[word] / N

 

This function will compute the probability

def known(words):
    return set(w for w in words if w in data)

 

This function is to see if the word is present or not.

def edits1(word):
    letters    = 'abcdefghijklmnopqrstuvwxyz'
    splits     = [(word[:i], word[i:])    for i in range(len(word) + 1)]
    deletes    = [L + R[1:]               for L, R in splits if R]
    transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
    replaces   = [L + c + R[1:]           for L, R in splits if R for c in
                  letters]
    inserts    = [L + c + R               for L, R in splits for c in
    letters]
    return set(deletes + transposes + replaces + inserts)
 
def edits2(word):
    return (e2 for e1 in edits1(word) for e2 in edits1(e1))

 

These two functions will calculate the edit distance. It is based on the rules that we have learned earlier.

def correction(word):
    return max(candidates(word), key=Probability)
 
def candidates(word):
    return (known([word]) or known(edits1(word)) or known(edits2(word)) or
           [word])

 

The candidate function will first check if the word is in the corpus if no then it will check the edit distance. The correction method will correct the word which will have the highest probability.

All the code is as follows:

import re
from collections import Counter
 
def file_read(path):
     return re.findall(r'w+', path.lower())
 
data = Counter(file_read(open('data/20K.txt').read()))
 
def Probability(word, N=sum(data.values())):
    return data[word] / N
 
def correction(word):
    return max(candidates(word), key=Probability)
 
def candidates(word):
    return (known([word]) or known(edits1(word)) or known(edits2(word)) or
           [word])
 
def known(words):
    return set(w for w in words if w in data)
 
def edits1(word):
    letters    = 'abcdefghijklmnopqrstuvwxyz'
    splits     = [(word[:i], word[i:])    for i in range(len(word) + 1)]
    deletes    = [L + R[1:]               for L, R in splits if R]
    transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
    replaces   = [L + c + R[1:]           for L, R in splits if R for c in
                  letters]
    inserts    = [L + c + R               for L, R in splits for c in
    letters]
    return set(deletes + transposes + replaces + inserts)
 
def edits2(word):
    return (e2 for e1 in edits1(word) for e2 in edits1(e1))

 

The output is follows:

print(correction("candida"))

print(correction("jind"))

print(correction("wite"))

 

Output:

candid

find

with

 

 

Improvement in Channel Model:

By using the confusion matrix, we can make a richer edit. This improvement was added by Brill and Moore in 2000. There are some letters which are mistakenly written by some other letter. For example, ‘a’ is the majority time misspelled by ‘e’ and ‘m’ is often misspelled by ‘n’. By making the matrix of such occurrence, it is easy to replace the letter if the word is not in the dictionary

  • ent → ant
  • ph → f

These above are the example of misspelled words.

The second improvement was made by Toutanova and Moore in 2002 by incorporating the pronunciation in the channel. This made voice recognition more robust, efficient and accurate.

 

What were the factors on which improvement was made:

The improvement was made on the following basis:

  • Surrounding Letters: What are the surrounding letters and what will be the probability of a letter that is misspelled.
  • Position of Word: What is the position of the word and what are the surrounding word. We then compute the bigram probability to know what will be the next word.
  • Nearby keys on the keyboard: What are the nearby keys of a keyboard. This makes many mistakes in spelling as a user press nearby button instead of pressing the actual button.

 

Classifier-based methods for real-life spell checker:

For real-life spell checker, the Noisy Channel Model is not good enough. For that purpose classifier based methods are used suppose we have a letter cloudy so the system should know that the word will be ‘weather’ and if we have ‘verb’ or  ‘or not’ than the word should be ‘whether’.

We have to classify such pair for modeling.

A classifier that can be used for such problems are:

  • Naive Bayes
  • SVM
  • KNN etc.

 

Text Classification:

Text classification assigns the classes to the text according to their content. These classes have then established a relationship with the text and form a hierarchy of categories. The text classification includes many types of pre-processing. These pre-processing includes text extraction, tokenization, removal of stop words and conversion to unigram or bigram.





Login/Sign Up

Comments




Related Posts



© Copyright 2019, All Rights Reserved. paayi.com

This site uses cookies. By continuing to use this site or clicking "I Agree", you agree to the use of cookies. Read our cookies policy and privacy statement for more information.