Recommendation System using Natural Language Processing Technique in Python

Python Machine Learning: Recommendation System NLP Techniques
Written by Paayi Tech |17-Oct-2020 | 0 Comments | 528 Views

In this section, we will see how to make a recommendation system of movies by using natural language processing. There are multiple types of recommendation systems normally that the market is currently using.

  • Content-Based Filtering: In this filtering, we use the name of the movies and description of the movies and then find the similarity by using the textual data. In this technique, the rules of natural language processing will be used. Today we will be using this technique.
  • Collaborative Technique: It is a technique in which we recommend the movie to the user by likes, dislikes, or ratings of other users.
  • Hybrid Filtering: Hybrid filtering is the technique in which both content-based and collaborative are merged to form one filtering technique.


So today, we will be doing the content-based technique and for which we will be using the dataset from Kaggle. Now lets deep dive into the code. We will first import the essential modules.

import pandas as pd

import numpy as np

from ast import literal_eval

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

from sklearn.metrics.pairwise import linear_kernel, cosine_similarity

import warnings



Pandas for importing the CSV and for data manipulation pandas will be used. Eval is the library that converts the string to the python object. For example, if we have a string that has a list like characters, then eval will convert to the real list. It evaluates the string to the real python object.

The other module that we are going to import is a vectorizer. A computer does not understand what text is, so we have to convert the text to vectors, and for that, this is the best method.

Then we will use the similarity method to find the similarity between the text.

Now we will import the CSV files so that we can do further implementation.

movie = pd.read_csv('movie.csv')

links = pd.read_csv('links_small.csv')



Now, after that, the data looks like this:

Figure 1


There are many features in the movie data set so. First, we have to make them clean. But what we have to clear. We can see that genre is given in the list form. But the type of the genre is a string. To parse the data and iterate over the list and dictionary, we first have to evaluate the data into the python object. So the following action should be made.

movie['genres'] = movie['genres'].fillna('[]').apply(literal_eval).apply(lambda x:[i['name'] for i in x] if isinstance(x,list) else [])


movie['year'] = pd.to_datetime(movie['release_date'], errors='coerce').apply(lambda x:str(x).split('-')[0] if x!=np.nan else np.nan)


In the first line, we have to change the object type of genre from string to list object. We first fill the null value to an empty string. After that, we converted the string date to the actual date-time format of python. By the date-time, we can know how many movies are released in which year.





Figure 2

This the plot based on movies released in different years. Now we will link the two CSV data by the id. In this way, we can make the size of the data of the movie smaller because the data is too broad, and it will consume much memory.

links = links[links['tmdbId'].notnull()]['tmdbId'].astype('int')

movie['id'] = movie['id'].astype('int')


In the above line of code, we have converted the id data to an integer so we can easily natch the ids from the 2 data sources.

small= movie[movie['id'].isin(links)]

small['tagline'] = small['tagline'].fillna('')

small['description'] = small['overview']+movie['tagline']

small['description'] = small['description'].fillna('')


Now we will make a new data source that only includes the data, which is both presents in the links and movie data. Moreover, then we make a description that includes the tagline of the movie and the overview of the movie. Which makes the total description of the data. The data looks like as following:

1        When siblings Judy and Peter discover an encha...

2        A family wedding reignites the ancient feud be...

3        Cheated on, mistreated, and stepped on, the wom...

4        Just when George Banks has recovered from his ...

5        Obsessive master thief, Neil McCauley leads a ...

6        An ugly duckling having undergone a remarkable...

7        A mischievous young boy, Tom Sawyer, witnesses...

8        International action superstar Jean Claude Van...

9        James Bond must unmask the mysterious head of ...

10       Widowed U.S. president Andrew Shepherd, one of...



It contains all the data that we can use for the extraction of data. Now to make these descriptions understandable to the computer, we have to vectorize the data as follows:

tf = TfidfVectorizer(analyzer='word',ngram_range=(1, 4),min_df=0, stop_words='english')

matrix = tf.fit_transform(small['description'])


If we try to print the matrix following output can be seen.




Output: <9098x635391 sparse matrix of type ''

        with 849580 stored elements in Compressed Sparse Row format>


We can see that the data is in compressed form. Now we will calculate the similarity by using cosine similarity.

sim = cosine_similarity(matrix, matrix)


The output of this code is as follows:

array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,

        0.        ],

       [0.        , 1.        , 0.00715858, ..., 0.00185797, 0.        ,

        0.        ],

       [0.        , 0.00715858, 1.        , ..., 0.        , 0.        ,

        0.        ],


       [0.        , 0.00185797, 0.        , ..., 1.        , 0.        ,

        0.        ],

       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,

        0.        ],

       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,

        0.        ]])



We can see that the data is too much sparsed.

small = small.reset_index()

titles = small['title']

ind = pd.Series(small.index, index=titles)


def Get_Recommendation(title):

    idd = ind[title]

    scores = list(enumerate(sim[idd]))

    scores = sorted(scores, key=lambda x: x[1], reverse=True)

    scores = scores[1:20]

    movie_indices = [i[0] for i in sim_scores]


    return titles.iloc[movie_indices]


We then compute the similarity, we sort the 20 maximum or closely reside data. Now we will call the function, and we will see the results:






The output is as follows:

8888                        Pixels

8607       Guardians of the Galaxy

8153                Wreck-It Ralph

6391                    Stay Alive

3195            Dungeons & Dragons

8669                         Ouija

5802               Comfort and Joy

5355      Night of the Living Dead

4081     The Giant Spider Invasion

6322                 Grandma's Boy

8210              Would You Rather

1644                     Peter Pan

2080                      eXistenZ

6544             Sleeping Dogs Lie

6284    Zathura: A Space Adventure

4172                    Rollerball

8409                  Ender's Game

1618                   BASEketball

7307                         Gamer


These are all movies that are close to Jumanji.

Login/Sign Up


Related Posts

© Copyright 2020, All Rights Reserved.

This site uses cookies. By continuing to use this site or clicking "I Agree", you agree to the use of cookies. Read our cookies policy and privacy statement for more information.