In this section, we will see how to make a recommendation system of movies by using natural language processing. There are multiple types of recommendation system normally that market is using currently.
- Content-Based Filtering: In this filtering, we use the name of the movies and description of the movies and then find the similarity by using the textual data. In this technique, the rules of natural language processing will be used. Today we will be using this technique.
- Collaborative Technique: It is a technique in which we recommend the movie to the user by likes, dislikes or rating of other users.
- Hybrid Filtering: Hybrid filtering is the technique in which both content-based and collaborative are merged to form one filtering technique.
So today we will be doing the content-based technique and for which we will be using the dataset from kaggle. Now lets deep dive into the code. We will first import the essential modules.
Pandas for importing the CSV and for data manipulation pandas will be used. Eval is the library that converts the string to the python object. For example, if we have a string that has a list like characters than eval will convert to the real list. It evaluates the string to the real python object
The other module that we are going to import is vectorizer. A computer does not understand what text is, so we have to convert the text to vectors and for that this is the best method.
Then we will use the similarity method to find the similarity between the text.
Now we will import the CSV files so that we can do the further implementation.
Now after that, the data looks like this:
There are many features in the movie data set so. First, we have to make the clean. But what we have to clear. We can see that genre is given in the list form. But the type of the genre is a string. To parse the data and iterate over the list and dictionary we first have to evaluate the data into the python object. So the following action should be made.
In the first line, we have to change the object type of genre from string to list object. We first fill the null value to an empty string. After that, we converted the string date to the actual date time format of python. By the date time, we can know how much movies are released in which year.
This the plot based on movies released in different years. Now we will link the two CSV data by the id. In this way, we can make the size of the data of the movie smaller because the data is too much broad and it will consume much memory.
In the above line of code, we have converted the id data to an integer so we can easily natch the ids from the 2 data source.
Now we will make a new data source that only includes the data which is both presents in the links and movies data. Moreover, then we make a description that includes the tagline of the movie and the overview of the movie. Which makes the total description of the data. The data looks like as following:
It contains all the data that we can use for the extraction of data. Now to make these descriptions understandable to the computer we have to vectorize the data as follows:
If we try to print the matrix following output can be seen.
We can see that the data is in compressed form. Now we will calculate the similarity by using cosine similarity.
The output of this code is as follows:
We can see that the data is too much sparsed.
We then compute the similarity, we sort the 20 maximum or closely reside data. Now we will call the function, and we will see the results:
The output is as follows:
These are all movies that are close to Jumanji.