Information retrieval is finding material usually documents of an unstructured nature that satisfies an information need from within extensive collection usually stored on computers. There are many forms of information retrieval like email search, searching from a laptop, corporate knowledge bases, legal information retrieval, etc.
Basic Assumption of Information Retrieval:
The collection is a set of documents. This collection is a static collection most of the time.
The primary goal of information retrieval is to retrieve documents with information that is relevant to the user’s information need and helps the user complete a task.
Document term matrix:
A document term matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of document. In a document, term network lines compare to records in the gathering and sections relate to terms. There are different plans for deciding the value that every section in the network should take. They are handy in the field of natural language programming.
Document term matrix is a binary matrix and thus form a sparse matrix.
For the query tea and my following operations will be performed:
Take a bit representations tea = 110 me =011
Perform AND operation bitwise which result in 010.
That indicates that document two contains tea and me.
The efficiency of Binary TDM:
For any realistically sized corpus, the TDM will be too large to store.
For example, 1M documents and 500 k unique terms lead to 500 Billion matrix cell
However, the matrix is hugely parsing. The most document only contains a handful of terms and the majority of terms only occur in a handful of documents.
However, it is not a better representation for making it better. We only record the 1 positions, and an Inverted Index can achieve this.
The inverted index in computer science is an index data structure storing a mapping from content, such as word or numbers, to its location in a database file, or in a document or a set of documents. The reason for an inverted is to permit quick full content quests, at the expense of expanded preparing when a record is added to the database. The inverted record might be simply the database document, instead of its list. It is the most prominent data structure utilized in record recovery frameworks utilized on a broad scale, for example, web index.
How does it work?
For each term t, we must store a list of all documents that contain t. We first identify the document by its id or a document serial number.
We cannot use the fixed size of the array for this. For this purpose we need a variable size posting lists because on disk, a continuous run of postings is normal and best and in memory, can use linked lists or variable length arrays.
Bag of Word:
A bag of words model or BoW for short is a way of extracting features from the text for use in modeling, such as with machine learning algorithm. The approach is straightforward to use and can be used in the way of extracting features from the document. A bag of words is a representation of a text that describes the occurrence of words within a document. It involves two things:
- A dictionary of all the words known.
- A probability of the occurrence of a.
Is known as a bag of words because any data about the order or structure or structure of words in the report are disposed of. The model is just worried about whether realized word happens in the record or not.
Vector representation doesn’t consider the ordering of words in a document. For example, consider the two sentence “John is quicker than Mary” and “Mary is quicker than John” both have the same vector. It can cause ambiguity.
Following is the code to make a bag of words in python.
The problem with counting word vectorizer is that highly frequent word start to dominate in the document, but not contain as much informational content to the model as rarer but perhaps domain-specific words. One approach is to re-scale the constancy of words by how many times they show up in all documents so that the scores for frequent words like “the” that are also frequent across all document are penalized
This approach to scoring is called Term Frequency-Inverse Document Frequency, or TF-IDF for short where:
- Term Frequency: is a scoring of the frequency of the word in the current document.
- Inverse Document Frequency: is a scoring of how rare the word is across the document.
Therefore the IDF of a rare term is high, although the IDF of a common term is likely to be low.
The log is used to dampen the value of the system. Following is the code to extract TF-IDF values in python:
Queries as a Vector:
In this concept, documents are points or vectors in this space. These are very sparse vectors; most entries are zero. The key idea behind this approach is to do the same for the queries and represent them as a vector. The 2nd key idea behind it to ranked documents according to their proximity to the query in this space. The proximity near to two vector means similarity of the vector. It ranks more relevant document higher than a less relevant document.
To make this approach workable and to find the proximity of two vectors we calculate the minimum angle instead of Euclidean distance because Euclidean distance is significant for vectors of different lengths.
The above figure shows that query q has a minimum angle with d2 than d1 and in the last d3.
Variants of TF-IDF weighting:
Summary of vector space ranking:
It represents the query as a weighted TF-IDF vector and also represents each document as a weighted TF-IDF vector. It computes the cosine similarity score for the query vector, and each document vector afterward ranks the documents concerning the query by score and in last return the top 10 documents to the user.