Information retrieval is finding material, usual documents of an unstructured nature that satisfies an information need from within extensive collection usually stored on computers. There are many forms of information retrieval, like email search, searching from a laptop, corporate knowledge bases, legal information retrieval, etc.
Basic Assumption of Information Retrieval:
The collection is a set of documents. This collection is a static collection most of the time.
The primary goal of information retrieval is to retrieve documents with information that is relevant to the user’s information need and helps the user complete a task.
Document term matrix:
A document term matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document, the term network lines compare to records in the gathering and sections relates to terms. There are different plans for deciding the value that every section in the network should take. They are handy in the field of natural language programming.
The document term matrix is a binary matrix and thus form a sparse matrix.
For the query tea and my following operations will be performed:
Take a bit representations tea = 110 me =011
Perform AND operation bitwise, which results in 010.
That indicates that document two contains tea and me.
The efficiency of Binary TDM:
For any realistically sized corpus, the TDM will be too large to store.
For example, 1M documents and 500 k unique terms lead to 500 Billion matrix cells.
However, the matrix is hugely parsing. The most document only contains a handful of terms, and the majority of terms only occur in a handful of documents.
However, it is not a better representation for making it better. We only record the one position, and an Inverted Index can achieve this.
The inverted index in computer science is an index data structure storing a mapping from content, such as word or numbers, to its location in a database file, or in a document or a set of documents. The reason for an inverted is to permit quick full content quests at the expense of expanded preparation when a record is added to the database. The inverted record might be simply the database document, instead of its list. It is the most prominent data structure utilized in record recovery frameworks utilized on a broad scale, for example, web index.
How does it work?
For each term t, we must store a list of all documents that contain t. We first identify the document by its id or a document serial number.
We cannot use the fixed size of the array for this. For this purpose, we need a variable size posting lists because on disk, a continuous run of postings is normal and best and in memory, can use linked lists or variable-length arrays.
Bag of Word:
A bag of words model or BoW for short is a way of extracting features from the text for use in modeling, such as with machine learning algorithms. The approach is straightforward to use and can be used in the way of extracting features from the document. A bag of words is a representation of a text that describes the occurrence of words within a document. It involves two things:
- A dictionary of all the words known.
- A probability of the occurrence of a.
It is known as a bag of words because any data about the order or structure or structure of words in the report are disposed of. The model is just worried about whether the realized word happens in the record or not.
Vector representation doesn’t consider the ordering of words in a document. For example, consider the two-sentence “John is quicker than Mary” and “Mary is quicker than John” both have the same vector. It can cause ambiguity.
Following is the code to make a bag of words in python.
The problem with counting word vectorizer is that highly frequent words start to dominate in the document, but not contain as much informational content to the model as rarer but perhaps domain-specific words. One approach is to re-scale the constancy of words by how many times they show up in all documents so that the scores for frequent words like “the” that are also frequent across all documents are penalized.
This approach to scoring is called Term Frequency-Inverse Document Frequency, or TF-IDF for short where:
- Term Frequency: is a scoring of the frequency of the word in the current document.
- Inverse Document Frequency: is a scoring of how rare the word is across the document.
Therefore the IDF of a rare term is high, although the IDF of a common term is likely to go low.
The log is used to dampen the value of the system. Following is the code to extract TF-IDF values in python:
Queries as a Vector:
In this concept, documents are points or vectors in this space. These are very sparse vectors; most entries are zero. The key idea behind this approach is to do the same for the queries and represent them as a vector. The 2nd key idea behind it to ranked documents according to their proximity to the query in this space. The proximity near to two vector means similarity of the vector. It ranks more relevant documents higher than a less relevant document.
To make this approach workable and to find the proximity of two vectors, we calculate the minimum angle instead of Euclidean distance because Euclidean distance is significant for vectors of different lengths.
The above figure shows that query q has a minimum angle with d2 than d1 and in the last d3.
Variants of TF-IDF weighting:
Summary of vector space ranking:
It represents the query as a weighted TF-IDF vector and also represents each document as a weighted TF-IDF vector. It computes the cosine similarity score for the query vector, and each document vector afterward ranks the documents concerning the query by score and, in last, return the top 10 documents to the user.