The principle of maximum entropy states that the probability distribution, which suitably represents the current state of knowledge, is the one with the most significant entropy in the context of precisely stated prior data. Unlike the naive Bayes method, maximum entropy is used to guarantee the uniqueness and consistency of probability assignments obtain by different methods, statistical mechanics, and logical inference in particular.

A maximum entropy probability distribution has the entropy that is partially as big as that of all additional members of a specified class of probability distribution. Corresponding to the principle of maximum entropy, if nothing is known about a distribution except that it belongs to a particular class, then the distribution with the most significant entropy should be chosen as the least informative.

- Maximum entropy methods are very general ways to predict probability distributions given constraints on their moments.
- Predict relative abundance distributions based on the number of individuals, species, and total energy
- Uniformity means high entropy – we can search for distributions that have properties we desire but also have high entropy.

The uncertainty of a distribution is called entropy. At each step, we quantify uncertainty by the formula :

*log ( 1 / p _{x} )*

**Why Use Maxent:**

- Presence-only data
- Explore complex relationships with the environment.
- Unbiased predictions based on constraints

**Why not use Maxent:**

- The algorithm is a bit black box and allows only some customization.
- Statistical properties are not well understood.

**Feature overlapping:**

The Maxent model handles overlapping features well. Unlike the Naive Bayes model, there is no double-counting of the probability. Each probability has its unique entropy on which decision is made. This model does not make the assumptions based on abundance rather than take into account the evidence of the current word, the previous word, and the next word to make it more robust, more features are added, like part of speech tags and signature of the current, previous, and next word.

The feature interacted in the following ways:

Figure 1

The distribution is uniform and always equals to 1.

**Feature Interaction:**

- For log-linear/logistic regression models in statistics, it is standard to do a greedy stepwise search over the space of all possible interaction terms.
- This combinatorial space is exponential in size, but that's okay as most statistics models only have 4-8 features.
- In NLP, our models commonly use hundreds of thousands of features, so that's not okay.
- Commonly, interaction terms are added by hand based on linguistic intuition.

**How to avoid Over-fitting of a Model:**

**Issues of Scale:**

- Lots of features: NLP models can have well over a million features. Even storing a single array of parameter values can have a substantial memory cost.
- Lots of sparsity: The model can cause overfitting easily. Many features seen in training will never occur again at the test time.
- Optimization problem: Feature weights can be infinite; an iterative solver can take a long time to get to those infinities.

We can solve it with the following methods:

- Early stopping
- Priors
- Regularization
- Smoothing of virtual data.

**Early Stopping:**

Suppose we have a data of 4 heads and 0 tail. In this case, the optimal value of lambda will go to infinity. One way to deal with this point at issue is to stop the optimization early, after a few iterations.

Then the value of lambda will be finite but presumably significant. The optimization won't take forever. Moreover, this technique was used in the early work of Maxent.

**Prior:**

What if we had a previous expectation that parameter value would not be immense? We could then balance evidence suggesting broad parameters against our prior. The evidence would never totally defeat the prior, and parameters would be smoothed. We can do this explicitly by changing the optimization objective to maximum posterior likelihood.

We can use Gaussian or quadratic or L2 priors. Intuition is that parameters shouldn't be a significant and prior expectation that each parameter will be distributed according to Gaussian with mean and variance.

Figure 2

**POS Tagging**

The part of speech tagger is a software tool that labels words as one of several categories to identify the words function in a given language. In the English language, words fall into one of eight or nine parts of speech. POS taggers use algorithms to label terms in text bodies.

These taggers make more complex categories than those defined as basic PoS, with tags such as "noun-plural" or even more complex labels. Part of speech taggers categorizes terms in PoS types by their relational position in a phrase, relationship with nearby terms, and by the word's definition. PoS taggers fall into those that use stochastic methods, those based on probability, and those who are rule-based.

**Use Cases:**

- It is used for sentiment analysis. Sentiment analysis can be done without the POS tag, but it makes the analysis more robust. For example, we have a sentence, "I like you" and "I am like you.". If these sentences are analyzed without POS tags, the result will be not up to the mark.
- POS tags are used for grammatical correction within a sentence; when we write a sentence that is grammatically wrong, the Microsoft Word shows an error that the sentence is grammatically wrong.
- POS tags are used for the prediction or completion of the sentence. When we write an incomplete sentence in the google search bar, it makes a prediction and completes a verdict according to the previous words.
- Relation extraction also uses POS tags to identify the relation between two entities.

Figure 3

The above-written code extracts the part of speech from the sentence in python. These part of speech is then used to make the hierarchical connection of sentence which later can be used to extract relation.

**How difficult is POS tagging:**

There are about 11% of the word types in the brown corpus are ambiguous around part of speech. However, they tend to be very good at performance with common words like that, this, etc.

**Source of Information:**

Following are the primary source of information for POS tagging:

- Knowledge of neighboring words.
- Knowledge of word probabilities.

From experience, we can deduce POS tagging that uses MEMM tagger, and Bidirectional dependencies are a much higher thanMaxent model.

**Tagging without Sequence Information:**

Figure 4