How Gradient Boosting Works?
We take up a weak learner, and at each progression, we add another weak learner to expand the execution and construct a solid student. It decreases the loss of the misfortune work. We iteratively pass each model and register the misfortune. The misfortune speaks to the mistake residuals and utilizing this misfortune esteem the forecasts are refreshed to limit the residuals.
Let's break it down step by step:
In the first iteration, we take a simple model and try to fit the complete data. The loss function will try to reduce these error residuals by adding more weak learners. The new weak learners are added to concentrate on the areas where existing learner are performing poorly.
After three iterations, the model can fit the data on the line much better. This iterative process is continued until the residual is zero.
After 20 iterations the model fit the data exactly as the residual error drops to zero.
Optimizing Gradient Boost:
We can tune the following hyper-parameter to optimize the gradient boosting:
- n_estimator: It controls the number of weak learners.
- learning_rate: Controls the contribution of weak learners in the final combination.
- max_depth: Maximum depth limits the number of nodes in the tree.
XG Boost:
Extreme gradient boosting has taken data science competition by storm. XG boost is a part of an ensemble of classifiers. It is similar to gradient boosting the algorithm, but it has a few tricks which make it stand out from the other.
Both XG-Boost and gradient boosting follows the same principle. There however is a difference that XG boost used a more regularized model formalization to control overfitting, which gives better performance.
Regularization:
The regularization avoids the model from getting overfit. Its main function is to control the complexity during training.
The traditional treatment of machine learning only emphasized improving impurity while not focusing on complexity control. By defining it formally, we can Enhance our learning model.
Working of XG-Boost:
When using gradient boosting for regression, the weak learners are regression trees, and each regression treemaps an input data point to one of its leaf that contains a continuous score. XG Boost minimizes a regularized objective ( L1 and L2 ) objective function that combines a convex loss function. The training proceeds iteratively, adding a new tree that predicts the residual or error of prior trees that are then combined with the previous tree to make the final prediction. It is called extreme gradient boosting because it involves a gradient descent algorithm to minimizes the loss when added new models.
Objective Function = training loss + Regularization.
By increasing the value of regularization, the model will become conservative.
Optimizing the XGBOOST:
We can optimize the XG boost for better accuracy depending upon the type of data set.
- Regularization: by adjusting the L2 and L1 hyper-parameter we can achieve precision. Increasing the value make model conservative
- Objective: It specifies the learning task depending upon the data set. reg:linear for linear regression, reg:logistic for logistic regression, multi:softmax for multi-classes etc.
- eval_metric: Evaluation metrics for validation data, a default metric will be assigned according to objective (rmse for regression, and error for classification, mean average precision for ranking)
The choices are listed below:
rmse: root mean square error
mae: mean absolute error
log loss: negative log-likelihood. - Gamma: Minimum loss reduction required to make a further partition on a leaf node of the tree. The larger gamma is, the more conservative algorithm will be.
- max_depth: Maximum depth of the tree. Increasing the value make the model more complex and more likely to overfit.
- min_child_weight: Minimum sum of instances weight needed in a child. If the tree partition step results in a leaf node with the sum of instance weight lest than min_child_weight, then the building process will give up further partitioning.
Feature of XGBOOST:
- Clever penalization of a tree:
- Newton Boosting: Newton-Raphson method of approximations which provides a direct route to the minima than gradient descent.
- Extra Randomization of Tree: Column Subsampling
- The proportional shrinking of leaf nodes: weights of the trees that are calculated with less evidence is shrunk more heavily.
Bagging Vs Boosting:
- There is no clear winner in this it only depends on the data and the amount of data
- Boosting takes more computation power when compared with the bagging techniques
- Both techniques by combining the different model try to reduce the variance thus decreasing the chance of overfitting.
- The model that combined different model is more reliable than the model that consists of only one algorithm
- Boosting technique can reduce the bias while training the model whereas bagging cant do this
- If we want to minimize the variance and we don't have a high bias in our data than bagging perform much better than boosting.
Which one is better?
Both bagging and boosting come under a category which is known as an umbrella technique which also known as ensemble method.
Bagging is to have multiple classifiers trained on different samples subset and allow these classifiers to vote on a final decision, contrasting with just using one classifier.
Boosting is to have a series of classifiers to train on the data set, but gradually putting more emphasis on training examples that the previous classifiers will focus on these harder examples. So in the end, you will have a series of classifiers which are in general balanced but slightly more focused on the hard training examples.
So in practice boosting beats bagging in general but both boosting and bagging will beat plain classifier.
Ada Boost Vs. Gradient Boost:
Ada Boost |
Gradient Boost |
Ada boost uses a weak base learner and tries to boost the performance of weak learner by iteratively shifting the focus toward problematic observation. |
Gradient boost uses a weak base learner and tries to boost the performance of weak learner by iteratively shifting the focus toward problematic observation. |
In this method, the shift is done by up-weighting observations that were misclassified before. |
Gradient boost identifies difficult observation by large residuals computed in the previous iterations. |
High weight data points identify shortcomings. |
Gradients identify shortcoming. |
Exponential loss of Ada boost gives more weight for those samples fitted worse |
Gradient Boost dissects error components to bring in more explanation. |