XGBoost Detailed Explanation

Tavish Aggarwal

February 7, 2022

In the previous post, Boosting Algorithms explained in detail we discussed in detail boosting algorithms and there working.

In this post, let's talk about one more popular algorithm in boosting category: XGBoost.

XGBoost

XGBoost (Extreme Gradient Boosting) which is one of the most popular Gradient Boosting algorithms. It is a decision-tree-based ensemble Machine Learning algorithm that uses a gradient boosting framework.

NOTE: This Gradient Boosting tree algorithm is the same as the Gradient Boosting algorithm, just that the model which we fit at each iteration is a decision tree.

XGBoost algorithm is a modification of the Gradient Boosting tree algorithm and involves Gradient Boosting on shallow decision trees. Therefore let's summarize the Gradient Boosting tree algorithm first before understanding XGBoost.

  1. Initialize a crude initial function \(F_0\)
  2. For t = 0 to T-1 (where T is the number of iterations or models)
    • Compute the loss function \(L(y_i, F_t(x_i))\)
    • Compute the new target values, the negative gradients \(- \frac{\partial L(y,F_t)}{\partial F_t}\) for all data points
    • Fit a shallow decision tree on the above data points to get \(J_{t}\) terminal nodes represented as \(R_{j}\), j = 1 to \(J_{t}\)
    • For each terminal node j = 1 to \(J_{t}\)
      • Find the \(\alpha_j\) such that it minimizes \(\sum_{x_i ϵ R_j} L(y_i, F_t(x_i) + \alpha_j\)
      • All these \(\alpha_j\) constitute the incremental tree \(h_{t + 1}\).
    • Perform \(F_{t+1} = F_t + \lambda_th_{t+1}\)
  3. The final model is \(F_T\)

NOTE: Gradient Boosting tree algorithm is almost same as that of Gradient Boosting algorithm with slight modification. Here the model which we fit at each iteration is a decision tree.

The \(\alpha_j\) are the values of the terminal nodes/leaves. And the stopping criteria here is that the gradients are close to zero (or exactly zero).

XGBoost uses a regularised model formulation to control overfitting, which gives it better performance. It is also known as a regularised boosting technique.

As that of Ridge and Lasso regression where the objective function was the sum of the loss function and regularized function, the XGBoost algorithm objective function constitutes the sum of the loss function and regularization term for all predictors/trees.

$$\text{Objective Function : Training Loss + Regularization}$$

The major difference between the Gradient Boosting and the XGBoost is that XGBoost incorporates the regularisation parameter in its objective function to control over-fitting.

$$\text{Objective Function} = \sum^n_{i=1}L(y_i, F_t(x_i)) + \sum^T_{t=1} \Omega(h_t) $$

Since it's not easy to learn all trees at once, therefore instead of using gradient boosting to optimize the objective function we use Additive Training.

The Additive strategy is where we fix what we have learned and add one tree at a time. For example:

$$F_0(x_i) = 0; \\
F_1(x_i) = F_0(x_i) + h_1(x_i); \\
F_2(x_i) = F_1(x_i) + h_2(x_i)$$

And so on . . . . . . .

Here, \(F_t(x_i) \) represents the prediction of the \(i^{th} \) instance at \(t^{th} \) iteration. And we greedily add \(h_t \) to our model to minimize objective function.

\(F_t(x_i) \) is what we need to calculate at round t.

$$\text{Objective Function}^t = \sum^n_{i=1}L(y_i, F_{t-1}(x_i) + h_t(x_i)) + \sum^T_{t=1} \Omega(h_t) $$

The objective function shown above can be approximated using the Taylor series expansion.

# Importing packages
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris

# Loading dataset
data = load_iris()

X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# Training model
xgbClassifier = XGBClassifier()
xgbClassifier.fit(X_train, y_train)

xgbClassifier.predict(X_test)

xgbClassifier.score(X_test, y_test)

Advantages of XGBoost algorithm

  • Parallel Computing: When we run xgboost, by default, it would use all the cores of your laptop/machine enabling its capacity to do parallel computation. Parallelization within a single tree where each independent branch of the tree is trained separately.
  • Regularization: The biggest advantage of xgboost is that it uses regularization and controls the overfitting and simplicity of the model which gives it better performance.
  • Enabled Cross Validation: XGBoost is enabled with internal Cross Validation function
  • Missing Values: XGBoost is designed to handle missing values internally. The missing values are treated in such a manner that if there exists any trend in missing values, it is captured by the model.
  • Flexibility: XGBoost id not just limited to regression, classification, and ranking problems, it supports user-defined objective functions as well. Furthermore, it supports user-defined evaluation metrics as well.

Summary

In this post, we learned about one of the most popular Gradient Boosting algorithms i.e XGBoost and it's advantages like:

  • Parallel Computing
  • Regularization
  • Enabled Cross Validation
  • Missing Values
  • Flexibility

There are few other boosting algorithms as well like CatBoost, LightGBM which we will be covering in upcoming posts.

Author Info

Tavish Aggarwal

Website: http://tavishaggarwal.com

Living in Hyderabad and working as a research-based Data Scientist with a specialization to improve the major key performance business indicators in the area of sales, marketing, logistics, and plant productions. He is an innovative team leader with data wrangling out-of-the-box capabilities such as outlier treatment, data discovery, data transformation with a focus on yielding high-quality results.