Boosting Algorithms explained in detail

Tavish Aggarwal

December 19, 2023

In the post, Random Forests Explained in detail we discussed Random Forest which uses the technique of Bagging to create an ensemble of the decision tree. In this post, we will be discussing Boosting techniques and will look at a few popular algorithms:

  • Adaptive Boosting or AdaBoost
  • Gradient Boosting

which uses boosting techniques to create an ensemble.

Introduction to Boosting

The basic idea of Boosting is to combine a lot of weak learners to get a strong learner, where a weak learner refers to a model that is slightly better than a model that predicts at random.

In simple terms, it refers to the technique where we train a bunch of individual models sequentially and each model learns from mistakes made by the previous model.

The basic idea of the Boosting algorithms is to:

  1. Create an ensemble that makes high errors only on the less frequent data points and low errors on the more frequent data points.
  2. Build a series of models (which uses weak learning algorithms) specifically targeted at the data points that have been incorrectly predicted by the other models in the ensemble. In this way, the series of models keeps reducing the average error, and we will have an extremely high-accuracy ensemble.

Boosting is a way of generating a strong model/learners from a weak learning algorithm.

NOTE: SVM, regression, and other techniques are algorithms that are used to create models.

Boosting algorithms can be used for both classification and regression models.

Weak Learning Algorithms

Weak Learning Algorithms will produce a model that does marginally better than a random guess. Any model that has, say, a 60-70% chance of being correct is considered a model generated from a weak learning algorithm.

Consider a hypothetical example where I am using the Decision tree algorithm to generate a model. Suppose with depth 6, I would be expecting to have an accuracy of 80% or above.

However, because of computational limitations, I am only able to train a tree with depth 2 and able to achieve 70% accuracy. I have regularized my tree only to depth 2 to control computational resources. And I am getting only 70% accuracy with the model produced.

In this case, because of the regularizing of the depth to 2, my algorithm is behaving like a weak learning algorithm which is doing better than a random guess but it's not ready to use in production.

Here our goal is to use weak learning algorithms and generate a model that can be used in production. Let's look at one such boosting algorithm called AdaBoost Algorithm.

AdaBoost Algorithm

AdaBoost stands for Adaptive Boosting. Here we generate an ensemble of the models in such a way that few data points are correctly predicted by the model \(M_1\), few by model \(M_2\), and so on. Let's understand the AdaBoost algorithm in classification settings.

Here the algorithm A produces a model M given the training set T and the distribution D.

$$A(T,D) = M$$

Initially, the distribution will be uniform for all of the data points. i.e. All data points will have the same probability. But as we go along the probability of incorrectly predicted data points by model \(M_1\) will be pushed up in the distribution.

In mathematical terms, the AdaBoost algorithm boosts the probability of points that are incorrectly classified and suppresses the probability of points that are correctly classified.

NOTE: The objective of doing this is to make sure that the model pays much more attention to data points that are not correctly classified.

For every new model, the distribution of the data points changes (i.e. The weight assigned to each data point). And the objective function is defined which needs to be minimized.

$$\text{Objective Function} = ^{min}_hE_D[L(h(x_i),y_i)] = ^{min}_h\sum^n_{i=1}p_D(x_i).L(h(x_i),y_i)$$

where \(p_D(x_i)\) represents the distribution for the current iteration. Here in the objective function, we are making sure that the loss for the data points having high probability distribution is low.

Now that we have an Objective function defined, we need to answer two questions:

  • Determine how to change the distribution function to generate the next model in the ensemble.
  • Calculation of the weights given to each of the models to get the final ensemble

Create a new distribution for the next iteration

We know that the points that are not correctly predicted need to be boosted in the next iteration. So we define the probability of a new distribution for the iteration as:

$$p_{d(t+1)}(x_i)={p_{d(t+1)}(x_i).e^{-\alpha _t y_i h_t(x_i)} \over z_t}$$

where,

$$z_t = \sum^n_{i=1}p_{d(t)}(x_i).e^{-\alpha _t y_i h_t(x_i)}$$

and \(\alpha\) is always a positive number which controls the weight of the model in the ensemble.

Note that here we have chosen \(e^{-x}\) to find distribution in the next iteration. The reason for doing this is that when we have:

  1. For the Incorrect prediction from the model, we need to give importance to that data point in the next model. Here \(y_i h_t(x_i)\) will be negative and \(e^{-\alpha _t y_i h_t(x_i)}\) will result is huge positive value. Hence increasing the importance of the data point in the next model.
  2. For the correct prediction from the model, we don't need to consider a data point in the next model. Here \(y_i h_t(x_i)\) will be positive and \(e^{-\alpha _t y_i h_t(x_i)}\) will result is value close to zero. Hence the data point will not be considered in the next model.

NOTE: The probability of the first distribution is defined as 1/n where n is the number of data points.

So far so good. We now know how to create individual distributions, but how to combine them to create a final ensemble?

Weights are given to each of the models to get the final ensemble

To create a final ensemble we concatenate the models based on the weight. The weight \(\alpha_t\) is represented as:

$$\alpha_t = {1\over 2} ln({1-\epsilon_t \over \epsilon_t})$$

where,

$$\epsilon_t = \sum_{i:h_{t(x_i \ne h_i)}}p_{D(t)}(x_i)$$

represents the probability of algorithm \(h_t\) making a mistake in classification.

The choice for the \(\alpha_t\) expression is made to make sure that our models are better than random guess (more than 50% probability), as \(ln({1-\epsilon_t \over \epsilon_t})\) will be always greater than 1.

    AdaBoost tends to boost the probabilities of misclassified points and there is a high chance that outliers will be misclassified, it will keep increasing the probability associated with the outliers. Therefore it is advisable to remove outliers.

    The below example demonstrates the Python implementation of the AdaBoost algorithm.

    # Importing packages
    from sklearn.ensemble import AdaBoostClassifier
    from sklearn.model_selection import train_test_split
    from sklearn.datasets import load_iris
    
    # Loading dataset
    data = load_iris()
    
    X = data.data
    y = data.target
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
    
    # Training model
    adaClassifier = AdaBoostClassifier(n_estimators = 100, random_state = 0)
    adaClassifier.fit(X_train, y_train)
    
    adaClassifier.predict(X_test)
    
    adaClassifier.score(X_test, y_test)

    Gradient Boosting

    As discussed earlier, the Boosting model’s key is learning from previous mistakes. Let's understand Gradient Boosting in regression settings.

    Suppose we have n data points and we are looking for the function \(F(x_i)\) which produces the output almost equal to \(y_i\). But in real case scenarios, there is a difference between predicted output \(y_{pi}\) and actual output \(y_i\). This difference is called a residual \(y_i - y_{pi}\).

    Now in Gradient Boosting, we train another model \(M_1\) on the data points with the target variable as \(y_i - y_{i}^0\) or \(y_i - F_0(x_i)\).

    And we keep on training models up to \(F_t\) model. And the target variable for \(F_t\) model is represented as:

    $$y - [F_0 + F_1 + F_2 + F_3 + . . . . . . . . . . . + F_{t-2} + F_{t-1}] $$

    Gradient Boosting consists of two steps at each iteration:

    1. Find the residual and fit a new model on the residual: To find a change that we need to make to the model we find gradient descent of the loss function.
    2. Add the new model to the older model and continue the next iteration

    NOTE: We can use any of the loss function here, based on the problem. The only point to remember is that loss function should be differentiable. Few loss functions that we can use are:

    • Quadratic cost
    • Cross-entropy cost
    • Exponential cost
    • Hellinger distance
    • Kullback–Leibler divergence
    • Generalized Kullback–Leibler divergence
    • Itakura–Saito distance

    In simple words at each iteration, we add an incremental model, which fits on the negative gradients of the loss function evaluated at current target values. This incremental model can be a linear regression model a decision tree or any other model. And we stop iterations when we see that the gradients are very close to zero.

    The below example demonstrates the Python implementation of the Gradient Boosting algorithm.

    # Importing packages
    from sklearn.ensemble import GradientBoostingRegressor
    from sklearn.model_selection import train_test_split
    from sklearn.datasets import load_boston
    from sklearn.metrics import r2_score
    
    # Loading dataset
    data = load_boston()
    
    X = data.data
    y = data.target
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
    
    # Training model
    gradientRegressor = GradientBoostingRegressor(n_estimators = 100, random_state = 0)
    gradientRegressor.fit(X_train, y_train)
    
    r2_score(y_test, gradientRegressor.predict(X_test))

      Summary

      In this post, we learned about AdaBoost and the Gradient Boosting algorithm. Let's summarize the steps involved in each of the stated algorithms.

      Summarization of the steps we take for the AdaBoost algorithm:

      1. Initialize the probabilities of the distribution as 1/n where n is the number of data points
      2. For t = 0 to T, repeat the following (T is the total number of iterations or models)
        • Fit an algorithm \(h_t\) on the training data using the respective probabilities
        • Compute  \(\epsilon_t = \sum_{i:h_{t(x_i \ne h_i)}}p_{D(t)}(x_i)\)
        • Compute \(\alpha_t = {1\over 2} ln({1-\epsilon_t \over \epsilon_t})\)
        • Update \(p_{d(t+1)}(x_i)={p_{d(t+1)}(x_i).e^{-\alpha _t y_i h_t(x_i)} \over z_t}\) where, \(z_t = \sum^n_{i=1}p_{d(t)}(x_i).e^{-\alpha _t y_i h_t(x_i)}\) 
      3. Final Model: \(H=\sum_{t=1}^T \alpha_t h_t\) 

        Summarization of the steps we take for the Gradient Boosting algorithm:

        1. initialize a crude initial function \(F_0\)
        2. For t =  0 to T-1 (where T is the number of iterations or models)
          • Compute the loss function \(L(y_i, F_t(x_i))\)
          • Compute the new target values, the negative gradients \(- \frac{\partial L(y,F_t)}{\partial F_t}\) for all data points
          • Fit a model \(H_{t+1}\) on these new target values
          • Compute the next function \(F_{t+1} = F_t + \lambda_t h_{t+1}\), where \(\lambda\) represent the learning rate.
        3. The final model is \(F_T\)

          There are a few other boosting algorithms as well like XGBoost, CatBoost, and LightGBM. To know more about XGBoost refer to the post XGBoost Detailed Explanation.

          Author Info

          Tavish Aggarwal

          Website: http://tavishaggarwal.com

          Living in Hyderabad and working as a research-based Data Scientist with a specialization to improve the major key performance business indicators in the area of sales, marketing, logistics, and plant productions. He is an innovative team leader with data wrangling out-of-the-box capabilities such as outlier treatment, data discovery, data transformation with a focus on yielding high-quality results.