Ensemble Learning with Scikit-Learn: A Friendly Introduction | by Riccardo Andreoni | Sep, 2023

Ensemble learning algorithms like XGBoost or Random Forests are among the top-performing models in Kaggle competitions. How do they work?

Riccardo Andreoni
Towards Data Science
Source: unsplash.com

Fundamental learning algorithms as logistic regression or linear regression are often too simple to achieve adequate results for a machine learning problem. While a possible solution is to use neural networks, they require a vast amount of training data, which is rarely available. Ensemble learning techniques can boost the performance of simple models even with a limited amount of data.

Imagine asking a person to guess how many jellybeans there are inside a big jar. One person’s answer will unlikely be a precise estimate of the correct number. Instead, if we ask a thousand people the same question, the average answer will likely be close to the actual number. This phenomenon is called the wisdom of the crowd [1]. When dealing with complex estimation tasks, the crowd can be considerably more precise than an individual.

Ensemble learning algorithms take advantage of this simple principle by aggregating the predictions of a group of models, like regressors or classifiers. For an aggregation of classifiers, the ensemble model could simply pick the most common class between the predictions of the low-level classifiers. Instead, the ensemble can use the mean or the median of all the predictions for a regression task.

Image by the author.

By aggregating a large number of weak learners, i.e. classifiers or regressors which are only slightly better than random guessing, we can achieve unthinkable results. Consider a binary classification task. By aggregating 1000 independent classifiers with individual accuracy of 51% we can create an ensemble achieving an accuracy of 75% [2].

This is the reason why ensemble algorithms are often the winning solutions in many machine-learning competitions!

There exist several techniques to build an ensemble learning algorithm. The principal ones are bagging, boosting, and stacking. In the following…

Source link

This post originally appeared on TechToday.