Boosting
Contents
Boosting#
Boosting refers to an ensemble method in which several models are trained sequentially with each model learning from the errors of its predecessors. In this chapter, you’ll be introduced to the two boosting methods of AdaBoost and Gradient Boosting.
Adaboost#
Define the AdaBoost classifier#
In the following exercises you’ll revisit the Indian Liver Patient dataset which was introduced in a previous chapter. Your task is to predict whether a patient suffers from a liver disease using 10 features including Albumin, age and gender. However, this time, you’ll be training an AdaBoost ensemble to perform the classification task. In addition, given that this dataset is imbalanced, you’ll be using the ROC AUC score as a metric instead of accuracy.
As a first step, you’ll start by instantiating an AdaBoost classifier.
AdaBoostClassifier
from
sklearn.ensemble
.
DecisionTreeClassifier
with
max_depth
set to 2.
AdaBoostClassifier
consisting of 180 trees
and setting the base_estimator
to dt
.
# edited/added
indian_liver_patient = pd.read_csv("archive/Machine-Learning-with-Tree-Based-Models-in-Python/datasets/indian_liver_patient.csv")
df = indian_liver_patient.rename(columns={'Dataset':'Liver_disease'})
df = df.dropna()
X = df[['Age', 'Total_Bilirubin',
'Direct_Bilirubin',
'Alkaline_Phosphotase',
'Alamine_Aminotransferase', 'Aspartate_Aminotransferase',
'Total_Protiens', 'Albumin', 'Albumin_and_Globulin_Ratio', 'Gender']]
LabelEncoder = sklearn.preprocessing.LabelEncoder()
X['Is_male'] = LabelEncoder.fit_transform(X['Gender'])
X = X.drop(columns='Gender')
y = df['Liver_disease']-1
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X,y)
# Import DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier
# Import AdaBoostClassifier
from sklearn.ensemble import AdaBoostClassifier
# Instantiate dt
dt = DecisionTreeClassifier(max_depth=2, random_state=1)
# Instantiate ada
ada = AdaBoostClassifier(base_estimator=dt,
n_estimators=180, random_state=1)
Well done! Next comes training ada
and evaluating the
probability of obtaining the positive class in the test set.
Train the AdaBoost classifier#
Now that you’ve instantiated the AdaBoost classifier ada
,
it’s time train it. You will also predict the probabilities of obtaining
the positive class in the test set. This can be done as follows:
Once the classifier ada
is trained, call the
.predict_proba()
method by passing X_test
as a
parameter and extract these probabilities by slicing all the values in
the second column as follows:
ada.predict_proba(X_test)[:,1]
The Indian Liver dataset is processed for you and split into 80% train
and 20% test. Feature matrices X_train
and
X_test
, as well as the arrays of labels
y_train
and y_test
are available in your
workspace. In addition, we have also loaded the instantiated model
ada
from the previous exercise.
ada
to the training set.
# Fit ada to the training set
ada.fit(X_train, y_train)
# Compute the probabilities of obtaining the positive class
## AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=2,
## random_state=1),
## n_estimators=180, random_state=1)
y_pred_proba = ada.predict_proba(X_test)[:,1]
## /Users/macos/Library/r-miniconda/envs/r-reticulate/lib/python3.8/site-packages/sklearn/base.py:441: UserWarning: X does not have valid feature names, but AdaBoostClassifier was fitted with feature names
## warnings.warn(
Great work! Next, you’ll evaluate ada
’s ROC AUC score.
Evaluate the AdaBoost classifier#
Now that you’re done training ada
and predicting the
probabilities of obtaining the positive class in the test set, it’s time
to evaluate ada
’s ROC AUC score. Recall that the ROC AUC
score of a binary classifier can be determined using the
roc_auc_score()
function from sklearn.metrics
.
The arrays y_test
and y_pred_proba
that you
computed in the previous exercise are available in your workspace.
roc_auc_score
from sklearn.metrics
.
ada
’s test set ROC AUC score, assign it to
ada_roc_auc
, and print it out.
# Import roc_auc_score
from sklearn.metrics import roc_auc_score
# Evaluate test-set roc_auc_score
ada_roc_auc = roc_auc_score(y_test, y_pred_proba)
# Print roc_auc_score
print('ROC AUC score: {:.2f}'.format(ada_roc_auc))
## ROC AUC score: 0.72
Not bad! This untuned AdaBoost classifier achieved a ROC AUC score of 0.70!
Gradient Boosting (GB)#
Define the GB regressor#
You’ll now revisit the Bike Sharing Demand dataset that was introduced in the previous chapter. Recall that your task is to predict the bike rental demand using historical weather data from the Capital Bikeshare program in Washington, D.C.. For this purpose, you’ll be using a gradient boosting regressor.
As a first step, you’ll start by instantiating a gradient boosting regressor which you will train in the next exercise.
GradientBoostingRegressor
from
sklearn.ensemble
.
Instantiate a gradient boosting regressor by setting the parameters:
max_depth
to 4
n_estimators
to 200
# Import GradientBoostingRegressor
from sklearn.ensemble import GradientBoostingRegressor
# Instantiate gb
gb = GradientBoostingRegressor(max_depth=4,
n_estimators=200,
random_state=2)
Awesome! Time to train the regressor and predict test set labels.
Train the GB regressor#
You’ll now train the gradient boosting regressor gb
that
you instantiated in the previous exercise and predict test set labels.
The dataset is split into 80% train and 20% test. Feature matrices
X_train
and X_test
, as well as the arrays
y_train
and y_test
are available in your
workspace. In addition, we have also loaded the model instance
gb
that you defined in the previous exercise.
gb
to the training set.
y_pred
.
# Fit gb to the training set
gb.fit(X_train,y_train)
# Predict test set labels
## GradientBoostingRegressor(max_depth=4, n_estimators=200, random_state=2)
y_pred = gb.predict(X_test)
Great work! Time to evaluate the test set RMSE!
Evaluate the GB regressor#
Now that the test set predictions are available, you can use them to
evaluate the test set Root Mean Squared Error (RMSE) of gb
.
y_test
and predictions y_pred
are available in
your workspace.
mean_squared_error
from sklearn.metrics
as MSE
.
mse_test
.
rmse_test
.
# Import mean_squared_error as MSE
from sklearn.metrics import mean_squared_error as MSE
# Compute MSE
mse_test = MSE(y_test, y_pred)
# Compute RMSE
rmse_test = mse_test**0.5
# Print RMSE
print('Test set RMSE of gb: {:.3f}'.format(rmse_test))
## Test set RMSE of gb: 0.452
Great work!
Stochastic Gradient Boosting#
Regression with SGB#
As in the exercises from the previous lesson, you’ll be working with the Bike Sharing Demand dataset. In the following set of exercises, you’ll solve this bike count regression problem using stochastic gradient boosting.
Instantiate a Stochastic Gradient Boosting Regressor (SGBR) and set:
max_depth
to 4 and n_estimators
to 200,
subsample
to 0.9, and
max_features
to 0.75.
# Import GradientBoostingRegressor
from sklearn.ensemble import GradientBoostingRegressor
# Instantiate sgbr
sgbr = GradientBoostingRegressor(
max_depth=4,
subsample=0.9,
max_features=0.75,
n_estimators=200,
random_state=2)
Well done!
Train the SGB regressor#
In this exercise, you’ll train the SGBR sgbr
instantiated
in the previous exercise and predict the test set labels.
The bike sharing demand dataset is already loaded processed for you; it
is split into 80% train and 20% test. The feature matrices
X_train
and X_test
, the arrays of labels
y_train
and y_test
, and the model instance
sgbr
that you defined in the previous exercise are
available in your workspace.
sgbr
to the training set.
y_pred
.
# Fit sgbr to the training set
sgbr.fit(X_train,y_train)
# Predict test set labels
## GradientBoostingRegressor(max_depth=4, max_features=0.75, n_estimators=200,
## random_state=2, subsample=0.9)
y_pred = sgbr.predict(X_test)
Great! Next comes test set evaluation!
Evaluate the SGB regressor#
You have prepared the ground to determine the test set RMSE of
sgbr
which you shall evaluate in this exercise.
y_pred
and y_test
are available in your
workspace.
mean_squared_error
as MSE
from
sklearn.metrics
.
mse_test
.
rmse_test
.
# Import mean_squared_error as MSE
from sklearn.metrics import mean_squared_error as MSE
# Compute test set MSE
mse_test = MSE(y_test,y_pred)
# Compute test set RMSE
rmse_test = mse_test**0.5
# Print rmse_test
print('Test set RMSE of sgbr: {:.3f}'.format(rmse_test))
## Test set RMSE of sgbr: 0.445
The stochastic gradient boosting regressor achieves a lower test set
RMSE than the gradient boosting regressor (which was
52.071
)!