Bagging and Random Forests
Contents
Bagging and Random Forests#
Bagging is an ensemble method involving training the same algorithm many times using different subsets sampled from the training data. In this chapter, you’ll understand how bagging can be used to create a tree ensemble. You’ll also learn how the random forests algorithm can lead to further ensemble diversity through randomization at the level of each split in the trees forming the ensemble.
Bagging#
Define the bagging classifier#
In the following exercises you’ll work with the Indian Liver Patient dataset from the UCI machine learning repository. Your task is to predict whether a patient suffers from a liver disease using 10 features including Albumin, age and gender. You’ll do so using a Bagging Classifier.
DecisionTreeClassifier from
sklearn.tree and BaggingClassifier from
sklearn.ensemble.
DecisionTreeClassifier called
dt.
BaggingClassifier called bc
consisting of 50 trees.
# Import DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier
# Import BaggingClassifier
from sklearn.ensemble import BaggingClassifier
# Instantiate dt
dt = DecisionTreeClassifier(random_state=1)
# Instantiate bc
bc = BaggingClassifier(base_estimator=dt, n_estimators=50, random_state=1)
Great! In the following exercise, you’ll train bc and
evaluate its test set performance.
Evaluate Bagging performance#
Now that you instantiated the bagging classifier, it’s time to train it and evaluate its test set accuracy.
The Indian Liver Patient dataset is processed for you and split into 80%
train and 20% test. The feature matrices X_train and
X_test, as well as the arrays of labels
y_train and y_test are available in your
workspace. In addition, we have also loaded the bagging classifier
bc that you instantiated in the previous exercise and the
function accuracy_score() from
sklearn.metrics.
bc to the training set.
y_pred.
bc’s test set accuracy.
# Fit bc to the training set
bc.fit(X_train, y_train)
# Predict test set labels
## BaggingClassifier(base_estimator=DecisionTreeClassifier(random_state=1),
##                   n_estimators=50, random_state=1)
y_pred = bc.predict(X_test)
# Evaluate acc_test
acc_test = accuracy_score(y_test, y_pred)
print('Test set accuracy of bc: {:.2f}'.format(acc_test)) 
## Test set accuracy of bc: 0.70
Great work! A single tree dt would have achieved an
accuracy of 63% which is 4% lower than bc’s accuracy!
Out of Bag Evaluation#
Prepare the ground#
In the following exercises, you’ll compare the OOB accuracy to the test set accuracy of a bagging classifier trained on the Indian Liver Patient dataset.
In sklearn, you can evaluate the OOB accuracy of an ensemble classifier
by setting the parameter oob_score to True
during instantiation. After training the classifier, the OOB accuracy
can be obtained by accessing the .oob_score\_ attribute
from the corresponding instance.
In your environment, we have made available the class
DecisionTreeClassifier from sklearn.tree.
BaggingClassifier from
sklearn.ensemble.
DecisionTreeClassifier with
min_samples_leaf set to 8.
BaggingClassifier consisting of 50 trees and
set oob_score to True.
# Import DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier
# Import BaggingClassifier
from sklearn.ensemble import BaggingClassifier
# Instantiate dt
dt = DecisionTreeClassifier(min_samples_leaf=8, random_state=1)
# Instantiate bc
bc = BaggingClassifier(base_estimator=dt, 
            n_estimators=50,
            oob_score=True,
            random_state=1)
Great! In the following exercise, you’ll train bc and
compare its test set accuracy to its OOB accuracy.
OOB Score vs Test Set Score#
Now that you instantiated bc, you will fit it to the
training set and evaluate its test set and OOB accuracies.
The dataset is processed for you and split into 80% train and 20% test.
The feature matrices X_train and X_test, as
well as the arrays of labels y_train and
y_test are available in your workspace. In addition, we
have also loaded the classifier bc instantiated in the
previous exercise and the function accuracy_score() from
sklearn.metrics.
bc to the training set and predict the test set labels
and assign the results to y_pred.
acc_test by calling
accuracy_score.
bc’s OOB accuracy acc_oob by
extracting the attribute oob_score\_ from bc.
# Fit bc to the training set 
bc.fit(X_train, y_train)
# Predict test set labels
## BaggingClassifier(base_estimator=DecisionTreeClassifier(min_samples_leaf=8,
##                                                         random_state=1),
##                   n_estimators=50, oob_score=True, random_state=1)
y_pred = bc.predict(X_test)
# Evaluate test set accuracy
acc_test = accuracy_score(y_test, y_pred)
# Evaluate OOB accuracy
acc_oob = bc.oob_score_
# Print acc_test and acc_oob
print('Test set accuracy: {:.3f}, OOB accuracy: {:.3f}'.format(acc_test, acc_oob))
## Test set accuracy: 0.690, OOB accuracy: 0.687
Great work! The test set accuracy and the OOB accuracy of
bc are both roughly equal to 70%!
Random Forests (RF)#
Train an RF regressor#
In the following exercises you’ll predict bike rental demand in the Capital Bikeshare program in Washington, D.C using historical weather data from the Bike Sharing Demand dataset available through Kaggle. For this purpose, you will be using the random forests algorithm. As a first step, you’ll define a random forests regressor and fit it to the training set.
The dataset is processed for you and split into 80% train and 20% test.
The features matrix X_train and the array
y_train are available in your workspace.
RandomForestRegressor from
sklearn.ensemble.
RandomForestRegressor called rf
consisting of 25 trees.
rf to the training set.
# Import RandomForestRegressor
from sklearn.ensemble import RandomForestRegressor
# Instantiate rf
rf = RandomForestRegressor(n_estimators=25,
            random_state=2)
            
# Fit rf to the training set    
rf.fit(X_train, y_train) 
## RandomForestRegressor(n_estimators=25, random_state=2)
Great work! Next comes the test set RMSE evaluation part.
Evaluate the RF regressor#
You’ll now evaluate the test set RMSE of the random forests regressor
rf that you trained in the previous exercise.
The dataset is processed for you and split into 80% train and 20% test.
The features matrix X_test, as well as the array
y_test are available in your workspace. In addition, we
have also loaded the model rf that you trained in the
previous exercise.
mean_squared_error from sklearn.metrics
as MSE.
y_pred.
rmse_test.
# Import mean_squared_error as MSE
from sklearn.metrics import mean_squared_error as MSE
# Predict the test set labels
y_pred = rf.predict(X_test)
# Evaluate the test set RMSE
rmse_test = MSE(y_test,y_pred)**0.5
# Print rmse_test
print('Test set RMSE of rf: {:.2f}'.format(rmse_test))
## Test set RMSE of rf: 0.43
Great work! You can try training a single CART on the same dataset. The
test set RMSE achieved by rf is significantly smaller than
that achieved by a single CART!
Visualizing features importances#
In this exercise, you’ll determine which features were the most
predictive according to the random forests regressor rf
that you trained in a previous exercise.
For this purpose, you’ll draw a horizontal barplot of the feature
importance as assessed by rf. Fortunately, this can be done
easily thanks to plotting capabilities of pandas.
We have created a pandas.Series object called
importances containing the feature names as
index and their importances as values. In addition,
matplotlib.pyplot is available as plt and
pandas as pd.
.sort_values() method on importances
and assign the result to importances_sorted.
Call the .plot() method on importances_sorted
and set the arguments:
kind to ‘barh’
color to ‘lightgreen’
# Create a pd.Series of features importances
importances = pd.Series(data=rf.feature_importances_,
                        index= X_train.columns)
# Sort importances
importances_sorted = importances.sort_values()
# Draw a horizontal barplot of importances_sorted
importances_sorted.plot(kind='barh', color='lightgreen')
plt.title('Features Importances')
plt.show()
 
Apparently, hr and workingday are the most
important features according to rf. The importances of
these two features add up to more than 90%!
