Model Tuning
Contents
Model Tuning#
The hyperparameters of a machine learning model are parameters that are not learned from data. They should be set prior to fitting the model to the training set. In this chapter, you’ll learn how to tune the hyperparameters of a tree-based model using grid search cross validation.
Tuning a CART’s Hyperparameters#
Tree hyperparameters#
In the following exercises you’ll revisit the Indian Liver Patient dataset which was introduced in a previous chapter.
Your task is to tune the hyperparameters of a classification tree. Given that this dataset is imbalanced, you’ll be using the ROC AUC score as a metric instead of accuracy.
We have instantiated a DecisionTreeClassifier
and assigned
to dt
with sklearn
’s default hyperparameters.
You can inspect the hyperparameters of dt
in your console.
Which of the following is not a hyperparameter of dt
?
min_impurity_decrease
min_weight_fraction_leaf
min_features
splitter
Well done! There is no hyperparameter named min_features
.
Set the tree’s hyperparameter grid#
In this exercise, you’ll manually set the grid of hyperparameters that
will be used to tune the classification tree dt
and find
the optimal classifier in the next exercise.
Define a grid of hyperparameters corresponding to a Python dictionary
called params_dt
with:
‘max_depth’
set to a list of values 2, 3, and 4
‘min_samples_leaf’
set to a list of values 0.12,
0.14, 0.16, 0.18
# Define params_dt
params_dt = {'max_depth':[2,3,4], 'min_samples_leaf':[0.12,0.14,0.16,0.18]}
Great! Next comes performing the grid search.
Search for the optimal tree#
In this exercise, you’ll perform grid search using 5-fold cross
validation to find dt
’s optimal hyperparameters. Note that
because grid search is an exhaustive process, it may take a lot time to
train the model. Here you’ll only be instantiating the
GridSearchCV
object without fitting it to the training set.
As discussed in the video, you can train such an object similar to any
scikit-learn estimator by using the .fit()
method:
grid_object.fit(X_train, y_train)
An untuned classification tree dt
as well as the dictionary
params_dt
that you defined in the previous exercise are
available in your workspace.
GridSearchCV
from
sklearn.model_selection
.
Instantiate a GridSearchCV
object using 5-fold CV by
setting the parameters:
estimator
to dt
, param_grid
to
params_dt
and
scoring
to ‘roc_auc’
.
# edited/added
dt = sklearn.tree.DecisionTreeClassifier()
# Import GridSearchCV
from sklearn.model_selection import GridSearchCV
# Instantiate grid_dt
grid_dt = GridSearchCV(estimator=dt,
param_grid=params_dt,
scoring='roc_auc',
cv=5,
n_jobs=-1)
Awesome! As we said earlier, we will fit the model to the training data for you and in the next exercise you will compute the test set ROC AUC score.
Evaluate the optimal tree#
In this exercise, you’ll evaluate the test set ROC AUC score of
grid_dt
’s optimal model.
In order to do so, you will first determine the probability of obtaining
the positive label for each test set observation. You can use the
methodpredict_proba()
of an sklearn classifier to compute a
2D array containing the probabilities of the negative and positive
class-labels respectively along columns.
The dataset is already loaded and processed for you (numerical features
are standardized); it is split into 80% train and 20% test.
X_test
, y_test
are available in your
workspace. In addition, we have also loaded the trained
GridSearchCV
object grid_dt
that you
instantiated in the previous exercise. Note that grid_dt
was trained as follows:
grid_dt.fit(X_train, y_train)
roc_auc_score
from sklearn.metrics
.
.best_estimator\_
attribute from
grid_dt
and assign it to best_model
.
y_pred_proba
.
test_roc_auc
of
best_model
.
# edited/added
grid_dt.fit(X_train, y_train)
# Import roc_auc_score from sklearn.metrics
## GridSearchCV(cv=5, estimator=DecisionTreeClassifier(), n_jobs=-1,
## param_grid={'max_depth': [2, 3, 4],
## 'min_samples_leaf': [0.12, 0.14, 0.16, 0.18]},
## scoring='roc_auc')
from sklearn.metrics import roc_auc_score
# Extract the best estimator
best_model = grid_dt.best_estimator_
# Predict the test set probabilities of the positive class
y_pred_proba = best_model.predict_proba(X_test )[:,1]
# Compute test_roc_auc
test_roc_auc = roc_auc_score(y_test, y_pred_proba)
# Print test_roc_auc
print('Test set ROC AUC score: {:.3f}'.format(test_roc_auc))
## Test set ROC AUC score: 0.700
Great work! An untuned classification-tree would achieve a ROC AUC score
of 0.54
!
Tuning a RF’s Hyperparameters#
Random forests hyperparameters#
In the following exercises, you’ll be revisiting the Bike Sharing Demand dataset that was introduced in a previous chapter. Recall that your task is to predict the bike rental demand using historical weather data from the Capital Bikeshare program in Washington, D.C.. For this purpose, you’ll be tuning the hyperparameters of a Random Forests regressor.
We have instantiated a RandomForestRegressor
called
rf
using sklearn
’s default hyperparameters.
You can inspect the hyperparameters of rf
in your console.
Which of the following is not a hyperparameter of rf
?
min_weight_fraction_leaf
criterion
learning_rate
warm_start
Well done! There is no hyperparameter named learning_rate
.
Set the hyperparameter grid of RF#
In this exercise, you’ll manually set the grid of hyperparameters that
will be used to tune rf
’s hyperparameters and find the
optimal regressor. For this purpose, you will be constructing a grid of
hyperparameters and tune the number of estimators, the maximum number of
features used when splitting each node and the minimum number of samples
(or fraction) per leaf.
Define a grid of hyperparameters corresponding to a Python dictionary
called params_rf
with:
‘n_estimators’
set to a list of values 100, 350,
500
‘max_features’
set to a list of values ‘log2’,
‘auto’, ‘sqrt’
‘min_samples_leaf’
set to a list of values 2, 10,
30
# Define the dictionary 'params_rf'
params_rf = {'n_estimators':[100, 350, 500],
'max_features':['log2','auto','sqrt'],
'min_samples_leaf':[2,10,30]}
Great work! Time to perform the grid search.
Search for the optimal forest#
In this exercise, you’ll perform grid search using 3-fold cross
validation to find rf
’s optimal hyperparameters. To
evaluate each model in the grid, you’ll be using the
negative
mean squared error metric.
Note that because grid search is an exhaustive search process, it may
take a lot time to train the model. Here you’ll only be instantiating
the GridSearchCV
object without fitting it to the training
set. As discussed in the video, you can train such an object similar to
any scikit-learn estimator by using the .fit()
method:
grid_object.fit(X_train, y_train)
The untuned random forests regressor model rf
as well as
the dictionary params_rf
that you defined in the previous
exercise are available in your workspace.
GridSearchCV
from
sklearn.model_selection
.
GridSearchCV
object using 3-fold CV by using
negative mean squared error as the scoring metric.
# Import GridSearchCV
from sklearn.model_selection import GridSearchCV
rf = sklearn.ensemble.RandomForestRegressor()
# Instantiate grid_rf
grid_rf = GridSearchCV(estimator=rf,
param_grid=params_rf,
scoring='neg_mean_squared_error',
cv=3,
verbose=1,
n_jobs=-1)
Awesome! Next comes evaluating the test set RMSE of the best model.
Evaluate the optimal forest#
In this last exercise of the course, you’ll evaluate the test set RMSE
of grid_rf
’s optimal model.
The dataset is already loaded and processed for you and is split into
80% train and 20% test. In your environment are available
X_test
, y_test
and the function
mean_squared_error
from sklearn.metrics
under
the alias MSE
. In addition, we have also loaded the trained
GridSearchCV
object grid_rf
that you
instantiated in the previous exercise. Note that grid_rf
was trained as follows:
grid_rf.fit(X_train, y_train)
mean_squared_error
as MSE
from
sklearn.metrics
.
grid_rf
and assign it to
best_model
.
best_model
’s test set labels and assign the result
to y_pred
.
best_model
’s test set RMSE.
# edited/added
grid_rf = grid_rf.fit(X_train, y_train)
# Import mean_squared_error from sklearn.metrics as MSE
## Fitting 3 folds for each of 27 candidates, totalling 81 fits
from sklearn.metrics import mean_squared_error as MSE
# Extract the best estimator
best_model = grid_rf.best_estimator_
# Predict test set labels
y_pred = best_model.predict(X_test)
# Compute rmse_test
rmse_test = MSE(y_test, y_pred)**0.5
# Print rmse_test
print('Test RMSE of best model: {:.3f}'.format(rmse_test))
## Test RMSE of best model: 0.405
Magnificent work!
Congratulations!#
Congratulations!#
Congratulations on completing this course!
How far you have come#
Take a moment to take a look at how far you have come! In chapter 1, you started off by understanding and applying the CART algorithm to train decision trees or CARTs for problems involving classification and regression. In chapter 2, you understood what the generalization error of a supervised learning model is. In addition, you also learned how underfitting and overfitting can be diagnosed with cross-validation. Furthermore, you learned how model ensembling can produce results that are more robust than individual decision trees. In chapter 3, you applied randomization through bootstrapping and constructed a diverse set of trees in an ensemble through bagging. You also explored how random forests introduces further randomization by sampling features at the level of each node in each tree forming the ensemble. Chapter 4 introduced you to boosting, an ensemble method in which predictors are trained sequentially and where each predictor tries to correct the errors made by its predecessor. Specifically, you saw how AdaBoost involved tweaking the weights of the training samples while gradient boosting involved fitting each tree using the residuals of its predecessor as labels. You also learned how subsampling instances and features can lead to a better performance through Stochastic Gradient Boosting. Finally, in chapter 5, you explored hyperparameter tuning through Grid Search cross-validation and you learned how important it is to get the most out of your models.
Thank you!#
I hope you enjoyed taking this course as much as I enjoyed developing it. Finally, I encourage you to apply the skills you learned by practicing on real-world datasets.