Model Tuning#

The hyperparameters of a machine learning model are parameters that are not learned from data. They should be set prior to fitting the model to the training set. In this chapter, you’ll learn how to tune the hyperparameters of a tree-based model using grid search cross validation.

Tuning a CART’s Hyperparameters#

Tree hyperparameters#

In the following exercises you’ll revisit the Indian Liver Patient dataset which was introduced in a previous chapter.

Your task is to tune the hyperparameters of a classification tree. Given that this dataset is imbalanced, you’ll be using the ROC AUC score as a metric instead of accuracy.

We have instantiated a DecisionTreeClassifier and assigned to dt with sklearn’s default hyperparameters. You can inspect the hyperparameters of dt in your console.

Which of the following is not a hyperparameter of dt?

  • min_impurity_decrease

  • min_weight_fraction_leaf

  • min_features

  • splitter

Well done! There is no hyperparameter named min_features.

Set the tree’s hyperparameter grid#

In this exercise, you’ll manually set the grid of hyperparameters that will be used to tune the classification tree dt and find the optimal classifier in the next exercise.

  • Define a grid of hyperparameters corresponding to a Python dictionary called params_dt with:

  • the key ‘max_depth’ set to a list of values 2, 3, and 4
  • the key ‘min_samples_leaf’ set to a list of values 0.12, 0.14, 0.16, 0.18
  • # Define params_dt
    params_dt = {'max_depth':[2,3,4], 'min_samples_leaf':[0.12,0.14,0.16,0.18]}
    

    Great! Next comes performing the grid search.

    Search for the optimal tree#

    In this exercise, you’ll perform grid search using 5-fold cross validation to find dt’s optimal hyperparameters. Note that because grid search is an exhaustive process, it may take a lot time to train the model. Here you’ll only be instantiating the GridSearchCV object without fitting it to the training set. As discussed in the video, you can train such an object similar to any scikit-learn estimator by using the .fit() method:

    grid_object.fit(X_train, y_train)
    

    An untuned classification tree dt as well as the dictionary params_dt that you defined in the previous exercise are available in your workspace.

  • Import GridSearchCV from sklearn.model_selection.
  • Instantiate a GridSearchCV object using 5-fold CV by setting the parameters:

  • estimator to dt, param_grid to params_dt and
  • scoring to ‘roc_auc’.
  • # edited/added
    dt = sklearn.tree.DecisionTreeClassifier()
    
    # Import GridSearchCV
    from sklearn.model_selection import GridSearchCV
    
    # Instantiate grid_dt
    grid_dt = GridSearchCV(estimator=dt,
                           param_grid=params_dt,
                           scoring='roc_auc',
                           cv=5,
                           n_jobs=-1)
    

    Awesome! As we said earlier, we will fit the model to the training data for you and in the next exercise you will compute the test set ROC AUC score.

    Evaluate the optimal tree#

    In this exercise, you’ll evaluate the test set ROC AUC score of grid_dt’s optimal model.

    In order to do so, you will first determine the probability of obtaining the positive label for each test set observation. You can use the methodpredict_proba() of an sklearn classifier to compute a 2D array containing the probabilities of the negative and positive class-labels respectively along columns.

    The dataset is already loaded and processed for you (numerical features are standardized); it is split into 80% train and 20% test. X_test, y_test are available in your workspace. In addition, we have also loaded the trained GridSearchCV object grid_dt that you instantiated in the previous exercise. Note that grid_dt was trained as follows:

    grid_dt.fit(X_train, y_train)
    
  • Import roc_auc_score from sklearn.metrics.
  • Extract the .best_estimator\_ attribute from grid_dt and assign it to best_model.
  • Predict the test set probabilities of obtaining the positive class y_pred_proba.
  • Compute the test set ROC AUC score test_roc_auc of best_model.
  • # edited/added
    grid_dt.fit(X_train, y_train)
    
    # Import roc_auc_score from sklearn.metrics
    
    ## GridSearchCV(cv=5, estimator=DecisionTreeClassifier(), n_jobs=-1,
    ##              param_grid={'max_depth': [2, 3, 4],
    ##                          'min_samples_leaf': [0.12, 0.14, 0.16, 0.18]},
    ##              scoring='roc_auc')
    
    from sklearn.metrics import roc_auc_score
    
    # Extract the best estimator
    best_model = grid_dt.best_estimator_
    
    # Predict the test set probabilities of the positive class
    y_pred_proba = best_model.predict_proba(X_test )[:,1]
    
    # Compute test_roc_auc
    test_roc_auc = roc_auc_score(y_test, y_pred_proba)
    
    # Print test_roc_auc
    print('Test set ROC AUC score: {:.3f}'.format(test_roc_auc))
    
    ## Test set ROC AUC score: 0.700
    

    Great work! An untuned classification-tree would achieve a ROC AUC score of 0.54!

    Tuning a RF’s Hyperparameters#

    Random forests hyperparameters#

    In the following exercises, you’ll be revisiting the Bike Sharing Demand dataset that was introduced in a previous chapter. Recall that your task is to predict the bike rental demand using historical weather data from the Capital Bikeshare program in Washington, D.C.. For this purpose, you’ll be tuning the hyperparameters of a Random Forests regressor.

    We have instantiated a RandomForestRegressor called rf using sklearn’s default hyperparameters. You can inspect the hyperparameters of rf in your console.

    Which of the following is not a hyperparameter of rf?

    • min_weight_fraction_leaf

    • criterion

    • learning_rate

    • warm_start

    Well done! There is no hyperparameter named learning_rate.

    Set the hyperparameter grid of RF#

    In this exercise, you’ll manually set the grid of hyperparameters that will be used to tune rf’s hyperparameters and find the optimal regressor. For this purpose, you will be constructing a grid of hyperparameters and tune the number of estimators, the maximum number of features used when splitting each node and the minimum number of samples (or fraction) per leaf.

  • Define a grid of hyperparameters corresponding to a Python dictionary called params_rf with:

  • the key ‘n_estimators’ set to a list of values 100, 350, 500
  • the key ‘max_features’ set to a list of values ‘log2’, ‘auto’, ‘sqrt’
  • the key ‘min_samples_leaf’ set to a list of values 2, 10, 30
  • # Define the dictionary 'params_rf'
    params_rf = {'n_estimators':[100, 350, 500],
                 'max_features':['log2','auto','sqrt'],
                 'min_samples_leaf':[2,10,30]}
    

    Great work! Time to perform the grid search.

    Search for the optimal forest#

    In this exercise, you’ll perform grid search using 3-fold cross validation to find rf’s optimal hyperparameters. To evaluate each model in the grid, you’ll be using the negative mean squared error metric.

    Note that because grid search is an exhaustive search process, it may take a lot time to train the model. Here you’ll only be instantiating the GridSearchCV object without fitting it to the training set. As discussed in the video, you can train such an object similar to any scikit-learn estimator by using the .fit() method:

    grid_object.fit(X_train, y_train)
    

    The untuned random forests regressor model rf as well as the dictionary params_rf that you defined in the previous exercise are available in your workspace.

  • Import GridSearchCV from sklearn.model_selection.
  • Instantiate a GridSearchCV object using 3-fold CV by using negative mean squared error as the scoring metric.
  • # Import GridSearchCV
    from sklearn.model_selection import GridSearchCV
    
    rf = sklearn.ensemble.RandomForestRegressor()
    
    # Instantiate grid_rf
    grid_rf = GridSearchCV(estimator=rf,
                           param_grid=params_rf,
                           scoring='neg_mean_squared_error',
                           cv=3,
                           verbose=1,
                           n_jobs=-1)
    

    Awesome! Next comes evaluating the test set RMSE of the best model.

    Evaluate the optimal forest#

    In this last exercise of the course, you’ll evaluate the test set RMSE of grid_rf’s optimal model.

    The dataset is already loaded and processed for you and is split into 80% train and 20% test. In your environment are available X_test, y_test and the function mean_squared_error from sklearn.metrics under the alias MSE. In addition, we have also loaded the trained GridSearchCV object grid_rf that you instantiated in the previous exercise. Note that grid_rf was trained as follows:

    grid_rf.fit(X_train, y_train)
    
  • Import mean_squared_error as MSE from sklearn.metrics.
  • Extract the best estimator from grid_rf and assign it to best_model.
  • Predict best_model’s test set labels and assign the result to y_pred.
  • Compute best_model’s test set RMSE.
  • # edited/added
    grid_rf = grid_rf.fit(X_train, y_train)
    
    # Import mean_squared_error from sklearn.metrics as MSE 
    
    ## Fitting 3 folds for each of 27 candidates, totalling 81 fits
    
    from sklearn.metrics import mean_squared_error as MSE
    
    # Extract the best estimator
    best_model = grid_rf.best_estimator_
    
    # Predict test set labels
    y_pred = best_model.predict(X_test)
    
    # Compute rmse_test
    rmse_test = MSE(y_test, y_pred)**0.5
    
    # Print rmse_test
    print('Test RMSE of best model: {:.3f}'.format(rmse_test)) 
    
    ## Test RMSE of best model: 0.405
    

    Magnificent work!

    Congratulations!#

    Congratulations!#

    Congratulations on completing this course!

    How far you have come#

    Take a moment to take a look at how far you have come! In chapter 1, you started off by understanding and applying the CART algorithm to train decision trees or CARTs for problems involving classification and regression. In chapter 2, you understood what the generalization error of a supervised learning model is. In addition, you also learned how underfitting and overfitting can be diagnosed with cross-validation. Furthermore, you learned how model ensembling can produce results that are more robust than individual decision trees. In chapter 3, you applied randomization through bootstrapping and constructed a diverse set of trees in an ensemble through bagging. You also explored how random forests introduces further randomization by sampling features at the level of each node in each tree forming the ensemble. Chapter 4 introduced you to boosting, an ensemble method in which predictors are trained sequentially and where each predictor tries to correct the errors made by its predecessor. Specifically, you saw how AdaBoost involved tweaking the weights of the training samples while gradient boosting involved fitting each tree using the residuals of its predecessor as labels. You also learned how subsampling instances and features can lead to a better performance through Stochastic Gradient Boosting. Finally, in chapter 5, you explored hyperparameter tuning through Grid Search cross-validation and you learned how important it is to get the most out of your models.

    Thank you!#

    I hope you enjoyed taking this course as much as I enjoyed developing it. Finally, I encourage you to apply the skills you learned by practicing on real-world datasets.