Fine-tuning your XGBoost model#

This chapter will teach you how to make your XGBoost models as performant as possible. You’ll learn about the variety of parameters that can be adjusted to alter the behavior of XGBoost and how to tune them efficiently so that you can supercharge the performance of your models.

Why tune your model?#

When is tuning your model a bad idea?#

Now that you’ve seen the effect that tuning has on the overall performance of your XGBoost model, let’s turn the question on its head and see if you can figure out when tuning your model might not be the best idea. Given that model tuning can be time-intensive and complicated, which of the following scenarios would NOT call for careful tuning of your model?

  • You have lots of examples from some dataset and very many features at your disposal.
  • You are very short on time before you must push an initial model to production and have little data to train your model on.
  • You have access to a multi-core (64 cores) server with lots of memory (200GB RAM) and no time constraints.
  • You must squeeze out every last bit of performance out of your xgboost model.
  • Yup! You cannot tune if you do not have time!

    Tuning the number of boosting rounds#

    Let’s start with parameter tuning by seeing how the number of boosting rounds (number of trees you build) impacts the out-of-sample performance of your XGBoost model. You’ll use xgb.cv() inside a for loop and build one model per num_boost_round parameter.

    Here, you’ll continue working with the Ames housing dataset. The features are available in the array X, and the target vector is contained in y.

  • Create a DMatrix called housing_dmatrix from X and y.
  • Create a parameter dictionary called params, passing in the appropriate “objective” (“reg:linear”) and “max_depth” (set it to 3).
  • Iterate over num_rounds inside a for loop and perform 3-fold cross-validation. In each iteration of the loop, pass in the current number of boosting rounds (curr_num_rounds) to xgb.cv() as the argument to num_boost_round.
  • Append the final boosting round RMSE for each cross-validated XGBoost model to the final_rmse_per_round list.
  • num_rounds and final_rmse_per_round have been zipped and converted into a DataFrame so you can easily see how the model performs with each boosting round. Hit ‘Submit Answer’ to see the results!
  • # Create the DMatrix: housing_dmatrix
    housing_dmatrix = xgb.DMatrix(data=X, label=y)
    
    # Create the parameter dictionary for each tree: params 
    params = {"objective":"reg:linear", "max_depth":3}
    
    # Create list of number of boosting rounds
    num_rounds = [5, 10, 15]
    
    # Empty list to store final round rmse per XGBoost model
    final_rmse_per_round = []
    
    # Iterate over num_rounds and build one model per num_boost_round parameter
    for curr_num_rounds in num_rounds:
    
        # Perform cross-validation: cv_results
        cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=3, num_boost_round=curr_num_rounds, metrics="rmse", as_pandas=True, seed=123)
        
        # Append final round RMSE
        final_rmse_per_round.append(cv_results["test-rmse-mean"].tail().values[-1])
        
    # Print the resultant DataFrame
    
    ## [15:32:32] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
    ## [15:32:32] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
    ## [15:32:32] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
    ## [15:32:32] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
    ## [15:32:32] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
    ## [15:32:32] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
    ## [15:32:33] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
    ## [15:32:33] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
    ## [15:32:33] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
    
    num_rounds_rmses = list(zip(num_rounds, final_rmse_per_round))
    print(pd.DataFrame(num_rounds_rmses,columns=["num_boosting_rounds","rmse"]))
    
    ##    num_boosting_rounds          rmse
    ## 0                    5  50903.299479
    ## 1                   10  34774.191406
    ## 2                   15  32895.098307
    

    Awesome! As you can see, increasing the number of boosting rounds decreases the RMSE.

    Automated boosting round selection using early_stopping#

    Now, instead of attempting to cherry pick the best possible number of boosting rounds, you can very easily have XGBoost automatically select the number of boosting rounds for you within xgb.cv(). This is done using a technique called early stopping.

    Early stopping works by testing the XGBoost model after every boosting round against a hold-out dataset and stopping the creation of additional boosting rounds (thereby finishing training of the model early) if the hold-out metric (“rmse” in our case) does not improve for a given number of rounds. Here you will use the early_stopping_rounds parameter in xgb.cv() with a large possible number of boosting rounds (50). Bear in mind that if the holdout metric continuously improves up through when num_boost_rounds is reached, then early stopping does not occur.

    Here, the DMatrix and parameter dictionary have been created for you. Your task is to use cross-validation with early stopping. Go for it!

  • Perform 3-fold cross-validation with early stopping and “rmse” as your metric. Use 10 early stopping rounds and 50 boosting rounds. Specify a seed of 123 and make sure the output is a pandas DataFrame. Remember to specify the other parameters such as dtrain, params, and metrics.
  • Print cv_results.
  • # Create your housing DMatrix: housing_dmatrix
    housing_dmatrix = xgb.DMatrix(data=X,label=y)
    
    # Create the parameter dictionary for each tree: params
    params = {"objective":"reg:linear", "max_depth":4}
    
    # Perform cross-validation with early stopping: cv_results
    cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=3, num_boost_round=50, early_stopping_rounds=10, metrics="rmse", as_pandas=True, seed=123)
    
    # Print cv_results
    
    ## [15:32:35] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
    ## [15:32:35] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
    ## [15:32:35] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
    
    print(cv_results)
    
    ##     train-rmse-mean  train-rmse-std  test-rmse-mean  test-rmse-std
    ## 0     141871.630208      403.632409   142640.630208     705.552907
    ## 1     103057.033854       73.787612   104907.677083     111.124997
    ## 2      75975.958333      253.705643    79262.057292     563.761707
    ## 3      57420.515625      521.666323    61620.138021    1087.681933
    ## 4      44552.960938      544.168971    50437.558594    1846.450522
    ## 5      35763.942708      681.796885    43035.660156    2034.476339
    ## 6      29861.469401      769.567549    38600.881511    2169.803563
    ## 7      25994.679036      756.524834    36071.816407    2109.801581
    ## 8      23306.832031      759.237670    34383.183594    1934.542189
    ## 9      21459.772786      745.623841    33509.141927    1887.374589
    ## 10     20148.728516      749.612756    32916.806641    1850.890045
    ## 11     19215.382162      641.387202    32197.834635    1734.459068
    ## 12     18627.391276      716.256399    31770.848958    1802.156167
    ## 13     17960.697265      557.046469    31482.781901    1779.126300
    ## 14     17559.733724      631.413289    31389.990234    1892.321401
    ## 15     17205.712891      590.168517    31302.885417    1955.164927
    ## 16     16876.571615      703.636538    31234.060547    1880.707358
    ## 17     16597.666992      703.677646    31318.347656    1828.860164
    ## 18     16330.460612      607.275030    31323.636719    1775.911103
    ## 19     16005.972331      520.472435    31204.138021    1739.073743
    ## 20     15814.299479      518.603218    31089.865885    1756.024090
    ## 21     15493.405924      505.617405    31047.996094    1624.672630
    ## 22     15270.733724      502.021346    31056.920573    1668.036788
    ## 23     15086.381836      503.910642    31024.981120    1548.988924
    ## 24     14917.606445      486.208398    30983.680990    1663.131129
    ## 25     14709.591797      449.666844    30989.479818    1686.664414
    ## 26     14457.285156      376.785590    30952.116536    1613.170520
    ## 27     14185.567708      383.100492    31066.899088    1648.531897
    ## 28     13934.065104      473.464919    31095.643880    1709.226491
    ## 29     13749.646485      473.671156    31103.885417    1778.882817
    ## 30     13549.837891      454.900755    30976.083984    1744.514903
    ## 31     13413.480469      399.601066    30938.469401    1746.051298
    ## 32     13275.916341      415.404898    30931.000651    1772.471473
    ## 33     13085.878906      493.793750    30929.056640    1765.541487
    ## 34     12947.182292      517.789542    30890.625651    1786.510889
    ## 35     12846.026367      547.731831    30884.489583    1769.731829
    ## 36     12702.380534      505.522036    30833.541667    1690.999881
    ## 37     12532.243815      508.298122    30856.692709    1771.447014
    ## 38     12384.056641      536.224879    30818.013672    1782.783623
    ## 39     12198.445312      545.165866    30839.394531    1847.325690
    ## 40     12054.582682      508.840691    30776.964844    1912.779519
    ## 41     11897.033528      477.177882    30794.703776    1919.677255
    ## 42     11756.221354      502.993261    30780.961589    1906.820582
    ## 43     11618.846029      519.835813    30783.754557    1951.258396
    ## 44     11484.081380      578.429092    30776.734375    1953.449992
    ## 45     11356.550781      565.367451    30758.544271    1947.456794
    ## 46     11193.557292      552.298192    30729.973307    1985.701585
    ## 47     11071.317383      604.088404    30732.662760    1966.997355
    ## 48     10950.777018      574.864279    30712.243490    1957.751584
    ## 49     10824.865885      576.664748    30720.852214    1950.513825
    

    Great work!

    Overview of XGBoost’s hyperparameters#

    Tuning eta#

    It’s time to practice tuning other XGBoost hyperparameters in earnest and observing their effect on model performance! You’ll begin by tuning the “eta”, also known as the learning rate.

    The learning rate in XGBoost is a parameter that can range between 0 and 1, with higher values of “eta” penalizing feature weights more strongly, causing much stronger regularization.

  • Create a list called eta_vals to store the following “eta” values: 0.001, 0.01, and 0.1.
  • Iterate over your eta_vals list using a for loop.
  • In each iteration of the for loop, set the “eta” key of params to be equal to curr_val. Then, perform 3-fold cross-validation with early stopping (5 rounds), 10 boosting rounds, a metric of “rmse”, and a seed of 123. Ensure the output is a DataFrame.
  • Append the final round RMSE to the best_rmse list.
  • # Create your housing DMatrix: housing_dmatrix
    housing_dmatrix = xgb.DMatrix(data=X, label=y)
    
    # Create the parameter dictionary for each tree (boosting round)
    params = {"objective":"reg:linear", "max_depth":3}
    
    # Create list of eta values and empty list to store final round rmse per xgboost model
    eta_vals = [0.001, 0.01, 0.1]
    best_rmse = []
    
    # Systematically vary the eta
    for curr_val in eta_vals:
    
        params["eta"] = curr_val
        
        # Perform cross-validation: cv_results
        cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=3,
                            num_boost_round=10, early_stopping_rounds=5,
                            metrics="rmse", as_pandas=True, seed=123)
        
        # Append the final round rmse to best_rmse
        best_rmse.append(cv_results["test-rmse-mean"].tail().values[-1])
    
    # Print the resultant DataFrame
    
    ## [15:32:39] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
    ## [15:32:39] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
    ## [15:32:39] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
    ## [15:32:39] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
    ## [15:32:39] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
    ## [15:32:39] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
    ## [15:32:39] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
    ## [15:32:39] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
    ## [15:32:39] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
    
    print(pd.DataFrame(list(zip(eta_vals, best_rmse)), columns=["eta","best_rmse"]))
    
    ##      eta      best_rmse
    ## 0  0.001  195736.406250
    ## 1  0.010  179932.161458
    ## 2  0.100   79759.401041
    

    Great work!

    Tuning max_depth#

    In this exercise, your job is to tune max_depth, which is the parameter that dictates the maximum depth that each tree in a boosting round can grow to. Smaller values will lead to shallower trees, and larger values to deeper trees.

  • Create a list called max_depths to store the following “max_depth” values: 2, 5, 10, and 20.
  • Iterate over your max_depths list using a for loop.
  • Systematically vary “max_depth” in each iteration of the for loop and perform 2-fold cross-validation with early stopping (5 rounds), 10 boosting rounds, a metric of “rmse”, and a seed of 123. Ensure the output is a DataFrame.
  • # Create your housing DMatrix: housing_dmatrix
    housing_dmatrix = xgb.DMatrix(data=X,label=y)
    
    # Create the parameter dictionary
    params = {"objective":"reg:linear"}
    
    # Create list of max_depth values
    max_depths = [2, 5, 10, 20]
    best_rmse = []
    
    # Systematically vary the max_depth
    for curr_val in max_depths:
    
        params["max_depth"] = curr_val
        
        # Perform cross-validation
        cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=2,
                     num_boost_round=10, early_stopping_rounds=5,
                     metrics="rmse", as_pandas=True, seed=123)
        
        # Append the final round rmse to best_rmse
        best_rmse.append(cv_results["test-rmse-mean"].tail().values[-1])
    
    # Print the resultant DataFrame
    
    ## [15:32:41] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
    ## [15:32:41] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
    ## [15:32:41] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
    ## [15:32:41] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
    ## [15:32:41] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
    ## [15:32:41] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
    ## [15:32:41] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
    ## [15:32:41] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
    
    print(pd.DataFrame(list(zip(max_depths, best_rmse)),columns=["max_depth","best_rmse"]))
    
    ##    max_depth     best_rmse
    ## 0          2  37957.476562
    ## 1          5  35596.599610
    ## 2         10  36065.537110
    ## 3         20  36739.574219
    

    Great work!

    Tuning colsample_bytree#

    Now, it’s time to tune “colsample_bytree”. You’ve already seen this if you’ve ever worked with scikit-learn’s RandomForestClassifier or RandomForestRegressor, where it just was called max_features. In both xgboost and sklearn, this parameter (although named differently) simply specifies the fraction of features to choose from at every split in a given tree. In xgboost, colsample_bytree must be specified as a float between 0 and 1.

  • Create a list called colsample_bytree_vals to store the values 0.1, 0.5, 0.8, and 1.
  • Systematically vary “colsample_bytree” and perform cross-validation, exactly as you did with max_depth and eta previously.
  • # Create your housing DMatrix
    housing_dmatrix = xgb.DMatrix(data=X,label=y)
    
    # Create the parameter dictionary
    params={"objective":"reg:linear","max_depth":3}
    
    # Create list of hyperparameter values
    colsample_bytree_vals = [0.1, 0.5, 0.8, 1]
    best_rmse = []
    
    # Systematically vary the hyperparameter value 
    for curr_val in colsample_bytree_vals:
    
        params["colsample_bytree"] = curr_val
        
        # Perform cross-validation
        cv_results = xgb.cv(dtrain=housing_dmatrix, params=params, nfold=2,
                     num_boost_round=10, early_stopping_rounds=5,
                     metrics="rmse", as_pandas=True, seed=123)
        
        # Append the final round rmse to best_rmse
        best_rmse.append(cv_results["test-rmse-mean"].tail().values[-1])
    
    # Print the resultant DataFrame
    
    ## [15:32:43] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
    ## [15:32:43] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
    ## [15:32:43] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
    ## [15:32:43] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
    ## [15:32:43] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
    ## [15:32:43] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
    ## [15:32:43] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
    ## [15:32:43] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
    
    print(pd.DataFrame(list(zip(colsample_bytree_vals, best_rmse)), columns=["colsample_bytree","best_rmse"]))
    
    ##    colsample_bytree     best_rmse
    ## 0               0.1  51386.587890
    ## 1               0.5  36585.345703
    ## 2               0.8  36093.660157
    ## 3               1.0  35836.042968
    

    Awesome! There are several other individual parameters that you can tune, such as “subsample”, which dictates the fraction of the training data that is used during any given boosting round. Next up: Grid Search and Random Search to tune XGBoost hyperparameters more efficiently!