Using XGBoost in pipelines
Contents
Using XGBoost in pipelines#
Take your XGBoost skills to the next level by incorporating your models into two end-to-end machine learning pipelines. You’ll learn how to tune the most important XGBoost hyperparameters efficiently within a pipeline, and get an introduction to some more advanced preprocessing techniques.
Review of pipelines using sklearn#
Exploratory data analysis#
Before diving into the nitty gritty of pipelines and preprocessing, let’s do some exploratory analysis of the original, unprocessed Ames housing dataset. When you worked with this data in previous chapters, we preprocessed it for you so you could focus on the core XGBoost concepts. In this chapter, you’ll do the preprocessing yourself!
A smaller version of this original, unprocessed dataset has been
pre-loaded into a pandas
DataFrame called df
.
Your task is to explore df
in the Shell and pick the option
that is incorrect. The larger purpose of this exercise
is to understand the kinds of transformations you will need to perform
in order to be able to use XGBoost.
LotArea
column is
10516.828082
.
LotFrontage
column has no missing values and its
entries are of type float64
.
SalePrice
is
79442.502883
.
Well done! The LotFrontage
column actually does have
missing values: 259, to be precise. Additionally, notice how columns
such as MSZoning
, PavedDrive
, and
HouseStyle
are categorical. These need to be encoded
numerically before you can use XGBoost. This is what you’ll do in the
coming exercises.
Encoding categorical columns I: LabelEncoder#
Now that you’ve seen what will need to be done to get the housing data ready for XGBoost, let’s go through the process step-by-step.
First, you will need to fill in missing values - as you saw previously,
the column LotFrontage
has many missing values. Then, you
will need to encode any categorical columns in the dataset using one-hot
encoding so that they are encoded numerically. You can watch
this
video from
Supervised
Learning with scikit-learn for a refresher on the idea.
The data has five categorical columns: MSZoning
,
PavedDrive
, Neighborhood
,
BldgType
, and HouseStyle
. Scikit-learn has a
LabelEncoder
function that converts the values in each categorical column into
integers. You’ll practice using this here.
LabelEncoder
from
sklearn.preprocessing
.
LotFrontage
column with
0
using .fillna()
.
df.dtypes
equals object
.
LabelEncoder
object. You can do this in the same
way you instantiate any scikit-learn estimator.
LabelEncoder()
. To do this, use the
.fit_transform()
method of le
in the provided
lambda function.
# edited/added
df = pd.read_csv("archive/Extreme-Gradient-Boosting-with-XGBoost/datasets/ames_unprocessed_data.csv")
# Import LabelEncoder
from sklearn.preprocessing import LabelEncoder
# Fill missing values with 0
df.LotFrontage = df.LotFrontage.fillna(0)
# Create a boolean mask for categorical columns
categorical_mask = (df.dtypes == object)
# Get list of categorical column names
categorical_columns = df.columns[categorical_mask].tolist()
# Print the head of the categorical columns
print(df[categorical_columns].head())
# Create LabelEncoder object: le
## MSZoning Neighborhood BldgType HouseStyle PavedDrive
## 0 RL CollgCr 1Fam 2Story Y
## 1 RL Veenker 1Fam 1Story Y
## 2 RL CollgCr 1Fam 2Story Y
## 3 RL Crawfor 1Fam 2Story Y
## 4 RL NoRidge 1Fam 2Story Y
le = LabelEncoder()
# Apply LabelEncoder to categorical columns
df[categorical_columns] = df[categorical_columns].apply(lambda x: le.fit_transform(x))
# Print the head of the LabelEncoded categorical columns
print(df[categorical_columns].head())
## MSZoning Neighborhood BldgType HouseStyle PavedDrive
## 0 3 5 0 5 2
## 1 3 24 0 2 2
## 2 3 5 0 5 2
## 3 3 6 0 5 2
## 4 3 15 0 5 2
Well done! Notice how the entries in each categorical column are now
encoded numerically. A BldgTpe
of 1Fam
is
encoded as 0
, while a HouseStyle
of
2Story
is encoded as 5
.
Encoding categorical columns II: OneHotEncoder#
Okay - so you have your categorical columns encoded numerically. Can you
now move onto using pipelines and XGBoost? Not yet! In the categorical
columns of this dataset, there is no natural ordering between the
entries. As an example: Using LabelEncoder
, the
CollgCr
Neighborhood
was encoded as
5
, while the Veenker
Neighborhood
was encoded as 24
, and Crawfor
as
6
. Is Veenker
“greater” than
Crawfor
and CollgCr
? No - and allowing the
model to assume this natural ordering may result in poor performance.
As a result, there is another step needed: You have to apply a one-hot encoding to create binary, or “dummy” variables. You can do this using scikit-learn’s OneHotEncoder.
OneHotEncoder
from
sklearn.preprocessing
.
OneHotEncoder
object called ohe
.
Specify the keyword arguments
categorical_features=categorical_mask
and
sparse=False
.
.fit_transform()
method, apply the
OneHotEncoder
to df
and save the result as
df_encoded
. The output will be a NumPy array.
df_encoded
, and then the shape of
df
as well as df_encoded
to compare the
difference.
# Import OneHotEncoder
from sklearn.preprocessing import OneHotEncoder
# Create OneHotEncoder: ohe
ohe = OneHotEncoder(categories="auto", sparse=False)
# Apply OneHotEncoder to categorical columns - output is no longer a dataframe: df_encoded
df_encoded = ohe.fit_transform(df)
# Print first 5 rows of the resulting dataset - again, this will no longer be a pandas dataframe
print(df_encoded[:5, :])
# Print the shape of the original DataFrame
## [[0. 0. 0. ... 0. 0. 0.]
## [1. 0. 0. ... 0. 0. 0.]
## [0. 0. 0. ... 0. 0. 0.]
## [0. 0. 0. ... 0. 0. 0.]
## [0. 0. 0. ... 0. 0. 0.]]
print(df.shape)
# Print the shape of the transformed array
## (1460, 21)
print(df_encoded.shape)
## (1460, 3369)
Superb! As you can see, after one hot encoding, which creates binary variables out of the categorical variables, there are now 62 columns.
Encoding categorical columns III: DictVectorizer#
Alright, one final trick before you dive into pipelines. The two step
process you just went through - LabelEncoder
followed by
OneHotEncoder
- can be simplified by using a
DictVectorizer.
Using a DictVectorizer
on a DataFrame that has been
converted to a dictionary allows you to get label encoding as well as
one-hot encoding in one go.
Your task is to work through this strategy in this exercise!
DictVectorizer
from
sklearn.feature_extraction
.
df
into a dictionary called df_dict
using its .to_dict()
method with “records”
as
the argument.
DictVectorizer
object called dv
with the keyword argument sparse=False
.
DictVectorizer
on df_dict
by using
its .fit_transform()
method.
# Import DictVectorizer
from sklearn.feature_extraction import DictVectorizer
# Convert df into a dictionary: df_dict
df_dict = df.to_dict("records")
# Create the DictVectorizer object: dv
dv = DictVectorizer(sparse=False)
# Apply dv on df: df_encoded
df_encoded = dv.fit_transform(df_dict)
# Print the resulting first five rows
print(df_encoded[:5,:])
# Print the vocabulary
## [[3.000e+00 0.000e+00 1.000e+00 0.000e+00 0.000e+00 2.000e+00 5.480e+02
## 1.710e+03 1.000e+00 5.000e+00 8.450e+03 6.500e+01 6.000e+01 3.000e+00
## 5.000e+00 5.000e+00 7.000e+00 2.000e+00 0.000e+00 2.085e+05 2.003e+03]
## [3.000e+00 0.000e+00 0.000e+00 1.000e+00 1.000e+00 2.000e+00 4.600e+02
## 1.262e+03 0.000e+00 2.000e+00 9.600e+03 8.000e+01 2.000e+01 3.000e+00
## 2.400e+01 8.000e+00 6.000e+00 2.000e+00 0.000e+00 1.815e+05 1.976e+03]
## [3.000e+00 0.000e+00 1.000e+00 0.000e+00 1.000e+00 2.000e+00 6.080e+02
## 1.786e+03 1.000e+00 5.000e+00 1.125e+04 6.800e+01 6.000e+01 3.000e+00
## 5.000e+00 5.000e+00 7.000e+00 2.000e+00 1.000e+00 2.235e+05 2.001e+03]
## [3.000e+00 0.000e+00 1.000e+00 0.000e+00 1.000e+00 1.000e+00 6.420e+02
## 1.717e+03 0.000e+00 5.000e+00 9.550e+03 6.000e+01 7.000e+01 3.000e+00
## 6.000e+00 5.000e+00 7.000e+00 2.000e+00 1.000e+00 1.400e+05 1.915e+03]
## [4.000e+00 0.000e+00 1.000e+00 0.000e+00 1.000e+00 2.000e+00 8.360e+02
## 2.198e+03 1.000e+00 5.000e+00 1.426e+04 8.400e+01 6.000e+01 3.000e+00
## 1.500e+01 5.000e+00 8.000e+00 2.000e+00 0.000e+00 2.500e+05 2.000e+03]]
print(dv.vocabulary_)
## {'MSSubClass': 12, 'MSZoning': 13, 'LotFrontage': 11, 'LotArea': 10, 'Neighborhood': 14, 'BldgType': 1, 'HouseStyle': 9, 'OverallQual': 16, 'OverallCond': 15, 'YearBuilt': 20, 'Remodeled': 18, 'GrLivArea': 7, 'BsmtFullBath': 2, 'BsmtHalfBath': 3, 'FullBath': 5, 'HalfBath': 8, 'BedroomAbvGr': 0, 'Fireplaces': 4, 'GarageArea': 6, 'PavedDrive': 17, 'SalePrice': 19}
Fantastic! Besides simplifying the process into one step,
DictVectorizer
has useful attributes such as
vocabulary\_
which maps the names of the features to their
indices. With the data preprocessed, it’s time to move onto pipelines!
Preprocessing within a pipeline#
Now that you’ve seen what steps need to be taken individually to
properly process the Ames housing data, let’s use the much cleaner and
more succinct DictVectorizer
approach and put it alongside
an XGBoostRegressor
inside of a scikit-learn pipeline.
DictVectorizer
from
sklearn.feature_extraction
and Pipeline
from
sklearn.pipeline
.
LotFrontage
column of
X
with 0
.
DictVectorizer(sparse=False)
for “ohe_onestep”
and xgb.XGBRegressor()
for “xgb_model”
.
Pipeline()
and
steps
.
Pipeline
. Don’t forget to convert X
into a format that DictVectorizer
understands by calling
the to_dict(“records”)
method on X
.
# Import necessary modules
from sklearn.feature_extraction import DictVectorizer
from sklearn.pipeline import Pipeline
# Fill LotFrontage missing values with 0
X.LotFrontage = X.LotFrontage.fillna(0)
# Setup the pipeline steps: steps
steps = [("ohe_onestep", DictVectorizer(sparse=False)),
("xgb_model", xgb.XGBRegressor())]
# Create the pipeline: xgb_pipeline
xgb_pipeline = Pipeline(steps)
# Fit the pipeline
xgb_pipeline.fit(X.to_dict("records"), y)
## [15:33:00] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
## Pipeline(steps=[('ohe_onestep', DictVectorizer(sparse=False)),
## ('xgb_model', XGBRegressor())])
Well done! It’s now time to see what it takes to use XGBoost within pipelines.
Incorporating XGBoost into pipelines#
Cross-validating your XGBoost model#
In this exercise, you’ll go one step further by using the pipeline you’ve created to preprocess and cross-validate your model.
xgb_pipeline
using
steps
.
cross_val_score()
.
You’ll have to pass in the pipeline, X
(as a dictionary,
using .to_dict(“records”)
), y
, the number of
folds you want to use, and scoring
(“neg_mean_squared_error”
).
# Import necessary modules
from sklearn.feature_extraction import DictVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
# Fill LotFrontage missing values with 0
X.LotFrontage = X.LotFrontage.fillna(0)
# Setup the pipeline steps: steps
steps = [("ohe_onestep", DictVectorizer(sparse=False)),
("xgb_model", xgb.XGBRegressor(max_depth=2, objective="reg:linear"))]
# Create the pipeline: xgb_pipeline
xgb_pipeline = Pipeline(steps)
# Cross-validate the model
cross_val_scores = cross_val_score(xgb_pipeline, X.to_dict("records"), y, cv=10, scoring="neg_mean_squared_error")
# Print the 10-fold RMSE
## [15:33:03] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
## [15:33:03] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
## [15:33:04] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
## [15:33:04] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
## [15:33:04] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
## [15:33:05] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
## [15:33:05] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
## [15:33:05] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
## [15:33:06] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
## [15:33:06] WARNING: src/objective/regression_obj.cu:152: reg:linear is now deprecated in favor of reg:squarederror.
print("10-fold RMSE: ", np.mean(np.sqrt(np.abs(cross_val_scores))))
## 10-fold RMSE: 29903.48369050373
Great work!
Kidney disease case study I: Categorical Imputer#
You’ll now continue your exploration of using pipelines with a dataset that requires significantly more wrangling. The chronic kidney disease dataset contains both categorical and numeric features, but contains lots of missing values. The goal here is to predict who has chronic kidney disease given various blood indicators as features.
As Sergey mentioned in the video, you’ll be introduced to a new library,
sklearn_pandas
,
that allows you to chain many more processing steps inside of a pipeline
than are currently supported in scikit-learn. Specifically, you’ll be
able to impute missing categorical values directly using the
Categorical_Imputer()
class in sklearn_pandas
,
and the DataFrameMapper()
class to apply any arbitrary
sklearn-compatible transformer on DataFrame columns, where the resulting
output can be either a NumPy array or DataFrame.
We’ve also created a transformer called a Dictifier
that
encapsulates converting a DataFrame using
.to_dict(“records”)
without you having to do it explicitly
(and so that it works in a pipeline). Finally, we’ve also provided the
list of feature names in kidney_feature_names
, the target
name in kidney_target_name
, the features in X
,
and the target in y
.
In this exercise, your task is to apply the
CategoricalImputer
to impute all of the categorical columns
in the dataset. You can refer to how the numeric imputation mapper was
created as a template. Notice the keyword arguments
input_df=True
and df_out=True
? This is so that
you can work with DataFrames instead of arrays. By default, the
transformers are passed a numpy
array of the selected
columns as input, and as a result, the output of the DataFrame mapper is
also an array. Scikit-learn transformers have historically been designed
to work with numpy
arrays, not pandas
DataFrames, even though their basic indexing interfaces are similar.
DataFrameMapper()
and
SimpleImputer()
. SimpleImputer()
does not need
any arguments to be passed in. The columns are contained in
categorical_columns
. Be sure to specify
input_df=True
and df_out=True
, and use
category_feature
as your iterator variable in the list
comprehension.
# edited/added
import pandas as pd
X = pd.read_csv('archive/Extreme-Gradient-Boosting-with-XGBoost/datasets/chronic_kidney_X.csv')
y = pd.read_csv('archive/Extreme-Gradient-Boosting-with-XGBoost/datasets/chronic_kidney_y.csv').to_numpy().ravel()
# Import necessary modules
from sklearn_pandas import DataFrameMapper, CategoricalImputer
from sklearn.impute import SimpleImputer
# Check number of nulls in each feature columns
nulls_per_column = X.isnull().sum()
print(nulls_per_column)
# Create a boolean mask for categorical columns
## age 9
## bp 12
## sg 47
## al 46
## su 49
## bgr 44
## bu 19
## sc 17
## sod 87
## pot 88
## hemo 52
## pcv 71
## wc 106
## rc 131
## rbc 152
## pc 65
## pcc 4
## ba 4
## htn 2
## dm 2
## cad 2
## appet 1
## pe 1
## ane 1
## dtype: int64
categorical_feature_mask = X.dtypes == object
# Get list of categorical column names
categorical_columns = X.columns[categorical_feature_mask].tolist()
# Get list of non-categorical column names
non_categorical_columns = X.columns[~categorical_feature_mask].tolist()
# Apply numeric imputer
numeric_imputation_mapper = DataFrameMapper(
[([numeric_feature], SimpleImputer(strategy='median'))
for numeric_feature in non_categorical_columns],
input_df=True,
df_out=True
)
# Apply categorical imputer
categorical_imputation_mapper = DataFrameMapper(
[(category_feature, CategoricalImputer())
for category_feature in categorical_columns],
input_df=True,
df_out=True
)
Great work!
Kidney disease case study II: Feature Union#
Having separately imputed numeric as well as categorical columns, your
task is now to use scikit-learn’s
FeatureUnion
to concatenate their results, which are contained in two separate
transformer objects - numeric_imputation_mapper
, and
categorical_imputation_mapper
, respectively.
You may have already encountered FeatureUnion
in
Machine
Learning with the Experts: School Budgets. Just like with pipelines,
you have to pass it a list of (string, transformer)
tuples,
where the first half of each tuple is the name of the transformer.
FeatureUnion
from sklearn.pipeline
.
numeric_imputation_mapper
and
categorical_imputation_mapper
using
FeatureUnion()
, with the names “num_mapper”
and “cat_mapper”
respectively.
# Import FeatureUnion
from sklearn.pipeline import FeatureUnion
# Combine the numeric and categorical transformations
numeric_categorical_union = FeatureUnion([
("num_mapper", numeric_imputation_mapper),
("cat_mapper", categorical_imputation_mapper)
])
Great work!
Kidney disease case study III: Full pipeline#
It’s time to piece together all of the transforms along with an
XGBClassifier
to build the full pipeline!
Besides the numeric_categorical_union
that you created in
the previous exercise, there are two other transforms needed: the
Dictifier()
transform which we created for you, and the
DictVectorizer()
.
After creating the pipeline, your task is to cross-validate it to see how well it performs.
numeric_categorical_union
,
Dictifier()
, and DictVectorizer(sort=False)
transforms, and xgb.XGBClassifier()
estimator with
max_depth=3
. Name the transforms
“featureunion”
, “dictifier”
“vectorizer”
, and the estimator “clf”
.
pipeline
using
cross_val_score()
. Pass it the pipeline,
pipeline
, the features, kidney_data
, the
outcomes, y
. Also set scoring
to
“roc_auc”
and cv
to 3
.
# edited/added
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction import DictVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
import xgboost as xgb
import numpy as np
# Define Dictifier class to turn df into dictionary as part of pipeline
class Dictifier(BaseEstimator, TransformerMixin):
def fit(self, X, y=None):
return self
def transform(self, X):
if type(X) == pd.core.frame.DataFrame:
return X.to_dict("records")
else:
return pd.DataFrame(X).to_dict("records")
# Create full pipeline
pipeline = Pipeline([
("featureunion", numeric_categorical_union),
("dictifier", Dictifier()),
("vectorizer", DictVectorizer(sort=False)),
("clf", xgb.XGBClassifier(max_depth=3))
])
# Perform cross-validation
cross_val_scores = cross_val_score(pipeline, X, y, scoring='roc_auc', cv=3)
# Print avg. AUC
print("3-fold AUC: ", np.mean(cross_val_scores))
## 3-fold AUC: 0.998637406769937
Great work!
Tuning XGBoost hyperparameters#
Bringing it all together#
Alright, it’s time to bring together everything you’ve learned so far! In this final exercise of the course, you will combine your work from the previous exercises into one end-to-end XGBoost pipeline to really cement your understanding of preprocessing and pipelines in XGBoost.
Your work from the previous 3 exercises, where you preprocessed the data and set up your pipeline, has been pre-loaded. Your job is to perform a randomized search and identify the best hyperparameters.
’clf\_\_learning_rate’
(from 0.05
to 1
in increments of
0.05
), ’clf\_\_max_depth’
(from 3
to 10
in increments of 1
), and
’clf\_\_n_estimators’
(from 50
to
200
in increments of 50
).
pipeline
as the estimator, perform 2-fold
RandomizedSearchCV
with an n_iter
of
2
. Use “roc_auc”
as the metric, and set
verbose
to 1
so the output is more detailed.
Store the result in randomized_roc_auc
.
randomized_roc_auc
to X
and
y
.
randomized_roc_auc
.
# edited/added
from sklearn.model_selection import RandomizedSearchCV
# Create the parameter grid
gbm_param_grid = {
'clf__learning_rate': np.arange(.05, 1, .05),
'clf__max_depth': np.arange(3,10, 1),
'clf__n_estimators': np.arange(50, 200, 50)
}
# Perform RandomizedSearchCV
randomized_roc_auc = RandomizedSearchCV(estimator=pipeline,
param_distributions=gbm_param_grid,
n_iter=2, scoring='roc_auc', cv=2, verbose=1)
# Fit the estimator
randomized_roc_auc.fit(X, y)
# Compute metrics
## Fitting 2 folds for each of 2 candidates, totalling 4 fits
## RandomizedSearchCV(cv=2,
## estimator=Pipeline(steps=[('featureunion',
## FeatureUnion(transformer_list=[('num_mapper',
## DataFrameMapper(df_out=True,
## features=[(['age'],
## SimpleImputer(strategy='median')),
## (['bp'],
## SimpleImputer(strategy='median')),
## (['sg'],
## SimpleImputer(strategy='median')),
## (['al'],
## SimpleImputer(strategy='median')),
## (['su'],
## SimpleImputer(strategy='...
## input_df=True))])),
## ('dictifier', Dictifier()),
## ('vectorizer',
## DictVectorizer(sort=False)),
## ('clf', XGBClassifier())]),
## n_iter=2,
## param_distributions={'clf__learning_rate': array([0.05, 0.1 , 0.15, 0.2 , 0.25, 0.3 , 0.35, 0.4 , 0.45, 0.5 , 0.55,
## 0.6 , 0.65, 0.7 , 0.75, 0.8 , 0.85, 0.9 , 0.95]),
## 'clf__max_depth': array([3, 4, 5, 6, 7, 8, 9]),
## 'clf__n_estimators': array([ 50, 100, 150])},
## scoring='roc_auc', verbose=1)
print(randomized_roc_auc.best_score_)
## 0.9969066666666666
print(randomized_roc_auc.best_estimator_)
## Pipeline(steps=[('featureunion',
## FeatureUnion(transformer_list=[('num_mapper',
## DataFrameMapper(df_out=True,
## features=[(['age'],
## SimpleImputer(strategy='median')),
## (['bp'],
## SimpleImputer(strategy='median')),
## (['sg'],
## SimpleImputer(strategy='median')),
## (['al'],
## SimpleImputer(strategy='median')),
## (['su'],
## SimpleImputer(strategy='median')),
## (['bgr'],
## SimpleImputer(s...
## CategoricalImputer()),
## ('htn',
## CategoricalImputer()),
## ('dm',
## CategoricalImputer()),
## ('cad',
## CategoricalImputer()),
## ('appet',
## CategoricalImputer()),
## ('pe',
## CategoricalImputer()),
## ('ane',
## CategoricalImputer())],
## input_df=True))])),
## ('dictifier', Dictifier()),
## ('vectorizer', DictVectorizer(sort=False)),
## ('clf',
## XGBClassifier(learning_rate=0.4, max_depth=7,
## n_estimators=150))])
Amazing work! This type of pipelining is very common in real-world data science and you’re well on your way towards mastering it.
Final Thoughts#
Final Thoughts#
Congratulations on completing this course. Let’s go over everything we’ve covered in this course, as well as where you can go from here with learning other topics related to XGBoost that we didn’t have a chance to cover.
What We Have Covered And You Have Learned#
So, what have we been able to cover in this course? Well, we’ve learned how to use XGBoost for both classification and regression tasks. We’ve also covered all the most important hyperparameters that you should tune when creating XGBoost models, so that they are as performant as possible. And we just finished up how to incorporate XGBoost into pipelines, and used some more advanced functions that allow us to seamlessly work with Pandas DataFrames and scikit-learn. That’s quite a lot of ground we’ve covered and you should be proud of what you’ve been able to accomplish.
What We Have Not Covered (And How You Can Proceed)#
However, although we’ve covered quite a lot, we didn’t cover some other topics that would advance your mastery of XGBoost. Specifically, we never looked into how to use XGBoost for ranking or recommendation problems, which can be done by modifying the loss function you use when constructing your model. We also didn’t look into more advanced hyperparameter selection strategies. The most powerful strategy, called Bayesian optimization, has been used with lots of success, and entire companies have been created just for specifically using this method in tuning models (for example, the company sigopt does exactly this). It’s a powerful method, but would take an entire other DataCamp course to teach properly! Finally, we haven’t talked about ensembling XGBoost with other models. Although XGBoost is itself an ensemble method, nothing stops you from combining the predictions you get from an XGBoost model with other models, as this is usually a very powerful additional way to squeeze the last bit of juice from your data. Learning about all of these additional topics will help you become an even more powerful user of XGBoost. Now that you know your way around the package, there’s no reason for you to stop learning how to get even more benefits out of it.
Congratulations!#
I hope you’ve enjoyed taking this course on XGBoost as I have teaching it. Please let us know if you’ve enjoyed the course and definitely let me know how I can improve it. It’s been a pleasure, and I hope you continue your data science journey from here!