Classification with XGBoost
Contents
Classification with XGBoost#
This chapter will introduce you to the fundamental idea behind XGBoost—boosted learners. Once you understand how XGBoost works, you’ll apply it to solve a common classification problem found in industry: predicting whether a customer will stop being a customer at some point in the future.
Welcome to the course!#
Which of these is a classification problem?#
Given below are 4 potential machine learning problems you might encounter in the wild. Pick the one that is a classification problem.
Which of these is a binary classification problem?#
Great! A classification problem involves predicting the category a given data point belongs to out of a finite set of possible categories. Depending on how many possible categories there are to predict, a classification problem can be either binary or multi-class. Let’s do another quick refresher here. Your job is to pick the binary classification problem out of the following list of supervised learning problems.
Introducing XGBoost#
XGBoost: Fit/Predict#
It’s time to create your first XGBoost model! As Sergey showed you in
the video, you can use the scikit-learn .fit()
/
.predict()
paradigm that you are already familiar to build
your XGBoost models, as the xgboost
library has a
scikit-learn compatible API!
Here, you’ll be working with churn data. This dataset contains imaginary
data from a ride-sharing app with user behaviors over their first month
of app usage in a set of imaginary cities as well as whether they used
the service 5 months after sign-up. It has been pre-loaded for you into
a DataFrame called churn_data
- explore it in the Shell!
Your goal is to use the first month’s worth of data to predict whether
the app’s users will remain users of the service at the 5 month mark.
This is a typical setup for a churn prediction problem. To do this,
you’ll split the data into training and test sets, fit a small
xgboost
model on the training set, and evaluate its
performance on the test set by computing its accuracy.
pandas
and numpy
have been imported as
pd
and np
, and train_test_split
has been imported from sklearn.model_selection
.
Additionally, the arrays for the features and the target have been
created as X
and y
.
xgboost
as xgb
.
random_state
of 123
.
XGBoostClassifier
as xg_cl
using xgb.XGBClassifier()
. Specify
n_estimators
to be 10
estimators and an
objective
of ‘binary:logistic’
. Do not worry
about what this means just yet, you will learn about these parameters
later in this course.
xg_cl
to the training set (X_train,
y_train)
using the .fit()
method.
X_test
) using the
.predict()
method and hit ‘Submit Answer’ to print the
accuracy.
# edited/added
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
churn_data = pd.read_csv("archive/Extreme-Gradient-Boosting-with-XGBoost/datasets/churn_data.csv")
# import xgboost
import xgboost as xgb
# Create arrays for the features and the target: X, y
X, y = churn_data.iloc[:,:-1], churn_data.iloc[:,-1]
# Create the training and test sets
X_train,X_test,y_train,y_test= train_test_split(X, y, test_size=0.2, random_state=123)
# Instantiate the XGBClassifier: xg_cl
xg_cl = xgb.XGBClassifier(objective='binary:logistic', n_estimators=10, seed=123)
# Fit the classifier to the training set
xg_cl.fit(X_train,y_train)
# Predict the labels of the test set: preds
## XGBClassifier(n_estimators=10, seed=123)
preds = xg_cl.predict(X_test)
# Compute the accuracy: accuracy
accuracy = float(np.sum(preds==y_test))/y_test.shape[0]
print("accuracy: %f" % (accuracy))
## accuracy: 0.743300
Well done! Your model has an accuracy of around 74%. In Chapter 3, you’ll learn about ways to fine tune your XGBoost models. For now, let’s refresh our memories on how decision trees work. See you in the next video!
What is a decision tree?#
Decision trees#
Your task in this exercise is to make a simple decision tree using
scikit-learn’s DecisionTreeClassifier
on the breast
cancer
dataset that comes pre-loaded with scikit-learn.
This dataset contains numeric measurements of various dimensions of individual tumors (such as perimeter and texture) from breast biopsies and a single outcome value (the tumor is either malignant, or benign).
We’ve preloaded the dataset of samples (measurements) into
X
and the target values per tumor into y
. Now,
you have to split the complete dataset into training and testing sets,
and then train a DecisionTreeClassifier
. You’ll specify a
parameter called max_depth
. Many other parameters can be
modified within this model, and you can check all of them out
here.
-
train_test_split
fromsklearn.model_selection
. -
DecisionTreeClassifier
fromsklearn.tree
.
random_state
of 123
.
DecisionTreeClassifier
called
dt_clf_4
with a max_depth
of 4
.
This parameter specifies the maximum number of successive split points
you can have before reaching a leaf node.
# edited/added
breast_cancer = pd.read_csv("archive/Extreme-Gradient-Boosting-with-XGBoost/datasets/breast_cancer.csv")
X = breast_cancer.iloc[:,2:].to_numpy()
y = np.array([0 if i == "M" else 1 for i in breast_cancer.iloc[:,1]])
# Import the necessary modules
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
# Create the training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
# Instantiate the classifier: dt_clf_4
dt_clf_4 = DecisionTreeClassifier(max_depth=4)
# Fit the classifier to the training set
dt_clf_4.fit(X_train, y_train)
# Predict the labels of the test set: y_pred_4
## DecisionTreeClassifier(max_depth=4)
y_pred_4 = dt_clf_4.predict(X_test)
# Compute the accuracy of the predictions: accuracy
accuracy = float(np.sum(y_pred_4==y_test))/y_test.shape[0]
print("accuracy:", accuracy)
## accuracy: 0.9736842105263158
Great work! It’s now time to learn about what gives XGBoost its state-of-the-art performance: Boosting.
What is Boosting?#
Measuring accuracy#
You’ll now practice using XGBoost’s learning API through its baked in
cross-validation capabilities. As Sergey discussed in the previous
video, XGBoost gets its lauded performance and efficiency gains by
utilizing its own optimized data structure for datasets called a
DMatrix
.
In the previous exercise, the input datasets were converted into
DMatrix
data on the fly, but when you use the
xgboost
cv
object, you have to first
explicitly convert your data into a DMatrix
. So, that’s
what you will do here before running cross-validation on
churn_data
.
DMatrix
called churn_dmatrix
from
churn_data
using xgb.DMatrix()
. The features
are available in X
and the labels in y
.
xgb.cv()
.
dtrain
is your churn_dmatrix
,
params
is your parameter dictionary, nfold
is
the number of cross-validation folds (3
),
num_boost_round
is the number of trees we want to build
(5
), metrics
is the metric you want to compute
(this will be “error”
, which we will convert to an
accuracy).
# Create arrays for the features and the target: X, y
X, y = churn_data.iloc[:,:-1], churn_data.iloc[:,-1]
# Create the DMatrix from X and y: churn_dmatrix
churn_dmatrix = xgb.DMatrix(data=X, label=y)
# Create the parameter dictionary: params
params = {"objective":"reg:logistic", "max_depth":3}
# Perform cross-validation: cv_results
cv_results = xgb.cv(dtrain=churn_dmatrix, params=params,
nfold=3, num_boost_round=5,
metrics="error", as_pandas=True, seed=123)
# Print cv_results
print(cv_results)
# Print the accuracy
## train-error-mean train-error-std test-error-mean test-error-std
## 0 0.28232 0.002366 0.28378 0.001932
## 1 0.26951 0.001855 0.27190 0.001932
## 2 0.25605 0.003213 0.25798 0.003963
## 3 0.25090 0.001845 0.25434 0.003827
## 4 0.24654 0.001981 0.24852 0.000934
print(((1-cv_results["test-error-mean"]).iloc[-1]))
## 0.75148
Nice work. cv_results
stores the training and test mean and
standard deviation of the error per boosting round (tree built) as a
DataFrame. From cv_results
, the final round
‘test-error-mean’
is extracted and converted into an
accuracy, where accuracy is 1-error
. The final accuracy of
around 75% is an improvement from earlier!
Measuring AUC#
Now that you’ve used cross-validation to compute average out-of-sample
accuracy (after converting from an error), it’s very easy to compute any
other metric you might be interested in. All you have to do is pass it
(or a list of metrics) in as an argument to the metrics
parameter of xgb.cv()
.
Your job in this exercise is to compute another common metric used in
binary classification - the area under the curve (“auc”
).
As before, churn_data
is available in your workspace, along
with the DMatrix churn_dmatrix
and parameter dictionary
params
.
5
boosting rounds and
“auc”
as your metric.
“test-auc-mean”
column of
cv_results
.
# Perform cross_validation: cv_results
cv_results = xgb.cv(dtrain=churn_dmatrix, params=params,
nfold=3, num_boost_round=5,
metrics="auc", as_pandas=True, seed=123)
# Print cv_results
print(cv_results)
# Print the AUC
## train-auc-mean train-auc-std test-auc-mean test-auc-std
## 0 0.768893 0.001544 0.767863 0.002820
## 1 0.790864 0.006758 0.789157 0.006846
## 2 0.815872 0.003900 0.814476 0.005997
## 3 0.822959 0.002018 0.821682 0.003912
## 4 0.827528 0.000769 0.826191 0.001937
print((cv_results["test-auc-mean"]).iloc[-1])
## 0.826191
Fantastic! An AUC of 0.84 is quite strong. As you have seen, XGBoost’s learning API makes it very easy to compute any metric you may be interested in. In Chapter 3, you’ll learn about techniques to fine-tune your XGBoost models to improve their performance even further. For now, it’s time to learn a little about exactly when to use XGBoost.
When should I use XGBoost?#
Using XGBoost#
XGBoost is a powerful library that scales very well to many samples and works for a variety of supervised learning problems. That said, as Sergey described in the video, you shouldn’t always pick it as your default machine learning library when starting a new project, since there are some situations in which it is not the best option. In this exercise, your job is to consider the below examples and select the one which would be the best use of XGBoost.