Classification with XGBoost#

This chapter will introduce you to the fundamental idea behind XGBoost—boosted learners. Once you understand how XGBoost works, you’ll apply it to solve a common classification problem found in industry: predicting whether a customer will stop being a customer at some point in the future.

Welcome to the course!#

Which of these is a classification problem?#

Given below are 4 potential machine learning problems you might encounter in the wild. Pick the one that is a classification problem.

  • Given past performance of stocks and various other financial data, predicting the exact price of a given stock (Google) tomorrow.
  • Given a large dataset of user behaviors on a website, generating an informative segmentation of the users based on their behaviors.
  • Predicting whether a given user will click on an ad given the ad content and metadata associated with the user.
  • Given a user’s past behavior on a video platform, presenting him/her with a series of recommended videos to watch next.
  • Well done! This is indeed a classification problem.

    Which of these is a binary classification problem?#

    Great! A classification problem involves predicting the category a given data point belongs to out of a finite set of possible categories. Depending on how many possible categories there are to predict, a classification problem can be either binary or multi-class. Let’s do another quick refresher here. Your job is to pick the binary classification problem out of the following list of supervised learning problems.

  • Predicting whether a given image contains a cat.
  • Predicting the emotional valence of a sentence (Valence can be positive, negative, or neutral).
  • Recommending the most tax-efficient strategy for tax filing in an automated accounting system.
  • Given a list of symptoms, generating a rank-ordered list of most likely diseases.
  • Correct! A binary classification problem involves picking between 2 choices.

    Introducing XGBoost#

    XGBoost: Fit/Predict#

    It’s time to create your first XGBoost model! As Sergey showed you in the video, you can use the scikit-learn .fit() / .predict() paradigm that you are already familiar to build your XGBoost models, as the xgboost library has a scikit-learn compatible API!

    Here, you’ll be working with churn data. This dataset contains imaginary data from a ride-sharing app with user behaviors over their first month of app usage in a set of imaginary cities as well as whether they used the service 5 months after sign-up. It has been pre-loaded for you into a DataFrame called churn_data - explore it in the Shell!

    Your goal is to use the first month’s worth of data to predict whether the app’s users will remain users of the service at the 5 month mark. This is a typical setup for a churn prediction problem. To do this, you’ll split the data into training and test sets, fit a small xgboost model on the training set, and evaluate its performance on the test set by computing its accuracy.

    pandas and numpy have been imported as pd and np, and train_test_split has been imported from sklearn.model_selection. Additionally, the arrays for the features and the target have been created as X and y.

  • Import xgboost as xgb.
  • Create training and test sets such that 20% of the data is used for testing. Use a random_state of 123.
  • Instantiate an XGBoostClassifier as xg_cl using xgb.XGBClassifier(). Specify n_estimators to be 10 estimators and an objective of ‘binary:logistic’. Do not worry about what this means just yet, you will learn about these parameters later in this course.
  • Fit xg_cl to the training set (X_train, y_train) using the .fit() method.
  • Predict the labels of the test set (X_test) using the .predict() method and hit ‘Submit Answer’ to print the accuracy.
  • # edited/added
    import pandas as pd
    import numpy as np
    from sklearn.model_selection import train_test_split
    churn_data = pd.read_csv("archive/Extreme-Gradient-Boosting-with-XGBoost/datasets/churn_data.csv")
    
    # import xgboost
    import xgboost as xgb
    
    # Create arrays for the features and the target: X, y
    X, y = churn_data.iloc[:,:-1], churn_data.iloc[:,-1]
    
    # Create the training and test sets
    X_train,X_test,y_train,y_test= train_test_split(X, y, test_size=0.2, random_state=123)
    
    # Instantiate the XGBClassifier: xg_cl
    xg_cl = xgb.XGBClassifier(objective='binary:logistic', n_estimators=10, seed=123)
    
    # Fit the classifier to the training set
    xg_cl.fit(X_train,y_train)
    
    # Predict the labels of the test set: preds
    
    ## XGBClassifier(n_estimators=10, seed=123)
    
    preds = xg_cl.predict(X_test)
    
    # Compute the accuracy: accuracy
    accuracy = float(np.sum(preds==y_test))/y_test.shape[0]
    print("accuracy: %f" % (accuracy))
    
    ## accuracy: 0.743300
    

    Well done! Your model has an accuracy of around 74%. In Chapter 3, you’ll learn about ways to fine tune your XGBoost models. For now, let’s refresh our memories on how decision trees work. See you in the next video!

    What is a decision tree?#

    Decision trees#

    Your task in this exercise is to make a simple decision tree using scikit-learn’s DecisionTreeClassifier on the breast cancer dataset that comes pre-loaded with scikit-learn.

    This dataset contains numeric measurements of various dimensions of individual tumors (such as perimeter and texture) from breast biopsies and a single outcome value (the tumor is either malignant, or benign).

    We’ve preloaded the dataset of samples (measurements) into X and the target values per tumor into y. Now, you have to split the complete dataset into training and testing sets, and then train a DecisionTreeClassifier. You’ll specify a parameter called max_depth. Many other parameters can be modified within this model, and you can check all of them out here.

  • Import:
    • train_test_split from sklearn.model_selection.
    • DecisionTreeClassifier from sklearn.tree.
  • Create training and test sets such that 20% of the data is used for testing. Use a random_state of 123.
  • Instantiate a DecisionTreeClassifier called dt_clf_4 with a max_depth of 4. This parameter specifies the maximum number of successive split points you can have before reaching a leaf node.
  • Fit the classifier to the training set and predict the labels of the test set.
  • # edited/added
    breast_cancer = pd.read_csv("archive/Extreme-Gradient-Boosting-with-XGBoost/datasets/breast_cancer.csv")
    X = breast_cancer.iloc[:,2:].to_numpy()
    y = np.array([0 if i == "M" else 1 for i in breast_cancer.iloc[:,1]])
    
    # Import the necessary modules
    from sklearn.model_selection import train_test_split
    from sklearn.tree import DecisionTreeClassifier
    
    # Create the training and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
    
    # Instantiate the classifier: dt_clf_4
    dt_clf_4 = DecisionTreeClassifier(max_depth=4)
    
    # Fit the classifier to the training set
    dt_clf_4.fit(X_train, y_train)
    
    # Predict the labels of the test set: y_pred_4
    
    ## DecisionTreeClassifier(max_depth=4)
    
    y_pred_4 = dt_clf_4.predict(X_test)
    
    # Compute the accuracy of the predictions: accuracy
    accuracy = float(np.sum(y_pred_4==y_test))/y_test.shape[0]
    print("accuracy:", accuracy)
    
    ## accuracy: 0.9736842105263158
    

    Great work! It’s now time to learn about what gives XGBoost its state-of-the-art performance: Boosting.

    What is Boosting?#

    Measuring accuracy#

    You’ll now practice using XGBoost’s learning API through its baked in cross-validation capabilities. As Sergey discussed in the previous video, XGBoost gets its lauded performance and efficiency gains by utilizing its own optimized data structure for datasets called a DMatrix.

    In the previous exercise, the input datasets were converted into DMatrix data on the fly, but when you use the xgboost cv object, you have to first explicitly convert your data into a DMatrix. So, that’s what you will do here before running cross-validation on churn_data.

  • Create a DMatrix called churn_dmatrix from churn_data using xgb.DMatrix(). The features are available in X and the labels in y.
  • Perform 3-fold cross-validation by calling xgb.cv(). dtrain is your churn_dmatrix, params is your parameter dictionary, nfold is the number of cross-validation folds (3), num_boost_round is the number of trees we want to build (5), metrics is the metric you want to compute (this will be “error”, which we will convert to an accuracy).
  • # Create arrays for the features and the target: X, y
    X, y = churn_data.iloc[:,:-1], churn_data.iloc[:,-1]
    
    # Create the DMatrix from X and y: churn_dmatrix
    churn_dmatrix = xgb.DMatrix(data=X, label=y)
    
    # Create the parameter dictionary: params
    params = {"objective":"reg:logistic", "max_depth":3}
    
    # Perform cross-validation: cv_results
    cv_results = xgb.cv(dtrain=churn_dmatrix, params=params, 
                        nfold=3, num_boost_round=5, 
                        metrics="error", as_pandas=True, seed=123)
    
    # Print cv_results
    print(cv_results)
    
    # Print the accuracy
    
    ##    train-error-mean  train-error-std  test-error-mean  test-error-std
    ## 0           0.28232         0.002366          0.28378        0.001932
    ## 1           0.26951         0.001855          0.27190        0.001932
    ## 2           0.25605         0.003213          0.25798        0.003963
    ## 3           0.25090         0.001845          0.25434        0.003827
    ## 4           0.24654         0.001981          0.24852        0.000934
    
    print(((1-cv_results["test-error-mean"]).iloc[-1]))
    
    ## 0.75148
    

    Nice work. cv_results stores the training and test mean and standard deviation of the error per boosting round (tree built) as a DataFrame. From cv_results, the final round ‘test-error-mean’ is extracted and converted into an accuracy, where accuracy is 1-error. The final accuracy of around 75% is an improvement from earlier!

    Measuring AUC#

    Now that you’ve used cross-validation to compute average out-of-sample accuracy (after converting from an error), it’s very easy to compute any other metric you might be interested in. All you have to do is pass it (or a list of metrics) in as an argument to the metrics parameter of xgb.cv().

    Your job in this exercise is to compute another common metric used in binary classification - the area under the curve (“auc”). As before, churn_data is available in your workspace, along with the DMatrix churn_dmatrix and parameter dictionary params.

  • Perform 3-fold cross-validation with 5 boosting rounds and “auc” as your metric.
  • Print the “test-auc-mean” column of cv_results.
  • # Perform cross_validation: cv_results
    cv_results = xgb.cv(dtrain=churn_dmatrix, params=params, 
                        nfold=3, num_boost_round=5, 
                        metrics="auc", as_pandas=True, seed=123)
    
    # Print cv_results
    print(cv_results)
    
    # Print the AUC
    
    ##    train-auc-mean  train-auc-std  test-auc-mean  test-auc-std
    ## 0        0.768893       0.001544       0.767863      0.002820
    ## 1        0.790864       0.006758       0.789157      0.006846
    ## 2        0.815872       0.003900       0.814476      0.005997
    ## 3        0.822959       0.002018       0.821682      0.003912
    ## 4        0.827528       0.000769       0.826191      0.001937
    
    print((cv_results["test-auc-mean"]).iloc[-1])
    
    ## 0.826191
    

    Fantastic! An AUC of 0.84 is quite strong. As you have seen, XGBoost’s learning API makes it very easy to compute any metric you may be interested in. In Chapter 3, you’ll learn about techniques to fine-tune your XGBoost models to improve their performance even further. For now, it’s time to learn a little about exactly when to use XGBoost.

    When should I use XGBoost?#

    Using XGBoost#

    XGBoost is a powerful library that scales very well to many samples and works for a variety of supervised learning problems. That said, as Sergey described in the video, you shouldn’t always pick it as your default machine learning library when starting a new project, since there are some situations in which it is not the best option. In this exercise, your job is to consider the below examples and select the one which would be the best use of XGBoost.

  • Visualizing the similarity between stocks by comparing the time series of their historical prices relative to each other.
  • Predicting whether a person will develop cancer using genetic data with millions of genes, 23 examples of genomes of people that didn’t develop cancer, 3 genomes of people who wound up getting cancer.
  • Clustering documents into topics based on the terms used in them.
  • Predicting the likelihood that a given user will click an ad from a very large clickstream log with millions of users and their web interactions.
  • Correct! Way to end the chapter. Time to apply XGBoost to solve regression problems!