Regression
Contents
Regression#
In this chapter, you will be introduced to regression, and build models to predict sales values using a dataset on advertising expenditure. You will learn about the mechanics of linear regression and common performance metrics such as R-squared and root mean squared error. You will perform k-fold cross-validation, and apply regularization to regression models to reduce the risk of overfitting.
Introduction to regression#
Creating features#
In this chapter, you will work with a dataset called
sales_df
, which contains information on advertising
campaign expenditure across different media types, and the number of
dollars generated in sales for the respective campaign. The dataset has
been preloaded for you. Here are the first two rows:
tv radio social_media sales
1 13000.0 9237.76 2409.57 46677.90
2 41000.0 15886.45 2913.41 150177.83
You will use the advertising expenditure as features to predict sales
values, initially working with the “radio”
column. However,
before you make any predictions you will need to create the feature and
target arrays, reshaping them to the correct format for scikit-learn.
X
, an array of the values from the
sales_df
DataFrame’s “radio”
column.
y
, an array of the values from the
sales_df
DataFrame’s “sales”
column.
X
into a two-dimensional NumPy array.
X
and y
.
# edited/added
sales_df = pd.read_csv("archive/Supervised-Learning-with-scikit-learn/datasets/advertising_and_sales_clean.csv").drop(['influencer'], axis = 1)
import numpy as np
# Create X from the radio column's values
X = sales_df["radio"].values
# Create y from the sales column's values
y = sales_df["sales"].values
# Reshape X
X = X.reshape(-1, 1)
# Check the shape of the features and targets
print(X.shape, y.shape)
## (4546, 1) (4546,)
Excellent! See that there are 4546 values in both arrays? Now let’s build a linear regression model!
Building a linear regression model#
Now you have created your feature and target arrays, you will train a linear regression model on all feature and target values.
As the goal is to assess the relationship between the feature and target values there is no need to split the data into training and test sets.
X
and y
have been preloaded for you as
follows:
y = sales_df["sales"].values
X = sales_df["radio"].values.reshape(-1, 1)
LinearRegression
.
X
, storing as
predictions
.
# Import LinearRegression
from sklearn.linear_model import LinearRegression
# Create the model
reg = LinearRegression()
# Fit the model to the data
reg.fit(X, y)
# Make predictions
## LinearRegression()
predictions = reg.predict(X)
print(predictions[:5])
## [ 95491.17119147 117829.51038393 173423.38071499 291603.11444202
## 111137.28167129]
Great model building! See how sales values for the first five predictions range from $95,000 to over $290,000. Let’s visualize the model’s fit.
Visualizing a linear regression model#
Now you have built your linear regression model and trained it using all
available observations, you can visualize how well the model fits the
data. This allows you to interpret the relationship between
radio
advertising expenditure and sales
values.
The variables X
, an array of radio
values,
y
, an array of sales
values, and
predictions
, an array of the model’s predicted values for
y
given X
, have all been preloaded for you
from the previous exercise.
matplotlib.pyplot
as plt
.
y
against X
,
with observations in blue.
X
.
# Import matplotlib.pyplot
import matplotlib.pyplot as plt
# Create scatter plot
plt.scatter(X, y, color="blue")
# Create line plot
plt.plot(X, predictions, color="red")
plt.xlabel("Radio Expenditure ($)")
plt.ylabel("Sales ($)")
# Display the plot
plt.show()
The model nicely captures a near-perfect linear correlation between radio advertising expenditure and sales! Now let’s take a look at what is going on under the hood to calculate this relationship.
The basics of linear regression#
Fit and predict for regression#
Now you have seen how linear regression works, your task is to create a
multiple linear regression model using all of the features in the
sales_df
dataset, which has been preloaded for you. As a
reminder, here are the first two rows:
tv radio social_media sales
1 13000.0 9237.76 2409.57 46677.90
2 41000.0 15886.45 2913.41 150177.83
You will then use this model to predict sales based on the values of the test features.
LinearRegression
and train_test_split
have
been preloaded for you from their respective modules.
X
, an array containing values of all features in
sales_df
, and y
, containing all values from
the “sales”
column.
y_pred
, making predictions for sales
using the test features.
# Create X and y arrays
X = sales_df.drop("sales", axis=1).values
y = sales_df["sales"].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Instantiate the model
reg = LinearRegression()
# Fit the model to the data
reg.fit(X_train, y_train)
# Make predictions
## LinearRegression()
y_pred = reg.predict(X_test)
print("Predictions: {}, Actual Values: {}".format(y_pred[:2], y_test[:2]))
## Predictions: [53176.66154234 70996.19873235], Actual Values: [55261.28 67574.9 ]
Great work! The first two predictions appear to be within around 5% of the actual values from the test set!
Regression performance#
Now you have fit a model, reg
, using all features from
sales_df
, and made predictions of sales values, you can
evaluate performance using some common regression metrics.
The variables X_train
, X_test
,
y_train
, y_test
, and y_pred
,
along with the fitted model, reg
, all from the last
exercise, have been preloaded for you.
Your task is to find out how well the features can explain the variance in the target values, along with assessing the model’s ability to make predictions on unseen data.
mean_squared_error
.
y_test
and y_pred
.
r_squared
and rmse
.
# Import mean_squared_error
from sklearn.metrics import mean_squared_error
# Compute R-squared
r_squared = reg.score(X_test, y_test)
# Compute RMSE
rmse = mean_squared_error(y_test, y_pred, squared=False)
# Print the metrics
print("R^2: {}".format(r_squared))
## R^2: 0.9990152104759368
print("RMSE: {}".format(rmse))
## RMSE: 2944.4331996001
Wow, the features explain 99.9% of the variance in sales values! Looks like this company’s advertising strategy is working well!
Cross-validation#
Cross-validation for R-squared#
Cross-validation is a vital approach to evaluating a model. It maximizes the amount of data that is available to the model, as the model is not only trained but also tested on all of the available data.
In this exercise, you will build a linear regression model, then use 6-fold cross-validation to assess its accuracy for predicting sales using social media advertising expenditure. You will display the individual score for each of the six-folds.
The sales_df
dataset has been split into y
for
the target variable, and X
for the features, and preloaded
for you. LinearRegression
has been imported from
sklearn.linear_model
.
KFold
and cross_val_score
.
kf
by calling KFold()
, setting the
number of splits to six, shuffle
to True
, and
setting a seed of 5
.
reg
on X
and
y
, passing kf
to cv
.
cv_scores
.
# Import the necessary modules
from sklearn.model_selection import KFold, cross_val_score
# Create a KFold object
kf = KFold(n_splits=6, shuffle=True, random_state=5)
reg = LinearRegression()
# Compute 6-fold cross-validation scores
cv_scores = cross_val_score(reg, X, y, cv=kf)
# Print scores
print(cv_scores)
## [0.99894062 0.99909245 0.9990103 0.99896344 0.99889153 0.99903953]
Notice how R-squared for each fold ranged between 0.74
and
0.77
? By using cross-validation, we can see how performance
varies depending on how the data is split!
Analyzing cross-validation metrics#
Now you have performed cross-validation, it’s time to analyze the results.
You will display the mean, standard deviation, and 95% confidence
interval for cv_results
, which has been preloaded for you
from the previous exercise.
numpy
has been imported for you as np
.
cv_results
.
np.quantile()
.
# edited/added
cv_results = cv_scores
# Print the mean
print(np.mean(cv_results))
# Print the standard deviation
## 0.9989896443678249
print(np.std(cv_results))
# Print the 95% confidence interval
## 6.608118371529651e-05
print(np.quantile(cv_results, [0.025, 0.975]))
## [0.99889767 0.99908583]
An average score of 0.75
with a low standard deviation is
pretty good for a model out of the box! Now let’s learn how to apply
regularization to our regression models.
Regularized regression#
Regularized regression: Ridge#
Ridge regression performs regularization by computing the squared values of the model parameters multiplied by alpha and adding them to the loss function.
In this exercise, you will fit ridge regression models over a range of
different alpha values, and print their $\\(R^2\\)$ scores. You will use
all of the features in the sales_df
dataset to predict
“sales”
. The data has been split into X_train
,
X_test
, y_train
, y_test
for you.
A variable called alphas
has been provided as a list
containing different alpha values, which you will loop through to
generate scores.
Ridge
.
Ridge
, setting alpha equal to
alpha
.
ridge
.
# Import Ridge
from sklearn.linear_model import Ridge
alphas = [0.1, 1.0, 10.0, 100.0, 1000.0, 10000.0]
ridge_scores = []
for alpha in alphas:
# Create a Ridge regression model
ridge = Ridge(alpha=alpha)
# Fit the data
ridge.fit(X_train, y_train)
# Obtain R-squared
score = ridge.score(X_test, y_test)
ridge_scores.append(score)
## Ridge(alpha=0.1)
## Ridge()
## Ridge(alpha=10.0)
## Ridge(alpha=100.0)
## Ridge(alpha=1000.0)
## Ridge(alpha=10000.0)
print(ridge_scores)
## [0.9990152104759369, 0.9990152104759373, 0.9990152104759419, 0.999015210475987, 0.9990152104764387, 0.9990152104809561]
Well done! The scores don’t appear to change much as alpha
increases, which is indicative of how well the features explain the
variance in the target—even by heavily penalizing large coefficients,
underfitting does not occur!
Lasso regression for feature importance#
In the video, you saw how lasso regression can be used to identify important features in a dataset.
In this exercise, you will fit a lasso regression model to the
sales_df
data and plot the model’s coefficients.
The feature and target variable arrays have been pre-loaded as
X
and y
, along with
sales_columns
, which contains the dataset’s feature names.
Lasso
from sklearn.linear_model
.
0.3
.
lasso_coef
.
# edited/added
sales_columns = sales_df.drop(['sales'], axis = 1).columns
# Import Lasso
from sklearn.linear_model import Lasso
# Instantiate a lasso regression model
lasso = Lasso(alpha=0.3)
# Fit the model to the data
lasso.fit(X, y)
# Compute and print the coefficients
## Lasso(alpha=0.3)
lasso_coef = lasso.coef_
print(lasso_coef)
## [ 3.56256962 -0.00397035 0.00496385]
plt.bar(sales_columns, lasso_coef)
## <BarContainer object of 3 artists>
plt.xticks(rotation=45)
## ([0, 1, 2], [Text(0, 0, ''), Text(0, 0, ''), Text(0, 0, '')])
plt.show()
See how the figure makes it clear that expenditure on TV advertising is the most important feature in the dataset to predict sales values! In the next chapter, we will learn how to further assess and improve our model’s performance!