Applying logistic regression and SVM
Contents
Applying logistic regression and SVM#
In this chapter you will learn the basics of applying logistic
regression and support vector machines (SVMs) to classification
problems. You’ll use the scikit-learn
library to fit
classification models to real data.
scikit-learn refresher#
KNN classification#
In this exercise you’ll explore a subset of the
Large Movie
Review Dataset. The variables X_train
,
X_test
, y_train
, and y_test
are
already loaded into the environment. The X
variables
contain features based on the words in the movie reviews, and the
y
variables contain labels for whether the review sentiment
is positive (+1) or negative (-1).
This course touches on a lot of concepts you may have forgotten, so if you ever need a quick refresher, download the scikit-learn Cheat Sheet and keep it handy!
# edited/added
import numpy as np
from sklearn.datasets import load_svmlight_file
X_train, y_train = load_svmlight_file('archive/Linear-Classifiers-in-Python/datasets/train_labeledBow.feat')
X_test, y_test = load_svmlight_file('archive/Linear-Classifiers-in-Python/datasets/test_labeledBow.feat')
X_train = X_train[11000:13000,:2500]
y_train = y_train[11000:13000]
y_train[y_train < 5] = -1.0
y_train[y_train >= 5] = 1.0
X_test = X_test[11000:13000,:2500]
y_test = y_test[11000:13000]
y_test[y_train < 5] = -1.0
y_test[y_train >= 5] = 1.0
from sklearn.neighbors import KNeighborsClassifier
# Create and fit the model
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
# Predict on the test features, print the results
## KNeighborsClassifier()
pred = knn.predict(X_test)[0]
print("Prediction for test example 0:", pred)
## Prediction for test example 0: 1.0
Nice work! Looks like you remember how to use scikit-learn
for supervised learning.
Comparing models#
Compare k nearest neighbors classifiers with k=1 and k=5 on the
handwritten digits data set, which is already loaded into the variables
X_train
, y_train
, X_test
, and
y_test
. You can set k with the n_neighbors
parameter when creating the KNeighborsClassifier
object,
which is also already imported into the environment.
Which model has a higher test accuracy?
# Create and fit the model
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train, y_train)
## KNeighborsClassifier(n_neighbors=1)
knn.score(X_test, y_test)
# Predict on the test features, print the results
## 0.1645
pred = knn.predict(X_test)[0]
print("Prediction for test example 0:", pred)
# Create and fit the model
## Prediction for test example 0: 1.0
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
## KNeighborsClassifier()
knn.score(X_test, y_test)
# Predict on the test features, print the results
## 0.056
pred = knn.predict(X_test)[0]
print("Prediction for test example 0:", pred)
## Prediction for test example 0: 1.0
k=1
k=5
Great! You’ve just done a bit of model selection!
Overfitting#
Which of the following situations looks like an example of overfitting?
Training accuracy 50%, testing accuracy 50%.
Training accuracy 95%, testing accuracy 95%.
Training accuracy 95%, testing accuracy 50%.
Training accuracy 50%, testing accuracy 95%.
Applying logistic regression and SVM#
Running LogisticRegression and SVC#
In this exercise, you’ll apply logistic regression and a support vector machine to classify images of handwritten digits.
SVC()
) to the
handwritten digits data set using the provided train/validation split.
# edited/added
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn import datasets
digits = datasets.load_digits()
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target)
# Apply logistic regression and print scores
lr = LogisticRegression()
lr.fit(X_train, y_train)
## LogisticRegression()
##
## /Users/macos/Library/r-miniconda/envs/r-reticulate/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
## STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
##
## Increase the number of iterations (max_iter) or scale the data as shown in:
## https://scikit-learn.org/stable/modules/preprocessing.html
## Please also refer to the documentation for alternative solver options:
## https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
## n_iter_i = _check_optimize_result(
print(lr.score(X_train, y_train))
## 1.0
print(lr.score(X_test, y_test))
# Apply SVM and print scores
## 0.9488888888888889
svm = SVC()
svm.fit(X_train, y_train)
## SVC()
print(svm.score(X_train, y_train))
## 0.994060876020787
print(svm.score(X_test, y_test))
## 0.9844444444444445
Nicely done! Later in the course we’ll look at the similarities and differences of logistic regression vs. SVMs.
Sentiment analysis for movie reviews#
In this exercise you’ll explore the probabilities outputted by logistic regression on a subset of the Large Movie Review Dataset.
The variables X
and y
are already loaded into
the environment. X
contains features based on the number of
times words appear in the movie reviews, and y
contains
labels for whether the review sentiment is positive (+1) or negative
(-1).
# edited/added
import numpy as np
import pandas as pd
from sklearn.datasets import load_svmlight_file
X, y = load_svmlight_file('archive/Linear-Classifiers-in-Python/datasets/train_labeledBow.feat')
X = X[11000:13000,:2500]
y = y[11000:13000]
y[y < 5] = -1.0
y[y >= 5] = 1.0
vocab = pd.read_csv('archive/Linear-Classifiers-in-Python/datasets/vocab.csv')['0'].values.tolist()
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(vocabulary = vocab)
def get_features(review):
return vectorizer.transform([review])
# Instantiate logistic regression and train
lr = LogisticRegression()
lr.fit(X, y)
# Predict sentiment for a glowing review
## LogisticRegression()
##
## /Users/macos/Library/r-miniconda/envs/r-reticulate/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
## STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
##
## Increase the number of iterations (max_iter) or scale the data as shown in:
## https://scikit-learn.org/stable/modules/preprocessing.html
## Please also refer to the documentation for alternative solver options:
## https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
## n_iter_i = _check_optimize_result(
review1 = "LOVED IT! This movie was amazing. Top 10 this year."
review1_features = get_features(review1)
print("Review:", review1)
## Review: LOVED IT! This movie was amazing. Top 10 this year.
print("Probability of positive review:", lr.predict_proba(review1_features)[0,1])
# Predict sentiment for a poor review
## Probability of positive review: 0.8807769884058808
review2 = "Total junk! I'll never watch a film by that director again, no matter how good the reviews."
review2_features = get_features(review2)
print("Review:", review2)
## Review: Total junk! I'll never watch a film by that director again, no matter how good the reviews.
print("Probability of positive review:", lr.predict_proba(review2_features)[0,1])
## Probability of positive review: 0.9086001000263592
Fantastic! The second probability would have been even lower, but the word “good” trips it up a bit, since that’s considered a “positive” word.
Linear classifiers#
Which decision boundary is linear?#
Which of the following is a linear decision boundary?
(1)
(2)
(3)
(4)
Visualizing decision boundaries#
In this exercise, you’ll visualize the decision boundaries of various classifier types.
A subset of scikit-learn
’s built-in wine
dataset is already loaded into X
, along with binary labels
in y
.
LogisticRegression
, LinearSVC
,
SVC
, KNeighborsClassifier
.
for
loop.
plot_4\_classifers()
function (similar to the code
here),
passing in X
, y
, and a list containing the
four classifiers.
# edited/added
import matplotlib.pyplot as plt
X = np.array([[11.45, 2.4 ],
[13.62, 4.95],
[13.88, 1.89],
[12.42, 2.55],
[12.81, 2.31],
[12.58, 1.29],
[13.83, 1.57],
[13.07, 1.5 ],
[12.7 , 3.55],
[13.77, 1.9 ],
[12.84, 2.96],
[12.37, 1.63],
[13.51, 1.8 ],
[13.87, 1.9 ],
[12.08, 1.39],
[13.58, 1.66],
[13.08, 3.9 ],
[11.79, 2.13],
[12.45, 3.03],
[13.68, 1.83],
[13.52, 3.17],
[13.5 , 3.12],
[12.87, 4.61],
[14.02, 1.68],
[12.29, 3.17],
[12.08, 1.13],
[12.7 , 3.87],
[11.03, 1.51],
[13.32, 3.24],
[14.13, 4.1 ],
[13.49, 1.66],
[11.84, 2.89],
[13.05, 2.05],
[12.72, 1.81],
[12.82, 3.37],
[13.4 , 4.6 ],
[14.22, 3.99],
[13.72, 1.43],
[12.93, 2.81],
[11.64, 2.06],
[12.29, 1.61],
[11.65, 1.67],
[13.28, 1.64],
[12.93, 3.8 ],
[13.86, 1.35],
[11.82, 1.72],
[12.37, 1.17],
[12.42, 1.61],
[13.9 , 1.68],
[14.16, 2.51]])
y = np.array([ True, True, False, True, True, True, False, False, True,
False, True, True, False, False, True, False, True, True,
True, False, True, True, True, False, True, True, True,
True, True, True, True, True, False, True, True, True,
False, False, True, True, True, True, False, False, False,
True, True, True, False, True])
def make_meshgrid(x, y, h=.02, lims=None):
"""Create a mesh of points to plot in
Parameters
----------
x: data to base x-axis meshgrid on
y: data to base y-axis meshgrid on
h: stepsize for meshgrid, optional
Returns
-------
xx, yy : ndarray
"""
if lims is None:
x_min, x_max = x.min() - 1, x.max() + 1
y_min, y_max = y.min() - 1, y.max() + 1
else:
x_min, x_max, y_min, y_max = lims
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
return xx, yy
def plot_contours(ax, clf, xx, yy, proba=False, **params):
"""Plot the decision boundaries for a classifier.
Parameters
----------
ax: matplotlib axes object
clf: a classifier
xx: meshgrid ndarray
yy: meshgrid ndarray
params: dictionary of params to pass to contourf, optional
"""
if proba:
Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:,-1]
Z = Z.reshape(xx.shape)
out = ax.imshow(Z,extent=(np.min(xx), np.max(xx), np.min(yy), np.max(yy)),
origin='lower', vmin=0, vmax=1, **params)
ax.contour(xx, yy, Z, levels=[0.5])
else:
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
out = ax.contourf(xx, yy, Z, **params)
return out
def plot_classifier(X, y, clf, ax=None, ticks=False, proba=False, lims=None):
# assumes classifier "clf" is already fit
X0, X1 = X[:, 0], X[:, 1]
xx, yy = make_meshgrid(X0, X1, lims=lims)
if ax is None:
plt.figure()
ax = plt.gca()
show = True
else:
show = False
# can abstract some of this into a higher-level function for learners to call
cs = plot_contours(ax, clf, xx, yy, cmap=plt.cm.coolwarm, alpha=0.8, proba=proba)
if proba:
cbar = plt.colorbar(cs)
cbar.ax.set_ylabel('probability of red $\Delta$ class', fontsize=20, rotation=270, labelpad=30)
cbar.ax.tick_params(labelsize=14)
#ax.scatter(X0, X1, c=y, cmap=plt.cm.coolwarm, s=30, edgecolors=\'k\', linewidth=1)
labels = np.unique(y)
if len(labels) == 2:
ax.scatter(X0[y==labels[0]], X1[y==labels[0]], cmap=plt.cm.coolwarm,
s=60, c='b', marker='o', edgecolors='k')
ax.scatter(X0[y==labels[1]], X1[y==labels[1]], cmap=plt.cm.coolwarm,
s=60, c='r', marker='^', edgecolors='k')
else:
ax.scatter(X0, X1, c=y, cmap=plt.cm.coolwarm, s=50, edgecolors='k', linewidth=1)
ax.set_xlim(xx.min(), xx.max())
ax.set_ylim(yy.min(), yy.max())
# ax.set_xlabel(data.feature_names[0])
# ax.set_ylabel(data.feature_names[1])
if ticks:
ax.set_xticks(())
ax.set_yticks(())
# ax.set_title(title)
if show:
plt.show()
else:
return ax
def plot_4_classifiers(X, y, clfs):
# Set-up 2x2 grid for plotting.
fig, sub = plt.subplots(2, 2)
plt.subplots_adjust(wspace=0.2, hspace=0.2)
for clf, ax, title in zip(clfs, sub.flatten(), ("(1)", "(2)", "(3)", "(4)")):
# clf.fit(X, y)
plot_classifier(X, y, clf, ax, ticks=True)
ax.set_title(title)
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.neighbors import KNeighborsClassifier
# Define the classifiers
classifiers = [LogisticRegression(), LinearSVC(),
SVC(), KNeighborsClassifier()]
# Fit the classifiers
for c in classifiers:
c.fit(X, y)
# Plot the classifiers
## LogisticRegression()
## LinearSVC()
## SVC()
## KNeighborsClassifier()
##
## /Users/macos/Library/r-miniconda/envs/r-reticulate/lib/python3.8/site-packages/sklearn/svm/_base.py:1199: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
## warnings.warn(
plot_4_classifiers(X, y, classifiers)
plt.show()
Nice! As you can see, logistic regression and linear SVM are linear classifiers whereas KNN is not. The default SVM is also non-linear, but this is hard to see in the plot because it performs poorly with default hyperparameters. With better hyperparameters, it performs well.