7  Classification Models

Published

June 9, 2026

Keywords

Categrical Data, Binary Classification, Logistic Regression, Principal Components Analysis, PCA, k-Means Clustering, Support Vector Machines, SVM, Confusion Matrices

7.1 Introduction

This module teaches some classification methods which are useful for working with response variables that are categorical (factors).

Learning Outcomes

  • Explain the purpose of classification models.
  • Explain the difference between regression and binary classification.
  • Build different classification models: Logistic Regression, SVM,
  • Tune Classifiers and calculate Performance Metrics
  • Explain False Positives and False Negatives from a confusion matrix.

7.1.1 References

7.2 Classification

When your question of interest involves a categorical response, you must use a classification method.

  • Regression methods/models won’t work with a categorical response very well as they assume the response is continuous.
  • The will just generate predictions with decimal places and that does not work for things like “hair color” or “gender”.

There are many Classification methods for binary classification (when there are only two levels) or more general approaches for response variable with multiple levels (categories) such as Hair Color.

Consider a dataset containing several variables (columns). Some of the variables are designated as the “response variables” (or “outputs”). You’d like to predict the response variables from the “explanatory variables” (or “inputs”). There are also variables that you didn’t (or couldn’t) measure, which are the “latent variables”.

These distinctions between variable roles are a bit arbitrary, and it can be a bit of a problem to determine which variables should play which role, especially because the same variable can play different roles depending on your research question.

Exploring your data (on the training set, like in Lesson 6) is an important part of determining possible variable roles.

“Machine learning” is a term that you’ve probably heard, perhaps surrounded by a bit of mystery. While there are several different kinds of algorithms that are usually considered “learning”, they are essentially just fancier versions of statistical regression.

  • Machine learning comes in two varieties: “supervised” and “unsupervised”.
    • In supervised learning, we have a known response variable that is labeled in our data so we can build a prediction model and use a loss function to calculate the difference between our prediction and the “true” value. Linear Regression is an example.
    • In unsupervised learning, the goal is a little different since we don’t have labelled data; we don’t know what is the correct response. Here want to determine which variables should play which roles so we get the most information out of our data.

As in Lesson 6, careful attention to the train-validation-test process with the data is critical, because modern supervised learning algorithms are extremely flexible.

  • Moreover, since they are quite a bit more sensitive to the data than (say) linear regression, it is very easy to get a misleading performance measurement if you don’t follow the correct process.

7.3 Logistic Regression for Binary Classification

Many problems in data science involve predicting whether an observation belongs to one of two categories. Examples include:

  • whether a municipality contains a cultural site
  • whether a customer defaults on a loan
  • whether an email is spam
  • whether a medical test result is positive or negative

These are examples of binary classification problems because the outcome has only two possible classes.

Linear regression is generally not appropriate for this setting because predicted values can fall outside the range from 0 to 1. Instead, a common approach is logistic regression, which models the probability that an observation belongs to one of the classes.

The logistic regression model produces estimated probabilities such as:

  • 0.92 → very likely to belong to the positive class
  • 0.51 → uncertain but slightly more likely positive
  • 0.08 → unlikely to belong to the positive class

The predicted probabilities are constrained to remain between 0 and 1 through the use of the logistic (sigmoid) function.

\[P(Y=1)=\frac{1}{1+e^{-(\beta_0+\beta_1x_1+\cdots+\beta_px_p)}} \tag{7.1}\]

  • Note that the exponent in Equation 9.5 looks similar to a linear regression function \(y = \beta_0+\beta_1x_1+\cdots+\beta_px_p\).

Figure 7.1 shows a plot of a sigmoid function and you can see how with the right parameters, you can convert any number \(x\) to a number \(y\) that is between 0 and 1 like a probability.

Show code
library(tidyverse)
beta0 <- 0
beta1 <- 1

df <- tibble(x = (rep(1:1000) - 500) / 50,
y = exp(beta0 + beta1*x)/(1 + exp(beta0 + beta1*x)))

ggplot(df, aes(x, y)) +
  geom_line() 
Figure 7.1: A logistic curve ranges between 0 and 1 on the Y axis

Essentially logistic regression allows us to use our data and, instead of predicting a category label directly, then use the sigmoid function to do regression and estimate the probability of membership in a class.

7.3.1 Classification Thresholds

To convert our estimated probabilities into actual classifications, we have to choose a decision threshold.

A common choice is 0.5:

  • probability ≥ 0.5 → classify as positive
  • probability < 0.5 → classify as negative

However, the threshold does not need to be fixed at 0.5. Different applications may require different tradeoffs between false positives and false negatives.

For example:

  • In medical screening, a lower threshold may be preferred to avoid missing possible disease cases.
  • In spam filtering, a higher threshold may reduce the risk of incorrectly classifying legitimate email as spam.

Threshold selection is therefore connected to the costs and risks associated with classification errors.

7.3.2 Evaluating Classification Models

Binary classification models are commonly evaluated using metrics such as:

  • accuracy
  • precision
  • recall (sensitivity) -specificity
  • F1 score
  • ROC curves and AUC

These measures help determine how well the model distinguishes between the two classes and whether a chosen threshold is appropriate for the problem being studied.

7.3.3 Example: Predicting the Probability of a Cultural Site

Suppose we want to predict whether a municipality contains a cultural site based on household and socio-economic characteristics.

  • Because the response variable has two possible outcomes (TRUE/FALSE or Yes/No), logistic regression is an appropriate modeling approach.

In this example, the model estimates the probability that a municipality contains a cultural site using variables such as:

  • average household size
  • population
  • computer ownership
  • university

For this example we will not split the data to simplify things so our model may overfit the data and give optimistic performance metrics.

7.3.4 Fit the Model

We will use the glm() (general linear model) function from R {stats}.

  • We first have to convert the values of the response to be 0 or 1.
  • The argument family = "binomial" is what tells glm() to use logistic regression.
read_csv("./data/house_ed_gdp_joined.csv") |> 
  select(culture_site, Average_size, population, has_computer, university) |> 
  mutate(culture_site = if_else(culture_site =="has_site", 1, 0)) |> 
  drop_na() ->
  logreg_data

cultural_logit <- glm(culture_site ~ Average_size + population + has_computer +
    university, data = logreg_data,
  family = "binomial"
)

summary(cultural_logit)

Call:
glm(formula = culture_site ~ Average_size + population + has_computer + 
    university, family = "binomial", data = logreg_data)

Coefficients:
               Estimate Std. Error z value Pr(>|z|)  
(Intercept)  -5.170e-01  4.095e+00  -0.126   0.8995  
Average_size -1.611e+00  1.145e+00  -1.407   0.1596  
population    3.121e-05  1.831e-05   1.705   0.0883 .
has_computer  2.380e-02  1.147e-01   0.208   0.8356  
university    1.947e-01  1.431e-01   1.361   0.1736  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 62.226  on 58  degrees of freedom
Residual deviance: 35.581  on 54  degrees of freedom
AIC: 45.581

Number of Fisher Scoring iterations: 6
  • We can interpret the model using summary:
    • Positive coefficients increase the predicted probability of a cultural site, while negative coefficients decrease it.
    • The \(p\) values indicate how likely individual predictors contribute to the prediction.
      • Here only population is relatively low
    • The substantial reduction in deviance from 62 to 35 indicates that the predictors collectively explain meaningful variation in the probability of a cultural site.
  • Note that logistic regression is an iterative process for finding the best fit and this took just 6 iterations.

7.3.4.1 An ROC Curve Directly Evaluates Classification Performance.

Earlier, the model converted predictors into predicted probabilities:

predict(cultural_logit, type = "response")[1:15] |> round(2)
   1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
0.17 0.07 0.02 0.04 0.02 0.01 0.02 0.02 0.12 0.98 0.10 0.07 0.02 0.03 0.91 

To classify municipalities, you eventually choose a threshold such as .5, .4, or even .7.

  • However, different thresholds produce different tradeoffs between:
    • True positives: correctly identifying municipalities with cultural sites
    • False positives: incorrectly predicting a cultural site

The ROC curve shows model performance across all possible classification thresholds simultaneously. It plots:

  • True Positive Rate (Sensitivity) on the y-axis:

\[\text{Sensitivity} = \text{TPR} = \frac{TP}{TP+FN}\]

  • This measures the proportion of actual positive cases that are correctly identified.
  • False Positive Rate on the x-axis

\[\text{FPR} = \frac{FP}{FP+TN} = 1 - \text{Specificity}\]

  • This measures the proportion of actual negative cases that are incorrectly classified as positive.

As the classification threshold changes, the balance between sensitivity and false positives also changes.

  • The ROC curve visualizes these tradeoffs across all thresholds rather than relying on a single fixed cutoff such as 0.5.
  • A stronger classifier pushes the curve toward the upper-left corner while a weak classifier stays close to the diagonal line representing random guessing.
library(pROC)
# Predicted probabilities
logreg_data <- logreg_data |>
  mutate( pred_prob = predict(cultural_logit, type = "response")
  )

# ROC curve
roc_obj <- roc(response = logreg_data$culture_site,
               predictor = logreg_data$pred_prob
)

# Plot ROC
plot(roc_obj,legacy.axes = TRUE, xlim = c(1, 0), xaxs = "i",main = "ROC Curve")

  • This ROC curve lies well above the diagonal reference line which suggests the logistic regression model has reasonably good classification ability for distinguishing municipalities with and without cultural sites.
  • This can be interpreted as municipalities with cultural sites tend to receive systematically higher predicted probabilities than municipalities without cultural sites.
  • Near the upper-left region, the model achieves both relatively high sensitivity and relatively high specificity.

7.3.4.2 Area Under the Curve (AUC)

A common summary statistic is the AUC.

  • AUC = 0.5 → essentially random guessing
  • AUC ≈ 0.7 → reasonable discrimination
  • AUC ≈ 0.8+ → strong discrimination
  • AUC = 1.0 → perfect classification
# AUC
auc(roc_obj)
Area under the curve: 0.913

This mode has a very high AUC of .91 which indicates that the model can often correctly identify municipalities containing cultural sites without generating excessive false positives.

  • If we randomly select one municipality with a cultural site and one without a cultural site, the model will assign a higher predicted probability to the municipality containing the cultural site about 91.3% of the time.

7.3.4.3 Selecting the Threshold

The ROC curve also illustrates that there is no single universally “correct” classification threshold.

  • Choosing the classification threshold is fundamentally a decision about balancing different kinds of errors as you typically cannot reduce both False Positives and False Negatives.

One common strategy is to select the threshold that maximizes the balance between sensitivity and specificity.

coords(
  roc_obj,
  "best", #Defaults to Youden to determine the best
  ret = c(
    "threshold",
    "sensitivity",
    "specificity"
  )
)
  threshold sensitivity specificity
1 0.1512663   0.8461538    0.826087
  • The Youden Index is \(J = Sensitivity + Specificity - 1\) which seeks a threshold that jointly maximizes both measures.

Here, the results suggest a threshold of .15 (far from .5) which achieves a balance of approximately:

  • Sensitivity ≈ 0.846
  • Specificity ≈ 0.826

A threshold can be lower than .5 for multiple reasons related to the structure of the data.

  • This threshold does not mean: “15% probability means likely cultural site.”
  • Instead, it means: “Given the behavior of this model and dataset, classifying observations above 0.151 produces the best discrimination performance.”

The threshold is therefore a decision rule, not a probability interpretation.

7.4 Support Vector Machines (SVMs)

We’ll now focus our attention on a more sophisticated supervised learning classification technique, called support vector machines (SVMs).

7.4.1 SVM Overview

The goal of an SVM is simple: given a table of numerical variables, assign a classification to each observation.

  • SVMs are more capable than Logistic Regression as they can do both binary classification as well as multi-class classification.

You use the training sample of your data to help the SVM give the correct classification to a set of observations with known classifications.

Once “trained” you can use the SVM to classify other observations that don’t have known classifications.

  • For instance, a common task for an SVM might be to determine if a pedestrian is in view of a car’s collision avoidance camera.
  • Given a digital image – a big collection of numerical variables – the output of such an SVM would simply be “pedestrian” (true) or “not pedestrian” (false).

A given SVM is a rather elaborate algorithm, whose behavior is determined by quite a few numerical “parameters”.

  • The training data are used internally to determine these parameters.
  • Thankfully, that process is handled internally by the {e1071} package, so you can just tell an SVM to “train thyself” given some data and it will!

Actually, there are many kinds of SVMs, though they ultimately look about the same to the data scientist from the standpoint of inputs and outputs.

  • The {e1071} library provides four different types of SVM, though there are many others that are in wide usage.
  • In the machine learning jargon, this means that in addition to the parameters of each SVM, there are “hyperparameters” that select the particular SVM algorithm you want to use.

In our train-tuning-test process, the tuning stage, where we select one algorithm without changing it, can be thought of as the training stage for the hyperparameters.

It sounds fancy, but it’s really pretty easy: try a handful of versions of your algorithm, and then pick the best.

library(tidyverse)
library(e1071)

It you get an error, you probably do not have the e1071 library installed, and you’ll need kernlab for our dataset. Here is how to fix that:

install.packages('e1071')
install.packages('kernlab')

7.4.2 The spam data set

The data we’ll be using for this lesson is kind of amusing

data("spam", package = "kernlab")

It is a single table, called spam, in which each row corresponds to an email message.

  • The columns are normalized word frequencies for various words present in the textual content of the messages.
  • Finally, there is a type column that either contains the string “spam” or “nonspam”.

You guessed it! We are going to make an email spam filter!

Now it happens that this particular dataset is rather easy to “cheat” because it includes a few oddities in how it was collected.

  • Specifically, this dataset was collected before “spear phishing” was common (where a spammer impersonates a trusted sender whose account has been compromised).
  • As a result, words that are characteristic of the data collector’s organization are a dead giveaway that a message is “nonspam”.
  • These correspond to two columns: george and num650, so we’ll deselect these.

7.4.3 Create Training, Validation and Test Sets

As we described earlier, we need to create a sampling frame and split our data into training, tuning, and test sets.

set.seed(1234)
raw_data_samplingframe <- spam |>
  select(-george, -num650) |> # Remove the "cheating" columns...
  mutate(snum = sample.int(n(), n()) / n())

training <- raw_data_samplingframe |>
  filter(snum < 0.6) |>
  select(-snum)

tuning <- raw_data_samplingframe |>
  filter(snum >= 0.6, snum < 0.8) |>
  select(-snum)

test <- raw_data_samplingframe |>
  filter(snum >= 0.8) |>
  select(-snum)

You can save these off in CSV files if you like.

dir.create("./outputs", showWarnings = FALSE)
training |> write_csv("./outputs/spam_training.csv")
tuning |> write_csv("./outputs/spam_tuning.csv")
test |> write_csv("./outputs/spam_test.csv")

In a few places we will want just the input variables. It will be frequently useful to have the correct answers in a separate table as well.

Let’s split these off now:

training_input <- training |> select(-type)
training_truth <- training$type

tuning_input <- tuning |> select(-type)
tuning_truth <- tuning$type

test_input <- test |> select(-type)
test_truth <- test$type

7.4.4 Training stage:

7.4.4.1 Principal Components Analysis

Let’s start with a quick visualization! Since all the explanatory variables are numerical (they’re normalized word frequencies) and there are quite few of them, principal components analysis (PCA) is a good way to create just two variables that contain most of the information in the data.

Note

Principal Components Analysis (PCA) is a well-established method for reducing the dimensions of the data.

It uses linear algebra to create linear combinations of a set of \(k\) variables into \(k\) Principal Components where each principal component is independent of the others.

The Principal Components are designed so the First principal component captures as much information as possible from one linear combination, then the second captures the most of out what is left, and so on.

Thus the first two principal components are often used in plots as they capture the bulk of the information in just two variables so we can easily plot them.

  • Use the prcomp() function from the Base R {stats} package.
spam_pca <- training_input |> prcomp()

Now we can plot the messages as a scatterplot, and color by type:

spam_pca$x |>
  as_tibble() |>
  mutate(type = training_truth) |>
  ggplot(aes(PC1, PC2, color = type)) +
  geom_point()

Each message corresponds to a point in this plot.

  • Messages with similar word usage end up near each other, while messages using different words tend to be further away from each other.
  • Notice how the spam messages spread out a bit further from the nonspam messages.
  • These outliers are easier to detect than the spam messages that do a better job of looking like a nonspam message.

7.4.5 K-Means Clustering

Clustering with k-means is a kind of “semi”-supervised learning, in that we tell it the number of classes we want.

Note

K-means clustering is a method for separating data into \(k\) clusters or groups (you pick the \(k\)) such the points in each cluster as as similar to each other as possible and as dissimilar to points in any other cluster as much as possible.

K-means is an iterative algorithm.

  • We set the seed as the first step in K-means is to randomly assign all points to one of the possible clusters.
  • It then calculates a lot of distances and moves points from one cluster to another.
  • It then recalculates the distances and moves points, over and over, until the improvement in performance is minimal.

In the case of spam detection, there are two classes: spam and nonspam. So, here we go!

  • Use the kmeans() function from the Base R {stats} package.
set.seed(1234)
spam_kmeans <- training_input |> kmeans(2)

Did k-means correctly identify the spam? We can tell if we make a table comparing the k-means cluster ID (an integer) with the true classes:

kmeans_results <- tibble(training_truth, km = spam_kmeans$cluster)

Then we can count the number of times we get a match between the two.

We could use table() but I think this looks nicer as a contingency table, so we’ll need to pivot_wider():

table(kmeans_results)
kmeans_results |>
  count(training_truth, km) |>
  pivot_wider(names_from = km, values_from = n)
              km
training_truth    1    2
       nonspam 1675   31
       spam     946  108
# A tibble: 2 × 3
  training_truth   `1`   `2`
  <fct>          <int> <int>
1 nonspam         1675    31
2 spam             946   108

Have a look at the results closely.

  • The best possible performance consists of having each row and each column having exactly one zero entry.
  • Because k-means has an element of randomness to it, I can’t tell you specifically which class ID number (the km column) means spam or not, but it’s not very effective.

Just as a sanity check, though, you can ask whether k-means is detecting “something” about the spamminess of a message.

  • The way to do this is with chi-squared.
  • We’ve done this many times before, so the following block of code should make perfect sense to set up the contingency table:
ct <- kmeans_results |>
  count(training_truth, km) |>
  pivot_wider(names_from = km, values_from = n) |>
  column_to_rownames("training_truth") # Look out! column named the same as a variable!

And then let’s just run the test:

chisq.test(ct)

    Pearson's Chi-squared test with Yates' continuity correction

data:  ct
X-squared = 95.041, df = 1, p-value < 2.2e-16

Well, I got a small \(p\)-value. So k-means detects a statistically significant feature of the spam, but it’s not really conclusive either.

Why did k-means do poorly? Let’s label the PCA plot we made in Step 8 with the k-means clusters instead of the true values:

  • We use as.factor() to convert the clusters from numbers to a factor so the legend is correct.
spam_pca$x |>
  as_tibble() |>
  ggplot(aes(PC1, PC2, color = as.factor(spam_kmeans$cluster))) +
  geom_point(alpha = .6) 

What you should see is that k-means splits the messages into an “inner core” and “outliers”.

  • It labels all of the inner core as nonspam, and the outliers as spam.
  • While this works for the really far outliers (which are obviously spam messages!), it doesn’t do a good job closer to the core.

Let’s give up on k-means, and switch over to SVM training.

7.4.6 Training the SVM

You train an SVM using the svm() function.

  • The svm() function can take two styles of input, though they both do the same thing internally. If you’re interested, you can read the documentation:
?svm

Here’s the format that the e1071 authors use more frequently in the documentation:

  • The ~ is the formula operator again.
  • The . on the right side of the ~ means use all the variables.
svm_linear <- svm(type ~ ., # Watch the syntax.  It's a tilde, a period, and then comma
  data = training, # Unlike tidyverse, the data input is not the first argument.  We can't pipe...
  kernel = "linear"
) # "linear" is the hyperparameter (see Step 14)

The other option is tidyverse compatible, but you have to split the data into input and output first:

svm_linear <- training_input |>
  svm(y = training_truth, kernel = "linear")

These both do the same thing, so pick whichever you like. I’m going to use the first version in what follows.

The way SVMs work is that they split the space of observations (imagine the PCA plot) into two halves by way of a “separating surface”.

  • That surface is high dimensional and takes quite a few parameters to define, but you need not worry too much about them.
  • Additionally, there are many kinds of surfaces you might try to use, each defined by a set of equations. - Choose a different set of equations, and your surfaces will look different!
  • The choice of a set of equations is called a “hyperparameter”, and in the case of svm(), the hyperparameter is called kernel.

There are four kernels supported by e1071:

  • ‘linear’ : The separating surface is a flat plane
  • ‘polynomial’ : The separating surface is a given by a polynomial… like \(z=ax^2+by^2+cxy+...\) (lots more)
  • ‘radial’ : The separating surface is an ellipsoidal blob
  • ‘sigmoid’ : The separating surface is kind of “S”-shaped

Since our dataset isn’t too big, it doesn’t hurt to try all of these!

  • We can use the tuning dataset to select the one that does the best job at correctly classifying the email messages as spam or nonspam.

How do we do this? Well, we just repeat with a different kernel:

svm_poly <- svm(type ~ .,
  data = training,
  kernel = "polynomial"
)
svm_radial <- svm(type ~ .,
  data = training,
  kernel = "radial"
)
svm_sigmoid <- svm(type ~ .,
  data = training,
  kernel = "sigmoid"
)

Note: each of the different kernels have additional hyperparameters in them (like degree, gamma, and the like), that you might need to change in some circumstances. - We don’t need to do that here, but you may have to change them if you don’t get good results on other data…

Now that we’ve trained the SVMs for our data, it’s a good idea to anticipate their performance.

7.4.7 Plotting Results

Plotting is perhaps the best way to do this, but plotting in the original data variables is not reasonable.

  • That’s unfortunate because that’s how our SVMs were trained.
  • But, since we are using the training data, there is nothing wrong with training an additional set of SVMs to work on the PCA data since these plot better.
  • The hope (which will be validated against the tuning sample) is that these plots will be representative.

To that end, we need to transform the training data using PCA, and add back the true classifications (‘spam’ / ‘nonspam’).

training_pca <- spam_pca$x |>
  as_tibble() |>
  mutate(type = training_truth)

OK, let’s plot each SVM. It happens that the trained SVMs made by the svm() function already know how to plot themselves, though this doesn’t use ggplot().

  • Instead, it uses the base R plot() function. Here’s how to do it:
sl <- svm(type ~ ., data = training_pca, kernel = "linear") # Watch the punctuation!
plot(sl, training_pca, PC1 ~ PC2, grid = 400) # This plots PC1 versus PC2

  • The points are color coded according to the actual response.
  • The x shapes are the points that are the “support vectors” used to figure out the boundary.
  • The black points in the red space are misclassified

If you look closely at this, you can see a line slicing through the plot, separating the spam from the nonspam. It does a pretty good job!

Try this with the other three kernels, changing what evidently needs to be changed.

  • Notice the different shapes of the different spam/nonspam regions! They all look a little different.

Which do you think best matches the shape of the true classes?

7.4.8 Validation/Tuning stage

Now we’re ready to choose the SVM kernel that will serve as our final version! Since our SVMs are already trained, we can just apply them to the new input data.

The way to do this is via the predict() function. It works like this:

predict(svm_linear, tuning_input)[1:20]
   6    8    9   14   20   22   24   28   32   39   45   52   53   55   66   79 
spam spam spam spam spam spam spam spam spam spam spam spam spam spam spam spam 
  88   97   98  101 
spam spam spam spam 
Levels: nonspam spam

You should get spammed with some “spam” and “nonspam”! Not very helpful, but it lets us see that the SVM is working.

Let’s use the tibble() function to bundle it all into one big data frame table:

tuning_results <- tibble(tuning_truth,
  linear = predict(svm_linear, tuning_input),
  poly = predict(svm_poly, tuning_input),
  radial = predict(svm_radial, tuning_input),
  sigmoid = predict(svm_sigmoid, tuning_input)
)

We want to check how many instances we have of where the tuning_truth matches with each of the other four columns: each one is where we correctly identified a message.

The way that we built the tuning_results table is a little annoying, because we really want to do these comparisons systematically.

So, let’s pivot the data longer by reworking the columns:

tuning_results1 <- tuning_results |> pivot_longer(cols = !tuning_truth)

After this, we check for matches. There are four kinds of these!

tuning_results2 <- tuning_results1 |>
  mutate(
    tp = (tuning_truth == "spam" & value == "spam"), # True positives (SVM got it right!)
    tn = (tuning_truth == "nonspam" & value == "nonspam"), # True negatives (SVM got it right!)
    fp = (tuning_truth == "nonspam" & value == "spam"), # False positive (SVM made a mistake)
    fn = (tuning_truth == "spam" & value == "nonspam")
  ) # False negative (SVM made a mistake)

Count up the occurrences of each of these

tuning_results3 <- tuning_results2 |>
  group_by(name) |> # Group by the kernel names
  summarize(
    tp = sum(tp),
    tn = sum(tn),
    fp = sum(fp),
    fn = sum(fn)
  )

7.4.9 2x2 Confusion Matrices

A 2x2 confusion matrix shows the performance of a model on a sample of size \(n\) by organizing the \(n\) predictions into each of the four possible outcomes and showing the counts for each outcome.

  • A confusion matrix allows one to see how much or how little a model “confuses” the two classifications.
Sample Confusion Matrix
Actual \ Predicted Positive (PP) Negative (PN)
Positive (P) True Positive (TP) count False Negative (FN) count
Negative (N) False Positive (FP) count True Negative (TN) count

7.4.10 Testing stage

With the counts of true/false positives/negatives, there are many possible kinds of scores you can make to determine which SVM kernel works best.

  • If you divide the counts by the the number of predictions \(n\) the size of your testing data, you get corresponding “rates”.

Different disciplines tend to prefer some of these scores over others: medical tests often report sensitivity and specificity, while most deep learning researchers prefer F1 scores.

If you’re curious, have a look at

https://en.wikipedia.org/wiki/Sensitivity_and_specificity

Just using the formulas on that page, we can compute a few of these:

tuning_results3 |>
  mutate(
    accuracy = (tp + tn) / (tp + tn + fp + fn),
    sens = tp / (tp + fn),
    spec = tn / (tn + fp),
    ppv = tp / (tp + fp),
    npv = fn / (tn + fn),
    f1 = 2 * tp / (2 * tp + fp + fn)
  )
# A tibble: 4 × 11
  name       tp    tn    fp    fn accuracy  sens  spec   ppv    npv    f1
  <chr>   <int> <int> <int> <int>    <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>
1 linear    346   502    31    41    0.922 0.894 0.942 0.918 0.0755 0.906
2 poly      184   520    13   203    0.765 0.475 0.976 0.934 0.281  0.630
3 radial    346   511    22    41    0.932 0.894 0.959 0.940 0.0743 0.917
4 sigmoid   335   484    49    52    0.890 0.866 0.908 0.872 0.0970 0.869

I noticed that ‘linear’ was the best in my case at least in terms of the accuracy and the F1 score, but it’s awfully close to the ‘radial’ score. This is now a situation where plotting can help, if you like.

Finally, you can repeat with just the svm_linear on the test data!

7.4.11 Another Example

The graphics in Figure 7.2 are from an SVM trained on data about home sales.

We want to classify Home QUALITY a factor with three different levels: LOW, MEDIUM, and HIGH.

  • Our predictors are SALES PRICE and YEAR_BUILT.

  • Figure 7.2 (a) show the points overlap so we cannot draw a simple line separating the points.

  • Figure 7.2 (b) shows the linear kernel.

  • Figure 7.2 (c) shows the radial kernel.

(a) Homes Raw Data
(b) Homes SVM Linear Kernel
(c) Homes SVM Radial Kernel
Figure 7.2: SVM for Home Sales with linear and radial kernels

You can see the vastly different shapes of the boundaries for the different regions in the pictures.

  • You can see the points that are Errors in prediction, e.g., green circles in the MEDIUM region, red circles in either HIGH or LOW, and black circles in the MEDIUM region.
  • Visually, it looks like the Radial kernel might be a better solution.

It can be challenging to visualize what is happening with the different kernels.

  • In short, a kernel transforms the data into a higher dimensional space, then fits the best multi-dimensional hyperplane it can, and then projects the result back into the original multi-dimensional space.

  • This YouTube video provides a nice example for a polynomial kernel.

7.5 Summary

Classifiers are an important set of methods for working with categorical response variables.

We have examined just a few forms of classifiers as there are many and there are also many packages with different implementations of the methods.

As with Regression methods, there is no “best classifier” as their performance depends on the underlying (hidden) multi-dimensional structure of the data and how well different methods might fit the data without overfitting.

Logistic Regression is popular for binary classification as it is interpretable and runs quickly.

SVM is also quite popular method as it handles multi-class classification, supports multiple kernels,and is computationally efficient.

Both of these methods are common baselines for assessing the performance of more complex classifiers such as neural networks.

7.6 Exercise 07: Classification Models

  1. Locate a dataset that has a number of numerical variables that you can use as explanatory variables and a binary (boolean) variable that you can use as a response variable. You could use the dataset from the previous lessons if you like.

Caution: some datasets do not separate well according to the response variables. It’s better if the data separate, but don’t worry if you can’t find a good dataset. The process will “work”, but it might not work well. Be sure to explain your findings; that’s more important than getting good results!

  1. Split your chosen dataset into the three training/tuning/test samples. Save each off each as a CSV file and upload these.

  2. Train at least three SVMs on your training data, being careful to explain which variables are the response and the explanatory variables. Feel free to experiment with different hyperparameters beyond what was done in the worksheet above. The hyperparameters you need might be quite different!

  3. Using your tuning set, select which SVM performs the best… and then…

  4. … determine its performance using the testing set.

  5. Make sure to explain Steps 4 and 5 in comments in your Quarto file!