8  Fitting Models

Published

June 9, 2026

Keywords

Overfitting, Bias-Variance Tradeoff, Splines, Regularization, Ridge, LASSO

8.1 Introduction

This module explores concepts and examples for choosing between models of reality based on data while navigating the complexities suggested in the following two statements.

George Box: All Models are wrong, but some are useful.
Figure 8.1: World famous statistician George E.P. Box

We all know that art is not truth. Art is a lie that makes us realize truth, at least the truth that is given us to understand. The artist must know the manner whereby to convince others of the truthfulness of his lies.

— Picasso, Pablo (1923)

XKCD also has a thought-provoking perspective we should consider.

XKCD comic on Curve Fitting Messages.

Learning Outcomes

  • Explain the Machine Learning concepts such as overfitting, flexible and restrictive models, and the bias-variance trade off.
  • Explain the purpose of training, validation, and test data sets for comparing models.
  • Explain the role of prediction error when comparing models.
  • Explain why regularization methods are used with ML models and what they can do.

8.1.1 References

8.2 Review of (Statistical) Machine Learning (SML)

SML can be thought of as collection of algorithms for building models of reality using data

  • to better understand (“learn”) the relationships among the variables in the data,
  • make predictions based on new observations,
  • and asses the models’ performance,
  • all in a repeatable manner.

SML algorithms are:

  • Programmed on a computer (so can be repeated).
  • Evaluated quantitatively (using a computer) where prediction accuracy is our main measure of performance.
  • Tuned to be optimal. Machine learning models generally have hyper-parameters that can be adjusted to try to improve performance
Important

From a conceptual perspective, given a data set, SML algorithms estimate some fixed but unknown mathematical relationship between the variables.

  • The algorithm tries to use the information in the data to create a model that mimics this unknown relationship so, given new data, it can predict the true value of response variable,
    • where the predicted value is close to the true value, i.e., the residual error or difference between the two is very small.

8.2.1 SML Can Help Answer Two Types of Questions

  • Inference: How are the variables related? Which ones are more important? Can I quantify my uncertainty?
  • Prediction: How well can we make predictions using new data?

Your questions will guide how you make decisions in building and choosing a “best” model based on your data.

8.2.2 Supervised or Unsupervised Learning?

Is there is a labeled response variable you can use for comparison?

(a) Labeled as Cats
(b) Labeled as Dogs
(c) Unlabeled Images
(d) More unlabeled Images
Figure 8.2: Same images, but some are labeled so the algorithm “knows” the image is a Cat or Dog as input and can compare a prediction to the “known” value. The second set is unlabeled so the algorithm has to determine how to distinguish among the possible groups of images.
  • The goal of Supervised learning is to build a model where the variables can accurately predict the associated response for a given set of input values.
    • For inference, we want to understand which predictors are useful or not and by how much in our predictions.
    • Methods include multiple forms of Regression or Classification.
  • The goal of Unsupervised Learning is to identify possible relationships among variables when there is no labeled response variable.
    • We want to find the “features” that help us differentiate or identify outcomes.
    • Methods include classification, clustering, and principal components

8.2.3 Regression or Classification?

  • Regression algorithms have response variables that are numerical/quantitative response.
  • Classification algorithms have response variables that are categorical (take values from a discrete set of categories such as TRUE or FALSE).

8.2.4 Parametric or Non-Parametric

  • Parametric: we assume there is a “nice” mathematical relationship between the variables. It could be linear or non-linear. Then we can estimate the parameters in our model so we can predict on new data.
    • These algorithms often allow us to make statistical tests about our results.
    • The closer our assumed relationship is to the “true but unknown” relationship, the more likely we will have good predictions.
  • Non-Parametric: We don’t make any assumptions about the form of the relationship.
    • We may still create parameters for the model but they don’t have any known statistical properties.
    • Non-parametric models can create more more flexible solutions, but we have to use other other methods (and run them many times) to quantify the uncertainty (accuracy) of our results.

8.3 Flexible or Restrictive?

A key concept in SML is choosing how flexible or non-flexible (restrictive) a model is by tuning the number of parameters (degrees of freedom) available for fitting the model.

  • Flexible models have more parameters (or degrees of freedom) so they can more closely match the input data.
    • Highly flexible models, with hundreds (billions) of parameters, can be complicated to interpret.
    • These are “black-box” models - since we can’t explain what is happening inside the model to explain the prediction, e.g., why was someone turned down for a loan.
    • Over-fitting the data occurs when a model is too flexible; it matches the sample data so closely its predictions on new data have high variance so are not very good. It is modeling the “noise” in addition to the “signal”.
  • Restrictive models have fewer parameters (or degrees of freedom).
    • They can be easier to interpret but may not match the data as well as a more flexible model.
    • Under-fitting the data occurs when a model is too restrictive; it misses key features of the relationship and its predictions on new data have high bias.

Think of degrees of freedom (df) as the number of times we can change the shape of the relationship between the set of predictor variables and the response variable across the range of the predictors.

  • A straight line is inflexible as it only has two degrees of freedom (intercept and slope). It has no bends or curves in it.

When we allow more degrees of freedom, we can create a more flexible model.

  • Flexible models follow the data more closely, e.g., adding a squared term allows a linear model to bend to follow the data more closely than a single straight line.
  • A high-degree polynomial (e.g., a spline) can be quite curvy and can be different on different parts of the data.
  • As an example, the plot of Covid 19 cases in DC over time in Figure 8.3 is highly non-linear so a linear model would not be a good fit as it would underfit the true relationship.
Show code
url <- c("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv")
readr::read_csv(url) |> 
  dplyr::filter(FIPS == 11001) |> 
  dplyr::select(tidyselect::contains("/")) |> 
  tidyr::pivot_longer(cols = tidyselect::contains("/"), names_to = "date", values_to = "cases", names_transform = lubridate::mdy) |> 
  dplyr::mutate(daily_cases = cases - dplyr::lag(cases)) |> 
  dplyr::filter(daily_cases != 0) ->
  covid_19_dc
rm(url)
covid_19_dc |> 
  ggplot2::ggplot(ggplot2::aes(date, daily_cases)) +
  ggplot2::geom_point(alpha = .5, size = .8) +
  ggplot2::geom_smooth(se = FALSE, method = lm, formula = y ~ x) +
  ggplot2::scale_y_log10() +
  ggplot2::labs(title = "Daily Covid Cases Reported in DC",
                subtitle = "log scale",
       caption = "COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University") ->
  covid_19_dc_plot
covid_19_dc_plot
A point plot of the daily cases for Covid19 in the US city of  Washington, DC showing the highly nonlinear nature of the data. A linear regression line does not fit the data well at all.
Figure 8.3: Daily Covid19 Cases in Washington DC

When choosing the degrees of freedom we will want to find the balance between too flexible and too restrictive.

8.4 Comparing Models

Regardless of our question (inference or prediction), we are usually building multiple models.

SML algorithms have strengths and weaknesses for different data sets so we want to try a variety of approaches to building a model.

  • For a regression problem we could try linear regression, polynomial regression, ridge Regression, LASSO Regression, regression Trees (multiple types), principal components regression, neural networks, …
  • For a classification problem we could try logistic regression, SVM, K-nearest Neighbors, Linear Discriminant Analysis/Quadratic Discriminant Analysis (LDA/QDA), classification trees (several types), naive Bayes prediction, neural networks, …

Even with the same approach or algorithm, we still have more choices to make:

  • choosing what variables or “features” to select for use in the model.
  • choosing whether and how to transform the selected variables.
  • choosing “hyper-parameters” to change how the algorithm runs to try to get the best performance, e.g., setting the hyper-parameter for when to stop based on the rate of improvement.

Given all these choices and models, as stated in Section 8.2, we need to quantitatively measure model performance.

There are a variety of metrics we can use but, as discussed in Section 6.2, we want to ensure we compare models based on how well the model performs on new data, i.e., on a validation or test data set.

  • We want to avoid comparing models on their performance on the training data.

Since we want the “best” predictions, we use metrics based on predictive error to compare models.

  • For Regression Models, that usually means using Mean Squared Error (MSE) for predictions from the new data.
  • For Classification Models, that usually means using Classification Error rates for predictions from the new data.
Note

The predictive error metrics are also known as the loss functions in machine learning.

  • We want to minimize the difference (the loss) between the true values and our predicted values

While these metrics provide a good start, there are other considerations at play as well.

  • A key one is the risk analysis for the predictions. It is often true that the risk is different for different kinds of errors.
  • If you are legally required to be able to explain the model results, you may have to forgo better performing “black box” models in favor of a more transparent but less accurate model.
  • If you are more concerned about how the model will perform on a variety of new data for prediction, you may be willing trade off some bias for reduced variance.
Important

As Figure 8.1 suggests, we are on a quest to find a model that is useful in answering our question.

Machines can learn, but Humans still have to decide what is useful to them.

8.5 Using Mean Squared Error (MSE) as a Measure of Performance for Regression

Prediction error is the flip side of prediction accuracy; How well can a model predict the unknown.

If a model has high variance, there is a lot of noise in the predictions; change the data a little and the predictions can change a lot.

Regression model performance is often measured by prediction mean squared error (MSE) where the predictions are on validation or test data, i.e., data not used to train the model.

  • To calculate a model’s MSE, (read the name from right to left).
    • Make a prediction for each of the new data points.
    • Subtract the prediction from the “true” known response value (that is the error).
    • Multiple each error by itself (Square the error)
    • Calculate the average across all the squared errors (get the mean).
    • Now you have the MSE.

We could do some math to show that MSE is the sum of three sources of error (reasons our predictions are off).

  1. The Inherent Variance of our Response variable: This is “Irreducible” error as we can’t do anything about it.
  • The higher the variance of our response variable, the higher the model’s MSE no matter how good the model is.
  • If we are predicting weight based on height, there are a lot of factors that influence weight besides height. - Thus we will have some error based on the fact we are using a model that is a subset or abstraction of reality.
  • We might be able to reduce this if we had more observations that provide a more accurate estimate of our response variables variance; but if it is high, that is life.
  • Life is messy and randomness is everywhere.
  1. The Squared Bias of our predictions: sometimes called “systematic error”.
  • How far are are our predictions from the true values on average (then squared).
    • Squaring penalizes predictions that are much farther away from the true value more than those that are close.
  • This can happen if we have too simple of a model, too restrictive.
  • If we try to model weight based on height as a straight line, but it should be a curve, we will consistently under- or over- predict the true value.
  • We can reduce the squared bias by increasing the flexibility of our model.
    • If our expected bias is 0, we say our estimates are “unbiased” which is a good thing.
  • High bias means our model is wrong in the same way over and over again (systematically).
  1. The Variance of our predicted values: sometimes called “model sensitivity to new data”.
  • How scattered are our predictions as we train the model on new data.
  • This can happen when we fit the model too tightly or closely to one set of data (the training data) so given a new training data set it does poorly on the new data.
  • Let’s say we built a model that accurately predicted weight for one group of people, and then tried it on a second group with similar but different heights.
    • If the predictions for people with similar heights but in the different groups are vastly different, that means the model has high variance in its predictions. This suggests it is too flexible.
  • We can reduce it by decreasing the flexibility of our model.
  • High variance means our model is too sensitive to the noise in the data.
ImportantThe Bias Variance Trade Off

The degrees of freedom trade off between flexible and restrictive models is equivalent to the bias and variance trade off in the results.

  • Flexible Methods have low bias as they can follow the data but can have high variance.
  • Restrictive Methods have low variance and but may have high bias as they may not match all the data well.

This need to balance or trade off bias and variance in predictions is why SML is not just “plug and chug”– we have to think.

  • The good news is we now have ways to quantitatively evaluate the trade off using the predictive error metrics.

Figure 8.4 shows possible options. We would always prefer low bias and low variance, but that rarely happens so we wind up with one of the other options.

  • This includes forgoing an “unbiased estimator” if the total predictive error is less by accepting a little bias.
(a) High Bias - Low Variance
(b) High Bias - High Variance
(c) Low Bias - Low Variance
(d) Low Bias - High Variance
Figure 8.4: The Bias Variance Trade Off for a True Value at the Bulls eye (c > [a = d] > b)

8.6 Choosing How Much Flexibility is “Best” for a Model

We will use the Auto data set from the American Statistical Association’s Exposition of 1983 (ASA 1983) to see how we can compare different levels of flexibility and choose a model.

  • Note: we will use just a training and a test set in this example. We could split into three sets but are keeping it simpler.

8.6.1 Get the Data and Plot the Variables of Interest.

Let’s load the {tidyverse}, read in the data, and glimpse() it.

Show code
library(tidyverse)
Auto <- read_csv("./data/Auto.csv")
glimpse(Auto)
Rows: 392
Columns: 9
$ mpg          <dbl> 18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 14, 15, 14, 2…
$ cylinders    <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 4, 6, 6, 6, 4, …
$ displacement <dbl> 307, 350, 318, 304, 302, 429, 454, 440, 455, 390, 383, 34…
$ horsepower   <dbl> 130, 165, 150, 150, 140, 198, 220, 215, 225, 190, 170, 16…
$ weight       <dbl> 3504, 3693, 3436, 3433, 3449, 4341, 4354, 4312, 4425, 385…
$ acceleration <dbl> 12.0, 11.5, 11.0, 12.0, 10.5, 10.0, 9.0, 8.5, 10.0, 8.5, …
$ year         <dbl> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 7…
$ origin       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 3, …
$ name         <chr> "chevrolet chevelle malibu", "buick skylark 320", "plymou…

We are interested in the relationship between weight and mpg (miles per gallon).

Let’s create a scatterplot with weight on the x axis and mpg on the y axis.

  • Use alpha = .5 to reduce the overplotting.
  • Assign the name base_plot to the plot and show it.
Show code
ggplot(Auto, aes(x = weight, y = mpg)) +
  geom_point(alpha = .5) ->
base_plot
base_plot

  • It does not look quite linear.

Start with the base_plot and use geom_smooth() to add a Linear Regression Line as in Figure 8.5.

  • Save as base_plot and show it.
base_plot +
  geom_smooth(method = lm, se = FALSE, col = "red", linewidth  = 2) ->
base_plot
base_plot
Figure 8.5: A plot of mpg vs weight with a linear smoother
  • Points seem to be more above the lines on the extremes and below the line in the middle (high bias).

8.6.2 Calculate the Prediction MSE

To calculate the Prediction MSE, we need test data so we will split the data into halves, training and testing data sets.

  • To create a 50/50 split, we will get a random sample of size \(n/2\) where \(n\) is the number of observations.
  • We can use that sample as a vector of indexes to subset the data set.
  • We will also set the seed for the pseudo-random number generator so our split of the observations is repeatable.
n <-  length(Auto$mpg)
n
[1] 392
set.seed(1234)
z <-  sample(n, n/2) # drawn from integers 1:n

Use the lm() function to create a regression model based on the training data subset using the indices in z.

reg <-  lm(mpg ~ weight, subset = z, data = Auto)

Use the testing data with predict() to generate the predictions from the new data.

Y <-  Auto$mpg[-z] # the True Y
Yhat <-  predict(reg, newdata = Auto[-z,]) # The predicted Y

Now, calculate the prediction MSE by calculating the differences between the vectors of \(Y\) and \(\hat{Y}\), squaring the differences, and then using the mean() function to get the average.

MSE <- mean((Y - Yhat)^2)
MSE
[1] 17.25805
  • This number does not mean much by itself. The MSE is only meaningful when comparing to the MSE from other models.
  • Perhaps we can get a lower MSE then 17.25805 with a different model.

The original plot in Figure 8.5 suggested the relationship was not linear. Since linear regression is a restrictive model it is probably not representing the data very well.

Let’s use a plot of the residuals versus fitted values commonly used in regression analysis.

  • If the “true” relationship is linear, and the linear model is a good fit, we would expect to see the residuals generally evenly-spaced both above and below the line across the range of the values.

  • Use Base R plot() to produce a quick plot of the Residuals (errors) vs the Fitted values.

plot(reg, which = 1)
Figure 8.6: A plot of the residuals versus the fitted values (the regression line)
  • Figure 8.6 shows there is clear curvature to the residuals as they are not evenly spaced above and below the line.
  • The linear model is not a good fit

8.6.3 Let’s Try a Spline Model which can be Flexible.

Spline models allow us to adjust the number of degrees of freedom or parameters in the model to provide more flexibility.

  • This makes df a hyper-parameter we can use later to tune our model.

Let’s start with df = 2. This is the same as simple linear regression with two parameters for intercept and slope.

We will use the geom_spline() plotting function from the {ggformula} package.

  • Start with base_plot and add the spline on top (in the next layer). Note the df=2 argument.
  • Save as base_plot2 and show it.
library(ggformula) # the package for geom_spline

base_plot +
geom_spline(df = 2, color = "orange", lty = 2, linewidth = 2) ->
  base_plot2
base_plot2

  • Notice how the linear regression line and the spline are exactly the same.

Let’s increase the degrees of freedom which allows for more flexibility.

  • Use df = 3, then 10, then 40, and then 140.
  • This does not mean the spline is fitting only higher-order polynomials. It may decide to split into different polynomials for different parts of the data.
base_plot +
  geom_spline(df = 2, color = "orange", lty = 2) +
  geom_spline(df = 3, color = "blue",  linewidth = 2) +
  geom_spline(df = 10, color = "brown", linewidth = 2) +
  geom_spline(df = 40, color = "green", linewidth = 2) +
  geom_spline(df = 140, color = "purple", linewidth = 1) 

  • As the degrees of freedom increase, the line is more flexible and matches the data better, but the model might be less accurate for prediction as the bias decreases but the variance increases.

8.7 Using Test MSE to Explore Tuning a Model

We want to compare the prediction MSE across various versions of a regression model.

  • We created multiple spline models where each had a different degrees of freedom.

The question is which was best? Can we optimize the model for the degrees of freedom that minimizes MSE.

As we saw in Section 8.5, MSE has two elements that are reducible, the Squared Bias and the Variance of the predictions. We can’t do anything about the irreducible error.

Important

To minimize the MSE we have to find the balance between minimizing the variance of the estimates and the bias of the estimates.

This step in tuning the model is where we face The Bias-Variance Trade-off.

8.7.1 Tuning an SML model

Tuning is optimization of a SML to minimize the predictive error, trying to find the balance between flexible and restrictive models.

Let’s go back to the Auto data. Recall the Figure 8.5 with regression line of mpg vs weight.

base_plot

The plot suggests linear regression tends to underestimate mpg for the light and heavy cars and overestimate the mid-size cars.

When we add a highly flexible spline model (df = 100), it is much closer to the data but the accuracy of the predictions appear to be highly dependent upon the value of weight.

base_plot +
  geom_spline(df = 100, color = "purple")

Let’s try to tune our model.

Step one is create training and test data. Instead of splitting 50-50, let’s pick 200 cars for our training set out of 392 observations.

set.seed(123)
z <-  sample(nrow(Auto),200)
training <-  Auto[z,]
test <-  Auto[-z,] # the complement of the indices for the training data

8.7.2 Fit a Regression Model

Let’s use the lm() function to fit a regression model on the training data and check the summary.

reg <- lm(mpg ~ weight, data = training)
summary(reg)

Call:
lm(formula = mpg ~ weight, data = training)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.0822 -2.7920 -0.4653  2.2020 16.4458 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 46.2373794  1.1255991   41.08   <2e-16 ***
weight      -0.0076223  0.0003641  -20.93   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.481 on 198 degrees of freedom
Multiple R-squared:  0.6888,    Adjusted R-squared:  0.6872 
F-statistic: 438.2 on 1 and 198 DF,  p-value: < 2.2e-16

Note the df is \(198 = 200 - 2 \text{ parameters}\).

We will use this model to make our predictions. We can predict for the entire data set and then select the -Z indexed values to calculate \(MSE_{test}\)

Yhat <-  predict(reg, newdata = Auto)
MSE_test <-  mean((Yhat[-z] - Auto$mpg[-z])^2)
MSE_test
[1] 17.44414

What’s the unit of measure for MSE_test? It’s miles-squared per gallons-squared, a bit hard to interpret.

What can we do? Let’s take the square root to give us the Root Mean Squared Error (RMSE).

RMSE_reg <- sqrt(MSE_test) |> round(2)
RMSE_reg
[1] 4.18

So the liner regression model is predicting within \(\pm\) 4.18

We could add more more variables from the data set, besides weight, which would make the model more flexible. How do we know if it makes it too flexible?

8.7.3 Fit a more flexible spline model instead

Use the smooth.spline() function from the Base R {stats} package to generate the spline model.

ss <-  smooth.spline(Auto$weight[z], Auto$mpg[z], df = 100)

Use predict() to make the predictions.

  • Note predict() returns a list of x and y. All we want is the y element.
Yhat <- predict(ss, Auto$weight[-z])
str(Yhat)
List of 2
 $ x: num [1:192] 3504 3693 3436 4341 4312 ...
 $ y: num [1:192] 18.3 22.2 16.1 15.4 13.9 ...
Yhat <- predict(ss, Auto$weight[-z])$y

Get the RSME and compare to RMSE_Reg

RMSE_ss <- sqrt(mean((Yhat - Auto$mpg[-z])^2))
RMSE_ss
[1] 6.205467
RMSE_reg
[1] 4.18
Note

Building spline models allow for fractional degrees of freedom as long as \(df \geq 1\). Hard to imagine in our minds, but mathematically fine.

Let’s use this to see if we can find an optimal value for our RMSE.

8.7.4 Fitting Many Models

Let’s fit 100 spline models with different degrees of freedom

  • We will use a random sample of 200 rows for training and the rest for test.
  • We will use the map_dbl() function from {purrr} which allows us to pass it a vector of degrees of freedom and a function name.
  • It will use the function to calculate the RMSE for each degree of freedom in the vector and return all the results as a numeric vector.
set.seed(123)
z <-  sample(nrow(Auto), 200)
deg_free <- seq(1.1, 11, .1) # create a sequence of numbers for the df

# Create a custom function to calculate RMSE (we use it several times)
calc_rmse_spline <- function(x, y, index, df){
  ss <-  smooth.spline(x = x[index], y = y[index], df = df)
  Yhat <-  predict(ss, x = x[-index])$y
  RMSE <- sqrt(mean((Yhat - y[-index])^2))
}

# we are using the anonymous function syntax \(arg) with argument df
# 
RMSE <- map_dbl(deg_free,
                \(df) calc_rmse_spline(Auto$weight, Auto$mpg, z, df)) 

Let’s plot how the RMSE changes with degrees of freedom.

Show code
ggplot(bind_cols(deg_free = deg_free, RMSE = RMSE), aes(deg_free, RMSE)) + # add the vectors as columns into a data frame.
  geom_point() +
  geom_point(aes(x = deg_free[(RMSE == min(RMSE))], y = min(RMSE)), color = "red") + 
  geom_text(aes(x = deg_free[(RMSE == min(RMSE))], y = min(RMSE)), label = (deg_free[(RMSE == min(RMSE))]), nudge_y = -.006,check_overlap = TRUE) + 
  geom_vline(xintercept = 2, color = "blue", lty = 2) +
  geom_text(aes(x = 1.4, y = 4.1), label = "Too Rigid\nHigh Bias\nUnderfit") +
  geom_text(aes(x = 6, y = 4.1), label = "Too Flexible\nHigh Variance\nOverfit") 

What happens if we change the seed for selecting our sample? What happens if we change our sample size?

Try the shiny app for Exploring Splines with the Auto Data.

Important

Selecting the hyper-parameter(s) to create a model that optimizes the bias-variance trade off is a typical challenge in Data Science.

8.7.5 DC Covid Data Example Continued

Recall Figure 8.3?

Show code
covid_19_dc_plot

How many splines are optimal??

Let’s get the data for DC and shape it for our analysis.

url <- c("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_US.csv")
readr::read_csv(url) |> 
  dplyr::filter(FIPS == 11001) |>   # Get just data for Washington DC
  dplyr::select(tidyselect::contains("/")) |> 
  tidyr::pivot_longer(cols = tidyselect::contains("/"), 
                      names_to = "date", values_to = "cases", 
                      names_transform = lubridate::mdy) |> 
  # Subtract totals for successive days to get new daily cases
  dplyr::mutate(daily_cases = cases - dplyr::lag(cases)) |>  
  dplyr::filter(daily_cases != 0) ->
  covid_19_dc
rm(url)

Let’s calculate the RMSE for all the possible Degrees of Freedom

set.seed(1234)
z <- sample(nrow(covid_19_dc), floor(nrow(covid_19_dc) * .6))
deg_free <- seq(2, 100, 1)

map_dbl(
  deg_free,
  \(df) calc_rmse_spline(
      as.integer(covid_19_dc$date),
      covid_19_dc$daily_cases, z, df
  )
) ->
RMSE
Show code
ggplot(bind_cols(deg_free = deg_free, RMSE = RMSE), aes(deg_free, RMSE)) +
  geom_point() +
  geom_point(aes(x = deg_free[(RMSE == min(RMSE))], y = min(RMSE)),
             color = "red") +
  geom_text(aes(x = deg_free[(RMSE == min(RMSE))], y = min(RMSE)),
            label = (deg_free[(RMSE == min(RMSE))]), nudge_y = -4,
            check_overlap = TRUE) +
  geom_vline(xintercept = 2, color = "blue", lty = 2)

Show code
cols <- c(
  "#56B4E9",  # light blue
  "#009E73",  # green-teal
  "#D55E00"   # vermillion
)

covid_19_dc_plot +
  geom_spline(
    df = 18,
    color = cols[1],
    linewidth = 1,
    linetype = "solid"
  ) +
  
  geom_spline(
    df = 50,
    color = cols[2],
    linewidth = 1,
    linetype = "solid"
  ) +
  
  geom_spline(
    df = 72,
    color = cols[3],
    linewidth = 1,
    linetype = "solid"
  ) +
  
  geom_text(
    aes(
      x = lubridate::mdy("07/05/2020"),
      y = 1200
    ),
    label = "df = 18",
    color = cols[1],
    fontface = "bold",
    hjust = 0
  ) +
  
  geom_text(
    aes(
      x = lubridate::mdy("07/05/2020"),
      y = 800
    ),
    label = "df = 50",
    color = cols[2],
    fontface = "bold",
    hjust = 0
  ) +
  
  geom_text(
    aes(
      x = lubridate::mdy("07/05/2020"),
      y = 550
    ),
    label = "df = 72",
    color = cols[3],
    fontface = "bold",
    hjust = 0
  ) +
  
  theme(
    panel.grid.minor = element_blank()
  )

What does this suggest to you?

8.8 Classification Prediction Error

Concepts such as bias-variance trade-off and optimizing flexibility to minimize prediction error are applicable in classification models, but the formulas are different than in regression.

  • We can’t calculate differences between two categories, e.g., if we were predicting hair color, \(\text{brown} - \text{red} = ?\) does not mean anything, so we just determine is the prediction is correct or incorrect, and call an incorrect prediction an error.

The error rate is the fraction or proportion of incorrect classifications.

  • The training error rate is the proportion of predictions based on the training data that are incorrect or misclassified.
  • What we will use to compare models is the proportion of misclassifications based on new data (test data not used in creating the model) or the test terror rate.

We want to optimize our classifier by minimizing the test error rate.

  • Note: the error rate is a percentage so the \(\text{accuracy rate} = 1 - \text{error rate}\).
    • An error rate of 20% means the accuracy rate is 80%.
  • Minimizing the error rate is the same as maximizing the accuracy rate.
Important

Optimizing SML models to be useful means humans making choices about the bias-variance trade off.

8.9 Regularization

Regularization is a family of methods used to improve model performance.

Think of regularization as a set of tools to keep a model from getting carried away.

  • It’s a way to control the complexity of a model so it doesn’t just memorize the data (overfitting) but actually learns patterns that generalize well to new data.

Regularization is like saying to your model: “Hey, don’t try to be perfect, just be reasonable.”

Without regularization, a model, especially complex ones like deep neural networks or models with lots of features, might:

  • Fit the noise in the data (like trying to explain randomness).
  • Depend too heavily on certain input variables.
  • Become so complicated that it’s unstable and unpredictable.

Methods exist to help mitigate the risk of overfitting by selecting the right variables, and reduce variance of our predictions by “shrinking” our coefficient estimates.

When we have a lot of variables to choose from in building a model, how do we know which ones to use? Why not use them all?

8.9.1 Every Variable Costs a Degree of Freedom

Every variable we include in the model adds a degree of freedom for the model making it more flexible

  • That also increases the risk of overfitting, capturing noise rather than the true signal.

There is an associated cost as well, especially in linear models.

  • The total degrees of freedom are determined by the number of observations, \(n\).
  • Each predictor we add uses up one degree of freedom.
  • For small to moderate sample sizes (e.g., in the hundreds), this trade-off becomes important.

Why does it matter?

The variance of the coefficient estimates increases as residual degrees of freedom decrease:

  • That means: fewer df for residuals \(\implies\) noisier estimates of the model coefficients.
  • Noisier estimates \(\implies\) higher variance in predictions \(\implies\) poorer generalization to new data.

8.9.2 More Variables Means More Chance of Multicollinearity

One of the statistical assumptions in linear regression is that our predictor variables are “statistically independent” of each other, i.e., knowing the value of one variable tells us nothing about the value of another.

  • We know that assumption is not really 100% true in the real world.
  • Even so, linear models are generally robust to small correlations among variables.
  • However, when two or more variables are highly correlated, trouble can arise with our models and their predictions.

Multicollinearity in statistics is a fancy term for when two or more variables are highly correlated with each other.

  • When this happens, it means that they are both trying to provide the same information to the model.
  • This makes the model unstable as small shifts in the data may tilt the model in different directions towards one variable versus the other.
  • It also increases the variance of the estimates of the variables’ coefficients.

One measure is called the Variance Inflation Factor (VIF). This a measure of how much a variable will inflate or increase the variance of other variables in the model due to having multicollinearity with them.

  • It is useful for identifying which variables might be removed from the model.

Let’s use the Auto.csv data.

Show code
library(tidyverse)
Auto <- read_csv("./data/Auto.csv")
glimpse(Auto)
Rows: 392
Columns: 9
$ mpg          <dbl> 18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 14, 15, 14, 2…
$ cylinders    <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 4, 6, 6, 6, 4, …
$ displacement <dbl> 307, 350, 318, 304, 302, 429, 454, 440, 455, 390, 383, 34…
$ horsepower   <dbl> 130, 165, 150, 150, 140, 198, 220, 215, 225, 190, 170, 16…
$ weight       <dbl> 3504, 3693, 3436, 3433, 3449, 4341, 4354, 4312, 4425, 385…
$ acceleration <dbl> 12.0, 11.5, 11.0, 12.0, 10.5, 10.0, 9.0, 8.5, 10.0, 8.5, …
$ year         <dbl> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 7…
$ origin       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1, 1, 3, …
$ name         <chr> "chevrolet chevelle malibu", "buick skylark 320", "plymou…

Let’s predict mpg based on the other variables not including name.

reg <- lm(mpg ~ . - name, data = Auto)

We can use the {car} package (Companion to Applied Regression) to extract variance inflation factors for the variables. (Fox and Weisberg 2019)

library(car)
vif(reg) |> round(2)
   cylinders displacement   horsepower       weight acceleration         year 
       10.74        21.84         9.94        10.83         2.63         1.24 
      origin 
        1.77 

How to interpret the VIF?

  • Having displacement in the model will inflate the variances of other variables by 21.843.

How was it computed?

  • By regressing displacement on all the other variables (except the response mpg and name which we left out of the model.)
Show code
reg_vif <- lm(displacement ~ . -name - mpg, data = Auto)
summary(reg_vif)

Call:
lm(formula = displacement ~ . - name - mpg, data = Auto)

Residuals:
    Min      1Q  Median      3Q     Max 
-74.429 -12.546   0.438  13.041 103.012 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -43.88383   31.41652  -1.397   0.1633    
cylinders     28.54918    1.63999  17.408  < 2e-16 ***
horsepower     0.53804    0.08939   6.019 4.09e-09 ***
weight         0.03347    0.00408   8.204 3.54e-15 ***
acceleration  -1.34481    0.66682  -2.017   0.0444 *  
year          -0.45258    0.34491  -1.312   0.1902    
origin       -11.73431    1.78891  -6.559 1.74e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 22.57 on 385 degrees of freedom
Multiple R-squared:  0.9542,    Adjusted R-squared:  0.9535 
F-statistic:  1337 on 6 and 385 DF,  p-value: < 2.2e-16

We can calculate the VIF of displacement as

1/(1 - summary(reg_vif)$r.squared)
[1] 21.83679

The VIF shows how well displacement could be predicted by the other variables in the data. While not perfect correlation, it is pretty high.

Let’s check the house_ed_gdp_df.

house_ed_gdp_df <- read_csv("./data/house_ed_gdp_joined.csv")
names(house_ed_gdp_df)
 [1] "Municipality"               "culture_site"              
 [3] "Average_size"               "has_washing_machine"       
 [5] "has_internet"               "has_computer"              
 [7] "has_car"                    "Illiteracy_rate"           
 [9] "primary_lower_secondary"    "upper_secondary"           
[11] "university"                 "population"                
[13] "general_revenue_per_person"
  • Regress general_revenue_per_person on all the variables but Municipality and check the VIF.
grp_reg <- lm(general_revenue_per_person ~ . -Municipality, data = house_ed_gdp_df)
vif(grp_reg) |> round(2)
           culture_site            Average_size     has_washing_machine 
                   2.01                    1.62                    1.84 
           has_internet            has_computer                 has_car 
                   3.01                    6.16                    2.37 
        Illiteracy_rate primary_lower_secondary         upper_secondary 
                   2.72                 1548.76                  437.86 
             university              population 
                 578.91                    2.37 
  • Are the results surprising? How do you interpret it?

  • What should we do?

Show code
grp_reg2 <- lm(general_revenue_per_person ~ . -Municipality - primary_lower_secondary - upper_secondary, data = house_ed_gdp_df)
vif(grp_reg2) |> round(2)
       culture_site        Average_size has_washing_machine        has_internet 
               1.85                1.58                1.68                2.44 
       has_computer             has_car     Illiteracy_rate          university 
               5.94                2.28                1.76                4.05 
         population 
               2.34 
  • These are much more reasonable.

Compare the summaries and note how the estimates have changed and the university standard error is much smaller.

summary(grp_reg)

Call:
lm(formula = general_revenue_per_person ~ . - Municipality, data = house_ed_gdp_df)

Residuals:
   Min     1Q Median     3Q    Max 
-21748  -7871  -1651   3947  63262 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)  
(Intercept)             -2.914e+05  9.498e+05  -0.307   0.7604  
culture_siteno_site      1.873e+03  6.823e+03   0.274   0.7849  
Average_size            -4.895e+03  4.896e+03  -1.000   0.3225  
has_washing_machine      8.897e+01  1.083e+03   0.082   0.9349  
has_internet            -4.064e+02  2.187e+02  -1.858   0.0694 .
has_computer             5.099e+02  6.546e+02   0.779   0.4399  
has_car                 -2.520e+01  3.310e+02  -0.076   0.9396  
Illiteracy_rate          2.142e+03  3.060e+03   0.700   0.4874  
primary_lower_secondary  3.359e+03  9.513e+03   0.353   0.7256  
upper_secondary          3.545e+03  9.571e+03   0.370   0.7127  
university               3.527e+03  9.398e+03   0.375   0.7091  
population              -2.485e-02  3.867e-02  -0.643   0.5236  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 15330 on 47 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.2599,    Adjusted R-squared:  0.08674 
F-statistic: 1.501 on 11 and 47 DF,  p-value: 0.1632
summary(grp_reg2)

Call:
lm(formula = general_revenue_per_person ~ . - Municipality - 
    primary_lower_secondary - upper_secondary, data = house_ed_gdp_df)

Residuals:
   Min     1Q Median     3Q    Max 
-20511  -7482  -1912   3244  64321 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)  
(Intercept)          5.563e+04  9.950e+04   0.559   0.5786  
culture_siteno_site  1.500e+03  6.431e+03   0.233   0.8166  
Average_size        -4.570e+03  4.743e+03  -0.963   0.3400  
has_washing_machine  1.841e+01  1.017e+03   0.018   0.9856  
has_internet        -4.332e+02  1.930e+02  -2.244   0.0294 *
has_computer         4.634e+02  6.307e+02   0.735   0.4660  
has_car              4.674e-01  3.186e+02   0.001   0.9988  
Illiteracy_rate      1.462e+03  2.417e+03   0.605   0.5482  
university           3.050e+02  7.709e+02   0.396   0.6941  
population          -2.498e-02  3.770e-02  -0.663   0.5107  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 15050 on 49 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.2571,    Adjusted R-squared:  0.1207 
F-statistic: 1.884 on 9 and 49 DF,  p-value: 0.07665
Note

A VIF of 4 is generally considered okay.

  • 5-10 should be noted, and
  • \(\text{VIF} \geq 10\) should be avoided.

The VIF is one method for humans making choices about what variables to include/exclude in a model.

8.9.3 Correlation Plots

Correlation plots provide some insights into how variables might be correlated with each other.

The {corrplot} package has a nice visualization.

  • Limit to numeric variables and no need to include the response.
cor_matrix <- cor(house_ed_gdp_df[, 3:12], use = "complete.obs")  # handles missing values
corrplot::corrplot(cor_matrix, method = "circle", type = "upper", tl.cex = 0.8)

  • Note the large dark red circles on primary_lower_secondary indicating high negative correlation with upper_secondary and university

8.9.4 Comparing Models for Variable Selection

Another method for selecting variables in regression is to build two models, a full and a reduced model.

  • The Reduced model has one or more variables of interest.
  • The Full model has all the variables in the Reduced model plus one or more additional variables.
    • The Reduced model is “nested” inside the Full model.

Then run a regression on the full model and one on the reduced model.

  • In this situation, the Full model’s Residual Sum of Squares must be equal to or less than the Residual Sum of Squares of the Reduced model because it is more flexible.

There is a statistical test for linear regression called the F-test that compares nested models based on how much the MSE is reduced by the Full relative to degrees of freedom

The ratio of the reduction in the MSE per df to the Full model MSE is called the F-Statistic.

If the F-Statistic is large, then the Full model is considered significantly better.

  • If the F-statistic is small, then the information in the extra variables was not worth the cost in degrees of freedom so the Reduced model with fewer variables is better

8.10 Regularization by Shrinkage and Variable Selection

We have seen above that we can choose to remove variables from the model. When there are more than a few variables that can be problematic.

  • The good news is there are ways to tweak the algorithms to have them automatically make choices of variables.

Without regularization, a model, especially complex ones like deep neural networks or models with lots of variables, might:

  • Fit the noise in the data trying to explain randomness.
  • Depend too heavily on certain input variables.
  • Become so complicated, it’s unstable and unpredictable.

Two big ideas in algorithmic regularization are shrinkage and variable selection.

8.10.1 Shrinkage:

Shrinkage is when you tweak the algorithm to slowly shift the model’s coefficient estimates (or weights in a neural network) closer to zero.

  • The algorithms are usually trying to minimize MSE. Now we make the goal to minimize MSE with the constraint that the sum of the squared coefficients must be \(\leq 1\).
    • The data in these models is pre-processed (cleaned) so it is all “standardized” with mean 0 and standard deviation 1 so the model can focus on the information in the data, not just the magnitude of the values.
  • Imagine you have a model that assigns a coefficient to every variable (e.g., “height is worth +2.3, age is worth –1.7,” etc.). Now force the model to reduce all the coefficients so they sum to one. It will figure out which variables deserve the largest coefficients.
  • This helps prevent the model from becoming overly confident in any single variable.
  • Keeping the coefficients smaller also reduces the variance in any prediction.

Shrinkage is like saying: “You can use these variables, but don’t shout — speak softly.”

8.10.2 Variable Selection:

Variable Selection is when you tweak the algorithm to shrink some coefficients and potentially shrink other coefficients all the way to 0, effectively removing them from the model.

  • Again the data is standardized to eliminate the effect of the magnitude of the data.
  • The penalty function now uses the constraint that the sum of the absolute value of the coefficients must be \(\leq 1\). This changes the geometry of the constraint so now 0 is possible.

It’s like saying: “You’re not pulling your weight — you’re out.”

  • This makes the model simpler, needing fewer degrees of freedom, and easier to interpret.

8.10.3 Regularization Beyond Classical Multiple Regression

Classical multiple regression estimates coefficients by minimizing the residual sum of squares (RSS) or equivalently the Mean Squared Error (MSE).

  • As the number of predictors grows, however, ordinary least squares models can become unstable, overly sensitive to noise, and prone to overfitting, especially when predictors are highly correlated or when there are many variables relative to the number of observations.
  • In earlier applications, regression models might contain only tens of predictors, and methods such as step-wise regression were commonly used to iteratively add or remove variables based on whether they improved the model.
  • Modern machine learning problems often involve hundreds, thousands, or even millions of predictors, making exhaustive variable-selection approaches such as step-wise regression computationally impractical and statistically unstable.

Modern regression methods address these problems through regularization, which extends multiple regression by adding a penalty term to the optimization problem.

  • Rather than only minimizing prediction error, the model is also penalized for overly large coefficients.
  • This encourages simpler and more stable models that often generalize better to new data.
  • Regularization methods can also help reduce multicollinearity and improve performance when working with high-dimensional data.
  • The penalty term also helps address situations where the number of predictors exceeds the number of observations.
    • In modern machine learning problems, it is common to work with models containing hundreds of thousands (or even millions) of parameters but far fewer observations, a setting where ordinary least squares regression either becomes unstable or cannot be uniquely estimated at all.
    • Regularization allows meaningful models to still be fit in these high-dimensional settings.

This leads naturally to methods such as Ridge Regression, LASSO Regression, and Elastic Net Regression, which are now widely used as strong baseline models for comparison with more complex machine learning methods such as neural networks.

8.10.4 Ridge Regression (L2 Regularization)

Ridge regression adds a penalty based on the sum of squared coefficients (a.k.a the L2 Norm):

\[\sum_{i=1}^{n}(y_i-\hat{y}_i)^2 + \lambda \sum_{j=1}^{p}\beta_j^2\]

The tuning parameter \(\lambda\) controls the amount of shrinkage.

  • When \(\lambda = 0\), Ridge regression becomes ordinary least squares regression.
  • As \(\lambda\) increases, coefficients are progressively shrunk toward zero.
  • Ridge regression usually keeps all predictors in the model, but reduces their influence.

This approach is especially useful when predictors are highly correlated or when many variables may contain at least some predictive information.

8.10.5 LASSO Regression (L1 Regularization)

LASSO regression changes the penalty from squared coefficients to the sum of the absolute values of the coefficients (a.k.a.the L1 Norm):

\[\sum_{i=1}^{n}(y_i-\hat{y}_i)^2 + \lambda \sum_{j=1}^{p}|\beta_j|\]

The geometry of the L1 penalty (hyper-tetrahedrons versus hyper-spheres for L2) allows some coefficients to become exactly zero.

  • Variables with coefficients shrunk to zero are effectively removed from the model.
  • This performs automatic variable selection while also reducing overfitting.
  • LASSO often produces simpler and more interpretable models than Ridge regression.

LASSO is particularly useful when you think only some variables really matter, and want to automatically choose the important ones.

8.10.6 Elastic Net Regression

Elastic Net regression is useful when you want the best of both worlds — some shrinkage, some variable selection.

  • Elastic Net combines both L1 and L2 penalties:
    • Ridge-style shrinkage stabilizes correlated predictors.
    • LASSO-style penalties allow variable selection.

Elastic Net is often preferred in high-dimensional settings where predictors are strongly correlated and some level of feature selection is desired.

The {glmnet} R package runs Ridge, LASSO, and Elastic Net regression by using a “mixing parameter” (\(\alpha\)) which controls how much influence each penalty has in the optimization.

  • It can also use the same L1 and L2 regularization methods on other than linear regression models to include logistic and multinomial, poisson, and Cox regression models.

Figure 8.7 shows typical outputs from {glmnet} showing how the coefficients shrink or go to zero as the size of the allowable regions decrease (the penalty increases) from 10 to 0.

The algorithms use an optimized grid search to find the “sweet spot” penalty value that gets good performance while trying to avoid overfitting.

(a) Plot of coefficient values getting close to 0 during Ridge Regression
(b) Plot of coefficient values getting close and going to 0 during LASSO Regression
Figure 8.7: Plots of how the coefficients change as Ridge and LASSO regression algorithms iterate through increasing the penalty (which is equivalent to decreasing the Norm).
Note

The regularize models are also effective for data with small numbers of parameters such that modeling frameworks such as {tidynmodels} no longer include step-wise regression.

8.10.7 Regularization for Neural Networks

Looking ahead, regularization is also fundamental in neural networks and modern machine learning systems.

  • Neural networks commonly use L2 regularization (often called weight decay) to prevent excessively large weights.
  • Sparsity-inducing penalties related to L1 regularization can also be applied in deep learning models.
  • Techniques such as dropout, early stopping, and batch normalization can all be viewed as additional forms of regularization.

Because Ridge, LASSO, and Elastic Net are computationally efficient, interpretable, and often surprisingly accurate, they are widely used as baseline models when evaluating more complex machine learning methods such as neural networks.

In practice, a neural network should generally outperform these regularized linear models before the additional complexity of deep learning is considered justified.

  • Regularized regression models are often surprisingly powerful baselines.
    • Whether a neural network performs substantially better depends largely on the scale and complexity of the non-linear relationships present in the data.
    • If the underlying patterns are mostly linear or only mildly non-linear, methods such as Ridge, LASSO, or Elastic Net may perform nearly as well while remaining far simpler, faster, and easier to interpret.

8.10.8 Summary

Regularization is like adding a little discipline to your model-building process.

Instead of letting the model memorize all the noise in the data to get the “best fit”, you guide it toward simpler, more general solutions.

  • Without it, a model might ace the test it trained on but flunk real life.
  • With regularization, you’re teaching it to learn the lesson, not the cheat sheet.

8.11 Exercise 08: Regularization

Let’s try ridge and lasso regression to see how they regularize the models

  1. Library the {tidyverse} and {glemnet} packages
Show code
library(tidyverse)
library(glmnet)
  1. Read in the house_ed_gdp_joined.csv file and assign the name house_ed_gdp_df.
Show code
house_ed_gdp_df <- read_csv("./data/house_ed_gdp_joined.csv")
  1. Remove Municipality and all rows with NA in one or more columns and assign the name ex_df.
  • Why would we remove Municipality?
  • We remove all NA as that is required by {glmnet}.
Show code
house_ed_gdp_df |> 
  select(-Municipality) |> 
  drop_na() ->
 ex_df
  1. Create a training/testing split of 75%/25% using seed 123:
Show code
set.seed(123)

train_index <- sample(
  x = seq_len(nrow(ex_df)),
  size = floor(0.75 * nrow(ex_df))
)

train_df <- ex_df[train_index, ]
test_df  <- ex_df[-train_index, ]
  1. Create a linear model predicting general_revenue_per_person using the other 11 variables and get the summary.
Show code
model_lm <- lm(general_revenue_per_person ~ ., data = train_df)
summary(model_lm)

Call:
lm(formula = general_revenue_per_person ~ ., data = train_df)

Residuals:
   Min     1Q Median     3Q    Max 
-23285  -8057  -1302   3336  56416 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)
(Intercept)             -6.545e+05  1.177e+06  -0.556    0.582
culture_siteno_site     -4.524e+03  9.953e+03  -0.455    0.652
Average_size             2.999e+02  7.527e+03   0.040    0.968
has_washing_machine      6.486e+02  1.564e+03   0.415    0.681
has_internet            -5.802e+02  3.676e+02  -1.578    0.124
has_computer             1.194e+03  9.390e+02   1.272    0.213
has_car                 -3.928e+02  5.173e+02  -0.759    0.453
Illiteracy_rate          2.013e+03  4.214e+03   0.478    0.636
primary_lower_secondary  6.648e+03  1.184e+04   0.562    0.578
upper_secondary          6.768e+03  1.182e+04   0.573    0.571
university               6.098e+03  1.164e+04   0.524    0.604
population              -3.587e-02  4.924e-02  -0.729    0.472

Residual standard error: 17420 on 32 degrees of freedom
Multiple R-squared:  0.2737,    Adjusted R-squared:  0.02397 
F-statistic: 1.096 on 11 and 32 DF,  p-value: 0.3953
  1. Now use the model.matrix() function (as a nice trick) on the training data to create the X matrix for the predictors and assign the name x_train and then assign the name y_train to the vector of the response variable general_revenue_per_person.
  • Then repeat for the test data to create x_test and y_test.

  • Any character or factor variables are converted to dummy variables as required by {glmnet}.

Show code
x_train<- model.matrix(
  general_revenue_per_person ~ .,
  data = train_df
)[, -1]

y_train <- train_df$general_revenue_per_person

x_test <- model.matrix(
  general_revenue_per_person ~ .,
  data = test_df
)[, -1]

y_test <- test_df$general_revenue_per_person
  1. Let’s find possible values for the penalty parameter \(\lambda\). Set the random number seed to 123 and then run cv.glmnet() with alpha = 0 for Ridge Regression.
  • Assign the name ridge_cv to the result.
  • Note: the cv in cv.glmnet() is a reminder that the algorithm is using 10-fold cross-validation to share the data across 10 different models so that no data used in training a model is used in testing it.
Show code
set.seed(123)

ridge_cv <- cv.glmnet(
  x = x_train,
  y = y_train,
  alpha = 0,       # alpha = 0 means Ridge regression
  standardize = TRUE # the default
  )
  1. Use plot(ridge_cv) to generate a plot of the MSE against the \(log(\lambda)\) penalty.
  • Extract the model values of lambda.min and lambda.1se.
    • lambda.min is the value that gets the minimum mean cross-validated error
    • lambda.min is the value that is the largest value of lambda such that error is within 1 standard error of the minimum. This is the default for reducing the chances of overfitting.
Show code
plot(ridge_cv)

Show code
ridge_cv$lambda.min
[1] 119849.9
Show code
ridge_cv$lambda.1se
[1] 7885321
  1. Now build a final model by using glmnet() with with a selected value for \(\lambda\). Let’s use lambda.1se. Assign the name model_ridge.
Show code
model_ridge <- glmnet(
  x = x_train,
  y = y_train,
  alpha = 0,
  lambda = ridge_cv$lambda.1se,
  standardize = TRUE # the default
)
  1. Use plot(ridge_cv$glmnet.fit, xvar = "norm", label = TRUE) to see the plot of the variable coefficients and coef(ridge_cv, s = "lambda.min") to see the actual coefficients.
plot(ridge_cv$glmnet.fit, xvar =  "norm", label = TRUE)

coef(ridge_cv, s = "lambda.min")
12 x 1 sparse Matrix of class "dgCMatrix"
                           lambda.min
(Intercept)              3.884386e+04
culture_siteno_site     -6.351185e+02
Average_size            -1.253236e+03
has_washing_machine      2.059006e+00
has_internet            -5.516536e+01
has_computer            -1.232510e+01
has_car                 -3.263712e+01
Illiteracy_rate          2.792019e+02
primary_lower_secondary -2.662319e+01
upper_secondary          7.407499e+01
university               1.534945e+01
population              -9.802349e-04
  • Note how the number of variables stays the same across the top of the plot.
  1. Set the random number seed to 123 and then run cv.glmnet() with alpha = 1 for LASSO Regression.
  • Assign the name lasso_cv to the result.
Show code
set.seed(123)

lasso_cv <- cv.glmnet(
  x = x_train,
  y = y_train,
  alpha = 1,       # alpha = 1 means LASSO regression
  standardize = TRUE # the default
  )
  1. Plot lasso_cv and look at the values for lambda.min and lambda.1se
Show code
plot(lasso_cv)

Show code
lasso_cv$lambda.min
[1] 4111.413
Show code
lasso_cv$lambda.1se
[1] 7885.321
  • Note how the number of variables keeps dropping across the top.
  1. Run glmnet() to get the final model with lambda.min.
Show code
model_lasso <- glmnet(
  x = x_train,
  y = y_train,
  alpha = 1,
  lambda = lasso_cv$lambda.min,
  standardize = TRUE # the default
)
  1. Plot the variables and look at the coefficients
Show code
plot(lasso_cv$glmnet.fit, xvar =  "norm", label = TRUE)

Show code
coef(lasso_cv, s = "lambda.min")
12 x 1 sparse Matrix of class "dgCMatrix"
                        lambda.min
(Intercept)             46944.5739
culture_siteno_site         .     
Average_size                .     
has_washing_machine         .     
has_internet             -234.2723
has_computer                .     
has_car                     .     
Illiteracy_rate             .     
primary_lower_secondary     .     
upper_secondary             .     
university                  .     
population                  .     
Show code
coef(lasso_cv, s = "lambda.1se")
12 x 1 sparse Matrix of class "dgCMatrix"
                        lambda.1se
(Intercept)                  31325
culture_siteno_site              .
Average_size                     .
has_washing_machine              .
has_internet                     .
has_computer                     .
has_car                          .
Illiteracy_rate                  .
primary_lower_secondary          .
upper_secondary                  .
university                       .
population                       .
  • Note how many variables have non-zero coefficients at the end.
  • lambda.min prioritizes predictive accuracy.
  • lambda.1se prioritizes simplicity and robustness against overfitting.
  1. We now have three fitted models using the training data. Let’s generate predictions on the testing data and compare their RMSE values.
  • We will use the custom function to generate the MSE
rmse <- function(actual, predicted) {
  sqrt(mean((actual - predicted)^2))
}

Generate predictions for each model using the test data.

  • Note that the predict function operates different for lm objects versus glmnet objects so we have different arguments
lm_test_pred <- predict(
  model_lm,
  newdata = test_df
)

ridge_test_pred <- predict(
  model_ridge,
  newx = x_test
)

lasso_test_pred <- predict(
  model_lasso,
  newx = x_test
)
  1. Calculate the \(\text{RMSE}_\text{test}\) for each model and put into a tibble
test_results <- tibble(
  Model = c(
    "Linear Regression",
    "Ridge Regression",
    "LASSO Regression"
  ),
  Test_RMSE = c(
    rmse(y_test, lm_test_pred),
    rmse(y_test, ridge_test_pred),
    rmse(y_test, lasso_test_pred)
  )
)

test_results
# A tibble: 3 × 2
  Model             Test_RMSE
  <chr>                 <dbl>
1 Linear Regression    12288.
2 Ridge Regression     10338.
3 LASSO Regression      9265.

Interpretation:

  • The ordinary linear regression model had the largest prediction error on unseen data.
  • Ridge regression improved performance substantially by shrinking coefficients and reducing model variance.
  • LASSO regression performed best, suggesting that some predictors were likely adding noise rather than useful signal.

This pattern is common when:

  • predictors are correlated,
  • the sample size is relatively small, (which is what we have here)
  • or some variables have weak predictive value (which is also what we have here).

The LASSO model likely improved generalization by shrinking less useful coefficients all the way to zero, effectively simplifying the model.

Points to consider:

  • Ordinary least squares focuses entirely on minimizing training error,
  • Ridge and LASSO intentionally accept a little more bias in exchange for lower variance and better out-of-sample prediction performance.

The following is one of the central ideas behind modern machine learning:

A slightly simpler model often predicts new data better than a highly flexible model that fits the training data extremely closely.

We will discuss Neural Networks next and it’s important to note as well:

Neural networks require even stronger forms of regularization because they may contain thousands or millions of parameters.