11  Deep Learning/Neural Networks

Chapter 10 ISLR2

Author
Affiliation

Richard Ressler

American University

Published

April 16, 2024

Deep Learning and Neural Nets are two terms that describe a set of machine learning methods that build networks with one or more layers of calculations to manipulate input data to derive outputs.

Deep Learning methods can used for supervised or unsupervised learning.

They tend to require lots of input data and many calculations so their development has expanded greatly in the last few years given the large increases in available data and affordable computing power.

Note

Deep Learning and Neural Nets have evolved in both computing and statistical domains and combine aspects of networks and statistical methods and transformations. Thus they tend to use different terms describing the same concepts, e.g., “features” are “predictors” are “inputs.”

11.1 A Single-Layer Neural Network

Neural networks are still trying to estimate the true but unknown function that defines the relationship between one set of data and another.

  • In the case of supervised learning, the relationship is between at set of \(X\), the inputs, and a \(Y\) the response or output.

Neuralnetworks use one or more layers to connect inputs to outputs.

  • Similar to boosted trees, the layers may have multiple nodes where each node is “weak”, but the combination of of multiple nodes in the layer can generate “strong” (useful) results.

Figure 11.1 depicts a single layer neural network with input nodes on the left, a single “hidden” layer with multiple “units” (nodes), and an output layer on the right.

Figure 11.1: Figure 10.1 from ISLR2

This network is taking information from four (predictor) inputs, \(X_1, X_2, X_3, X_4\) to generate an output layer \(\hat{f}(X)\) which predicts the response \(\hat{Y}\).

  • The arrows from the inputs to the units (nodes) in the middle layer indicate where each input is fed to a unit that will operate on all the inputs it receives.
    • The number of units in a hidden layer, \(K\), is a tuning parameter (to be discussed later).
  • In this single-layer model, each of the \(K\) hidden units makes its calculation and feeds its output to the output layer.
  • The output layer then combines all of its \(K\) inputs to calculate the final result/prediction.
Note

The terms units, nodes, or “perceptrons” all refer to the individual elements of a network layer.

11.1.1 Structure of a Single Layer Neural Net

What distinguishes a neural network from other methods is the way the hidden layers make calculations on the inputs and the final layer then combines the outputs from the previous layer, (inputs to the final layer), to calculate a final prediction.

Each hidden layer unit uses a pre-specified activation function on the input.

Let \(A_k\) represent the result of the calculation \(h_k(X)\) in the \(k\)th unit in the hidden layer.

For each unit, there is a non-linear activation function of a transformed version of the \(X_i,\, i = 1, \ldots, p\) such that

\[A_k = h_k(X) = g(z) = g(w_0 + \sum_{i=1}^p w_iX_i). \tag{11.1}\]

Each out put\(A_k\) is fed to the output layer where they are combined as a weighted linear combination such that

\[f(X) = \beta_0 + \sum_{i=1}^K \beta_k A_k = \beta_0 + \sum_{i=1}^K \beta_k\, g\left(w_{k0} + \sum_{j=1}^p w_{kj}X_j\right). \tag{11.2}\]

All of the parameters in both Equation 11.1 (the \(w_i\)) and in Equation 11.2 (\(\beta_i\)) have to be calculated/estimated from the data.

11.1.2 Activation Functions

The user pre-selects the activation function with the goal of helping to clearly differentiate between signal and noise in the data.

An early choice was the sigmoid activation function

\[g(z) = \frac{e^z}{1+e^z} = \frac{1}{1+e^{-z}}. \tag{11.3}\]

  • This is the same function we used with logistic regression.
  • It has the nice property of transforming linear functions into probabilities from 0 to 1.
  • Its sigmoid shape also helps highlight differences between signal and noise.

A more popular choice now is the ReLU (rectified linear unit) activation function.

\[g(z) = (z)_+ = \cases{0 \quad \text{if } z<0 \\ z\quad \text{otherwise}} \tag{11.4}\]

  • The ReLU calculations can be stored and computed faster than the sigmoid calculations.

Figure 11.2 shows the non-linear shapes of a sigmoid and a ReLU activation function.

Figure 11.2: Activation Functions have non-linear shapes. Fig 10.2 from ISLR2.

Activation functions are non-linear by choice to allow for non-linear relationships and interactions among the data.

  • If the activation functions were linear, Equation 11.2 would just be a linear combination of linear combinations from Equation 11.1 which is not new.

However, even though the \(g(z)\) are non-linear, the combinations in Equation 11.2 are still linear in the inputs.

  • The final model results from calculating linear combinations at each hidden unit, doing a non-linear transformation of each, and then, doing a linear combination on the transformed results.
  • The final layer is still linear in the transformed inputs so we can use (for quantitative variables), the usual squared error loss as an objective function (loss function) to be minimized to calculate the \(\beta\)s, i.e..,

\[\sum_{i=1}^{n}(y_i - f(x_i))^2. \tag{11.5}\]

11.2 Multi-layer Neural Networks

A single layer network can represent many possible \(\hat{f}(x)\).

Adding more hidden units in a layer and adding more layers allows for more possible transformations and provides more flexibility in the model.

  • In general, adding more, smaller layers can make solutions easier to find than just adding more units in a single layer.

Figure 11.3 shows a multi-layer network for classifying digits (0-9) with a different number of units in each hidden layer.

Figure 11.3: Multilayer Network Example. Fig 10.4 from ISLR2
  • It has two hidden layers L1 (256 units) and L2 (128 units).
  • There are 10 nodes in the output layer since there are 10 possible levels to be classified - each is a dummy variable. In this case, the ten variables really represent a single qualitative variable so are dependent on each other.
Note

Each of the activations in the units in the second layer is still a function of the original input \(X\), albeit a transformed version of it based on the activations in the first layer.

Adding more and more layers to the model builds a series of simple transformations into a complex model for the final result.

With more layers comes more notation.

  • We can add a superscript to indicate the layer, e.g., \(A^{(1)}_{k}\).
  • Consider all the parameters for each layer as a matrix. Thus we have \(W_1\), \(W_2\) and \(B\) as our three matrices of “weights” for the network.

11.2.1 Coefficients to be Estimated

There are a lot of coefficients (weights) to be estimated in \(W_1\), \(W_2\), and \(B\).

There are 784 pixels in a \(25 \times 28\) pixel image. These are the inputs.

  • Matrix \(W_1\) has \(785 \times 256 = 200,960\) elements. This is based on the 256 units and the 784 input values plus the intercept term (in NN called the “bias” term).
  • Matrix \(W_2\) thus has \(128 \times 257 = 32,896\) (256 + bias term).
  • Matrix \(B\) thus has \(10 \times 129 = 1,290\) elements. The 10 linear models for each level and the 128 outputs plus the bias term.

Together there are \(200,960 + 32,896 + 1,290 = 235,146\) coefficients (weights) to be estimated.

  • This is 33 times more than just doing multinomial logistic regression.
  • With a training set of only 60,000 images, there is a lot of opportunity for overfitting.

To avoid overfitting, some regularization is needed. Options include Ridge, Lasso, or neural-network specific methods such as. dropout regularization.

11.2.2 Softmax Activation Function

Since the outputs are dependent, essentially the probabilities that a given input is a specific image, this model will use the softmax activation function for the output layer.

\[f_m(X) = Pr(Y=m | X) = \frac{e^{Z_m}}{\sum_{l=0}^{9}e^{Z_l}} \tag{11.6}\]

Equation 11.6 is a generalization of the logistic function from binary logistics regression to multiple levels needed in multinomial logistic regression.

The final prediction is based on selecting the dummy variable with the highest probability.

11.2.3 Minimization by Cross-Entropy

Since in this example, the response is qualitative (categorical), the objective is to minimize the negative multinomial log-likelihood.

\[R(\theta) = -\sum_{i=1}^{n}\sum_{m=0}^9 y_{im} \text{log}(f_m(x_i)) \tag{11.7}\]

Equation 11.7 is a generalization of the loss function for negative log-likelihood from binary logistic regression to multiple levels.

11.3 Fitting a Neural Network

Fitting a neural network is complex due to the necessarily non-linear activation functions.

Warning

Deep learning and neural networks is an active area of research across many communities. You will see many different terms often describing the same concept or approach.

You will also see many articles or references describing the latest approaches.

What follows is not exhaustive by any means. It is designed to provide familiarity with some approaches to provide a basic understanding.

There are many tuning (hyper-parameters) in neural networks and the selection for a given set of data is still very much an art more than a science.

While the objective function in Equation 11.5 looks familiar, minimizing it over the activations functions is non-linear.

Important

Neural Networks use non-linear activation functions so their objective functions are non-convex and have multiple local optima in addition to a global optimum.

Instead of trying to find the “best” model at the one global optimum, potentially out of millions, we seek to find a useful model at a local minima.

Two strategies help find better local optima and reduce the chance of overfitting.

  • Slow Learning: As we saw with boosting, slow learning, small steps in gradient descent helps reduce the chance of overfitting. The algorithm stops when overfitting is detected.
  • Regularization: Imposing penalties on the parameters such as we saw with Ridge and Lasso regression.

Assume all the parameters (coefficients) are in a single vector \(\theta\).

Define the objective then as

\[R(\theta) = \frac{1}{2}\sum_{i=1}^n(y_i - f_\theta(x_i))^2. \tag{11.8}\]

A general algorithm to minimize Equation 11.8 could be:

  1. Make an initial guess for all the parameters \(\theta^0\) and set \(t = 0\).
  2. Find a vector \(\delta\) that creates a small change in \(\theta\) such that \(\theta^t + \delta = \theta^{t+1}\rightarrow\) reduces \(R(\theta^t).\)
  3. If \(R(\theta^{t+1})<R(\theta^t)\), add 1 to \(t\). Now \(t + 1\) is the new \(t\) and go back to step 2.
  4. Once \(R(\theta^{t+1})\geq R(\theta^t)\) stop. We have reached the bottom and found a local minimum (that we hope is good).

So how to find a good \(\delta\)?

11.3.1 Backpropagation

Finding \(\delta\) is a key in neural network optimization.

Let’s define the value of the gradient of \(R(\theta)\) evaluated at its current value \(\theta^m\) as the vector of its partial derivative evaluated at \(\theta^m\):

\[ \nabla R(\theta^{m})= \frac{\partial R(\theta)}{\partial \theta}\biggr\rvert_{\theta=\theta^m}. \tag{11.9}\]

  • This gives the direction to move in \(\theta\) space in which \(R(\theta)\) increases the most rapidly.

We want to move in the opposite direction. So we update \(\theta^{m+1}\) as

\[ \theta^{m+1} \leftarrow \theta^{m} - \rho \nabla R(\theta^{m}). \tag{11.10}\]

  • where \(\rho\) is a learning parameter that controls the “rate of learning”.
  • For very small \(\rho\) this should decrease \(R(\theta)\) such that we don’t go too far past \(R(\theta)=0\).

The good news is that Equation 11.8 is a set of sums so Equation 11.9 is also a set of sums.

Expanding the \(f_\theta(x_i)\) term in Equation 11.8, we get the complicated looking expression for a single observation \(i\)

\[ R_i(\theta) = \frac{1}{2}\left(y_i - \beta_o - \sum_{k=1}^K\beta_k\,g(w_o + \sum_{j=1}^{p}w_{kj}x_{ij})\right)^2. \tag{11.11}\]

Let’s use

\[z_{ik} = w_{k0} + \sum_{j=1}^pw_{kj}x_{ij} \tag{11.12}\]

to simplify Equation 11.11 as

\[ R_i(\theta) = \frac{1}{2}\left(y_i - \beta_o - \sum_{k=1}^K\beta_k\,g(z_{ik})\right)^2. \tag{11.13}\]

Since \(\delta\) is the change we want in the current values of \(\theta\), let’s take the partial derivative of Equation 11.13 with respect to \(\beta_k\) (using the chain rule) to find a good value of \(\delta\) for the \(\beta\)s as

\[ \begin{align} \frac{\partial R_i(\theta)}{\partial\beta_k} &= \frac{\partial R_i(\theta)}{\partial f_\theta(x_i)} . \frac{\partial f_\theta(x_i)}{\partial\beta_k} \\ &= -(y_i - f_\theta(x_i)) . g(z_{ik}) \end{align} \tag{11.14}\]

where we are using the derivative of Equation 11.8 to get the first term.

To find a value of \(\delta\) for the \(w\)s, we can now take the derivative of Equation 11.13 with respect to \(w_{kj}\) to get

\[ \begin{align} \frac{\partial R_i(\theta)}{\partial w_{kj}} &= \frac{\partial R_i(\theta)}{\partial f_\theta(x_i)}\,\, .\,\, \frac{\partial f_\theta(x_i)}{\partial g(z_{ik})} \quad . \quad \frac{\partial g(z_{ik})}{\partial z_{ik}} . \frac{\partial z_{ik}}{\partial w_{kj}}\\ &= -(y_i - f_\theta(x_i))\,\, .\,\, \beta_k \,\,.\,\, g'(z_{ik}).x_{ij} \end{align} \tag{11.15}\]

where we are using the derivative of Equation 11.8 for the first term, the derivative of Equation 11.2 for the second term, the derivative of \(g(z)\) for the third, and the derivative of Equation 11.12 for the fourth term.

Important

The first term in both partial derivatives is the residual from the final layer \((y_i - f_\theta(x_i))\).

Equation 11.14 shows the \(\delta_\beta\)’s are based on how the residual gets allocated across each of \(K\) hidden units in the last hidden layer given the value of \(g(z_{ik})\) for that unit.

Equation 11.15 shows the \(\delta_w\)’s are based on how the residual gets allocated across each of \(j\) inputs to the unit according to the value of each hidden unit’s \(\beta_k g'(z_{ik})\) for each \(x_i\). The form of \(g'()\) will depend upon the activation function we chose.

By starting with the final residual, we can now use the derivatives to calculate the new \(\delta\) for each parameter in each layer by moving from right to left across the network.

This process is known as backpropagation, as we are moving backwards, (right to left) to propagate the information in the latest residuals to update the weights (coefficients) for each unit in each layer.

11.3.2 Epochs

We know how to do three things:

  1. A Forward Pass: Use the initial inputs t compute the outputs for each unit in each layer (using the estimated weights and activation functions) to get to an output layer where the parameters (\(\beta\)s) are estimated (based on optimizing a loss function) to produce a prediction \(\hat{y}_i\) for each observation \(x_i\).
  2. Calculate the residuals \((\hat{y}_i - y_i)\).
  3. A Backwards Pass: Use the gradient of the loss function to allocate the residuals as a \(\delta\) to update each of the weights in the network thorough backpropagation.

We will repeat those steps multiple times in training the network.

An *Epoch** is defined as one complete forward pass and a complete backward pass (steps 1-3) through the network where every observation in the training data set contributes to the update.

  • The number of epochs to use is a tuning parameter when building the model.
  • There are trade offs of time and accuracy as well preventing overfitting.

11.3.3 Options for Epochs

With so many calculations to be made, there are different approaches to how to use the training data in each epoch.

  • Massive data sets can require many computations and significant memory to hold all the data at once.
  • Research continues into multiple options for improving the speed and reducing the memory requirements for neural networks.

Each approach has trade offs between how long it takes to complete an epoch, how fast or slow to get to convergence, how smoothly to get to convergence, how to reduce bias or multicollinearity, and how much memory is required.

There are three popular approaches for how much data to include in each forward/backward pass through the network. see Batch, Mini Batch & Stochastic Gradient Descent

  • Batch Gradient Descent
    • All the observations are used at once, in a single batch or iteration.
    • When data space is “well-behaved”, it can provide for a smooth convergence across multiple epochs.
    • It does require all the data fit into memory at once.
  • Stochastic Gradient Descent is the other extreme where only one observation at a time is fed into the model.
    • Thus there are \(n\) batches and each batch is processed in an iteration.
    • Much less data has to be computed each time and much less has to fit into memory.
    • The convergence can be less smooth as observations can vary widely.
  • Mini-Batch Gradient Descent
    • This is in between the others where a fraction of the data is used for each batch.
    • If there are \(m\) batches, each batch uses \(n/m\) observations with the last batch using whatever remains after the \(m-1\)st batch.
    • There are different methods for whether to update the \(\delta\)s after each mini-batch (do a backwards pass) or just store the residuals from a forward pass to update \(\delta\)s after all mini-batches have been run and you have \(n\) residuals.
    • There is some evidence that small batches (32) may be less smooth but converge more quickly and be more robust in prediction.
  • Hybrid Approaches
    • This is my term for the variety of methods that combine aspects of multiple approaches.
    • One such method is to select the observations for each iteration at random (with/without replacement.
    • This would mean that perhaps not all \(n\) observations made it into each epoch.
    • Another method is to randomly shuffle all the input data at each epoch so different data goes into different batches for each epoch.
Note

Some use the term Stochastic Gradient Descent (SGD) as shorthand to refer to all the above approaches for selecting the data to be used in an iteration.

Since the number of coefficients to be estimated (the \(w\)s and \(\beta\)s) can often be greater than the number of observations, using a regularization method can reduce the risk of overfitting.

11.4 Regularization of Neural Networks

11.4.1 Regularization Methods for Optimization.

We could use Ridge or LASSO Regression.

For Ridge, we add a penalty term to Equation 11.7 to now optimize

\[R(\theta; \lambda) = -\sum_{i=1}^{n}\sum_{m=0}^9 y_{im} \text{log}(f_m(x_i)) + \lambda\sum_j\theta_j^2. \tag{11.16}\]

For LASSO, we would change the last term to the \(L_1\) norm or \(\lambda\sum_j|\theta_j|\).

  • We can set \(\lambda\) to a small value or use a validation method to get it.
  • We could choose separate values of \(\lambda\) for each layer as well.

We could also implement a hyper-parameter to “stop early” to reduce overfitting.

11.4.2 Dropout Learning

Dropout Learning is a regularization method similar to Random Forests where for each iteration in an epoch, we randomly select a fraction of the nodes, \(\phi\), to “drop out” of the calculation.

  • The remaining weights are scaled by a factor of \(1/(1-\phi)\) to compensate for the missing units.
  • In practice, dropout is achieved by randomly setting the activations for the “dropped out” units to zero.

One can also insert a “dropout layer” between the input nodes and the first hidden layer.

Dropout Learning reduces the opportunity for nodes to become “too specialized” as other nodes have to make up for the residuals when they are dropped out. This tends to improve performance in prediction.

There are multiple suggestions to help with adjusting the drop out rate (Dropout Regularization in Deep Learning Models with Keras). These include:

  • Use a small dropout value of 20%-50% of neurons, with 20% providing a good starting point. A probability too low has minimal effect, and a value too high results in under-learning by the network.
  • Use a larger network. You are likely to get better performance when Dropout is used on a larger network, giving the model more of an opportunity to learn independent representations.
  • Use Dropout on incoming (visible) as well as hidden units. Application of Dropout at each layer of the network has shown good results.

11.5 Tuning Neural Networks

Tuning a Neural Network has additional complexity beyond selecting the inputs.

The model developer has to choose:

  • Designing the network based on the number of hidden layers, and the number of units per layer.
  • Setting the parameters for stochastic gradient descent. These includes the batch size, the number of epochs, and if used, details of data augmentation
  • Regularization tuning parameters depending upon the method(s) in use. These include the size of \(\lambda\) for Ridge or LASSO for each layer as well as the dropout fraction \(\phi\).

It is common to plot the improvement in model performance against the number of epochs to see if there is a good stopping point.

  • If the improvement is too rough/noisy, you can shrink the learning rate to smooth things out.

Properly tuned networks can take some time. However, they can outperform other methods.

11.6 Example of Deep Learning in R from ISLR2 (Section 10.9 Lab)

We will use the hitters database from {ISLR2}.

Let’s predict Salary based on all the other predictors and calculate the mean absolute prediction error.

We will use Keras and Tensorflow which are popular open-source platforms for machine learning for Python. They can also be used by R.

  • TensorFlow is an end-to-end platform for machine learning.
    • A tensor is a specific class of a data structure similar to a matrix or python numpy array.
    • The tensor “flows” through the network model as each iteration completes a forward and then a backwards pass.
    • TensorFlow requires tensors to be “rectangular” — that is, along each axis, every element is the same size.
    • There are specialized types of tensors that can handle different shapes.
  • Keras is a high-level API for TensorFlow that can simplify the project workflow so Keras and TensorFlow work together.

11.6.1 Keras TensorFlow Setup

We will use the {keras} R package which links to the {tensorflow} R package which then links to compiled python for speed.

  • Making the connection to python requires the {reticulate} package as well.

Setup can take some time to configure properly depending upon what you already have installed on your system.

See the following references for assistance in installing {reticulate}, a version of python, {keras}, and {tensorflow}.

11.6.2 Prepare the data

  • Remove NAs and set up a vector for making a test set of 1/3 the observations.
library(ISLR2)
Gitters <- na.omit(Hitters)
dplyr::glimpse(Gitters)
Rows: 263
Columns: 20
$ AtBat     <int> 315, 479, 496, 321, 594, 185, 298, 323, 401, 574, 202, 418, …
$ Hits      <int> 81, 130, 141, 87, 169, 37, 73, 81, 92, 159, 53, 113, 60, 43,…
$ HmRun     <int> 7, 18, 20, 10, 4, 1, 0, 6, 17, 21, 4, 13, 0, 7, 20, 2, 8, 16…
$ Runs      <int> 24, 66, 65, 39, 74, 23, 24, 26, 49, 107, 31, 48, 30, 29, 89,…
$ RBI       <int> 38, 72, 78, 42, 51, 8, 24, 32, 66, 75, 26, 61, 11, 27, 75, 8…
$ Walks     <int> 39, 76, 37, 30, 35, 21, 7, 8, 65, 59, 27, 47, 22, 30, 73, 15…
$ Years     <int> 14, 3, 11, 2, 11, 2, 3, 2, 13, 10, 9, 4, 6, 13, 15, 5, 8, 1,…
$ CAtBat    <int> 3449, 1624, 5628, 396, 4408, 214, 509, 341, 5206, 4631, 1876…
$ CHits     <int> 835, 457, 1575, 101, 1133, 42, 108, 86, 1332, 1300, 467, 392…
$ CHmRun    <int> 69, 63, 225, 12, 19, 1, 0, 6, 253, 90, 15, 41, 4, 36, 177, 5…
$ CRuns     <int> 321, 224, 828, 48, 501, 30, 41, 32, 784, 702, 192, 205, 309,…
$ CRBI      <int> 414, 266, 838, 46, 336, 9, 37, 34, 890, 504, 186, 204, 103, …
$ CWalks    <int> 375, 263, 354, 33, 194, 24, 12, 8, 866, 488, 161, 203, 207, …
$ League    <fct> N, A, N, N, A, N, A, N, A, A, N, N, A, N, N, A, N, N, A, N, …
$ Division  <fct> W, W, E, E, W, E, W, W, E, E, W, E, E, E, W, W, W, E, W, W, …
$ PutOuts   <int> 632, 880, 200, 805, 282, 76, 121, 143, 0, 238, 304, 211, 121…
$ Assists   <int> 43, 82, 11, 40, 421, 127, 283, 290, 0, 445, 45, 11, 151, 45,…
$ Errors    <int> 10, 14, 3, 4, 25, 7, 9, 19, 0, 22, 11, 7, 6, 8, 10, 16, 2, 5…
$ Salary    <dbl> 475.000, 480.000, 500.000, 91.500, 750.000, 70.000, 100.000,…
$ NewLeague <fct> N, A, N, N, A, A, A, N, A, A, N, N, A, N, N, A, N, N, N, N, …
n <- nrow(Gitters)
set.seed(13)
ntest <- trunc(n / 3)
testid <- sample(1:n, ntest)

11.6.3 Prepare Models using Other Methods for Comparison

Create a linear model as a basis for comparison.

lfit <- lm(Salary ~ ., data = Gitters[-testid, ])
lpred <- predict(lfit, Gitters[testid, ])
lae_lm <- with(Gitters[testid, ], mean(abs(lpred - Salary)))
lae_lm 
[1] 254.6687

Let’s create a lasso model as a basis for comparison.

  • Create the model matrix and scale it for use in cv.glm().
  • Run cv.glm() using type-measuer="mae" and then calculate the mean absolute prediction error.
x <- scale(model.matrix(Salary ~ . - 1, data = Gitters))
y <- Gitters$Salary

library(glmnet)
set.seed(123)
cvfit <- cv.glmnet(x[-testid, ], y[-testid], type.measure = "mae")
cpred <- predict(cvfit, x[testid, ], s = "lambda.min")
lae_lasso <- mean(abs(y[testid] - cpred))
lae_lasso 
[1] 253.1462

Slightly better.

11.6.4 Set up the Network Model

First we have to define the neural network object as a sequential learning model.

  • Load the {keras} package with library().

    • We use {keras} to define our network and set several hyper-parameters.
    • It is part of the TensorFlow framework.
  • keras_model_sequential creates a model object in which we can define a “stack of layers”.

    • A sequential model is appropriate for a plain stack of layers where each layer has exactly one input tensor and one output tensor.
    • This includes both the single hidden layer model and the multiple hidden layer models discussed earlier.
  • Specify each layer in order using layer_dense().

    • Here we define a single hidden layer with 50 units and the ReLU activation function.
  • You can also set the dropout layer with layer_dropout().

    • Here we set a dropout rate of .4 or 40% of the activations from the previous layer will be set to 0 each pass.
  • Finally, add the last layer (the output).

    • Here we want only one prediction, so one unit with no activation function.
    library(keras)
    modnn <- keras_model_sequential() %>%
      layer_dense(units = 50, activation = "relu", input_shape = ncol(x)) %>%
       layer_dropout(rate = 0.4) %>%
       layer_dense(units = 1)

Now that we have built the model for training, we have to get it into a structure to tell the python engine about it.

  • compile() invokes compile.keras.engine.training.Model ()
  • We have to tell it the optimizer to use and for most models that is optimizer_rmsprop().
  • We all tell it the metrics to be evaluated by the model during training and testing.
  • By default, Keras will create a placeholder for the model’s target “tensor” (input data), which will be fed with the target data during training.
modnn %>% compile(loss = "mse",
    optimizer = optimizer_rmsprop(),
    metrics = list("mean_absolute_error")
   )

11.6.5 Fit the Neural Network Model

Now we fit the model.

  • fit() invokes fit.keras.engine.training().
  • We supply the training data and two fitting parameters, epochs and batch_size.
  • Using a batch of 32 at each step of descent, the algorithm selects 32 training observations for the computation of the gradient.
  • Let’s start with just 15 epochs.
  • We also use the testing data for the validation_data= argument.
history <- modnn %>% fit(
#    x[-testid, ], y[-testid], epochs = 750, batch_size = 32, verbose = 0,
    x[-testid, ], y[-testid], epochs = 15, batch_size = 32,
    validation_data = list(x[testid, ], y[testid])
  )
class(history)
[1] "keras_training_history"

We can plot the results for the most recent set of epochs.

  • It uses {ggplot2} if available by default, otherwise Base R graphics.

  • There are plotting options for plot.keras_training_history() but it is not exported from {keras}.

    • You can look at help, but not call it directly.
    plot(history)

Note

If you run the fit() command a second time in the same R session, the fitting process will pick up where it left off.

  • Try re-running the fit() command, and then the plot() command!
  • You can restart R to reset.

Here is an example plot for 750 epochs.

  • Note how the green line for the test set (validation data) starts to stay above the training data at about 500 epochs.
  • This can be suggestive of overfitting.

750 Epochs on Hitters data

11.6.6 Predict and Measure Performance

  • predict() invokes predict.keras.engine.training.Model().
summary(modnn)
Model: "sequential"
________________________________________________________________________________
 Layer (type)                       Output Shape                    Param #     
================================================================================
 dense_1 (Dense)                    (None, 50)                      1050        
 dropout (Dropout)                  (None, 50)                      0           
 dense (Dense)                      (None, 1)                       51          
================================================================================
Total params: 1,101
Trainable params: 1,101
Non-trainable params: 0
________________________________________________________________________________
npred <- predict(modnn, x[testid, ])
mean(abs(y[testid] - npred))
[1] 535.2723
  • After just a few epochs, the model will probably not perform as well as the other models.
  • Here is one set of sequential outputs - running each set of epochs one after the other.
    • 15 Epochs yields LAE of 537.4442.
    • 200 yields 378.8374
    • 500 yields 271.2441
    • 700 yields 264.6041
    • 1,000 yields 259.3959
    • 1,500 yields 255.6646
    • 2,000 yields 256.7299
  • These are all higher than the 253 achieved by the Lasso model. However, more tuning may get better results.
    • One set of sequential runs had a result of 248.
Warning
  • Reproducing results can be a challenge in this setting as setting a seed in R does not carry over into Python where TensorFlow is running.

Keras and TensorFlow allow the building of very complicated models for a variety of purposes.

The more complicated the model the more hyper-parameters might be required and thus more dimensions to the tuning space to find the most useful model.

11.7 Summary

Important

Deep Learning and Neural Networks is a vast field of research and application.

There are multiple variants for working with special classes of problems such as Convolutional Neural Networks, or building models upon model or models that compete with other models.

While it cam be a powerful method for prediction, it often of little help in inference as the models tend to be “black-box” models where explainability is hard.

It is also dominated by other methods depending upon the data set and question to be answered. Add in the constraints on the available resources in computational power/memory as well as the time it takes to properly tune a model for maximum performance may make other methods much better choices if they perform reasonably well.

Machine may learn, but humans still have to make choices.