Rows: 263
Columns: 20
$ AtBat <int> 315, 479, 496, 321, 594, 185, 298, 323, 401, 574, 202, 418, …
$ Hits <int> 81, 130, 141, 87, 169, 37, 73, 81, 92, 159, 53, 113, 60, 43,…
$ HmRun <int> 7, 18, 20, 10, 4, 1, 0, 6, 17, 21, 4, 13, 0, 7, 20, 2, 8, 16…
$ Runs <int> 24, 66, 65, 39, 74, 23, 24, 26, 49, 107, 31, 48, 30, 29, 89,…
$ RBI <int> 38, 72, 78, 42, 51, 8, 24, 32, 66, 75, 26, 61, 11, 27, 75, 8…
$ Walks <int> 39, 76, 37, 30, 35, 21, 7, 8, 65, 59, 27, 47, 22, 30, 73, 15…
$ Years <int> 14, 3, 11, 2, 11, 2, 3, 2, 13, 10, 9, 4, 6, 13, 15, 5, 8, 1,…
$ CAtBat <int> 3449, 1624, 5628, 396, 4408, 214, 509, 341, 5206, 4631, 1876…
$ CHits <int> 835, 457, 1575, 101, 1133, 42, 108, 86, 1332, 1300, 467, 392…
$ CHmRun <int> 69, 63, 225, 12, 19, 1, 0, 6, 253, 90, 15, 41, 4, 36, 177, 5…
$ CRuns <int> 321, 224, 828, 48, 501, 30, 41, 32, 784, 702, 192, 205, 309,…
$ CRBI <int> 414, 266, 838, 46, 336, 9, 37, 34, 890, 504, 186, 204, 103, …
$ CWalks <int> 375, 263, 354, 33, 194, 24, 12, 8, 866, 488, 161, 203, 207, …
$ League <fct> N, A, N, N, A, N, A, N, A, A, N, N, A, N, N, A, N, N, A, N, …
$ Division <fct> W, W, E, E, W, E, W, W, E, E, W, E, E, E, W, W, W, E, W, W, …
$ PutOuts <int> 632, 880, 200, 805, 282, 76, 121, 143, 0, 238, 304, 211, 121…
$ Assists <int> 43, 82, 11, 40, 421, 127, 283, 290, 0, 445, 45, 11, 151, 45,…
$ Errors <int> 10, 14, 3, 4, 25, 7, 9, 19, 0, 22, 11, 7, 6, 8, 10, 16, 2, 5…
$ Salary <dbl> 475.000, 480.000, 500.000, 91.500, 750.000, 70.000, 100.000,…
$ NewLeague <fct> N, A, N, N, A, A, A, N, A, A, N, N, A, N, N, A, N, N, N, N, …
11 Deep Learning/Neural Networks
Chapter 10 ISLR2
Deep Learning and Neural Nets are two terms that describe a set of machine learning methods that build networks with one or more layers of calculations to manipulate input data to derive outputs.
- The Deep in Deep Learning refers to the use of multiple layers in the networks. The more layers of calculations, the deeper the network.
- The Neural in Neural Networks (Neural Nets) refers to the use of methods called Artificial Neural Networks whose design is inspired by or emulates the actions of nerves in animals.
- Nerves receive inputs from multiple chemical signals and when the level of signal crosses a threshold, the nerve can “fire”.
- When a nerve reaches its action potential and fires, it sends electro-chemical signals cascading down the nerve to generate outputs.
- These outputs generate input signals to other (nearby) nerves (or other cells).
- These signals can either trigger the activation in nearby nerves or suppress their activation.
- The outputs may also cause a cell to take or suppress a given activity.
Deep Learning methods can used for supervised or unsupervised learning.
They tend to require lots of input data and many calculations so their development has expanded greatly in the last few years given the large increases in available data and affordable computing power.
Deep Learning and Neural Nets have evolved in both computing and statistical domains and combine aspects of networks and statistical methods and transformations. Thus they tend to use different terms describing the same concepts, e.g., “features” are “predictors” are “inputs.”
11.1 A Single-Layer Neural Network
Neural networks are still trying to estimate the true but unknown function that defines the relationship between one set of data and another.
- In the case of supervised learning, the relationship is between at set of \(X\), the inputs, and a \(Y\) the response or output.
Neuralnetworks use one or more layers to connect inputs to outputs.
- Similar to boosted trees, the layers may have multiple nodes where each node is “weak”, but the combination of of multiple nodes in the layer can generate “strong” (useful) results.
Figure 11.1 depicts a single layer neural network with input nodes on the left, a single “hidden” layer with multiple “units” (nodes), and an output layer on the right.
This network is taking information from four (predictor) inputs, \(X_1, X_2, X_3, X_4\) to generate an output layer \(\hat{f}(X)\) which predicts the response \(\hat{Y}\).
- The arrows from the inputs to the units (nodes) in the middle layer indicate where each input is fed to a unit that will operate on all the inputs it receives.
- The number of units in a hidden layer, \(K\), is a tuning parameter (to be discussed later).
- In this single-layer model, each of the \(K\) hidden units makes its calculation and feeds its output to the output layer.
- The output layer then combines all of its \(K\) inputs to calculate the final result/prediction.
The terms units, nodes, or “perceptrons” all refer to the individual elements of a network layer.
11.1.1 Structure of a Single Layer Neural Net
What distinguishes a neural network from other methods is the way the hidden layers make calculations on the inputs and the final layer then combines the outputs from the previous layer, (inputs to the final layer), to calculate a final prediction.
Each hidden layer unit uses a pre-specified activation function on the input.
Let \(A_k\) represent the result of the calculation \(h_k(X)\) in the \(k\)th unit in the hidden layer.
For each unit, there is a non-linear activation function of a transformed version of the \(X_i,\, i = 1, \ldots, p\) such that
\[A_k = h_k(X) = g(z) = g(w_0 + \sum_{i=1}^p w_iX_i). \tag{11.1}\]
Each out put\(A_k\) is fed to the output layer where they are combined as a weighted linear combination such that
\[f(X) = \beta_0 + \sum_{i=1}^K \beta_k A_k = \beta_0 + \sum_{i=1}^K \beta_k\, g\left(w_{k0} + \sum_{j=1}^p w_{kj}X_j\right). \tag{11.2}\]
All of the parameters in both Equation 11.1 (the \(w_i\)) and in Equation 11.2 (\(\beta_i\)) have to be calculated/estimated from the data.
11.1.2 Activation Functions
The user pre-selects the activation function with the goal of helping to clearly differentiate between signal and noise in the data.
An early choice was the sigmoid activation function
\[g(z) = \frac{e^z}{1+e^z} = \frac{1}{1+e^{-z}}. \tag{11.3}\]
- This is the same function we used with logistic regression.
- It has the nice property of transforming linear functions into probabilities from 0 to 1.
- Its sigmoid shape also helps highlight differences between signal and noise.
A more popular choice now is the ReLU (rectified linear unit) activation function.
\[g(z) = (z)_+ = \cases{0 \quad \text{if } z<0 \\ z\quad \text{otherwise}} \tag{11.4}\]
- The ReLU calculations can be stored and computed faster than the sigmoid calculations.
Figure 11.2 shows the non-linear shapes of a sigmoid and a ReLU activation function.
Activation functions are non-linear by choice to allow for non-linear relationships and interactions among the data.
- If the activation functions were linear, Equation 11.2 would just be a linear combination of linear combinations from Equation 11.1 which is not new.
However, even though the \(g(z)\) are non-linear, the combinations in Equation 11.2 are still linear in the inputs.
- The final model results from calculating linear combinations at each hidden unit, doing a non-linear transformation of each, and then, doing a linear combination on the transformed results.
- The final layer is still linear in the transformed inputs so we can use (for quantitative variables), the usual squared error loss as an objective function (loss function) to be minimized to calculate the \(\beta\)s, i.e..,
\[\sum_{i=1}^{n}(y_i - f(x_i))^2. \tag{11.5}\]
11.2 Multi-layer Neural Networks
A single layer network can represent many possible \(\hat{f}(x)\).
Adding more hidden units in a layer and adding more layers allows for more possible transformations and provides more flexibility in the model.
- In general, adding more, smaller layers can make solutions easier to find than just adding more units in a single layer.
Figure 11.3 shows a multi-layer network for classifying digits (0-9) with a different number of units in each hidden layer.
- It has two hidden layers L1 (256 units) and L2 (128 units).
- There are 10 nodes in the output layer since there are 10 possible levels to be classified - each is a dummy variable. In this case, the ten variables really represent a single qualitative variable so are dependent on each other.
Each of the activations in the units in the second layer is still a function of the original input \(X\), albeit a transformed version of it based on the activations in the first layer.
Adding more and more layers to the model builds a series of simple transformations into a complex model for the final result.
With more layers comes more notation.
- We can add a superscript to indicate the layer, e.g., \(A^{(1)}_{k}\).
- Consider all the parameters for each layer as a matrix. Thus we have \(W_1\), \(W_2\) and \(B\) as our three matrices of “weights” for the network.
11.2.1 Coefficients to be Estimated
There are a lot of coefficients (weights) to be estimated in \(W_1\), \(W_2\), and \(B\).
There are 784 pixels in a \(25 \times 28\) pixel image. These are the inputs.
- Matrix \(W_1\) has \(785 \times 256 = 200,960\) elements. This is based on the 256 units and the 784 input values plus the intercept term (in NN called the “bias” term).
- Matrix \(W_2\) thus has \(128 \times 257 = 32,896\) (256 + bias term).
- Matrix \(B\) thus has \(10 \times 129 = 1,290\) elements. The 10 linear models for each level and the 128 outputs plus the bias term.
Together there are \(200,960 + 32,896 + 1,290 = 235,146\) coefficients (weights) to be estimated.
- This is 33 times more than just doing multinomial logistic regression.
- With a training set of only 60,000 images, there is a lot of opportunity for overfitting.
To avoid overfitting, some regularization is needed. Options include Ridge, Lasso, or neural-network specific methods such as. dropout regularization.
11.2.2 Softmax Activation Function
Since the outputs are dependent, essentially the probabilities that a given input is a specific image, this model will use the softmax activation function for the output layer.
\[f_m(X) = Pr(Y=m | X) = \frac{e^{Z_m}}{\sum_{l=0}^{9}e^{Z_l}} \tag{11.6}\]
Equation 11.6 is a generalization of the logistic function from binary logistics regression to multiple levels needed in multinomial logistic regression.
The final prediction is based on selecting the dummy variable with the highest probability.
11.2.3 Minimization by Cross-Entropy
Since in this example, the response is qualitative (categorical), the objective is to minimize the negative multinomial log-likelihood.
\[R(\theta) = -\sum_{i=1}^{n}\sum_{m=0}^9 y_{im} \text{log}(f_m(x_i)) \tag{11.7}\]
Equation 11.7 is a generalization of the loss function for negative log-likelihood from binary logistic regression to multiple levels.
- Equation 11.7 is known as the cross-entropy in information theory.
11.3 Fitting a Neural Network
Fitting a neural network is complex due to the necessarily non-linear activation functions.
Deep learning and neural networks is an active area of research across many communities. You will see many different terms often describing the same concept or approach.
You will also see many articles or references describing the latest approaches.
What follows is not exhaustive by any means. It is designed to provide familiarity with some approaches to provide a basic understanding.
There are many tuning (hyper-parameters) in neural networks and the selection for a given set of data is still very much an art more than a science.
While the objective function in Equation 11.5 looks familiar, minimizing it over the activations functions is non-linear.
Neural Networks use non-linear activation functions so their objective functions are non-convex and have multiple local optima in addition to a global optimum.
Instead of trying to find the “best” model at the one global optimum, potentially out of millions, we seek to find a useful model at a local minima.
Two strategies help find better local optima and reduce the chance of overfitting.
- Slow Learning: As we saw with boosting, slow learning, small steps in gradient descent helps reduce the chance of overfitting. The algorithm stops when overfitting is detected.
- Regularization: Imposing penalties on the parameters such as we saw with Ridge and Lasso regression.
Assume all the parameters (coefficients) are in a single vector \(\theta\).
Define the objective then as
\[R(\theta) = \frac{1}{2}\sum_{i=1}^n(y_i - f_\theta(x_i))^2. \tag{11.8}\]
A general algorithm to minimize Equation 11.8 could be:
- Make an initial guess for all the parameters \(\theta^0\) and set \(t = 0\).
- Find a vector \(\delta\) that creates a small change in \(\theta\) such that \(\theta^t + \delta = \theta^{t+1}\rightarrow\) reduces \(R(\theta^t).\)
- If \(R(\theta^{t+1})<R(\theta^t)\), add 1 to \(t\). Now \(t + 1\) is the new \(t\) and go back to step 2.
- Once \(R(\theta^{t+1})\geq R(\theta^t)\) stop. We have reached the bottom and found a local minimum (that we hope is good).
So how to find a good \(\delta\)?
11.3.1 Backpropagation
Finding \(\delta\) is a key in neural network optimization.
Let’s define the value of the gradient of \(R(\theta)\) evaluated at its current value \(\theta^m\) as the vector of its partial derivative evaluated at \(\theta^m\):
\[ \nabla R(\theta^{m})= \frac{\partial R(\theta)}{\partial \theta}\biggr\rvert_{\theta=\theta^m}. \tag{11.9}\]
- This gives the direction to move in \(\theta\) space in which \(R(\theta)\) increases the most rapidly.
We want to move in the opposite direction. So we update \(\theta^{m+1}\) as
\[ \theta^{m+1} \leftarrow \theta^{m} - \rho \nabla R(\theta^{m}). \tag{11.10}\]
- where \(\rho\) is a learning parameter that controls the “rate of learning”.
- For very small \(\rho\) this should decrease \(R(\theta)\) such that we don’t go too far past \(R(\theta)=0\).
The good news is that Equation 11.8 is a set of sums so Equation 11.9 is also a set of sums.
Expanding the \(f_\theta(x_i)\) term in Equation 11.8, we get the complicated looking expression for a single observation \(i\)
\[ R_i(\theta) = \frac{1}{2}\left(y_i - \beta_o - \sum_{k=1}^K\beta_k\,g(w_o + \sum_{j=1}^{p}w_{kj}x_{ij})\right)^2. \tag{11.11}\]
Let’s use
\[z_{ik} = w_{k0} + \sum_{j=1}^pw_{kj}x_{ij} \tag{11.12}\]
to simplify Equation 11.11 as
\[ R_i(\theta) = \frac{1}{2}\left(y_i - \beta_o - \sum_{k=1}^K\beta_k\,g(z_{ik})\right)^2. \tag{11.13}\]
Since \(\delta\) is the change we want in the current values of \(\theta\), let’s take the partial derivative of Equation 11.13 with respect to \(\beta_k\) (using the chain rule) to find a good value of \(\delta\) for the \(\beta\)s as
\[ \begin{align} \frac{\partial R_i(\theta)}{\partial\beta_k} &= \frac{\partial R_i(\theta)}{\partial f_\theta(x_i)} . \frac{\partial f_\theta(x_i)}{\partial\beta_k} \\ &= -(y_i - f_\theta(x_i)) . g(z_{ik}) \end{align} \tag{11.14}\]
where we are using the derivative of Equation 11.8 to get the first term.
To find a value of \(\delta\) for the \(w\)s, we can now take the derivative of Equation 11.13 with respect to \(w_{kj}\) to get
\[ \begin{align} \frac{\partial R_i(\theta)}{\partial w_{kj}} &= \frac{\partial R_i(\theta)}{\partial f_\theta(x_i)}\,\, .\,\, \frac{\partial f_\theta(x_i)}{\partial g(z_{ik})} \quad . \quad \frac{\partial g(z_{ik})}{\partial z_{ik}} . \frac{\partial z_{ik}}{\partial w_{kj}}\\ &= -(y_i - f_\theta(x_i))\,\, .\,\, \beta_k \,\,.\,\, g'(z_{ik}).x_{ij} \end{align} \tag{11.15}\]
where we are using the derivative of Equation 11.8 for the first term, the derivative of Equation 11.2 for the second term, the derivative of \(g(z)\) for the third, and the derivative of Equation 11.12 for the fourth term.
The first term in both partial derivatives is the residual from the final layer \((y_i - f_\theta(x_i))\).
Equation 11.14 shows the \(\delta_\beta\)’s are based on how the residual gets allocated across each of \(K\) hidden units in the last hidden layer given the value of \(g(z_{ik})\) for that unit.
Equation 11.15 shows the \(\delta_w\)’s are based on how the residual gets allocated across each of \(j\) inputs to the unit according to the value of each hidden unit’s \(\beta_k g'(z_{ik})\) for each \(x_i\). The form of \(g'()\) will depend upon the activation function we chose.
By starting with the final residual, we can now use the derivatives to calculate the new \(\delta\) for each parameter in each layer by moving from right to left across the network.
This process is known as backpropagation, as we are moving backwards, (right to left) to propagate the information in the latest residuals to update the weights (coefficients) for each unit in each layer.
11.3.2 Epochs
We know how to do three things:
- A Forward Pass: Use the initial inputs t compute the outputs for each unit in each layer (using the estimated weights and activation functions) to get to an output layer where the parameters (\(\beta\)s) are estimated (based on optimizing a loss function) to produce a prediction \(\hat{y}_i\) for each observation \(x_i\).
- Calculate the residuals \((\hat{y}_i - y_i)\).
- A Backwards Pass: Use the gradient of the loss function to allocate the residuals as a \(\delta\) to update each of the weights in the network thorough backpropagation.
We will repeat those steps multiple times in training the network.
An *Epoch** is defined as one complete forward pass and a complete backward pass (steps 1-3) through the network where every observation in the training data set contributes to the update.
- The number of epochs to use is a tuning parameter when building the model.
- There are trade offs of time and accuracy as well preventing overfitting.
11.3.3 Options for Epochs
With so many calculations to be made, there are different approaches to how to use the training data in each epoch.
- Massive data sets can require many computations and significant memory to hold all the data at once.
- Research continues into multiple options for improving the speed and reducing the memory requirements for neural networks.
Each approach has trade offs between how long it takes to complete an epoch, how fast or slow to get to convergence, how smoothly to get to convergence, how to reduce bias or multicollinearity, and how much memory is required.
There are three popular approaches for how much data to include in each forward/backward pass through the network. see Batch, Mini Batch & Stochastic Gradient Descent
-
Batch Gradient Descent
- All the observations are used at once, in a single batch or iteration.
- When data space is “well-behaved”, it can provide for a smooth convergence across multiple epochs.
- It does require all the data fit into memory at once.
-
Stochastic Gradient Descent is the other extreme where only one observation at a time is fed into the model.
- Thus there are \(n\) batches and each batch is processed in an iteration.
- Much less data has to be computed each time and much less has to fit into memory.
- The convergence can be less smooth as observations can vary widely.
-
Mini-Batch Gradient Descent
- This is in between the others where a fraction of the data is used for each batch.
- If there are \(m\) batches, each batch uses \(n/m\) observations with the last batch using whatever remains after the \(m-1\)st batch.
- There are different methods for whether to update the \(\delta\)s after each mini-batch (do a backwards pass) or just store the residuals from a forward pass to update \(\delta\)s after all mini-batches have been run and you have \(n\) residuals.
- There is some evidence that small batches (32) may be less smooth but converge more quickly and be more robust in prediction.
-
Hybrid Approaches
- This is my term for the variety of methods that combine aspects of multiple approaches.
- One such method is to select the observations for each iteration at random (with/without replacement.
- This would mean that perhaps not all \(n\) observations made it into each epoch.
- Another method is to randomly shuffle all the input data at each epoch so different data goes into different batches for each epoch.
Some use the term Stochastic Gradient Descent (SGD) as shorthand to refer to all the above approaches for selecting the data to be used in an iteration.
Since the number of coefficients to be estimated (the \(w\)s and \(\beta\)s) can often be greater than the number of observations, using a regularization method can reduce the risk of overfitting.
11.4 Regularization of Neural Networks
11.4.1 Regularization Methods for Optimization.
We could use Ridge or LASSO Regression.
For Ridge, we add a penalty term to Equation 11.7 to now optimize
\[R(\theta; \lambda) = -\sum_{i=1}^{n}\sum_{m=0}^9 y_{im} \text{log}(f_m(x_i)) + \lambda\sum_j\theta_j^2. \tag{11.16}\]
For LASSO, we would change the last term to the \(L_1\) norm or \(\lambda\sum_j|\theta_j|\).
- We can set \(\lambda\) to a small value or use a validation method to get it.
- We could choose separate values of \(\lambda\) for each layer as well.
We could also implement a hyper-parameter to “stop early” to reduce overfitting.
11.4.2 Dropout Learning
Dropout Learning is a regularization method similar to Random Forests where for each iteration in an epoch, we randomly select a fraction of the nodes, \(\phi\), to “drop out” of the calculation.
- The remaining weights are scaled by a factor of \(1/(1-\phi)\) to compensate for the missing units.
- In practice, dropout is achieved by randomly setting the activations for the “dropped out” units to zero.
One can also insert a “dropout layer” between the input nodes and the first hidden layer.
Dropout Learning reduces the opportunity for nodes to become “too specialized” as other nodes have to make up for the residuals when they are dropped out. This tends to improve performance in prediction.
There are multiple suggestions to help with adjusting the drop out rate (Dropout Regularization in Deep Learning Models with Keras). These include:
- Use a small dropout value of 20%-50% of neurons, with 20% providing a good starting point. A probability too low has minimal effect, and a value too high results in under-learning by the network.
- Use a larger network. You are likely to get better performance when Dropout is used on a larger network, giving the model more of an opportunity to learn independent representations.
- Use Dropout on incoming (visible) as well as hidden units. Application of Dropout at each layer of the network has shown good results.
11.5 Tuning Neural Networks
Tuning a Neural Network has additional complexity beyond selecting the inputs.
The model developer has to choose:
- Designing the network based on the number of hidden layers, and the number of units per layer.
- Setting the parameters for stochastic gradient descent. These includes the batch size, the number of epochs, and if used, details of data augmentation
- Regularization tuning parameters depending upon the method(s) in use. These include the size of \(\lambda\) for Ridge or LASSO for each layer as well as the dropout fraction \(\phi\).
It is common to plot the improvement in model performance against the number of epochs to see if there is a good stopping point.
- If the improvement is too rough/noisy, you can shrink the learning rate to smooth things out.
Properly tuned networks can take some time. However, they can outperform other methods.
11.6 Example of Deep Learning in R from ISLR2 (Section 10.9 Lab)
We will use the hitters database from {ISLR2}.
Let’s predict Salary
based on all the other predictors and calculate the mean absolute prediction error.
We will use Keras and Tensorflow which are popular open-source platforms for machine learning for Python. They can also be used by R.
- TensorFlow is an end-to-end platform for machine learning.
- A tensor is a specific class of a data structure similar to a matrix or python numpy array.
- The tensor “flows” through the network model as each iteration completes a forward and then a backwards pass.
- TensorFlow requires tensors to be “rectangular” — that is, along each axis, every element is the same size.
- There are specialized types of tensors that can handle different shapes.
- Keras is a high-level API for TensorFlow that can simplify the project workflow so Keras and TensorFlow work together.
11.6.1 Keras TensorFlow Setup
We will use the {keras} R package which links to the {tensorflow} R package which then links to compiled python for speed.
- Making the connection to python requires the {reticulate} package as well.
Setup can take some time to configure properly depending upon what you already have installed on your system.
See the following references for assistance in installing {reticulate}, a version of python, {keras}, and {tensorflow}.
11.6.2 Prepare the data
- Remove
NA
s and set up a vector for making a test set of 1/3 the observations.
11.6.3 Prepare Models using Other Methods for Comparison
Create a linear model as a basis for comparison.
lfit <- lm(Salary ~ ., data = Gitters[-testid, ])
lpred <- predict(lfit, Gitters[testid, ])
lae_lm <- with(Gitters[testid, ], mean(abs(lpred - Salary)))
lae_lm
[1] 254.6687
Let’s create a lasso model as a basis for comparison.
- Create the model matrix and scale it for use in
cv.glm()
. - Run
cv.glm()
usingtype-measuer="mae"
and then calculate the mean absolute prediction error.
x <- scale(model.matrix(Salary ~ . - 1, data = Gitters))
y <- Gitters$Salary
library(glmnet)
set.seed(123)
cvfit <- cv.glmnet(x[-testid, ], y[-testid], type.measure = "mae")
cpred <- predict(cvfit, x[testid, ], s = "lambda.min")
lae_lasso <- mean(abs(y[testid] - cpred))
lae_lasso
[1] 253.1462
Slightly better.
11.6.4 Set up the Network Model
First we have to define the neural network object as a sequential learning model.
-
Load the {keras} package with
library()
.- We use {keras} to define our network and set several hyper-parameters.
- It is part of the TensorFlow framework.
-
keras_model_sequential
creates a model object in which we can define a “stack of layers”.- A sequential model is appropriate for a plain stack of layers where each layer has exactly one input tensor and one output tensor.
- This includes both the single hidden layer model and the multiple hidden layer models discussed earlier.
-
Specify each layer in order using
layer_dense()
.- Here we define a single hidden layer with 50 units and the ReLU activation function.
-
You can also set the dropout layer with
layer_dropout()
.- Here we set a dropout rate of .4 or 40% of the activations from the previous layer will be set to 0 each pass.
-
Finally, add the last layer (the output).
- Here we want only one prediction, so one unit with no activation function.
Now that we have built the model for training, we have to get it into a structure to tell the python engine about it.
-
compile()
invokescompile.keras.engine.training.Model ()
- We have to tell it the optimizer to use and for most models that is
optimizer_rmsprop()
. - We all tell it the metrics to be evaluated by the model during training and testing.
- By default, Keras will create a placeholder for the model’s target “tensor” (input data), which will be fed with the target data during training.
11.6.5 Fit the Neural Network Model
Now we fit the model.
-
fit()
invokesfit.keras.engine.training()
. - We supply the training data and two fitting parameters,
epochs
andbatch_size
. - Using a batch of 32 at each step of descent, the algorithm selects 32 training observations for the computation of the gradient.
- Let’s start with just 15 epochs.
- We also use the testing data for the
validation_data=
argument.
history <- modnn %>% fit(
# x[-testid, ], y[-testid], epochs = 750, batch_size = 32, verbose = 0,
x[-testid, ], y[-testid], epochs = 15, batch_size = 32,
validation_data = list(x[testid, ], y[testid])
)
class(history)
[1] "keras_training_history"
We can plot the results for the most recent set of epochs.
It uses {ggplot2} if available by default, otherwise Base R graphics.
-
There are plotting options for
plot.keras_training_history()
but it is not exported from {keras}.- You can look at help, but not call it directly.
If you run the fit()
command a second time in the same R session, the fitting process will pick up where it left off.
- Try re-running the
fit()
command, and then theplot()
command! - You can restart R to reset.
Here is an example plot for 750 epochs.
- Note how the green line for the test set (validation data) starts to stay above the training data at about 500 epochs.
- This can be suggestive of overfitting.
11.6.6 Predict and Measure Performance
-
predict()
invokespredict.keras.engine.training.Model()
.
Model: "sequential"
________________________________________________________________________________
Layer (type) Output Shape Param #
================================================================================
dense_1 (Dense) (None, 50) 1050
dropout (Dropout) (None, 50) 0
dense (Dense) (None, 1) 51
================================================================================
Total params: 1,101
Trainable params: 1,101
Non-trainable params: 0
________________________________________________________________________________
[1] 535.2723
- After just a few epochs, the model will probably not perform as well as the other models.
- Here is one set of sequential outputs - running each set of epochs one after the other.
- 15 Epochs yields LAE of 537.4442.
- 200 yields 378.8374
- 500 yields 271.2441
- 700 yields 264.6041
- 1,000 yields 259.3959
- 1,500 yields 255.6646
- 2,000 yields 256.7299
- These are all higher than the 253 achieved by the Lasso model. However, more tuning may get better results.
- One set of sequential runs had a result of 248.
- Reproducing results can be a challenge in this setting as setting a seed in R does not carry over into Python where TensorFlow is running.
Keras and TensorFlow allow the building of very complicated models for a variety of purposes.
The more complicated the model the more hyper-parameters might be required and thus more dimensions to the tuning space to find the most useful model.
11.7 Summary
Deep Learning and Neural Networks is a vast field of research and application.
There are multiple variants for working with special classes of problems such as Convolutional Neural Networks, or building models upon model or models that compete with other models.
While it cam be a powerful method for prediction, it often of little help in inference as the models tend to be “black-box” models where explainability is hard.
It is also dominated by other methods depending upon the data set and question to be answered. Add in the constraints on the available resources in computational power/memory as well as the time it takes to properly tune a model for maximum performance may make other methods much better choices if they perform reasonably well.
Machine may learn, but humans still have to make choices.