AU STAT 427/627 Statistical Machine Learning

1 Course Overview

1.1 Purpose

Introduce statistical machine learning concepts, models, and algorithms so students can apply the appropriate methods, algorithms and tools to understand data through analysis, modeling, and evaluation of model performance.

1.2 Description

We explore supervised learning for regression and classification, unsupervised learning for clustering and principal components analysis, and related topics such as discriminant analysis, splines, lasso and other shrinkage methods, bootstrap, regression, and classification trees, and support vector machines, along with their tuning, diagnostics, and performance evaluation.

Students work in groups to create a final project to demonstrate the effective application of statistical machine learning to real-world data.

This course assumes recent experience with the material covered in the pre-requisite STAT 415/615 Regression or STAT 520 Applied Multivariate Analysis. It also assumes ability to manipulate data manipulation and conduct plotting and statistical analysis using the R (or other) programming language. This is not a programming course, but we will use R to explore the topics with data. Advanced programming skills and computer knowledge are not required.

1.3 Goals and Learning Outcomes

Upon successful completion of this course, you will be able to demonstrate competence in using different statistical learning methods involving large, messy, and multidimensional numerical and categorical data.

Specific learning outcomes for each course are:

Course Learning Outcome
427/627 Identify appropriate statistical learning methods for a given problem involving real data.
427/627 Analyze the underlying assumptions of the methods and be able to verify them, and then propose appropriate remedies for invalid assumptions.
627-only Identify other possible problems with messy data, such as multicollinearity, understand their consequences, and propose appropriate solutions.
427/627 Create and use training and test data to evaluate the performance of the chosen regression and/or classification techniques and analyze the results.
427-only Use available empirical tools to find the optimal balance between precision within training data and prediction power.
627-only Show, analytically or empirically, the optimal balance between precision within training data and prediction power.
627-only Apply cross-validation techniques to find the optimal degree of flexibility - the best subset of predictors or the optimal tuning parameters.
427/627 Illustrate results with appropriate plots and diagrams
427/627 Assess ethical implications of the application of statistical machine learning for a given problem.
427/627 Communicate analysis approaches, results/findings and implications in oral presentations or written reports in clear language, appropriately formatted with supporting references.

1.4 Prerequisites

This course requires successful completion of either STAT-415 or STAT-615 Regression or STAT 520 Applied Multivariate Analysis as a prerequisite. These courses have a pre/co-requisite for mathematics/statistics courses so it is assumed graduate students have an intermediate-level knowledge of statistical methods.

2 Required Resources

The textbook is required, but it, and all references, are free and available online at the links below.

You will need a computing device capable of running R or Python on the device or in the cloud.

  • RStudio is recommended or you may use other tools such as Jupyter Notebooks or VS Code.

You will also need to use AU’s Canvas Learning Management System and use its Zoom license in the event of a remote class.

2.1 Textbook

An Introduction to Statistical Learning with Applications in R, 2nd Edition
by G. James, D. Witten, T. Hastie, and R. Tibshirani., 2021. ISBN 1071614177.

2.2 Lecture Notes

Available on Canvas and online at https://rressler.quarto.pub/sml-lecture-notes/.

ISLR 2nd Edition Cover

2.4 Other References

3 Computing Environment

  • In class we will use the R statistical programming language, minimum version 4.3. R is free and may be downloaded from the R website.
  • In class we will interface with R through the free version of the RStudio Desktop Integrated Development Environment (IDE), minimum version 2023.06.1.
  • You can use whatever language or system you choose to complete assignments but the output must be submitted to Canvas in HTML or PDF format and written in a literate programming style, in a clear, concise manner with minimal errors. Your submissions must include code files which are reproducible. i.e., work on someone else’s computer, e.g., with relative paths to data or figures etc..
  • You will need reliable internet access to support Zoom video sessions for Office Hours.

4 General Schedule of Topics

The following is the general sequence of topics but they may blend over class periods.

Week -001 Topics Chapters
1 01/18/24 Introduction, motivation, and examples. Concepts in SML. 1, 2
2 01/25/24 Regression modeling and analysis. 3
3 02/01/24 Linearity and Classification K-Nearest Neighbors. 3, 4
4 02/08/24 Classification problems and classification tools. Logistic regression. 4
5 02/15/24 Classification: Linear and quadratic discriminant analysis. General Linear Models. 4
6 02/22/24 Resampling and cross-validation methods. LOOCV, K-fold CV. 5
7 02/29/24 Mid Term (1 hour)- High-dimensional data and shrinkage. 5
8 03/07/24 Jackknife, bootstrap. 6
03/14/24 Spring Break
9 03/21/24 Model selection methods and dimension reduction. 6
10 03/28/24 Ridge regression, LASSO Regression 6
11 04/04/24 Principal components. Partial least squares. Nonlinear trends and splines 6, 7
12 04/11/24 Trees 8
13 04/18/24 Bagging, Random Forests and SVM 8, 9
14 04/25/24 Clustering methods and Poster Presentations 12
15 05/02/24 Final Exam

5 Overall Structure

  • We will use Canvas as our Learning Management System for the course.
  • We will meet each week in person in the designated classroom (or on Zoom per AU guidelines).
  • The typical class will include walking through lecture notes on concepts and methods as well as exercises and demonstrations of methods in R.
  • There will be a quiz most classes covering material from the prior class and homework so you can assess your understanding of the material.
  • Lecture Notes will be posted on Canvas before each class. The notes cover concepts, methods and practical examples using R.
  • If a class session creates additional materials, they will be posted on Canvas after class.
  • I will use Zoom to record portions of each lecture.
  • Zoom Recordings will be available on Canvas Media Gallery within a day or two after class.