AU STAT 427/627 Statistical Machine Learning

1 Course Overview

1.1 Purpose

Introduce statistical machine learning concepts, models, and algorithms so students can apply the appropriate methods, algorithms and tools to understand data through analysis, modeling, and evaluation of model performance.

1.2 Description

We explore supervised learning for regression and classification, unsupervised learning for clustering and principal components analysis, and related topics such as discriminant analysis, splines, lasso and other shrinkage methods, bootstrap, regression, and classification trees, and support vector machines, along with their tuning, diagnostics, and performance evaluation.

Students work in groups to create a final project to demonstrate the effective application of statistical machine learning to real-world data.

This course assumes recent experience with the material covered in the pre-requisite STAT 415/615 Regression or STAT 520 Applied Multivariate Analysis. It also assumes ability to manipulate data manipulation and conduct plotting and statistical analysis using the R (or other) programming language. This is not a programming course, but we will use R to explore the topics with data. Advanced programming skills and computer knowledge are not required.

1.3 Goals and Learning Outcomes

Upon successful completion of this course, you will be able to demonstrate competence in using different statistical learning methods involving large, messy, and multidimensional numerical and categorical data.

Specific learning outcomes for each course are:

Course	Learning Outcome
427/627	Identify appropriate statistical learning methods for a given problem involving real data.
427/627	Analyze the underlying assumptions of the methods and be able to verify them, and then propose appropriate remedies for invalid assumptions.
627-only	Identify other possible problems with messy data, such as multicollinearity, understand their consequences, and propose appropriate solutions.
427/627	Create and use training and test data to evaluate the performance of the chosen regression and/or classification techniques and analyze the results.
427-only	Use available empirical tools to find the optimal balance between precision within training data and prediction power.
627-only	Show, analytically or empirically, the optimal balance between precision within training data and prediction power.
627-only	Apply cross-validation techniques to find the optimal degree of flexibility - the best subset of predictors or the optimal tuning parameters.
427/627	Illustrate results with appropriate plots and diagrams
427/627	Assess ethical implications of the application of statistical machine learning for a given problem.
427/627	Communicate analysis approaches, results/findings and implications in oral presentations or written reports in clear language, appropriately formatted with supporting references.

1.4 Prerequisites

This course requires successful completion of either STAT-415 or STAT-615 Regression or STAT 520 Applied Multivariate Analysis as a prerequisite. These courses have a pre/co-requisite for mathematics/statistics courses so it is assumed graduate students have an intermediate-level knowledge of statistical methods.

2 Required Resources

The textbook is required, but it, and all references, are free and available online at the links below.

You will need a computing device capable of running R or Python on the device or in the cloud.

RStudio is recommended or you may use other tools such as Jupyter Notebooks or VS Code.

You will also need to use AU’s Canvas Learning Management System and use its Zoom license in the event of a remote class.

2.1 Textbook

An Introduction to Statistical Learning with Applications in R, 2nd Edition
by G. James, D. Witten, T. Hastie, and R. Tibshirani., 2021. ISBN 1071614177.

2.2 Lecture Notes

Available on Canvas and online at https://rressler.quarto.pub/sml-lecture-notes/.

2.4 Other References

Introduction to Quarto Posit’s new publishing capability based on R Markdown and PanDoc
R for Data Science - Second Edition by Wickham and Grolemund
The tidyverse Style Guide by Hadley Wickham
Posit Cheat Sheets
Help files and vignettes for multiple packages used in the course available via RStudio Help.

3 Computing Environment

In class we will use the R statistical programming language, minimum version 4.3. R is free and may be downloaded from the R website.
In class we will interface with R through the free version of the RStudio Desktop Integrated Development Environment (IDE), minimum version 2023.06.1.
You can use whatever language or system you choose to complete assignments but the output must be submitted to Canvas in HTML or PDF format and written in a literate programming style, in a clear, concise manner with minimal errors. Your submissions must include code files which are reproducible. i.e., work on someone else’s computer, e.g., with relative paths to data or figures etc..
You will need reliable internet access to support Zoom video sessions for Office Hours.

4 General Schedule of Topics

The following is the general sequence of topics but they may blend over class periods.

Week	-001	Topics	Chapters
1	01/18/24	Introduction, motivation, and examples. Concepts in SML.	1, 2
2	01/25/24	Regression modeling and analysis.	3
3	02/01/24	Linearity and Classification K-Nearest Neighbors.	3, 4
4	02/08/24	Classification problems and classification tools. Logistic regression.	4
5	02/15/24	Classification: Linear and quadratic discriminant analysis. General Linear Models.	4
6	02/22/24	Resampling and cross-validation methods. LOOCV, K-fold CV.	5
7	02/29/24	Mid Term (1 hour)- High-dimensional data and shrinkage.	5
8	03/07/24	Jackknife, bootstrap.	6
	03/14/24	Spring Break
9	03/21/24	Model selection methods and dimension reduction.	6
10	03/28/24	Ridge regression, LASSO Regression	6
11	04/04/24	Principal components. Partial least squares. Nonlinear trends and splines	6, 7
12	04/11/24	Trees	8
13	04/18/24	Bagging, Random Forests and SVM	8, 9
14	04/25/24	Clustering methods and Poster Presentations	12
15	05/02/24	Final Exam

5 Overall Structure

We will use Canvas as our Learning Management System for the course.
We will meet each week in person in the designated classroom (or on Zoom per AU guidelines).
The typical class will include walking through lecture notes on concepts and methods as well as exercises and demonstrations of methods in R.
There will be a quiz most classes covering material from the prior class and homework so you can assess your understanding of the material.
Lecture Notes will be posted on Canvas before each class. The notes cover concepts, methods and practical examples using R.
If a class session creates additional materials, they will be posted on Canvas after class.
I will use Zoom to record portions of each lecture.
Zoom Recordings will be available on Canvas Media Gallery within a day or two after class.