# AU STAT 427/627 Statistical Machine Learning

## 1 Course Overview

### 1.1 Purpose

Introduce statistical machine learning concepts, models, and algorithms so students can apply the appropriate methods, algorithms and tools to understand data through analysis, modeling, and evaluation of model performance.

### 1.2 Description

We explore supervised learning for regression and classification, unsupervised learning for clustering and principal components analysis, and related topics such as discriminant analysis, splines, lasso and other shrinkage methods, bootstrap, regression, and classification trees, and support vector machines, along with their tuning, diagnostics, and performance evaluation.

Students work in groups to create a final project to demonstrate the effective application of statistical machine learning to real-world data.

This course assumes recent experience with the material covered in the pre-requisite STAT 415/615 “Regression” or STAT 520 “Applied Multivariate Analysis”. It also assumes some knowledge of data manipulation and statistical analysis using the R programming language. This is not a programming course, but we will use R during class to explore the topics with data. Advanced programming skills and advanced computer knowledge are not required.

### 1.3 Learning Outcomes

Upon successful completion of this course, you will be able to demonstrate competence in using different statistical learning methods involving large, messy, and multidimensional numerical and categorical data.

Specific learning outcomes for each course are:

Course | Learning Outcome |
---|---|

427/627 | Identify appropriate statistical learning methods for a given problem involving real data. |

427/627 | Analyze the underlying assumptions of the methods and be able to verify them, and then propose appropriate remedies for invalid assumptions. |

627-only | Identify other possible problems with messy data, such as multicollinearity, understand their consequences, and propose appropriate solutions. |

427/627 | Create and use training and test data to evaluate the performance of the chosen regression and/or classification techniques and analyze the results. |

427-only | Use available empirical tools to find the optimal balance between precision within training data and prediction power. |

627-only | Apply cross-validation techniques to find the optimal degree of flexibility - the best subset of predictors or the optimal tuning parameters. |

627-only | Show, analytically or empirically, the optimal balance between precision within training data and prediction power. |

427/627 | Illustrate results with appropriate plots and diagrams |

427/627 | Assess ethical implications of the application of statistical machine learning for a given problem. |

427/627 | Communicate analysis approaches, results/findings and implications in oral presentations or written reports in clear language, appropriately formatted with supporting references. |

### 1.4 Prerequisites

This course requires successful completion of either STAT-415 or STAT-615 “Regression” or STAT 520 “Applied Multivariate Analysis” as a prerequisite. These courses have a pre/co-requisite for math/statics courses so it is assumed graduate students in this course have an intermediate knowledge of statistical methods.

## 2 Required Resources

**The Textbook is required but it and all references are free and available online at the links below.**

You will need a computing device capable of running R or Python on the device or in the cloud.

- RStudio is recommended or you may use other tools such as VS Code

You will also need to be able to access and interact with American University’s Canvas Learning Management System and use its Zoom license in the event of a remote class.

- Lecture notes are available at STAT-427/627 Statistical Machine Learning Lecture Notes.

### 2.1 Textbook

The textbook: An Introduction to Statistical Learning with Applications in R, 2nd Edition by G. James, D. Witten, T. Hastie, and R. Tibshirani., 2021. ISBN 1071614177.

### 2.3 Other References

- Introduction to Quarto. Posit’s new publishing capability based on R Markdown and PanDoc.
- R for Data Science - Second Edition by Wickham and Grolemund - This edition is a work in progress.
- The tidyverse Style Guide by Hadley Wickham
- RStudio Cheat Sheets
*Help files and vignettes for multiple packages used in the courses*

## 3 Computing Environment

- In class we will use the R statistical programming language, minimum version 4.2. R is free and may be downloaded from the R website.
- In class we will interface with R through the free version of the RStudio Desktop Integrated Development Environment (IDE).
- You can use whatever language or system you choose to complete assignments but the output must be submitted to Canvas in HTML or PDF format and written inn a literate programming style, in a clear, concise manner with minimal errors.
- You will need access to reliable internet capable of supporting Zoom video sessions for Office Hours and potentially for Class.

## 4 General Schedule of Topics

The following is a general sequence of topics but they will blend over class periods.

Week | Topics | Chapters |
---|---|---|

1 | Introduction, motivation, and examples. Main principles of statistical machine learning. Regression and classification, bias and variance, training and testing, prediction and inference. | 1, 2 |

2 | Regression modeling and analysis. | 3 |

3 | Linearity and Classification K-Nearest Neighbors. | 3, 4 |

4 | Classification problems and classification tools. Logistic regression. | 4 |

5 | Classification: Linear and quadratic discriminant analysis. General Linear Models. | 4 |

6 | Resampling and cross-validation methods. LOOCV, K-fold CV. | 5 |

7 | Jackknife, bootstrap. | 5 |

8 | Mid Term (1 hour)- High-dimensional data and shrinkage. | 6 |

9 | Model selection methods and dimension reduction. | 6 |

10 | Ridge regression, LASSO Regression | 6 |

11 | Principal components. Partial least squares. Nonlinear trends and splines | 6, 7 |

12 | Trees | 8 |

13 | Bagging, Random Forests and SVM | 8, 9 |

14 | Clustering methods and Poster Presentations | 12 |

15 | Final Exam |

## 5 Overall Structure

- We will use Canvas as our Learning Management System for the course.
- We will meet each week in person in the designated classroom (or on Zoom per AU guidelines).
- The typical class will include walking through lecture notes on concepts and methods as well as exercises and demonstrations of methods in R.
- There will be a quiz at the end of each class covering material from the prior class and homework.
- Lecture Notes will be posted on Canvas before each class. The notes cover concepts, methods and practical examples using R.
- If a class session creates additional materials, they will be posted on Canvas after class.
- I will use Zoom to record portions of each lecture.
- Zoom Recordings will be available on Canvas Media Gallery within a day or two after class.

## 6 Graded Elements

There are five graded elements for the course that are weighted as part of determining a final grade.

Element | Weight | Description |
---|---|---|

Attendance and Engagement | 5% | Engagement during class discussions helps achievement of learning outcomes. |

Weekly homework assignments | 15% | Homework is assigned weekly and is due the following week. A typical homework includes problems to done by hand, to see how things work, and problems to be done using R (or other) software. |

Weekly Quizzes | 15% | 10- to 15-minute quizzes at the end of almost every class. Each quiz covers the material of the preceding week and the latest homework. Quizzes are to be completed by hand to demonstrate knowledge of concepts and how the methods work. |

Midterm Exam | 20% | The midterm covers the first several chapters of the material. It is in-class and problems will be by hand and by computer. Notes and course materials are allowed. Time: 1 hour 15 minutes. |

Final Project | 20% | Students form groups of two-four students, pick a topic and data set, and then use multiple methods from the course for modeling, data analysis, tuning, and cross-validation to make predictions, evaluate models and make final recommendations. A group can choose to either write a report or make a poster and present it in class. All groups submit their code supporting their report or poster. |

Final Exam | 25% | The final covers the material since the midterm. It is cumulative indirectly because later course materials depend heavily on earlier concepts and methods. Notes and course materials are allowed. Time: 2.5 hours. |

Mini-projects | Mini-projects may be assigned to individuals or groups to demonstrate knowledge of course learning outcomes. Mini projects may be aligned to any of the graded elements. |

### 6.1 Final Grades

Final grades are based on the weighted average of the graded elements with an emphasis on how students demonstrate the course learning outcomes by the end of the course.

Range % | Letter |
---|---|

93 or above | A |

90-92 | A- |

87-89 | B+ |

83-86 | B |

80-82 | B- |

77-79 | C+ |

70-76 | C- |

67 - 69 | C- |

60-66 | D |

59 or less | F |