AU STAT 427/627 Statistical Machine Learning

1 Course Overview

1.1 Purpose

Develop students as members of a data science professional community able to strategize about and solve diverse, complex problems in an efficient and reproducible manner by applying in-depth and broad knowledge of modern statistical programming capabilities, methods, and tools.

1.2 Description

Classes cover multiple data science methods focused on collecting, organizing, and analyzing data to build models and communicate analytical results in a reproducible manner.

  • You will use R packages to acquire, clean, and tidy data from a variety of online sources; develop shiny applications; build statistical models and analyses, and present/communicate findings.

  • You will use Git and GitHub for configuration management and collaboration.

  • You will collaborate with a small group to design and develop a new Shiny application for others to use to analyze data from an online source of your choice.

This course assumes recent experience with the material covered in DATA-412/612: the use of tidyverse packages {ggplot2}, {dplyr}, {tidyr}, {stringr}, {readr}, {forcats}, and {lubridate}, and the use of Quarto/R Markdown and the RStudio Integrated Development Environment. It also assumes some knowledge of statistical analysis in R.

1.3 Learning Outcomes

Upon successful completion of this course, you will be able to demonstrate competence in developing solutions requiring diverse data science methods for large and messy data.

Specific learning outcomes for each course are:

Course Learning Outcome
413/613 Design and implement advanced statistical programming solutions using the R Programming Language (with Tidyverse and other packages), the RStudio Desktop IDE, and Quarto/R Markdown to create efficient reproducible analysis involving large, real-world, data sets to solve problems.
413/613 Apply statistical programming capabilities for using application programming interfaces, web-scraping, and/or SQL to gather large data sets efficiently and securely.
413/613 Use Tidyverse methods to efficiently manipulate and analyze large data sets.
413/613 Create on-line solutions for generating data, enabling distributed interactive numerical and graphical analysis, and/or communicating results by developing an app, a website, and/or a dashboard.
413/613 Analyze textual data using basic Natural Language Processing concepts and methods.
413/613 Evaluate Data Science scenarios for potential ethical issues across the data science life cycle and identify possible mitigation strategies.
413/613 Understand multiple ways to engage with the professional Data Science community.
613 Use statistical programming capabilities to enable others to interactively create explanatory and predictive models, test hypotheses, analyze assumptions, and interpret the results as part of a group-developed deployed, interactive analysis product.
413/613 Use a version control system (Git) and a Git-centric, cloud-based collaboration environment (Git Hub) to enable distributed management of analysis and support collaboration on products.

1.4 Prerequisites

This course requires successful completion of either DATA-412 or DATA-612 as a prerequisite. These courses have a pre/co-requisite for math/statics courses. It is assumed graduate students in this course have an intermediate knowledge of statistical methods.

2 Required Resources

The Textbook is required but it and all references are free and available online at the links below.

You will need a computing device capable of running R or Python on the device or in the cloud.

  • RStudio is recommended or you may use other tools such as VS Code

You will also need to be able to access and interact with American University’s Canvas Learning Management System and use its Zoom license in the event of a remote class.

2.1 Textbook

All recommended references are free and available online at the links below:

3 Computing Environment

  • In class we will use the R statistical programming language, minimum version 4.2. R is free and may be downloaded from the R website.
  • In class we will interface with R through the free version of the RStudio Desktop Integrated Development Environment (IDE).
  • You can use whatever language or system you choose to complete assignments but the output must be submitted to Canvas in HTML or PDF format and written inn a literate programming style, in a clear, concise manner with minimal errors.
  • You will need access to reliable internet capable of supporting Zoom video sessions for Office Hours and potentially for Class.

4 General Schedule of Topics

The following is a general sequence of topics but they will blend over class periods.

Week Topics
1 Git and GitHub
2 Tidyverse Review
3 Vectors, Lists, For Loops, and purrr
4 Writing and Debugging Functions
5 Getting Data: Open Data, APIs
6 Getting Data: Web Scraping and Rectangling
7 Databases and SQL
8 GItHub Team Workflow and Shiny I: Basics: UI and Server Functions
9 Shiny II: Layouts and Customization
10 Responsible Data Science and List Columns/ Many Models,
11 Mapping with R
12 Text Analysis 1: Sentiment Analysis
13 Thanksgiving Break
13 Text Analysis 2: TF-IDF and Group Project Collaboration
14 Python in R and SETs
15 Group Project Demonstrations

5 Overall Structure

  • We will use Canvas as our Learning Management System for the course.

  • We will meet each week in person in the designated classroom (or on Zoom per AU guidelines).

  • The typical class will include walking through lecture notes on concepts and methods as well as exercises and demonstrations of methods in R.

  • Lecture Notes will be posted on GitHub before each class. The notes cover concepts, methods and practical examples using R.

  • If a class session creates additional materials, they will be posted on GitHub after class.

  • I will use Zoom to record portions of each lecture.

  • Zoom Recordings will be available on Canvas Media Gallery within a day or two after class.

6