AU STAT 413/613 Data Science
1 Course Overview
1.1 Purpose
Develop students as members of a data science professional community able to strategize about and solve diverse, complex problems in an efficient and reproducible manner by applying in-depth and broad knowledge of modern statistical programming capabilities, methods, and tools.
1.2 Description
Classes cover multiple data science methods focused on collecting, organizing, and analyzing data to build models and communicate analytical results in a reproducible manner.
You will use R packages to acquire, clean, and tidy data from a variety of online sources; develop shiny applications; build statistical models and analyses, and present/communicate findings.
You will use Git and GitHub for configuration management and collaboration.
You will collaborate with a small group to design and develop a new Shiny application for others to use to analyze data from an online source of your choice.
This course assumes recent experience with the material covered in DATA-412/612: the use of tidyverse packages {ggplot2}, {dplyr}, {tidyr}, {stringr}, {readr}, {forcats}, and {lubridate}, and the use of Quarto/R Markdown and the RStudio Integrated Development Environment. It also assumes some knowledge of statistical analysis in R.
1.3 Learning Outcomes
Upon successful completion of this course, you will be able to demonstrate competence in developing solutions requiring diverse data science methods for large and messy data.
Specific learning outcomes for each course are:
Course | Learning Outcome |
---|---|
413/613 | Design and implement advanced statistical programming solutions using the R Programming Language (with Tidyverse and other packages), the RStudio Desktop IDE, and Quarto/R Markdown to create efficient reproducible analysis involving large, real-world, data sets to solve problems. |
413/613 | Apply statistical programming capabilities for using application programming interfaces, web-scraping, and/or SQL to gather large data sets efficiently and securely. |
413/613 | Use Tidyverse methods to efficiently manipulate and analyze large data sets. |
413/613 | Create on-line solutions for generating data, enabling distributed interactive numerical and graphical analysis, and/or communicating results by developing an app, a website, and/or a dashboard. |
413/613 | Analyze textual data using basic Natural Language Processing concepts and methods. |
413/613 | Evaluate Data Science scenarios for potential ethical issues across the data science life cycle and identify possible mitigation strategies. |
413/613 | Understand multiple ways to engage with the professional Data Science community. |
613 | Use statistical programming capabilities to enable others to interactively create explanatory and predictive models, test hypotheses, analyze assumptions, and interpret the results as part of a group-developed deployed, interactive analysis product. |
413/613 | Use a version control system (Git) and a Git-centric, cloud-based collaboration environment (Git Hub) to enable distributed management of analysis and support collaboration on products. |
1.4 Prerequisites
This course requires successful completion of either DATA-412 or DATA-612 as a prerequisite. These courses have a pre/co-requisite for math/statics courses. It is assumed graduate students in this course have an intermediate knowledge of statistical methods.
2 Required Resources
The Textbook is required but it and all references are free and available online at the links below.
You will need a computing device capable of running R or Python on the device or in the cloud.
- RStudio is recommended or you may use other tools such as VS Code.
You will also need to be able to access and interact with American University’s Canvas Learning Management System and use its Zoom license in the event of a remote class.
- Lecture notes are available at DATA 413-613 Data Science Lecture Notes.
2.1 Textbook
All recommended references are free and available online at the links below:
R for Data Science - Second Edition by Wickham and Grolemund
Advanced R by Wickham (O’Reilly)
Mastering Shiny by Wickham
Text Mining with R by Silge & Robinson
Posit Cheat Sheets
The tidyverse Style Guide by Hadley Wickham
3 Computing Environment
- In class we will use the R statistical programming language, minimum version 4.2. R is free and may be downloaded from the R website.
- In class we will interface with R through the free version of the RStudio Desktop Integrated Development Environment (IDE).
- You can use whatever language or system you choose to complete assignments but the output must be submitted to Canvas in HTML or PDF format and written inn a literate programming style, in a clear, concise manner with minimal errors.
- You will need access to reliable internet capable of supporting Zoom video sessions for Office Hours and potentially for Class.
4 General Schedule of Topics
The following is a general sequence of topics but they will blend over class periods.
Week | Topics |
---|---|
1 | Git and GitHub |
2 | Tidyverse Review |
3 | Vectors, Lists, For Loops, and purrr |
4 | Writing and Debugging Functions |
5 | Getting Data: Open Data, APIs |
6 | Getting Data: Web Scraping and Rectangling |
7 | Databases and SQL |
8 | GItHub Team Workflow and Shiny I: Basics: UI and Server Functions |
9 | Shiny II: Layouts and Customization |
10 | Responsible Data Science and List Columns/ Many Models, |
11 | Mapping with R |
12 | Text Analysis 1: Sentiment Analysis |
13 | Text Analysis 2: TF-IDF and Group Project Collaboration |
14 | Python in R and SETs |
15 | Group Project Demonstrations |
5 Overall Structure
We will use Canvas as our Learning Management System for the course.
We will meet each week in person in the designated classroom (or on Zoom per AU guidelines).
The typical class will include walking through lecture notes on concepts and methods as well as exercises and demonstrations of methods in R.
Lecture Notes will be posted on GitHub before each class. The notes cover concepts, methods and practical examples using R.
If a class session creates additional materials, they will be posted on GitHub after class.
I will use Zoom to record portions of each lecture.
Zoom Recordings will be available on Canvas Media Gallery within a day or two after class.