1  Introduction

Published

November 6, 2024

Keywords

data science, life cycle

1.1 Introduction to DATA 413/613 Data Science

1.1.1 Purpose of this Course

  • Develop your competence, confidence, and creativity in employing a wide variety of data science methods at AU and beyond in an efficient, and responsible manner to solve real-world problems.
  • Build foundational knowledge of, and experience in, current data science strategies, methods, and capabilities to enable post-course learning.
  • Enhance your engagement with data science professional communities.
  • Contribute to your portfolio of work suitable for sharing with others.

1.1.2 Catalog Description

  • This course builds on the R tidyverse programming skills developed in DATA-412/612.
  • Students strategize about and solve complex problems involving large amounts of messy data in an efficient, reproducible, and ethical manner by learning and applying advanced R programming concepts, methods, and tools.
  • Core topics include version control, accessing data using application programming interfaces and web scraping, web application development with R Shiny, advanced R programming, functional programming, and large-scale statistical modeling.
  • A comprehensive project, based on multiple sets of current real-world data, integrates learning from across the course.

1.1.3 Learning Outcomes

After successful completion of this course, you should be able to …

  • Design and implement statistical programming solutions using advanced R, the Tidyverse and R Markdown/Quarto to create efficient reproducible analysis involving large, real-world, data sets to solve problems.
  • Apply statistical programming capabilities for web-scraping and using application programming interfaces to gather large data sets efficiently and securely.
  • Use code-based methods to efficiently manipulate and analyze large data sets.
  • Create web-based applications for generating data, enabling distributed interactive numerical and graphical analysis, and/or communicating results by developing an app, a website, and/or a dashboard.
  • Evaluate Data Science scenarios for potential ethical issues and possible mitigation strategies.
  • Understand multiple ways to engage with the professional Data Science community
  • Employ a version control system (Git) and a Git-centric, cloud-based collaboration environment (Git Hub) to enable distributed management of analysis and support collaboration on products.
  • Use statistical programming capabilities to enable others to interactively create explanatory and predictive models, test hypotheses, analyze assumptions, and interpret the results as part of a developed and deployed interactive analysis product.

1.1.4 Course Overview

This course is structured to help you successfully complete the life cycle of a data science project with real-world complexities.

  • The course modules vary in length. Most weeks will cover one module but some will cover two modules.
  • We begin with modern version control systems Git and GitHub. We will use these throughout the course.
  • Modules 3-5 delve deeper into R, Tidyverse, and RStudio capabilities we will need throughout the course.
  • Modules 6-8 focus on getting data from a wide variety of sources and how to “wrangle” this “wild-caught” data so you can use it for analysis.
  • Module 9 covers some advanced Git and GitHub capabilities for collaboration.
  • Modules 10 and 11 focus on building R Shiny web applications for enabling distributed analysis by others.
  • Module 12 addresses Considerations for Responsible Data Science.
  • Modules 13-16 address different kinds of analysis with multiple models, mapping, and text.
  • Module 17 introduces several aspect of the Python language and how you can access those capabilities in R.
  • A key aspect of this course is completing a group project where you develop an R Shiny App to enable other people to do their own analysis of a real-world data set.

1.2 A Data Science Life Cycle

Responsible Data Science depends upon following a repeatable process or life cycle for analysis and solution development.

There are many different life cycles and frameworks in the community. Some are tailored to one aspect of data science. Others attempt to include all aspects of data science in a single framework.

This course will use the following life cycle as a frame of reference given its focus on answering a question of interest.

Figure 1.1: A basic 8-step lifecycle for responsible data science.
  • Figure 1.1 portrays eight steps for a Data Science life cycle that start with someone asking a question and end with observing the outcomes of the solution.
  • Some might be tempted to stop at an earlier step, but a data scientist knows that every analysis and solution is based on assumptions, explicit and implicit.
  • Observing outcomes is a responsible approach to validating if assumptions were valid or responsible.
Figure 1.2: Responsible Data Science uses feedback from each step to assess the need to revisit earlier steps.
  • Figure 1.2 provides additional details on the types of activities that can occur within each step.
  • It also highlights that while Figure 1.1 shows a nice, circular process that is always making progress, responsible data science often takes one step forward and then two steps backwards.
  • Feedback from the activities at a step might indicate one should back up and repeat an earlier step.
  • As an example, if modeling and analysis shows the data is not as robust as desired or shows sampling bias that will render the results less useful for the question, one may need to back up to step 3 to get more data or even step 1 to get guidance on reframing the question of interest.
Figure 1.3: Implementing recommendations or a solution should generate more data that could support future analyses.
  • As implementation occurs, it will usually generate new data that could support future analysis.
  • Responsible data science will use this new data to assess assumptions made in building the solutions and whether there is disparate impact on the populations affected by the implementation.
Figure 1.4: Responsible data science includes considerations for shaping the analysis and solutions as well as how the analysis is conducted.
  • Figure 1.4 shows that responsible data science is not a single step in the life cycle but underlies activities at each step in the life cycle.
  • The top of the figure identifies several considerations for shaping the analysis or solution at each step to ensure the analysis or solution complies with laws and ethical guidelines while minimizing risks to fairness, privacy, and confidentiality of data and people.
  • The bottom of the figure identifies attributes for the activities at each step to ensure the work aligns with principles for responsible data science.
  • We will address aspects of responsible data science in more detail throughout the course.
Figure 1.5: Engaging in the data science community promotes career development and contribution as a responsible data scientist.
  • Figure 1.5 shows that the data science community is here to help you as you work through each step in the data science life cycle.
  • The data science community is not a single organization but an ecosystem of professional organizations, online forums, individual mentors, peers, and others who you can help.
  • Engaging in the community helps your professional development and helps you contribute back to the community by helping others.