Teaching Responsible Data Science - A Conversation

AU Department of Mathematics and Statistics Colloquium

Richard Ressler, American University

2024-01-30

Context for our Conversation

Big Data and Artificial Intelligence (AI) surrounds us:

  • In the News, Entertainment, Politics, Social Media, …
  • In Legal Settings around the World
  • In Academic Conversations
  • In our Lives

What is our role in preparing our graduates and students for the changing world?

How do we/should we inculcate ideas and practices of “Responsible Data Science”?

George Box and quote All models are wrong, but some are useful.

Data Science (DS) programs have a learning outcome for “Responsible Data Science.”

There are books, articles, and talks on “Responsible Data Science” but no standard definition.

We have defined Responsible Data Science (DS) in the context of a DS Life Cycle.

  • Graduates should be able to Evaluate problems for potential ethical issues across the DS life cycle and ensure analysis and solutions are transparent, reproducible, and developed in accordance with professional codes of conduct.

The goal for today is to have a “Conversation.”

  • How are we teaching Responsible Data Science?

  • Is what we are doing sufficient, useful? What else could/should we cover?

  • Is there a place for a Departmental strategy on teaching ethical practices?

How are we Teaching Responsible Data Science?

  • The DATA 413/613 Module on Responsible DS covers multiple topics:
    • Data Science Life Cycle and Responsible Data Science
    • Legal, Professional, and Ethical Considerations
    • Ethical Frameworks
    • Identifying Ethical Issues
    • Practicing Responsible Data Science
  • Some courses use Group Discussions
  • We teach aspects of Reproducibility and Transparency

Data Science Follows a Life Cycle

Eight step data science life cycle in a circle.

The DS Life Cycle is about answering a question.

Eight step data science life cycle in a circle with elements of responsible data science inside it.

Responsible Data Science asks questions across the life cycle.

Data Scientists Make Choices

  • What kinds of projects do I work on and What questions do I analyze?
  • Who are the Stakeholders?
  • Towards what goals do I optimize my models? Accuracy, Fairness, Equity, Equality, …?
  • How do I get my data? How do I protect my data?
  • Is my data representative of the population?
  • What do I do about “bad” or missing data or “outliers” that “mess up” my results?
  • What variables/features/attributes do I use?
  • How much effort do I put into checking my results? Are they repeatable?
  • How do I leverage/credit other people’s work?
  • How do I report my results - what is intellectual property and what should be public?

  • Choices can involve Legal, Professional, or Ethical Considerations

Choices have consequences, with benefits and risks.

Ursula the sea witch from Disney's Little Mermaid.

Laws Proscribe Some Choices

Professional Considerations

Professional considerations address the risk of activities related to organizations with which you affiliate.

  • Professional Organizations, Employers or Volunteer Organizations
  • Organization bylaws or policies identify and manage risk to the institution.
  • Organizational behavior and ethics may conflict with individual ethics
    • Choices can include trying to change the organization or offending individuals; or leaving or being forced to leave the organization, with potential legal issues as well.

When in doubt, ask a mentor or manager you trust.

Ethical Considerations

Ethical considerations arise when asking

  • What should I do?
  • What is the “right” or “moral” thing to do?

Ethical Choices can be hard, especially when choices may require violating a law, regulation, and/or professional guideline.

  • Individual principles and cultural norms shape options and guide choices in complex situations.

  • Often, there is no universally-accepted or even a good “right answer”.

  • May have to choose between two bad outcomes.

Depction of the Trolley problem with a trolly car and a person with a lever controlling which track the trolley car will take.

  • May have to choose between individual and group outcomes.

  • Ethical choices can lead to feelings of guilt, group reprobation, civil action (torts), or criminal charges.

Many, many, frameworks attempt to guide Ethical Choices.

Ethical Frameworks

Three (of many) Ethical Frameworks 1

  • Consequentialist or Utilitarianism: greatest balance of good over harm (groups/individual).
    • Choose the future outcomes that produce the most good.
    • Compromise is expected as the end justifies the means.
  • Duty: Do your Duty, Respect Rights, Be Fair, Follow Divine Guidance:
    • Do what is “right” regardless of the consequences or emotions.
    • Everyone has the same duties and ethical obligations at all times.
  • Virtues: Live a Virtuous Life by developing the proper character traits.
    • Ethical behavior is whatever a virtuous person would do.
    • Tends to reinforce local cultural norms as the standard of ethical behavior.

Frameworks can conflict with each other or are “wrong” in the extreme.

No single or simple right answer!

Identifying Ethical Issues

Given Bias in Data and Algorithms are DS Systems Ethical?

Not a new issue - goes back decades. However, the explosive growth of AI systems to support and even make decisions is generating concerns.

Three Articles for Class Discussion

  1. Higher error rates in classifying the gender of darker-skinned women than for lighter-skinned men (O’Brien 2019)

  2. Big Data used to generate unregulated e-scores in lieu of FICO scores for Credit in Lending (Bracey and Moeller 2018)

  3. Learning Analytics Can Violate Student Privacy (Raths 2018)

  • Discussion Questions
    • Is there an ethical issue or more than one? What is it?
    • Who is affected and who is responsible?
    • Pick on of the Professional Codes of Conduct or Guidelines. How would it apply?
    • What would you do differently or recommend?

Practicing Responsible Data Science

What Can You Do? What Should You Do?

Consider Ethical Choices Across the DS Life Cycle

Eight step data science life cycle in a circle with elements of responsible data science inside it.

Ask a question: Equity or equality? Stakeholders and Trade offs? What are our interests? Recency or Confirmation Bias?

Frame the Analysis: What is the population? Role of proxy variables? How are metrics for “fairness” affecting groups/individuals? Do we need an IRB (APA 2022)?

  • Get Data: How was it collected? Was informed consent required/given? Is there balanced representation? Selection Bias? Availability Bias? Survivorship Bias?

  • Shape Data: Are we aggregating distinct groups? How do we treat missing data? Are we separating training and testing data?

  • Model and Analyze: How are we documenting assumptions, treating extreme values, or checking over-fitting? Are we checking multiple fairness and performance metrics?

  • Communicate Results: Are the graphs misleading? Did we cherry pick or data snoop? Are we reporting \(p\)-values and hyper-parameters?

  • Deploy/Implement: Is the deployment accessible to all?

  • Observe Outcomes: Can we check assumptions and analyze outcomes for bias?

Are you following professional guidelines from ASA, ACM, INFORMS, …?

Consider Other Frameworks for Responsible Data Science

Published by the Royal Statistical Society and the Institute and Faculty of Actuaries in A Guide for Ethical Data Science

  1. Start with clear user need and public benefit
  2. Be aware of relevant legislation and codes of practice
  3. Use data that is proportionate to the user need
  4. Understand the limitations of the data
  5. Ensure robust practices and work within your skill set
  6. Make your work transparent and be accountable
  7. Embed data use responsibly

(RSS-IFA 2021)

Department of Defense AI Capabilities shall be:

  • Responsible … exercise appropriate levels of judgment and care, while remaining responsible for the development, deployment, and use….

  • Equitable … take deliberate steps to minimize unintended bias in AI capabilities.

  • Traceable … develop and deploy AI capabilities such that relevant personnel have an appropriate understanding …, including with transparent and auditable methodologies, data sources, and design procedures and documentation.

  • Reliable … AI capabilities will have explicit, well-defined uses, and the safety, security, and effectiveness … will be subject to testing and assurance within those defined uses ….

  • Governable … design and engineer AI capabilities to fulfill their intended functions while possessing the ability to detect and avoid unintended consequences, and the ability to disengage or deactivate deployed systems that demonstrate unintended behavior.

(DoD 2020)

AI Principles

  1. Be socially beneficial.
  2. Avoid creating or reinforcing unfair bias.
  3. Be built and tested for safety.
  4. Be accountable to people.
  5. Incorporate privacy design principles.
  6. Uphold high standards of scientific excellence.
  7. Be made available for uses that accord with these principles.

We will not design or deploy AI in the following application areas: Weapons, Surveillance, …

(GoogleAI 2022)

IBM Principles for Trust and Transparency

  1. The purpose of AI is to augment human intelligence. The purpose of AI and cognitive systems developed and applied by IBM is to augment – not replace – human intelligence.
  2. Data and insights belong to their creator.
  • IBM clients’ data is their data, and their insights are their insights. Client data and the insights produced on IBM’s cloud or from IBM’s AI are owned by IBM’s clients. We believe that government data policies should be fair and equitable and prioritize openness.
  1. New technology, including AI systems, must be transparent and explainable.
  • For the public to trust AI, it must be transparent. Technology companies must be clear about who trains their AI systems, what data was used in that training and, most importantly, what went into their algorithm’s recommendations. If we are to use AI to help make important decisions, it must be explainable.

(IBM 2019)

A guide to Building “Trustworthy” Data Products

Based on the golden rule: Treat others’ data as you would have them treat your data

  • Consent - Get permission from the owners or subjects of the data before …

  • Clarity - Ensure permission is based on a clear understanding of the extent of your intended usage

  • Consistency - Build trust by ensuring third parties adhere to your standards/agreements

  • Control (and Transparency) - Respond to data subject requests for access/modification/deletion, e.g., the right to be forgotten

  • Consequences (and Harm) - Consider how your usage may affect others in society and potential unintended applications.

(Loukides, Mason, and Patil 2018)

To Be an Ethically Responsible Data Scientist …

Integrate ethical decision making into your environment.

As Davy Crockett might say, “Try to be sure you are right, then Go Ahead!”

“Davy Crockett (2024)

American Davy Crockett.

Stay Current on Emerging Ideas

Try to stay on the fast moving train that is Responsible Data Science.

:::

Using Group Discussions to Heighten Awareness

  • Listen to the Data Skeptic Podcast on Fraudulent Amazon Reviewers.
    • Describe at least two ethical issues discussed in the podcast and their implications for machine learning model developers.
  • Listen to the Cognilytica Podcast AI System Transparency (skip to start about 5:55) and in a few short sentences in the Canvas discussion answer the following:
    • From your perspective, what is the primary ethical benefit of using such a framework …?
  • Listen to the ASA Stats and Stories Podcast The Impact of College Vaccine Mandates.
    • What are the ethical implications of releasing the paper before peer review, …?
  • Listen to the ASA Stats and Stories Podcast on Inclusive Data Collection
    • What are some ethical benefits and challenges of inclusive data collection?

Teaching Reproducibility and Transparency

  • Communicate with Literate Programming e.g., use Quarto for R, Python, …
    • Integrate text, code, images, graphics and results using text files.
    • Create Multiple Formats (HTML, PDF, Word, PowerPoint, …) from one file.
    • Create Accessible Documents for publication or sharing online.
  • Require Reproducible Code and Results
    • Use Relative paths so code works on other people’s machines.
    • Use GitHub Repositories for collaboration and shareable code.
    • Follow Tidyverse Style for readable, consistent code.
    • Set Random Number Seeds for reproducible results.
  • Emphasize Transparency in Analysis and Results
    • Document assumptions.
    • Document data collection processes and data manipulation.
    • Document tuning or hyper-parameters.
    • Report \(p\)-values and other performance metrics.

Conversations

Is the LO and What We Are Teaching Sufficient?

  • How to Think About Responsible DS Issues and Practices?
  • How to Implement Responsible DS Practices?
    • Gaps in some hard research issues.
    • Choosing the “right” Fairness Metrics for the question.
    • Checking Input data for balance and bias?
    • Checking Output Results for balance and bias?
  • Others?
  • Should we incorporate into more Assignments and Projects?

  • Recent Papers/Examples/Case Studies?

  • Ideas for In-Class Exercises?

  • Other Teaching Approaches?

Conversation on a Department Strategy?

Math, Stat , and DS professions have common ground in ethical guidelines.

  • Should we create a department-wide strategy to infuse specific ethical topics into select courses at 100, 200/300 and 400/600 levels t hat build on each other?

AU has a University-level learning outcome for Ethics covering all programs.

  • Graduates will demonstrate ethical decision-making skills; they will act with integrity, critically examine their own values, and respect how different values might be applied to address complex and ambiguous problems.

All undergraduates take a course in “Ethical Reasoning” with four learning outcomes.

  • 23 courses in 16 depts, e.g.,BIO (2), ECON, GOVT, MGMT, & PHYS. Most are 200 level.

Should we create a Math-Stat Department course in Ethical Reasoning?

  • e.g., “Ethics and Numbers in Society” or “Ethics of Data and Algorithms”, or “Ain’t no such thing as a Bad Number”

Closing Thoughts

Thank you for your contributions to the conversation on Teaching Responsible Data Science!

We have made progress over the past few years, but have a ways to go.

As the closing of the Data 413/613 module states…,

When it comes to Responsible Data Science, Jane Addams reminds us that just thinking about ethics is not enough, …

“Action indeed is the sole medium of expression for ethics.”

Social Reformer Jane Addams.

V. References

ACM. 2021. “The Code for Computing Professionals.” https://www.acm.org/code-of-ethics.
Acton, Carmen. 2022. “Are You Aware of Your Biases?” Harvard Business Review, February. https://hbr.org/2022/02/are-you-aware-of-your-biases.
APA. 2022. FAQs about IRBs.” Https://Www.apa.org. https://www.apa.org/advocacy/research/defending-research/review-boards.
ASA. 2021. “Ethical Guidelines for Statistical Practice.” https://www.amstat.org/ASA/Your-Career/Ethical-Guidelines-for-Statistical-Practice.aspx.
Association, DS. 2021. “Code of Conduct.” https://www.datascienceassn.org/code-of-conduct.html.
AutoML. 2022. “Ethics and Accessibility Guidelines.” https://2023.automl.cc/ethics/.
Baer, Tobais. 2019. Understand, Manage, and Prevent Algorithmic Bias. 1st ed. Aprees.
Bracey, Kali, and Marguerite Moeller. 2018. “Legal Considerations When Using Big Data and Artificial Intelligence to Make Credit Decisions.” https://github.com/AU-datascience/data/blob/main/413-613/Bracey%20Moeller%20Unregulated%20e-scores%20March%202018.pdf.
Brown University, Science and Technology Studies. 2021. “A Framework for Making Ethical Decisions.” https://www.brown.edu/academics/science-and-technology-studies/framework-making-ethical-decisions.
Bureau, US Census. 2023. “Understanding Differential Privacy.” Census.gov. https://www.census.gov/programs-surveys/decennial-census/decade/2020/planning-management/process/disclosure-avoidance/differential-privacy.html.
Burgess, Matt. 2020. “What Is GDPR? A Summary Guide for the UK.” Wired UK, March. https://www.wired.co.uk/article/what-is-gdpr-uk-eu-legislation-compliance-summary-fines-2018.
Calzon, Bernadita. 2021. “Misleading Statistics Real World Examples For Misuse of Data.” BI Blog | Data Visualization & Analytics Blog | Datapine. https://www.datapine.com/blog/misleading-statistics-and-data/.
Coy, Kevin, and Neil Hoffman. 2016. “Big Data Analytics Under HIPAA.” https://www.jdsupra.com/legalnews/big-data-analytics-under-hipaa-80678/.
“Davy Crockett.” 2024. Wikipedia, January. https://en.wikipedia.org/w/index.php?title=Davy_Crockett&oldid=1193687251.
Devaux, Elise. 2022. “What Is Differential Privacy: Definition, Mechanisms, and Examples - Statice.” https://www.statice.ai/post/what-is-differential-privacy-definition-mechanisms-examples.
DHUD, US. 2021. “Fair Housing Act.” https://www.hud.gov/program_offices/fair_housing_equal_opp/fair_housing_act_overview.
DoD, US. 2020. DOD Adopts 5 Principles of Artificial Intelligence Ethics.” https://www.defense.gov/Explore/News/Article/Article/2094085/dod-adopts-5-principles-of-artificial-intelligence-ethics/.
Edelman, Gilead. 2020. “Everything You Need to Know About the CCPA.” https://www.wired.com/story/ccpa-guide-california-privacy-law-takes-effect/.
EEOC, US. 2021a. “Genetic Information Nondiscrimination Act of 2008.” https://www.eeoc.gov/laws/statutes/gina.cfm.
———. 2021b. “Laws Enforced by EEOC.” https://www.eeoc.gov/statutes/laws-enforced-eeoc.
Fleisher, Will. 2024. AI Ethics.” Center for Digital Ethics. https://digitalethics.georgetown.edu/ai-artificial-intelligence-ethics/.
FTC. 2013a. “Children’s Online Privacy Protection Rule ("COPPA").” https://www.ftc.gov/enforcement/rules/rulemaking-regulatory-reform-proceedings/childrens-online-privacy-protection-rule.
FTC, US. 2013b. “Fair Credit Reporting Act.” https://www.ftc.gov/enforcement/statutes/fair-credit-reporting-act.
———. 2018. “Credit Reporting Information.” https://www.ftc.gov/news-events/media-resources/consumer-finance/credit-reporting.
GoogleAI. 2022. “Our Principles.” Google AI. https://ai.google/principles/.
Gordon, Cindy. 2022. “2023 Will Be The Year Of AI Ethics Legislation Acceleration.” Forbes. https://www.forbes.com/sites/cindygordon/2022/12/28/2023-will-be-the-year-of-ai-ethics-legislation-acceleration/.
Hand, David J. 2018. “Aspects of Data Ethics in a Changing World: Where Are We Now?” Big Data 6 (3): 176–90. https://doi.org/10.1089/big.2018.0083.
HCAI, Standford. 2022. “The AI Index Report Artificial Intelligence Index.” Welcome to the 2022 AI Index Report. https://aiindex.stanford.edu/report/.
IBM. 2019. IBMS Principles for Data Trust and Transparency.” IBM Policy. https://www.ibm.com/policy/trust-principles/.
IEAI-ML. 2021. “Awesome AI Guidelines.” https://github.com/EthicalML/awesome-artificial-intelligence-guidelines.
INFORMS. 2021. INFORMS Ethics Guidelines.” https://www.informs.org/About-INFORMS/Governance/INFORMS-Ethics-Guidelines.
Loeb, Emily, Adam Unikowsky, Caroline Cease, and Benjamin Hand. 2023. “Client Alert: Emerging AI Regulation: A Global Patchwork Quilt.” Jenner - Client Alert: Emerging AI Regulation: A Global Patchwork Quilt. https://www.jenner.com/en/news-insights/publications/client-alert-emerging-ai-regulation-a-global-patchwork-quilt.
Loukides, Mike, Hilary Mason, and DJ Patil. 2018. Ethics and Data Science. 1st ed. O’Reilly Media, Inc. https://www.oreilly.com/library/view/ethics-and-data/9781492043898/.
Lynch, Shana. 2023. “2023 State of AI in 14 Charts.” Stanford HAI. https://hai.stanford.edu/news/2023-state-ai-14-charts.
MIT-Media-Lab. 2019. AI Blindspot: A Discovery Process for Preventing, Detecting, and Mitigating Bias in AI Systems.” https://aiblindspot.media.mit.edu/.
O’Brien, Matt. 2019. MIT Researcher Exposing Bias in Facial Recognition Tech ...” https://www.insurancejournal.com/news/national/2019/04/08/523153.htm.
O’Neil, Cathy. 2017. Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy. https://weaponsofmathdestructionbook.com/.
Pomeroy, Robin, and Simon Torkington. 2023. “How Can We Create Responsible AI? Industry Leaders Share Their Insights.” World Economic Forum. https://www.weforum.org/agenda/2023/06/responsible-generative-ai-industry-regulation/.
Posit. 2022. “Quarto.” Quarto Overview. https://quarto.org/.
Raths, David. 2018. “When Learning Analytics Violate Student Privacy.” https://campustechnology.com/articles/2018/05/02/when-learning-analytics-violate-student-privacy.aspx.
RSS-IFA. 2021. “Data Science Ethics Guidelines UK.” https://www.actuaries.org.uk/upholding-standards/data-science-ethics.
Schneble, Christophe Olivier, Bernice Simone Elger, and David Martin Shaw. 2020. “Google’s Project Nightingale Highlights the Necessity of Data Science Ethics Review.” EMBO Molecular Medicine 12 (3): e12053. https://doi.org/10.15252/emmm.202012053.
SCU, Santa Clara. 2022. “An Introduction to Data Ethics.” https://www.scu.edu/ethics/focus-areas/technology-ethics/resources/an-introduction-to-data-ethics/.
“Trolley Problem.” 2024. Wikipedia, January. https://en.wikipedia.org/w/index.php?title=Trolley_problem&oldid=1193140026.
UChicago. n.d. “Center for Applied Artificial Intelligence.” The University of Chicago Booth School of Business. Accessed December 26, 2022. https://www.chicagobooth.edu/research/center-for-applied-artificial-intelligence.
Union, European. 2023. “Artificial Intelligence Act: Deal on Comprehensive Rules for Trustworthy AI.” https://www.europarl.europa.eu/news/en/press-room/20231206IPR15699/artificial-intelligence-act-deal-on-comprehensive-rules-for-trustworthy-ai.
University, Princeton. 2018. “Dialogues on AI and Ethics Case Studies.” https://aiethics.princeton.edu/case-studies/.
Us Dept of Labor. 2024. “Guidance on the Protection of Personal Identifiable Information.” DOL. http://www.dol.gov/general/ppii.
US DHHS, Office for Civil. 2015. HIPAA for Professionals.” Text. https://www.hhs.gov/hipaa/for-professionals/index.html.
US DOEd, Privay Technical Assistance Center. 2017. “Integrated Data Systems and Student Privacy.”
Vigen, Tyler. 2021. “15 Insane Things That Correlate With Each Other.” http://tylervigen.com/spurious-correlations.
Weissgerber Tracey L., Tracey et al. 2019. “Reveal, Don’t Conceal.” Circulation 140 (18): 1506–18. https://doi.org/10.1161/CIRCULATIONAHA.118.037777.
Wikipedia. 2022. “General Data Protection Regulation.” Wikipedia, December. https://en.wikipedia.org/w/index.php?title=General_Data_Protection_Regulation&oldid=1128013242.