General Instructions for Weekly Assignments
American University DATA 413-613 Data Science
D.1 General Instructions
- Accept all assignments via the link in the Canvas assignments but submit products only on GitHub via the assignment repository in the class organization.
- Clone your new repository to your local computer under the homework or assignments folder.
- Convert the assignment to an RStudio Project to automatically set the working directories for the console and terminal panes.
- If using the {here} package, check for
library(here)
or, if missing, insert it for automatic detection of the root directory (with a.git
folder or.RProj
file). You can also explicitly set the root directory withhere::i_am("analysis/hwxx_starter.qmd")
.
- If using the {here} package, check for
- Review the assignments early.
- Review the HTML file provided in the
\analysis
folder to see specific instructions and any plots or tables that are to be duplicated for your answers. - Review the rubric contained in the HTML file as it may help in understanding the questions. It details the graded elements of each question and their possible score. Higher scores suggest more complex elements. Not addressing or missing an element will result in 0 points for that element. Don’t sacrifice points unnecessarily.
- If you think there is an error in a question (it happens) or it is unclear how to interpret the question and rubric, please collaborate with others and/or email me right away!
- Review the HTML file provided in the
- Use the
Starter.qmd
file if provided- Most (but not all) assignments have a .qmd file with the word starter in it. If present, use it for your solutions.
- This file contains the questions so you do not need to copy the questions.
- It may not contain other information which is in the assignment HTML file.
- Rename the starter file under the
\analysis
folder ashwXX_yourname.Rmd
. - Modify the “author” field in the YAML header to be your name.
- You can modify other fields in the YAML but the format must remain HTML and the option
embed:resources: true
must stay to ensure your plots are included in the HTML output file.
- Most (but not all) assignments have a .qmd file with the word starter in it. If present, use it for your solutions.
- As you work, Save, Stage/Add, Commit and Push to GitHub at least every hour/day.
- Review your html file before final submission:
- Ensure the text format is appropriate, e.g., proper headings, and images are present and no excess pages of data.
- Check for general compliance with The tidyverse style guide e.g. using pipes to break up code for clarity, proper spacing, and no code lines over 80 characters. Consider using {styler}.
- Review for bad practices such as: using absolute paths, improper file or variable names or file structures, leaving excessive practice code in the final document, or showing many pages of excessive data in the output.
- Review the rubric to ensure all elements have been answered.
- Use RStudio
Edit/Check Spelling
to avoid excessive spelling errors. Note this will not catch all errors, e.g., in the headings, so also read output file. - Render your final .qmd file to HTML prior to staging, committing, and pushing your final submission.
- Complete Final Submission
- Stage and commit your repository, e.g., Quarto, HTML, .R, and any other required files.
- Push the committed repo to GitHub.
- Go to Canvas and enter a text comment for the assignment that your homework is complete and ready for scoring.
D.2 Guidelines
- Formatting
- Include the questions as well as your answers in a properly formatted file, i.e., questions are in text, code is in a code chunk, and any text answers, e.g., interpretations, are in text following the code, NOT as comments in a code chunk.
- Recommend using a bootstrap theme. See Quarto HTML Theming.
- Directory Structure and Data
- Use the same file/directory structure as in the assignment repo for most assignments. This means you must use relative paths in your .qmd files to read and write data/output so your results are reproducible.
- Recall the working directory for the .qmd file is the directory in which it is saved. See The working directory for R code chunks.
- Do not change the working directory for your file or use the
set_wd()
function. - See An Overview of Relative Path in Linux [The Complete Guide]).
- Since all assignments should be in a project repo with all work below the top level folder, consider using the {here} package.
- Note: {here} uses a slightly different syntax to build relative paths from the designated root of the project, not the current file.
- Do not move data files around the file structure. All data should either be in
data_raw
ordata
directories as appropriate. - Do not edit any provided data files. You must fix any errors using code. You can save the cleaned files in the data folder in a different format if desired.
- Use the same file/directory structure as in the assignment repo for most assignments. This means you must use relative paths in your .qmd files to read and write data/output so your results are reproducible.
- Avoiding Point Deductions. Follow these guidelines to reduce the risk of overall point deductions on an assignment.
- Only include code that is necessary to answer the questions. Excess code chunks or “test code” that is not part of your final answer may result in deductions.
- Generally follow tidyverse style for readability. Consider using {styler}. Excessive poor coding style, e.g., multiple pipes in one line, no spacing on either side of operators, or excessive line length that does not show in the html file, may result in point deductions.
- Use tidyverse functions where appropriate. Base R is fine for tasks where there is no tidyverse function or the tidyverse function is a simple wrapper for completness. Too much unnecessary base R will result in point deductions and/or no partial credit if it does not work.
- Do not show excessive data in your output. An HTML file with lots (10s or 100s) of extraneous pages of data will result in point deductions. Consider using
head()
,glimpse()
, orslice()
to show data instead of just using the variable name in your code. - Do not change working directories in your file.
- Do not leave an active
view()
orView()
orbrowser()
in your file. Ensure these commands iare deleted or commented out. - Excessive spelling errors may lose points.
- Getting Help on the Questions.
- Homework questions will often require you to think about how to combine ideas and methods from multiple classes, not just repeat something shown in class.
- The questions are not designed to “trick” you but often require attention to detail and building a strategy or plan to get to the solution.
- If you are unsure what the question or the rubric is asking you to do, ask your peers, the TA, or me.
- Collaboration with your peers is encouraged to discuss ideas and concepts (do not share or show code) about the questions and the rubric.
- Getting Help on the Solutions.
- There are multiple ways to solve any programming/analysis question so the questions often have specific requirements to guide you to a general tidyverse-based approach that meets the learning goal.
- However, the score is not based on an exact set of code but on demonstrating a correct tidyverse-based approach that meets the requirements of the question as well as gets the right answer.
- For simple questions there may be only one acceptable approach but even these may vary in terms of variable names etc..
- Use the RStudio Help menu and the Help pane to access help documents, vignettes, or cheat sheets, e.g., for function arguments you may not have used before or want to confirm the syntax.
- I use the RStudio Help ALL the time and I keep a folder of tidyverse Cheat Sheets on my computer desktop.
- Do not spend hours on a single part of a problem, getting frustrated. Email me (or use GitHub pull requests) to ask me any questions or request office hours. If you email me, ensure your most recent code is pushed to GitHub whether it is working or not.
- Ask for Help! I am generally able to answer emails during the week when I am not teaching or in a meeting.
D.3 Best Practices for Success
Consider why is the question is being asked in the way it is being asked. What is the learning purpose? What is the question trying to answer?
- All of the questions have the goal of helping you demonstrate competence and creativity in data science methods and tools from this course (and pre-requisites).
- That can help in understanding the question and how to interpret the results.
- The questions may require you to put things together differently, but you should not have to search for obscure packages to get to an answer. If you think you do, please ask.
Ask for clarification if the question is unclear or you think there might be an error in the question or the rubric. I am human and I do make mistakes or do not always think of all the ways others might interpret one of my questions. Collaborate with others on the meaning questions or ask me.
Think before you code (Philosophers say Cogito ergo sum but Data Scientists say Cogita antequam codas).
- Develop a strategy for solving a problem before you start typing.
- Think about what needs to be done with the current data to get to the solution and try also thinking backwards, what must the data look like to get the end result/plot to answer the question.
- Integrating both views can be helpful.
Develop an approach to implement your strategy through a series of steps. Think about your approach in terms of tidyverse verbs (functions).
Develop solutions using code that is concise, clear, and repeatable on someone else’s machine so it is easy to understand and maintain by your future self and others.
Write necessary and sufficient code but no more.
- Consider minimalist ideas such as Occam’s Razor or the KISS (Keep It Simple Student) principle.
- As you develop your strategy, consider if you can solve the problem in a few steps or lines of code versus many lines using brute force.
- Most questions can be answered in less than 15 lines of code. If you are writing 30+ lines of code in a chunk, there might be an easier way.
- Take advantage of the vectorized nature of R and the tidyverse.
- Develop custom functions instead of copying and pasting a lot of code several times.
Build a little, test a little, then repeat. Parva celeriter probantur or “Small things are tested quickly.”
- Check intermediate results step by step checking the size, shape and values of your data.
- If the code is more than one line, strongly recommend a coding style where you start the chunk with the data object and a pipe and end with the plot or assignment. That makes it easier to check results without affecting the original data. It is perfectly fine to start with the assignment but it can make it more work if you make a mistake and have to recreate the data again.
- Avoid intermeidate variables unless you plan to use them in a different code chunk. Use pipes and insert
head()
,view()
orglimpse()
after a pipe in the midst of a series of steps to check your answer and then comment it out or delete it. Some questions might be:- Is the number of rows/columns about right?
- Are joins creating .x and .y variables when they should not?
- Are probabilities greater than 1?
- Are totals/averages in the right order of magnitude?
- Are you operating on a grouped data frame or not?
- Check outside sources or test calculations to confirm your results are reasonable if appropriate.
Use the RStudio debug environment or or debug tools to get inside your functions. We will discuss this in class.
Read the questions carefully. If a question asks for a number of rows or the top five rows, or the five highest observations, ensure your answer only shows the desired results, not all the rows.
- If the question says to save to the data frame, or to update the data frame, ensure you do that and do not create a new data frame, or worse, not save or update the results.
- If the question says to interpret a graph or model result, discuss what the plot means, not just what it shows. How does the plot/results affect your recommendations or next steps in the analysis?
- In many cases, the questions build on each other. The results of one question provide a starting point for later questions.
When all else fails, enable Partial Credit.
- You get credit for three parts of a solution:
- Having the correct conceptual approach and strategy.
- Executing the approach correctly in R/RStudio.
- Correctly interpreting the results and communicating them in writing using the appropriate language.
- If you are not able to completely solve a problem, write down something relevant to indicate your knowledge of each part of the solution.
- If your code is not working, comment out the code block (highlight, then CMD+SHIFT+C on a Mac), and indicate what is happening.
- You get credit for three parts of a solution:
Use RStudio Keyboard Shortcuts for more efficient coding. You will inserting new code chunks, running partial code, inserting pipes, and assignments hundred of times in this course. Go to RStudio
Tools/Keyboard Shortcuts Help
to see what works for your machine.
D.4 Summary
This course is challenging. These general instructions can help you be successful this course while growing your competencies in using the data science life cycle when analyzing problems, developing and executing coding strategies, and collaborating with others.
Feel free to email me with questions at any time about these instructions, the instructions for the assignments, any topic in class, or, if you get stuck and just can’t figure out a path to an answer. The earlier in the assignment period you ask for help, the more likely you will get the help you need to figure things out.