Appendix C — Projects, Files, and Paths

C.1 References

C.2 Data Science Projects

Data science workflows are usually constructed around the concept of a “Project”.

From a conceptual perspective, a data science project is a coherent set of work structured to answer a set of questions, solve a problem, or create a capability.

  • In typical settings, a project may be associated with a defined outcome and a budget.
  • For this course, the entire set of lecture notes is a project. Each assignment will be its own project.

From a physical perspective, a data science “project” is a coherent set of files that are created or configured to support the work.

Tip

A best practice is to keep all project files within or beneath a single project folder (directory) on your computer.

  • Many coding tools such as Git expect this structure.
  • Putting your project under a single folder makes your work easier to maintain by future you or others.

However, it does not all go in a single folder. You can, and should, have multiple sub-folders to organize your work.

  • Separate your files based on their purpose, i.e., separate raw data, data, R scripts, and analysis files.
    • Depending upon the nature of the work, there may be a specific folder structured that is required, e.g., for an R package.
  • Figure C.1 shows the potential structure for a project in R.
myproject/              # Top-level project folder
├── data_raw/           # Folder for raw, untouched, data and data cleaning scripts
│   ├── my_data.csv     # Raw data files
│   └── clean_data.R    # R script for cleaning raw data
├── data/               # Folder for cleaned data
│   └── my_data.rds     # Cleaned data files
├── R/                  # Folder for scripts and custom functions
└── analysis/           # Folder for analysis and reports documents
    └── my_analysis.qmd # Quarto report
Figure C.1: Example folder structure for a project using R.

Most Integrated Development Environments (IDEs) expect you to organize your work under a single top-level folder, often referred to as the project root.

  • VS Code and Positron expect you to open a folder to define your workspace.
    • You can also open multiple folders and configure a workspace to access files from multiple projects at once.
  • RStudio allows you to create an RStudio Project, which includes special .Rproj and configuration files to track the state of your project.
    • Most modern workflows for developing R packages assume the project files are organized within an RStudio Project.
  • Jupyter does not enforce a formal project structure. However, best practice is to launch Jupyter from your top-level project folder and create or activate a virtual environment within that directory.
    • This ensures consistent paths and environment isolation across notebooks.

Figure C.2 shows the folder structure for an RStudio Project that is managed in Git. It also has files that support Quarto creating a website/book and publishing it on line.

myproject/             # Top-level project folder
├── .git/              # Git repository folder (hidden)
├── .gitignore         # Git ignore file for excluding files from version control
├── myproject.Rproj    # RStudio project file defining project settings
├── RProj.user         # RStudio user-specific session data (auto-generated)
├── _quarto.yml        # Quarto website/book configuration
├── _publish.yml       # Publication config for quarto.pub
├── data_raw/          # Raw, untouched data and cleaning scripts
│   ├── my_data.csv    # Raw data file
│   └── clean_data.R   # Data cleaning script
├── data/              # Cleaned, processed data
│   └── my_data.rds    # Cleaned data file
├── R/                 # Custom R scripts and functions
├── chapter_01.qmd     # First chapter content
├── chapter_01_images/ # Images for chapter 01
└── README.md          # GitHub project overview
Figure C.2: Example project folder structure for an RStudio project that is managed in Git and creates a quarto website.

This are just examples and projects may take on many different shapes.

  • You may have sub-folders for SQL or python, or different types of R scripts.
  • If you have created modules, you may have subfolders for each module.
  • If you have a website with multiple tabs or pages, you may have folders for each tab or page.
  • If you are creating an R package to publish it on CRAN you will have specific guidelines for some elements of your project. See the GGPLOT2 GitHub Repository as one example.
Warning

Unless you are an expert in working with Git submodules, there is rarely a good reason to nest one project inside another. Doing so can lead to a number of problems:

  • Git confusion: Git expects each repository to maintain its own isolated history. Nesting repositories can lead to unexpected behavior and complicate version control workflows.
  • Broken relative paths: Tools like Quarto assume a single project root. Nested projects can disrupt these assumptions and break file references.
  • Conflicting configurations: Having multiple .Rproj files, renv environments, or environment.yml files can create ambiguity about which settings or environment should be used.

✅ Best practice: Check before you create a project that your root directory is not already within another project. You can use git status in the terminal and you want it to fail, or you can look up the directory tree to find .*Rproj files.

The key is to organize your files in a way that is easy to understand (by you and others), separates or modularizes the work, facilitates configuration management, and meets deployment or publishing requirements.

C.2.1 RStudio Projects

When working in RStudio, you should convert every project of any duration into an “RStudio Project” to take full advantage of the IDE’s capabilities.

To convert a folder into an RStudio Project, you can use the IDE Menu File → New Project → New Directory → New Project or the drop-down arrow in the top right of the IDE.

  • If you have an existing folder, you can use it as the basis for the RStudio Project.

RStudio will create the two files (shown in Figure C.2) to keep track of your project.

  • You can open the project in RStudio by opening the .Rproj file or by using the File menu File/Open Project... or the drop down arrow at the top right of the IDE.
  • RStudio will automatically set up your console working directory, your Files pane working directory, and your Terminal pane prompt working directory to be the root folder of the RStudio Project.

See Working with RStudio Projects for more details.

Note

VSCode and Positron will also set the working directory of the environment to the project root when a Folder or Workspace is opened.

C.3 File Paths and Working Directories

Good practices include separating files into different folders. Thus, you will usually need to connect from the file you are in, be it .qmd, .R, or .py, to another file that has data, images, or functions you want to read or write.

Connecting to another file requires you telling your file the “path” to the other file of interest.

  • A file path tells your code where to find a file or folder. It’s like the GPS location for reading data, saving plots, or loading resources.

There are two types of File Paths: Absolute and Relative.

C.3.1 Absolute Paths

An absolute path specifies the full location of a file starting from the absolute root of the computer’s file system and working down through each layer of folder to get to the final file name.

  • On macOS/Linux: "/Users/jane/my_documents/DATA-413/Assignments/hw_01/data/mydata.csv"
  • On Windows: "C:/Users/Jane/Documents/DATA-413/Assignments/hw_01/data/mydata.csv"

The advantage of absolute paths is they are always accurate on your machine.

The major disadvantage is they are always broken on someone else’s machine` – your code is not portable to other systems or reproducible by other people.

Tip

If you can see your computer user name in a path, it is an absolute path and will not work on my machine! Change it.

The solution is to use a relative path.

C.3.2 Relative Paths

A relative path starts from the current working directory and navigates up (if needed) and then down through the folder levels to get to the file of interest.

Assume you are working in a .qmd file in your RStudio Project analysis folder and you want to read in some data from a file in the data folder that is at the same level as the analysis folder. The relative paths might look like:

  • On macOS/Linux: "../data/mydata.csv"
  • On Windows: "../data/mydata.csv"

The major advantage of relative paths is they will will work on someone else’s machine. You do not need to care what their computer’s folder structure looks like (as long as they are using the same folder structure for the project as you - which they should).

A minor disadvantage is if someone changes the project folder structure the path will not work on anyone’s machine – should be rare.

Building a relative path is straight forward.

  • Use . for the current directory, .. to go up one level, and /name to go down one level to a the name folder or file.
  • You can connect these together as much as needed to traverse the levels of folders.
  • You should only need to go up as many levels as needed and then down as many levels as needed to find the file.

C.3.3 Where is the Working Directory?

A working directory is the folder your computer uses as the default starting point to look for files the code wants to read or write.

  • Your computer can have many working directories at once as each computing process (a file or interactive pane with a cursor e.g., the Console pane or a Terminal window) typically has its own, independent working directory.
  • Understanding how working directories are set in different tools and file types is key to writing portable, reliable code.

RStudio has at least three possible working directories: one for the interactive Console pane, one for the Source file, e.g., .qmd or .R, and one for the Terminal window.

  • Console Pane: RStudio will set the Console working directory to the project root when opening an RStudio project or to the default from the Global Options - General if not in a project.
    • It will show the absolute path starting from the User root folder, shown as ~, at the top of the Console pane.
    • You can also run getwd() to see the path and setwd("new_path") to change the Console working directory.
    • You can also use the File pane More menu to change the working directory.
  • Source Pane
    • Quarto (.qmd) Files: When working interactively in a document code chunk or rendering a .qmd file, RStudio sets the working directory to the location of the file, not the project root.
      • If the file is saved in the analysis folder, that is the working directory.
    • R Scripts (.R files)
      • Running an R script uses the current Console Pane working directory, which defaults to the project root if opened as an RStudio Project.
    • Python Scripts (.py files)
      • If you run Python in RStudio via {reticulate}, it uses the R Console Pane working directory.
  • Terminal Window: RStudio will set the Terminal window working directory to the RStudio project root when first opening a project or leave it at the last folder that was being used if restoring a project or opening outside a project.
    • Whatever folder is showing in front of the cursor is the working directory for the window.
    • Any new terminal windows in a project will open at the root of the project.
    • You can see the full path starting from the user root, shown with ~, at the top of the terminal window.
    • If you want to change the working directory - navigate using bash cd newpath where newpath is indicated with combinations of .., and /foldername as used in a relative path.
Warning

Avoid using setwd() in Quarto documents or scripts as it is not reliable for ensuring your code works on anyone else’s computer.

  • You will see lots of blog posts and tutorials about how to use setwd() but most are old. Newer posts recommend against it in favor of using relative paths or functions from the {here} package or Base R.

The multiple working directories can cause challenges when running code in a file versus running it in the console.

C.4 The {here} Package for Finding Files in R Projects

The {here} package was designed to help users find files in their projects for both notebooks such as Quarto files and R scripts and help their work be reproducible.

The package assumes your work is in some sort of project structure that has a root directory.

It uses a helper package called {rprojroot} for finding the project root directory.

  • It does this by looking for standard files that indicate a project such as .Rproj, .git, .here, DESCRIPTION, etc.

Once it knows the project root directory, it creates an absolute path for it for the machine that is running the code.

Then when working in a document, it checks the working directory for the document relative to the root.

This allows you to then specify the path from the root directory to any other file (in or outside the project) with the here::here() function.

  • The function will figure out from where where it is being called and the correct path to the file.

Rather than using the current working directory (getwd()), here() always interprets paths relative to the project root.

It then converts those paths into absolute paths for the computer running the code. This makes your file references consistent whether you or someone else is:

  • running a script,
  • rendering a Quarto document,
  • or, executing code in an interactive R session.

The here() syntax allows you to create a complete path from the project root to the file name or use multiple arguments for each level of folder to the file name that it will concatenate with the / as the separator.

Assume we have a .qmd file in an analysis folder just below the project root and we want to load data from a data folder at the same level. The following code allows us to do that.

```r
library(here)                              # 1
readr::read_csv(here("data", "file.csv"))  # 2
```
  1. When you run library(here) in a .qmd file located in an analysis subfolder, it automatically walks up the directory tree to find the project root. It then creates a variable with the absolute path to the root.
  2. The here("data", "file.csv") concatenates data/file.csv to the absolute path to the root to create a complete path to the file.
Tip

If you are wondering why a path is not working, call here::dr_here() and it will provide a message that by default includes the reasonwhy here() is set to a particular directory.

here::dr_here()
Note

You may notice the help file for here() states “This package is intended for interactive use only.” That does not mean you should not use here() in your projects or scripts. That is where it is best.

Interpret that statement as a warning to be careful about using it inside a package you may be developing as it creates another dependency in the package.

  • If you are building a package, consider using filePath(), a base R function for building file paths in a cross-platform way:

Also note that other packages such as {plyr} or {txtutils} have a here() function so be careful about what gets loaded after {here}. You can always use the :: operator to be precise, here::here().

C.4.1 The {fs} package

Given the discussion of working with files and directories, if you are trying to work programmatically with folders and files, the {fs} package may be of interest. It is a tidyverse package that provides a “cross-platform, uniform interface to file system operations.”

The {fs} package (I think of it as file/folder support) provides functions in four main categories:

  • path_ for manipulating and constructing paths
  • file_ for files
  • dir_ for directories
  • link_ for links

Like other tidyverse functions, it works well with the pipes, is vectorized, returns “tidy” results, and “fails nice” in that provides detailed error messages.

It works well with {purrr} so if you need to work on all files in a directory, {fs} may be right for you.

C.4.2 Summary

  • Use well-organized Projects for clean, reproducible work for R or Python.
  • When using RStudio, convert each project into an RStudio Project for ease of navigation.
  • For R and Python, keep code portable with relative paths.
    • Although python has different rules for determining the working directory, it still likes relative paths.
  • If working in R, the {here} package can make it much easier to create reusable code.