2  Git and GitHub

Published

August 30, 2024

Keywords

Git, GitHub, Version Control

2.1 Introduction

2.1.1 Learning Outcomes

  • Use a terminal pane in RStudio.
  • Execute BASH shell commands to navigate among directories and manipulate files.
  • Use Git for version control of local repositories and files.
  • Use Git with the cloud-hosted GitHub for syncing local and remote repositories.
  • Manage class materials and homework assignments using Git and GitHub.

2.1.2 References:

2.1.2.1 Other References

2.1.3 Configure Your Computing System

  • If you have not used R or RStudio or Quarto before, see the Appendix on installing software
  • If you have not configured your system, see the Appendix on Git and GitHub Setup.

2.2 The Terminal Window

A terminal Window provides access to a command line where we can enter commands to the system through a shell layer.

  • The easiest way to open a terminal window is within R Studio with Tools > Terminal > New Terminal.

 

  • The terminal window should look like this:


  • The path before the dollar sign is the working directory of the terminal window, not necessarily the same as the RStudio Console working directory. It’s where the terminal shell will look as the starting point for all commands and files.

  • The tilde or twiddle, “~”, is shorthand for the “home directory”. Each computer has a home directory that is the “default directory” for Git - usually your Users/username directory.

2.2.1 The Command Line

  • The command line is like the R command prompt: you insert code, hit enter, and then the computer executes your command.

    • Other words for command line: shell, terminal, command line interface (CLI), and console.
    • These terms are technically slightly different.
  • All commands get placed after the dollar sign.

  • However, instead of typing R code, you type Shell Script.

  • In this class, we will use the command line primarily for two things:

    • Moving around your file system.
    • Running Git commands.

2.3 Bash (the Bourne-Again SHell)

  • There are many types of shells, each with their own scripting language.

  • We will use the Bash scripting language for this class.

  • Bash is different from R in the syntax for commands/functions.

    • R: f(x, y = 1)
    • Bash: f x --y=1
  • Arguments that are “flags” use only one dash, like f x -g would incorporate the g flag.

  • We’ll see this in Git

  • Bash is also case sensitive on commands and names e.g., cd not CD

2.3.1 Bash Commands for Navigating the Directory Structure:

  • Changing the Bash terminal window working directory is done by using Bash commands to navigate (move) to another directory, e.g., where you have files of interest.

  • pwd: Print working directory. Show the absolute path to the working directory. This is like getwd() in R.

pwd
  • ls: List the current files and folders in a directory. If you are on a windows machine in the default Command Prompt, use dir
ls
  • cd: Change directories. This is like setwd() in R.
  • We use the same elements to specify a path as we do in relative paths in R.
    • Using two periods means “move back (up) a folder and
    • /” means move down a folder: cd ../foldername“.
    cd ../
    pwd
  • If you use cd without specifying a folder to move to, it will move the working directory to the home directory.
cd
pwd
  • . is shorthand for “the current working directory”

  • You can combine the move up and move down arguments to navigate several layers up and then down several layers in one command.

  • As an example, I’m going to move from the home directory back to my course directory: /OneDrive\ -\ american.edu/Courses/DATA-413-613.

  • As a shortcut for any command, type a letter then hit the tab key and Bash will fill in valid letters to the point of ambiguity (if there is one). This can save a lot of typing errors.

    • If nothing happens after one letter, that means you are already at a point of ambiguity so type another letter, and so on.
    • If in doubt, use ls
    • Note: Bash automatically escapes any spaces in the directory names by putting a backslash \ in front of them.
cd ./OneDrive\ -\ american.edu/Courses/DATA-413-613/
  • man: Read the manual of a command. Just like help() in R.
man ls
  • This will open up the help page for ls. You can scroll through this page using the up and down arrows.
  • Exit this page by typing q.
  • This won’t work for Git Bash (for Windows users). Instead, you’ll need to type
ls --help

2.3.1.1 Exercises

  1. What is your home directory? Navigate to it and see what files/folders exist in your home directory. Navigate back to the directory with your notes file.
cd 
  1. Where does the following command take you? How does it work?
cd ~/../../..
  1. Read the manual page of ls. What does the a flag do? Try it out!
man ls
## ls --help ## windows
  1. Move back to the directory with your notes for this class like the example below
cd Users/rressler/OneDrive\ -\ american.edu/Courses/DATA-413-613/Lectures_Class/01_git_github/
## or use the ~ for home directory
cd ~/OneDrive\ -\ american.edu//Courses/STAT-413-613/Lectures_Class/01_git_github/

2.3.2 Bash Commands for Managing Files and Directories

  • I rarely use these but they are there for you.
  • cp: Copy a file in your working directory with cp filename newfilename. You can add relative paths to either argument.
cp 01a_basic_bash.html hellobash.html
ls
  • mv: Move/rename a file.
mv hellobash.html goodbyebash.html
ls
  • rm: Remove (delete) a file.
rm goodbyebash.html
ls
  • mkdir: Make a new directory/folder.
mkdir tempdir
ls
  • rmdir: Remove (delete) a directory/folder.
rmdir tempdir
ls
  • Terminus is a game you can play for more practice at navigating using Bash.

2.4 Why Bother Using a Version Control System (VCS)

  • A Version control system is a program which tracks iterative changes of your local files.

    • VCS tools have been around for years to support software development projects.
    • Popular ones include CVS, Subversion, Mercurial, and Git.
  • Git is the most popular VCS and “one of the best version control tools available” in 2022

  • You can go back to previous versions of your code/text and compare with the most recent version, or keep the old version and start a new development path.

  • You can create copies of your code or files, change them, then merge these copies together later.

2.4.1 Motivation 1: Change code without the fear of breaking the baseline or production version

  • You want to try out something new, but you aren’t sure if it will work.

  • Non-Git solution: Copy and rename the files over and over

    • analysis.R,
    • analysis2.R,
    • analysis3.R,
    • analysis_final.R,
    • analysis_final_final.R,
    • analysis_absolute_final.R,
    • analysis7.R
    • analysis8.R
  • Issues:

    • Difficult to remember differences among files.
    • Which files produced specific results?
    • Requires a lot of careful documentation and user bookkeeping (not likely to happen)!
  • Git lets you change files while automatically keeping track of all older versions. It is easy to revert back to older versions if you decide the new changes don’t work or you want to try a different approach.

2.4.2 Motivation 2: Easy collaboration. Everyone can see exactly what changes have been made and who made them and why. This can reduce code (and people) conflicts.

  • In a group setting, your collaborators might (will) suggest or drive changes to your analysis/code as they make changes to their code, e.g., creating new data or renaming variables and functions.

  • A first non-Git solution: Email files back/forth.

    • Issues:
      • You have to manually incorporate changes across the team.
      • Only one person can work on integrating the code at a time (otherwise multiple changes might be incompatible).
  • A second non-Git solution: Share a Dropbox or Google Docs folder (a “centralized” version control system).

    • Issues:
      • Again, only one person can work on the code at a time.
      • Difficult to see the history of changes.
      • Less user-friendly for tracking changes.
      • Difficult to run excursions.
      • Even the R {trackdown} Package for working with GoogleDrive says use Git to manage code.
  • Git lets each individual work on their own local repository and offer their changes to others for review before integrating into the code.

  • It allows you to control which changes get approved to be incorporated into the baseline and then automatically incorporates those changes and identifies any conflicts.

  • Documentation of changes is built into the workflow (so it actually happens!).

2.4.3 Motivation 3: High demand skill for future employment

2.4.4 Motivation 4 - Git is the Foundation for GitHub.

GitHub is a popular forum for creating, collaborating on, and sharing Git-based repositories.

  • You will see many R packages that are not ready for CRAN, or don’t want to be on CRAN, hosted and accessible on GitHub.

As examples:

  • You can make your course group-project repo public so everyone can see it or or just share it with select individuals.
    • Prospective employers can see your work.
    • Provides evidence for your your resume.
  • If you build a personal website (e.g., using Quarto), you can use Git and GitHub for version control of your website.
    • You can then use GitHub actions so you make changes and commit to GitHub, it automatically updates your website hosted on GitHub or elsewhere, e.g., quarto.pub or netlify.com.

In short, version control is a best practice for any project involving software as it provides transparency and reproducibility and enables collaboration across teams.

  • Employers expect it.
  • Your future you will appreciate it.
  • Git, GitHub, and RStudio IDE make it easy to integrate into your workflows.

2.5 Git Overview

Git works by managing file status using three levels 1:

Figure 2.1: https://marklodato.github.io/visual-git-guide/index-en.html
Note

We are starting by learning how to work with Git and GitHub in the terminal window as opposed to using integrated software (RStudio Git) or desktop applications (GitHub Desktop).

  • Understanding how to use Git in the terminal window makes it easier to use the other approaches.
  • It also provides access to all Git commands that may not be present in an integrated application.
    • As an example, if you need to fix an error, like committing a very large (>100MB file) into Git that will be rejected by GitHub for being too large, you must use commands only available in the terminal window.

You may use the terminal window or RStudio functionality for this course as long as you commit and push regularly.

2.5.1 The Git Workflow for Versioning Files.

The workflow consists of actions to move files across the three levels.

Files start at the lowest level, the working directory. Users advanced files into the history or out of the history.

  • Working Directory: Git uses the terminal working directory as its working directory.

    • This is the folder where your terminal pane shell tries to execute commands and where it looks for files.
    • It is not necessarily the same as your RStudio Console working directory!
    • Any changes to files you have saved but not staged or committed to Git only exist in the working directory and are not yet indexed or saved in the Git history.
  • Stage: Files that are staged (added) are prepared (scheduled) to be committed to the history, but are not yet committed. Only files in the stage will be committed to the history.

  • History: The timeline of file versions (snapshots). You commit a file to the history and then, even if you modify it later, you can always go back to that same file version.

  • We’ll focus on the right-hand-side of Figure 2.1 for now.

  • Here the workflow is typically:

    1. Modify files in your working directory as normal until you want to take a snapshot of one or more files. Save the files.
    2. Add these modified files to the staging area.
    3. Commit staged files to history, where they will be kept forever (on your computer).
  • The workflow on the left-hand side of Figure 2.1 is used to undo actions from the right side.

    • Usually only when you need to undo mistakes or have changed your mind.
Important

Git does not save complete versions of the file each time.

It tracks the differences that occur in each line of a text file and it can recreate the version at any point in the file’s commit history.

The tracking of “differences” in text-based files allows for very efficient storage and processing.

Git is not as efficient at tracking non-text (binary) files such as PDF or .xlsx where changes can only be identified at the whole file level.

2.5.2 Common Git Commands

All Git commands begin with git followed immediately by space and then an argument for the type of command you want to execute.

  • Most everything you will do is moving up the right side of the picture.

We will be using the following Git commands most often:

  • git init: Initialize (or create) a Git repository. Only do this once per project/repository.
  • git status: Show which files are staged in your working directory, and which are modified but not staged.
  • git add: Add modified files from your working directory to the stage.
  • git commit -m "descriptive message": commit your staged content as a new commit snapshot.
  • git push: create a copy or update a copy of your repository on GitHub,

If you want to see the “differences” between the previous version and the current saved version, you can use

  • git diff: Look at how files in the working directory have been modified.
  • git diff --staged: Look at how files in the stage have been modified.

2.5.3 Repositories and Folder Structure (Housekeeping)

A repository (or repo, for short) is a collection of files (in a folder and its sub-folders) being version controlled (configuration managed) as a set.

  • The repo also contains the local version control data, in a hidden .git folder and files.
    • You can use the RStudio IDE to see hidden files and folders by using the More option in the Files pane.

In data science, each repository is typically one project (like an analysis, a model, a homework, or a collection of code that performs a similar task).

  • Users often turn a Git repo into an RStudio project as well. While Git does not care about whether a repo is an RStudio project, the RStudio IDE has features which make it easier to use Git with RStudio projects as we will see.

It is not a good practice to nest repos or RStudio projects inside each another repo or project!

  • Before you create a new repo, take care that the terminal working directory is not already under a Git repo.
Tip

Before you create a new repo in the working directory of the terminal window, enter git status in the terminal window to see if it generates an error that the folder is not a git repository.

  • If generates the error, that is good, it is not already in a repo so it is safe to create a new repo.
  • If it does not generate an error, then you should navigate to a new folder before creating the repo.
  • If you think it should not be a repo, you can navigate up the directory structure to see where the hidden .git folder is located.
  • If it should not be a repo (created in error), you can delete the .gitfolder and go back to the original location to try to create your new repo again. Be sure to check again with git status.

Recommend the following structure for this course:

  • Create a folder somewhere on your computer called DATA_613 (or DATA_413).
    • Recommend a Google Drive or OneDrive location, NOT Downloads.
  • Create a folder under DATA_XXX called Homework where you will clone each week’s assignment as an individual repo in its own folder.
    • Each assignment will come as an individual repository with its own set of folders.
  • These repositories will constitute your local repositories within which you will manage your files with Git and sync between Git and GitHub.
Warning
  • Note, your DATA_XXX folder should NOT be a repo. It is just the top-level folder you use to organize multiple repos.
  • The folder Homework should not be a repo either. It will have multiple repos underneath it.

2.6 Git Basics

2.6.1 Intro

We’ll practice with the common Git commands as we examine a topic from the famous paper of Oeppen and Vaupel (Oeppen and Vaupel (2002)).

  • They found perhaps the strongest association in social science: a linear relationship between year of birth and the maximum life expectancy where the maximum is taken over countries. We’ll examine this relationship for ourselves.

  • We’ll use the gapminder_unfiltered data frame from the {gapminder} package. The variables in this data frame are:

    • country: The name of the country.
    • continent: The continent of the country.
    • year: The year of the measurement. From 1952 to 2007.
    • lifeExp: The life-expectancy of at birth, in years, of an individual.
    • pop: Population.
    • gdpPercap: GDP per capita (US$, inflation-adjusted).
  • Create a folder called life_exp, e.g., directly under DATA_XXX.

  • Create a new quarto document called life_exp_analysis.qmd within the life_exp folder. Your .qmd might look something like this:

---
title: "Examine Life Expectancy"
editor: visual
date: today
format: html
---
library(tidyverse)

# Abstract

Here, I re-examine the analysis of Oeppen and Vaupel (2002).

# Analysis
  • Save life_exp_analysis.qmd.

2.6.2 Initialize (Create) a Local Git Repository

Open up a terminal window and use the cd command to navigate so the terminal window working directory is DATA_XXX/life_exp/.

  • Track progress by looking at the prompt

  • Your RStudio should look something like:  

Use the command git init to turn the folder into a Git repository. Type it in the terminal

git init
  • You’ve just created a Git repository!

There is a .git hidden folder tracking all of the changes you make for the files you tell it about.

  • Use ls -a to show hidden files and folders
ls -a
  • Recall from Figure 2.1, Git won’t track any files in the working directory until they are saved and you tell it do so.

2.6.3 Git Status

Use git_status to see what files Git is tracking and which are untracked.

git status
  • The output should tell you that life_exp_analysis.qmd is not tracked. In fact, you should have no tracked files.

2.6.3.1 Branch Names: Master vs Main

Git (and GitHub) use branches to manage multiple different versions of the same code. We will discuss branches in a later class. For now, you have one branch in your repo - one set of code.

  • When you created your new repo, you may have noticed some lines starting with “hint:”

    Using 'master' as the name for the initial branch. This default branch name
    hint: is subject to change. To configure the initial branch name to use in all
    hint: of your new repositories, which will suppress this warning, call:
    hint: 
    hint:   git config --global init.defaultBranch <name>
    hint: 
    hint: Names commonly chosen instead of 'master' are 'main', 'trunk' and
    hint: 'development'. The just-created branch can be renamed via this command:
    hint: 
    hint:   git branch -m <name>

For years, Git and GitHub named the baseline set of code with the working/production baseline as the “master” branch.

  • In the past few years both Git and GitHub established a preference for using terms other than “master”.

  • At the top of your git status you might see a line that says:

    on branch master
  • To rename the current repo enter the following:

    git branch -m main
  • To change to use “main” as the default for new repositories, enter the following:

    git config --global init.defaultBranch main
  • Now all new git repos on your machine will use “main” as the name of the baseline (top) branch.

  • We will update your GitHub settings in a few minutes.

Note: many earlier references on Google search results will reference the “master” branch and they mean the baseline or top-level branch in the repo we now call “main.”

2.6.4 Stage Files

Use git add filename to add files to the staged level.

git add life_exp_analysis.qmd
  • It’s good practice to always check which files have been added:
git status

Useful Arguments for git add:

  • -A or --all will stage all modified and untracked files.
  • -u or --update will stage all modified files, but only if they are already being tracked.
  • Try git help add in the terminal window and use Enter or Return to scroll down to see all the options and then q to exit.
Tip
  • You can also use the up-arrow key to go back through the previous commands.
  • Unfortunately, on some shells, you cannot click into a middle of a command to edit, you have to delete from the end and retype but you can use auto-complete.

Use the .gitignore file to streamline your adding of files.

  • This file contains the names of files that you do Not want Git to track.

  • These are usually configuration files or administrative files.

  • A typical .gitignore file for an R Project looks like the following

          .Rproj.user
          .Rhistory
          .RData
          `Ruserdata
          .Rproj
          .DS_Store

2.6.5 Commit Files

Use git commit to create permanent snapshots in the commit history of the staged files.

  • The -m argument will allow/require you to make a comment about the commit.
git commit -m "New life exp qmd file."
  • Your message (written after the -m argument) should be concise, and describe what you have changed since the last commit. These often refer to issues or change report numbers as well.

  • If you forget to add a message, Git will open up your default text-editor where you can write down a message, save the file, and exit. The commit will occur after you exit the text editor.

Tip

If your default text editor is vim, exit it using “escape” (ESC key) and then type :q.

git status should now be clear because there are no modified files:

git status

You can see all of your commits using git log.

git log

Congratulations! You have now completed the workflow on the right side of Figure 2.1.

  • As a next step in the analysis of the gapminder_unfiltered data frame,
  1. Add code into life_exp_analysis.qmd to do the following:
    • Load the {tidyverse} packages into R with library()
    • Access the gapminder data frame in the gapminder library with gapminder::gapminder_unfiltered.
    • Find the maximum life expectancy for each year and the country which had the maximum life expectancy.
      • Hint: There are multiple ways to do this, but suggest using group_by() and filter().
    • Save the year, country and maximum life expectancy into a new data frame called sumdat.
  2. Edit the .qmd file to change the header from “Analysis” to “Life Expectancy Analysis”.
  3. Save life_exp_analysis.qmd.
library(tidyverse)
gapminder::gapminder_unfiltered  |> 
  group_by(year) |>
  filter(lifeExp == max(lifeExp)) |>
  ungroup() |>
  select(year, country, lifeExp) |>
  arrange(year) ->
  sumdat

2.6.6 Looking at Changes

Use git diff to see changes in all modified files.

git diff

Git tracks changes based on lines in the text file (not letters or words). Any change in a line tracks the whole line being changed.

  • Lines after a “+” are being added. Lines after a “-” are being removed.
  • When there are a lot of lines filling your terminal window, you can exit git diff by entering q.

Check the status of your modified files.

Stage your modified files with git add, but don’t commit yet.

Recheck your status.

git status
git add life_exp_analysis.qmd
git status
  • git diff won’t check for changes in the staged files by default. But you can see the differences using git diff --staged where --staged is an argument.
git diff
git diff --staged

Commit your changes. Use a short but informative commit message.

2.7 Using GitHub as a Remote Repository for Version Control

2.7.1 Intro to GitHub

GitHub is a website that hosts Git repositories and allows workflows for collaboration and continuous integration (among other things).

  • Don’t confuse Git with GitHub.
    • Git is version control system on your local machine
    • GitHub is a decentralized cloud-based system for managing and sharing many, many repositories.
  • GitHub recommends repositories remain small, ideally less than 1 GB, and less than 5 GB is strongly recommended.
    • Smaller repositories are faster to clone and easier to work with and maintain.
    • Individual files in a repository are strictly limited to a 100 MB maximum size limit. For more information, see About Git Large File Storage

Once you have an account, you do three things to host a repository on GitHub:

  1. Create a repo on GitHub.
  2. Tell Git where GitHub is going to host your repo (the URL).
  3. Tell Git to move (push) your committed files and commit history to the designated GitHub repo.

2.7.2 Configure Your Settings

2.7.2.1 Set the GitHub Default Branch Name to “main” for New Repositories

  • If you set the Git default to main, it could be confusing unless you set the default for GitHub to be the same.
  • Go to your GitHub account Settings (click on your profile picture in the top right).
    • Select Repositories
    • Change the Repository default branch for new repos to use “main”.
  • Note, you can also change the name of individual repos as well.

2.7.3 Create a Repository on GitHub

  • Go to your GitHub account with your GitHub ID
  • Create a new repo on GitHub by selecting New on the homepage:

 

  • Or go to the “Repositories” tab and select New



  • Tell GitHub the name of your repo. In general, it can be a different name than the repo on your local machine.
    • For this class, name it “life_exp_USERNAME” where “USERNAME” is your GitHub username.
    • In general, you do not need to include your username in your repo name
  • Make a small description.
  • Make it Public for now.
  • To avoid errors, do not initialize the new repository with README, license, or gitignore files.
    • You can add these files after your project has been pushed to GitHub.
  • Then, click Create Repository.


  • You will get a new screen with the suggestions for what to do on your computer in the terminal window to add code to your new repo


2.7.4 Tell Your Local Git Where GitHub Will Host Your Repository.

The location of a GitHub repository is the URL for the repo.

  • It is generally of the form “https://github.com/GHUser/GitHubRepoName.git” where GHUser is the user name of the repo owner and “GitHubRepoName” is whatever the owner chose to name the repo on GitHub.
  • Your GHUser name is your GitHub user name you created for this course.

Use the command git remote to tell Git to do something associated with a remote repository, e.g., on GitHub.

  • We want to tell Git to add a new remote repo (on GitHub) to store a copy of our local repo.
  • We need to tell Git the name and location of the new remote repository.

GitHub allows you to copy the URL for your new repository

  • You can use the Clone or Download button to copy it to your clipboard for pasting into your terminal.
    • It also gives you suggestions for the commands to use.


  • Use git remote add to tell Git the nickname for the remote repo and where it is hosted.
  • This example uses the URL for my GHUser name and repo name. Substitute your own repo URL.
    git remote add origin https://github.com/rressler/life_exp_rressler.git
  • In the above command, “origin” is just the nickname (or alias) we gave to the location URL that is hosting our repo so we don’t have to type the URL every time.
    • We could have used “github” or “deep_space_nine” instead, but “origin” or “upstream” are traditional nicknames you will see in documentation and on-line posts. Think of the “o” in origin as “online”.

2.7.5 Push Files From Your Local Repo to the Remote GitHub Origin

When you create a new local repo, you are working on the top-level known as the the main branch (used to be called the master branch by default).

The first time you are pushing to a brand new repo on GitHub, you need to use the -u flag (for upstream) and identify the remote nickname and branch:

git push -u origin main
  • Read this as git push (first time) to the upstream online repo with the origin (URL) and merge to its main branch.
Note
  • If this is the first time you have pushed code to GitHub on this account from this computer, the GitHub Credential Manager (GCM) you installed earlier will open a window to allow you to authenticate with a credential so you can connect your local Git to upstream GitHub.
    • A window will pop up asking you to authenticate using a browser or a code.
    • Chose the browser.
    • If you are not already logged into GitHub with the same email, it take you through the 2FA process you have established.
    • Assuming success, the GCM should take care of future authentication for you.
  • The -u is needed since this is a new repository for GitHub.
    • It tells Git to connect the behind-the-scenes commit history from the local repo to the upstream repo (GitHub URL) we nicknamed origin and to do it for the main branch of the upstream repo.
    • It is equivalent to the –set-upstream option where the “upstream” location is the GitHub origin URL.
    • If for some reason you need to change the URL for the GitHub repo, e.g., you changed your account name or you renamed the repo on GitHUb (uncommon events) so the URL has changed, you could use the -set-upstream option to update Git with the new URL for origin.

You will see code scroll by in the terminal window showing the actions that are happening as part of the push.

When the push is complete, your code is now up on GitHub.

 

  • Once Git knows where to go on GitHub for a given repo, just use git push to push new commits to GitHub.
    • This can be after you have a successful initial git push -u origin main for a new local repo you have pushed to GitHub.
    • Or, if you have cloned a repo that already existed on GitHub to your local computer as a new repo, the connection is known (as we will see later for an assignment repo).
  • In both cases, for all subsequent pushes for the same repo, you can just use:
git push

2.7.5.1 Exercise

  • Add some text before the code chunk in life_exp.qmd describing what the code is doing.
  • Save the file, stage (add) the file, commit the changes (with a comment), and then push the changes to GitHub.
  • Go to your GitHub account and confirm it has recorded your latest push - you may need to refresh the browser window.
  • You should now have a public repository on GitHub with an updated life_exp_analysis.qmd file in it.

2.8 Sharing GitHub Repositories with Others

  • GitHub has mechanisms for creating collaboration workflows to help users manage the sharing of repositories.

2.8.1 Access Control and Permissions

  • The first mechanism is access control.
  • All repos have one or more owners who can set permissions for access
  • Owners of a repo can control who can see, read from (pull), and/or write to (push) each of their repos.
  • To control who can see a repo, owners choose to make them public or private.
    • Public repositories are visible to everyone on the internet.
    • If a repository is private, only the owners can see it by default, and they can authorize others to see it.
    • I am the owner of our classroom account and all the repositories in it are Private by default.
    • Your personal account repositories are Public by default - free accounts are no longer limited in how many private repos they can have (you are limited to only 3 collaborators on a free account).
    • As of July 2024, GitHub reports having over 100 million users in 2023 and more than 200 million repositories as of 2022) (including at least 28 million public repositories), making it the largest host of source code in the world (https://expandedramblings.com/index.php/github-statistics/).

2.8.2 GitHub Commands for Sharing Repositories

The whole concept of GitHub is to enable widespread sharing of code while protecting the integrity of the code and the intellectual property rights of the code owners.

  • Repo owners set read and write permissions for the public and for authorized users
  • By default, you can create a copy of any of the public repositories and any private ones for which you have permission.

If you can see a repo, GitHub allows multiple ways of creating copies:.

  • fork copies a repo “horizontally”, from one GitHub account to another GitHub account without direct links or write permissions to the original repo.
  • clone copies a repo “vertically”, moving a copy of a repo on a GitHub account down to your local machine while maintaining a link to the original (origin) on GitHub so it’s easy to update your copy or even write to the original repo (if authorized).
  • download creates a file you can put on your local machine without any links.

When you have used fork or clone, you can use a pull request or git pull to update your copy with any changes from the original repo.

2.8.3 Fork a Repository

  • The GitHub fork command creates a separate and distinct copy of a given GitHub repo on your GitHub Account with no permissions on the original repo. The repo stays in the cloud.
  • You can make all the changes you want to the files in a forked repo.
  • However, the original owner will not see them directly - it is a separate repo on GitHub (and it’s not on your local system - yet).
  • If you make a change you think the original repo owner might like, you create a GitHub pull request to ask them to look at (pull) your code for review. They can comment on it and even decide to merge it into their original baseline. The is the way open source software (and many R packages) are developed.

2.8.4 Clone a Repository

  • clone is a Git command you use in your terminal window to copy a GitHub repo on to your machine as a new repo and maintain a link to the original repo.
  • You can pull updates from a cloned repo at any time using git pull.
  • You can push updates to the original if you have write privileges.
    • If you do not have write privileges and want to push updates to GitHub, you can go through a two step process:
      1. Fork the repo so you have a copy on your own GitHub account.
      2. Clone your forked copy to your local machine.
    • You can then make as many changes as you want and upload to your forked copy on GitHub.
    • You can then create a pull request to the original owners as before if you want to share your code with them for review.
  • To Clone a Repo:
    • Go to GitHub and copy the URL for the repo you want to clone to your clipboard.
    • Go to your terminal window and create/navigate to the directory under which you want to place the new repo.
Important
  • You do NOT need to create a new directory for the repo as it will be in its own directory.
  • The directory should NOT be part of a repo already to avoid nesting of repos.
  • Run git status in your directory to confirm it errors out with fatal: not a git repository so none of the parent directories all the way up your computer is a repo with a .git folder.
  • Once you are sure the folder is NOT an existing repo (or under one) then …
  • Enter the command git clone URL where instead of typing URL you paste in the URL you copied from the GitHub repo.
  • Look at help with git clone -h to see all the options

2.8.4.1 Exercise Continued - in Fork and Clone (in groups of 2-3)

  • Connect with someone else in the class and tell each other your GitHub ID.

  • The first step is to fork their life_exp_USERNAME repository to your account.

    • Search for the person on GitHub using their GitHub user ID - the repo should be public.
    • Go to their “life_exp_USERNAME” repository. The url should be something like “https://github.com/USERNAME/life_exp_USERNAME” where “USERNAME” is their username.
    • At the top right, on the level of the repo name, click the “Fork” button.
  • You should now have a copy of their life_exp_USERNAME repo on your own GitHub account page.

  • The second step is to clone the repo to your machine so you can make changes to it.

    • Copy the URL from the newly forked repo - click on the Code Button

    • Use your terminal window and navigate to the DATA_XXX/Homework directory on your local machine.

    • Clone the forked version of their repo to your local machine:

    git clone https://github.com/YOUR_USERNAME/life_exp_PARTNER_USERNAME.git
  • You should now be able to see (and edit) the files from the new repo in your local machine directory:

ls
ls -a
  • You will clone each homework assignment throughout the course ont our local machine.
  1. Edit your partner’s file to add code to create a scatterplot of year versus maximum life expectancy. Color code by country, and add a single Least Squares (LS) smoother line.
  • Your plot should look like this:

  1. Save the modified file, then add it (stage), and commit the changes with a comment.
  2. Push the changes to your forked repo on GitHub.
  • You should now see the changes in your repo on GitHub
sumdat |>
  ggplot(aes(x = year, y = lifeExp)) +
  geom_point(aes(color = country)) +
  geom_point(pch = 1) +
  theme_bw() +
  geom_smooth(method = "lm", se = FALSE) +
  xlab("Year") +
  ylab("Maximum Life Expectancy") +
  scale_color_discrete(name = "Country")

2.9 Git and GitHub in RStudio

RStudio has built in capabilities for working with Git and GitHub.

RStudio uses the construct of an RStudio “Project” for working with Git and GitHub.

  • RStudio Projects are like repos in that they can not be embedded or nested within each other - so each repo can be a project and each project can be a repo.
Tip

For each homework assignment, once you have cloned it to your machine, create it as an RStudio Project.

  • Open the project each time you work on your homework.
  • This will automatically position your console and terminal working directories to the correct folders.
  • It will also allow you to use the RStudio Git Pane for routine workflow with Git and GitHub.

2.9.1 Exercise: Turn the life_exp Repo into an RStudio Project

Go to RStudio Tools/Global Options.

  • Select Git/SVN
  • Ensure “Enable version control interface for RStudio Projects” is checked.

Go to your RStudio window and at the Top Right you see an icon for Project (none) with a down arrow

  • Click on the arrow and select New Project.

  • Click on the Existing Directory option. .

  • Find your life_exp folder.

    • Click the open in new session check box.
    • Click Create Project.

RStudio will now open up a second session.

  • You should see a second RStudio icon with the Project Folder’s name in it in your dock/taskbar.

  • The pane with environment pane should have a new Tab called Git.

  • If you do not see it, go to the RStudio Project Options (top right), select Git/SVN, and confirm you can see your folder, and select OK.

2.9.2 The Git Pane

The Git Pane shows the file structure in the repo directory - with only the tracked files that have uncommitted changes to them so no files listed your .gitignore file.

  • It has several commands in its menu to allow you to execute common Git and GitHub commands.
  • Both Diff and History open up a new window with either the un-committed changes to the selected file or the history of the committed changes to the selected file.

Changes

History

2.9.3 The Stage, Commit, Push Workflow

Open your life_exp.Rmd file and make a small change to add a few words and save it.

  • Go to the Git pane and you should see it.

  • In the main Git Pane you can click the check box to the left of a file to stage it.

    • You will see the status change to staged.
  • Then click on the Commit button in the menu row.

    • A new window will open with files to be committed. (you can also stage or un-stage files by clicking on them here).
  • Write a commit message in the upper right Commit message box and click the Commit button below the message.

    • You get a pop-up window with the results (from the terminal window).
    • Notice the file has disappeared from the Git Pane file listing but shows up in the history.
  • If you use the ’Diff” pop-up window, you can select a line, a chunk or a file to Stage.

    • You can use the separate buttons for each.
  • There is a Git status message in the Git pane that shows you are ahead of origin/main by one commit.

  • Select the Push button at the top right (green up arrow) to push to GitHub.

Status and Push

  • The status message is now gone.

  • You can go to GitHub to see your updated files.

  • There are other options in RStudio Git Pane that we will discuss in a later lecture on branching and collaboration workflow.

The combination of RStudio Projects with a dedicated pane for Git and GitHub makes for a convenient workflow for using Git and GitHub to manage your code.

2.10 Manage Class Materials and Homework Assignments Using Git and GitHub

  • We have a private organization in GitHub with a name based on the semester, something like 24F-DATA-613.
  • All class materials and assignments for this course will be online or provided via GitHub or Canvas.

2.10.1 Class Materials

  • Class Lecture notes will be online or on Canvas as HTML documents.
    • The notes are HTML files. The Canvas versions will have embedded images so can be quite large.
    • The online version (https://rressler.quarto.pub/data-413-613/) may be updated more frequently as errors or clarifications are identified.
    • If you want to point out any errors or suggest corrections, please email the instructor or DATA-613@american.edu.

2.10.2 Homework and Group Project Assignments

2.10.2.1 Accepting Assignments

Homework and Project Assignments will be provided on GitHub via a link posted in Canvas.

  • Each individual homework assignment will be its own repo.
  • You will see a link in the Canvas assignment. Click on the link to accept the assignment.
  • This will create the repo in GitHub.
  • Clone the assignment under your computer’s Homework directory. You do not need to create a new directory for it as it will create its own directory.
  • Cloning will create a new folder with any necessary sub-folders for analysis and data in your assignment.
  • Your repo will be a fork of the baseline homework repo (which you can’t see) and not a copy. This means that in the event the instructor updates the baseline rep for an assignment, e.g., correct an error, you can use a git pull to get the update.

2.10.2.2 Submitting Homework

All completed assignments should be submitted via pushing the final repo with .qmd and knitted HTML files and any data to GitHub.

  • Submit a comment on Canvas when completed so it is clear the assignment is ready for grading.
  • Save, stage, commit after each part of a question is completed and push after each major question is completed.
  • You may also submit pull requests to the instructor while you are working on them to submit questions.
  • GitHub tracks all commits and pushes so the instructor can see the how you progress on individual assignments
  • The instructor can also see all the commits and pushes on Group projects to include scope and authors.

You may get feedback on GitHub but all grades will be recorded in Canvas.

2.10.2.3 Exercise

  • There is a link to a hw00_student_info in the Canvas.
  • Accept the link.
  • Navigate to your Homework directory in your terminal window
  • Clone the assignment repo.
    • You should see an analysis folder and a data folder with a .csv file with possible data sources in it.
  • Edit the .qmd file to change the name in the header.
  • Save, Stage(add), Commit, and push back to GitHub.

2.11 Summary

2.11.1 Best Practices

Do not nest repos or R projects. Always check with git status before creating a new repo.

The best way to learn Git and GitHub is to use it over and over and over so you build “muscle memory” into your workflow.

  • Use the git add -u to reduce typing when you are just updating the same set of files.
  • Commit frequently with meaningful messages.
  • On the homework, commit after you complete each part of a question.

Keep your local and remote (GitHub) files in sync to minimize merge conflicts.

When collaborating, communicate who is working on what. Use Issues and Pull Requests for each issue and solve one issue at a time as separate workflows.

Usually, you should only commit plain text files such as .Rmd. .qmd, or HTML and any images or data.

  • You can commit R-generated PDFs or .docx but they are easy to reproduce and clog up your repo storage.
    • Git is actually tracking changes in each line in the files, not the whole files themselves. This is very efficient for plain text files.
    • However, for non-plain text (binary) files like PDFs and .docx, the whole file is changed when you change a single line, so Git ends up saving the whole file each each time you change it.

Do Not Push or Upload sensitive data or information to GitHub. This could include personal identifying information, passwords, SSH private keys, etc..

Git and GitHub have extensive help but there are also many references and stackoverflow or other posts about issues so feel free to research any issues or ask your instructor/TA for help.

2.11.2 Parting Wisdom from XKCD

 


  1. graphic from Mark Lodato↩︎