3  Tidyverse Review

Published

November 6, 2024

Keywords

Tidyverse, stringr, regex, lubridate, forcats

3.1 Introduction

3.1.1 Learning Outcomes

  • Manipulate strings with the {stringr} package.
  • Employ regular expressions (REGEX or regex) with strings.
  • Manipulate dates and times using the {lubridate} package.
  • Manipulate factors using the {forcats} package.
  • Employ current capabilities of Tidyverse packages to deliver data science solutions.
  • Employ literate programming as part of responsible data science.

3.1.2 References:

3.1.2.1 Other References

3.2 Strings

  • In R, strings (also called “characters”) are created and displayed within quotes.
  • Anything within quotes is a string, even numbers or punctuation.
  • You can have a vector of strings and use the subset operator [...].
x <- c("I", "am a", "4", "string", "vector.")
x[2:3]
[1] "am a" "4"   
  • You can use special characters in strings using the “escape character” \ which indicates the character after the backslash is to be treated as a special character.
  • For example, if you want to put a quotation mark in a string, you must “escape” the quotation mark with a backslash.
x <- "As Tolkien said, \"Not all those who wander are lost\""
writeLines(x) ## show the string as output
As Tolkien said, "Not all those who wander are lost"
print(x)      ## show the string as a character value
[1] "As Tolkien said, \"Not all those who wander are lost\""
  • "\n" means create a new line.
x <- "Not all those\nwho wander are lost."
writeLines(x)
Not all those
who wander are lost.
  • If you want to indicate a \, you must escape it, so use two \\.

3.2.1 The {stringr} Package

  • The {stringr} package has functions to make manipulating strings easier - (more user-friendly than base R’s grep() and gsub()).
  • {stringr} loads with the {tidyverse}.
library(tidyverse)
  • All {stringr} functions begin with str_ so in RStudio you can press tab after typing str_ to show a list of possible string manipulation functions.
  • The {stringr} functions we will cover today are often useful in cleaning or analyzing data from the web or during text analysis.
  • Many {stringr} functions have a pattern = argument where the pattern describes a string of interest that one is trying to match.
    • The pattern is often described using a “regular expression”. We will review these later.
    • Simple patterns can use just letters and operators such as | for OR.

3.2.1.1 Detect Matches

  • str_detect(string, pattern): Returns a logical value or vector: TRUE for a match to the pattern and FALSE if no match is found.
    • The negate = argument reverses the result.
  • Useful for creating new variables or filtering rows.
data("presidential") ## 12 former US presidents
presidential  |>
  mutate(has_a_e = str_detect(name, "e|a"))
# A tibble: 12 × 5
   name       start      end        party      has_a_e
   <chr>      <date>     <date>     <chr>      <lgl>  
 1 Eisenhower 1953-01-20 1961-01-20 Republican TRUE   
 2 Kennedy    1961-01-20 1963-11-22 Democratic TRUE   
 3 Johnson    1963-11-22 1969-01-20 Democratic FALSE  
 4 Nixon      1969-01-20 1974-08-09 Republican FALSE  
 5 Ford       1974-08-09 1977-01-20 Republican FALSE  
 6 Carter     1977-01-20 1981-01-20 Democratic TRUE   
 7 Reagan     1981-01-20 1989-01-20 Republican TRUE   
 8 Bush       1989-01-20 1993-01-20 Republican FALSE  
 9 Clinton    1993-01-20 2001-01-20 Democratic FALSE  
10 Bush       2001-01-20 2009-01-20 Republican FALSE  
11 Obama      2009-01-20 2017-01-20 Democratic TRUE   
12 Trump      2017-01-20 2021-01-20 Republican FALSE  
presidential  |>
  filter(str_detect(name, "e|a"))
# A tibble: 5 × 4
  name       start      end        party     
  <chr>      <date>     <date>     <chr>     
1 Eisenhower 1953-01-20 1961-01-20 Republican
2 Kennedy    1961-01-20 1963-11-22 Democratic
3 Carter     1977-01-20 1981-01-20 Democratic
4 Reagan     1981-01-20 1989-01-20 Republican
5 Obama      2009-01-20 2017-01-20 Democratic
presidential  |>
  filter(str_detect(name, "e|a", negate = TRUE))
# A tibble: 7 × 4
  name    start      end        party     
  <chr>   <date>     <date>     <chr>     
1 Johnson 1963-11-22 1969-01-20 Democratic
2 Nixon   1969-01-20 1974-08-09 Republican
3 Ford    1974-08-09 1977-01-20 Republican
4 Bush    1989-01-20 1993-01-20 Republican
5 Clinton 1993-01-20 2001-01-20 Democratic
6 Bush    2001-01-20 2009-01-20 Republican
7 Trump   2017-01-20 2021-01-20 Republican
  • str_count(string, pattern): Counts the number of matches within a string.
tibble(name = presidential$name,
   count = str_count(presidential$name, "a|e"))
# A tibble: 12 × 2
   name       count
   <chr>      <int>
 1 Eisenhower     2
 2 Kennedy        2
 3 Johnson        0
 4 Nixon          0
 5 Ford           0
 6 Carter         2
 7 Reagan         3
 8 Bush           0
 9 Clinton        0
10 Bush           0
11 Obama          2
12 Trump          0
  • str_count() counts non-overlapping matches.
str_count("abababa", "aba")
[1] 2

3.2.1.2 Subset Strings

  • str_sub(string, start, end) extracts the substring between the location of two characters (inclusive).
  • Default values for start and end are 1L and -1L.
x <- "The Road goes ever on and on"
str_sub(x)
[1] "The Road goes ever on and on"
str_sub(x, start = 3)
[1] "e Road goes ever on and on"
str_sub(x, end = 6)
[1] "The Ro"
str_sub(x, start = 3, end = 6)
[1] "e Ro"
  • You can also use str_sub() to replace substrings by assignment.
str_sub(x, start = 3, end = 6) <- "Tolkien"
x
[1] "ThTolkienad goes ever on and on"
  • str_subset(string, pattern, negate) returns the strings where there is a match.
  • negate = TRUE returns those without a match.
str_subset(presidential$name, "(e|a)")
[1] "Eisenhower" "Kennedy"    "Carter"     "Reagan"     "Obama"     
str_subset(presidential$name, "(e|a)", negate = TRUE)
[1] "Johnson" "Nixon"   "Ford"    "Bush"    "Clinton" "Bush"    "Trump"  

3.2.1.3 Manage Lengths

  • str_length(string) returns the count of “characters” (including spaces and punctuation).
  • Special characters, starting with the escape symbol \, only count as 1 character.
str_length("My apple tastes sweet")
[1] 21
str_length("My apple tastes sweet.")
[1] 22
str_length("My apple tastes sweet  .")
[1] 24
str_length("My apple tastes \nsweet  .")
[1] 25
  • str_trim(string, side) removes (potentially troublesome) invisible white space on the beginning and/or the end of a string.
  • str_squish(string) removes invisible spaces at the beginning and end, and also collapses multiple spaces in the middle of a string into one space.
  • Extra spaces can happen a lot with human-entered data.
x <- " This is a string with   extra whitespaces   in the beginning, middle and end. "
x
[1] " This is a string with   extra whitespaces   in the beginning, middle and end. "
str_trim(x)
[1] "This is a string with   extra whitespaces   in the beginning, middle and end."
str_squish(x)
[1] "This is a string with extra whitespaces in the beginning, middle and end."

3.2.1.4 Mutate Strings - Change the Characters in a String

3.2.1.4.1 Extracting/Assigning Substrings (Parts of a String)
  • As seen in Section 3.2.1.2, str_sub() can mutate a string by extracting a substring located between two characters (inclusive) or assigning new values to a substring.

  • str_replace() and str_replace_all() will replace matched patterns with the provided strings.

    • str_replace(string, pattern, replacement) replaces just the first match to the pattern in a string with the replacement string
    • str_replace_all(string, pattern, replacement) replaces all matched patterns.
  • Example: we want to get rid of all punctuation in the following phrase (the | means “OR”). Notice we have to escape the period . and ! with \\.

farewell <- c("Farewell we call to hearth and hall! \nThough wind may blow and rain may fall,\nWe must away ere break of day\nFar over wood and mountain tall.")
writeLines(farewell)
Farewell we call to hearth and hall! 
Though wind may blow and rain may fall,
We must away ere break of day
Far over wood and mountain tall.
farewell |>
  str_replace_all("\\.|\\!|,", "")
[1] "Farewell we call to hearth and hall \nThough wind may blow and rain may fall\nWe must away ere break of day\nFar over wood and mountain tall"
  • str_to_lower(string) and str_to_upper(string) convert all letters to lower or capital case.
  • str_to_sentence converts all words and letters to sentence case, even acronyms.
x2 <- "Deeds will Not be LESS vaLiant in the USA because THey are unpraised."
x2
[1] "Deeds will Not be LESS vaLiant in the USA because THey are unpraised."
str_to_lower(x2)
[1] "deeds will not be less valiant in the usa because they are unpraised."
str_to_upper(x2)
[1] "DEEDS WILL NOT BE LESS VALIANT IN THE USA BECAUSE THEY ARE UNPRAISED."
str_to_sentence(x2)
[1] "Deeds will not be less valiant in the usa because they are unpraised."
## fixing an acronym with str_replace()
str_to_sentence(x2) |>
  str_replace(" usa ", " USA ")
[1] "Deeds will not be less valiant in the USA because they are unpraised."

3.2.1.5 Join and Split Strings

3.2.1.5.1 Joining Strings
  • Combine/concatenate strings with str_c(..., sep = "", collapse = NULL).
x <- "Faithless is he that says"
y <- "farewell when the road darkens."
str_c(x, y)
[1] "Faithless is he that saysfarewell when the road darkens."
  • The default is to separate strings by nothing. Use the sep argument to change the separator.
str_c(x, y, sep = " ")
[1] "Faithless is he that says farewell when the road darkens."
  • If you provide str_c() a vector of arguments, it will vectorize the combining to include recycling the shorter argument.
  • You can provide a collapse = argument to convert to a single string from a vector.
  • Also see help for str_flatten().
x <- c("Short", "cuts", "make", "long", "delays.")
x
[1] "Short"   "cuts"    "make"    "long"    "delays."
str_c(x, "LOTR", sep = "-") # vectorized with recycling
[1] "Short-LOTR"   "cuts-LOTR"    "make-LOTR"    "long-LOTR"    "delays.-LOTR"
str_c(x, collapse = " ")   # single string
[1] "Short cuts make long delays."
str_c(x, "LOTR", collapse = " ")
[1] "ShortLOTR cutsLOTR makeLOTR longLOTR delays.LOTR"
  • Combining with NA results in NA:
str_c("Faithless is he that says", NA)
[1] NA

3.2.1.6 Joining Strings with Values from Variables or Functions

  • str_glue(..., .sep) is a wrapper around the glue::glue() function which allows you to “interpolate” strings and variables, e.g., combine a string with the dynamic value of a variable.
    • Notice the argument is .sep not sep
  • This can be useful in in-line code and text elements in graphics as an alternative to paste() and paste0().
str_glue("The mtcars data frame has", nrow(mtcars), "rows.", .sep = " ")
The mtcars data frame has 32 rows.

3.2.1.7 Split a String into Multiple Strings

  • str_split() will split a string based on a character we choose; useful with dates or full names at times.
  • Can help simplify working with long strings by breaking them into shorter pieces for analysis.
  • By default, the output is a list. Have to be careful when the string could split into different numbers of elements.
str_split(nycflights13::airports$name, "\\s") |> head()
[[1]]
[1] "Lansdowne" "Airport"  

[[2]]
[1] "Moton"     "Field"     "Municipal" "Airport"  

[[3]]
[1] "Schaumburg" "Regional"  

[[4]]
[1] "Randall" "Airport"

[[5]]
[1] "Jekyll"  "Island"  "Airport"

[[6]]
[1] "Elizabethton" "Municipal"    "Airport"     

3.2.1.8 Split a Data Frame Column of Type Character with separate()

  • The separate() function, from the {tidyr} package, is useful for splitting all the strings in a data frame column into multiple columns.

  • You can use regex or a vector of positions to turn a single column into multiple columns.

  • You need to specify at least three arguments:

    1. The column you want to separate that has two (or more) variables,
    2. The character vector of the names of the new variables, and
    3. The character or numeric positions by which to separate out the new variables from the current column.
  • Consider {tidyr}’s table three which has cases and country population combined.

head(table3) ## from tidyr package
# A tibble: 6 × 3
  country      year rate             
  <chr>       <dbl> <chr>            
1 Afghanistan  1999 745/19987071     
2 Afghanistan  2000 2666/20595360    
3 Brazil       1999 37737/172006362  
4 Brazil       2000 80488/174504898  
5 China        1999 212258/1272915272
6 China        2000 213766/1280428583
table3  |>
  separate(rate, into = c("cases", "population"),
           sep = "/") |>
  head()
# A tibble: 6 × 4
  country      year cases  population
  <chr>       <dbl> <chr>  <chr>     
1 Afghanistan  1999 745    19987071  
2 Afghanistan  2000 2666   20595360  
3 Brazil       1999 37737  172006362 
4 Brazil       2000 80488  174504898 
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583
  • See also tidyr::extract().

The separate() function has recently been superseded by three new functions in {tidyr}.

  • separate_wider_delim() splits by delimiter.
  • separate_wider_position() splits at fixed widths.
  • separate_wider_regex() splits with regular expression matches.
separate_wider_delim(table3, rate, 
                     names = c("cases", "population"),
                     delim = "/")
# A tibble: 6 × 4
  country      year cases  population
  <chr>       <dbl> <chr>  <chr>     
1 Afghanistan  1999 745    19987071  
2 Afghanistan  2000 2666   20595360  
3 Brazil       1999 37737  172006362 
4 Brazil       2000 80488  174504898 
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

3.2.2 Regular Expressions

3.2.2.1 Intro

  • Regular expressions (regex or regexp) are a syntax for pattern matching in strings.
  • Regex syntax is used by programmers in many computer languages. There are some slight variations but most uses are the same.
  • Wherever there is a pattern argument in a {stringr} function, you can use regex (to extract strings, get a logical if there is a match, etc.).

3.2.2.2 Regex Match Characters and Escaping with the Backslash \

  • Regex syntax includes special “match characters”, e.g., ., ?, (, ), ", and \.

  • These must be escaped using \ if you want to match their normal value.

  • Then you must also escape the \ so always use two \\ in front of a match character as stated on the Cheat Sheet.

  • Basic usage: find exact match(es) of a string pattern and replace them with the designated replacement string.

x <- "Ho! Ho! Ho! to the bottle I go to heal my heart and drown my woe."
str_replace_all(x, "hea", "XX")
[1] "Ho! Ho! Ho! to the bottle I go to XXl my XXrt and drown my woe."
  • A period “.matches any character to include spaces and punctuation.
str_replace_all(x, "hea.", "XX")
[1] "Ho! Ho! Ho! to the bottle I go to XX my XXt and drown my woe."
  • You “escape” a period with two backslashes “\\.” to match actual periods in the text.
  • You can also use match characters in the replacement string.
str_replace_all(x, ".", "X") ### Matches everything
[1] "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
str_replace_all(x, "\\.", "X") ### Matches the only period
[1] "Ho! Ho! Ho! to the bottle I go to heal my heart and drown my woeX"
str_replace_all(x, "\\.", "\\?") ### Matches the period and replaces with ?
[1] "Ho! Ho! Ho! to the bottle I go to heal my heart and drown my woe?"
str_replace_all(x, "\\!", "\\,") ### Matches the period and replaces with ?
[1] "Ho, Ho, Ho, to the bottle I go to heal my heart and drown my woe."
  • To match an escaped backslash \\, you need four backslashes (to escape the escape twice).
y <- "Rain and\\or snow may fall and\\or blow"
writeLines(y)
Rain and\or snow may fall and\or blow
y
[1] "Rain and\\or snow may fall and\\or blow"
str_replace_all(y, "\\\\", "XX")
[1] "Rain andXXor snow may fall andXXor blow"
  • Replace "back" and "lack" with "foo?" in the following.
x <- "but better is Beer if drink we lack, and Water Hot poured down the back."
str_replace_all(x, ".ack", "foo?")
[1] "but better is Beer if drink we foo?, and Water Hot poured down the foo?."

3.2.2.3 Other Regex Match Characters Serve as “Wild Cards”

  • These each have a negated value by using the capital letter in the syntax.
  • We’ll use this character vector for practice:
x <- c("Abba: 555-1234", "Ann: 555-0987", "Andy_R: 555-7654")
  • \\d matches any digit, \\D any non-digit.
str_replace(x, "\\d\\d\\d-\\d\\d\\d\\d", "XXX-XXXX")
[1] "Abba: XXX-XXXX"   "Ann: XXX-XXXX"    "Andy_R: XXX-XXXX"
str_replace_all(x, "\\D", "X")
[1] "XXXXXX555X1234"   "XXXXX555X0987"    "XXXXXXXX555X7654"
  • \\s matches any white space (e.g. space, tab, newline), \\S any non-white space.
x
[1] "Abba: 555-1234"   "Ann: 555-0987"    "Andy_R: 555-7654"
str_replace(x, "\\s", "X")
[1] "Abba:X555-1234"   "Ann:X555-0987"    "Andy_R:X555-7654"
str_replace_all(x, "\\S", "X")
[1] "XXXXX XXXXXXXX"   "XXXX XXXXXXXX"    "XXXXXXX XXXXXXXX"
  • \\w matches any word character (letter, digit, or underscore), \\W any non-word character.
x
[1] "Abba: 555-1234"   "Ann: 555-0987"    "Andy_R: 555-7654"
str_replace_all(x, "\\w", "X")
[1] "XXXX: XXX-XXXX"   "XXX: XXX-XXXX"    "XXXXXX: XXX-XXXX"
str_replace_all(x, "\\W", "X")
[1] "AbbaXX555X1234"   "AnnXX555X0987"    "Andy_RXX555X7654"
  • \\b matches word boundaries: a place where there is a non-word character followed by a word character or vice versa.
  • This includes the beginning of a string, the end of a string, or a substring with non-word characters on either end.
  • This is helpful for finding whole words regardless of where they appear.
x
[1] "Abba: 555-1234"   "Ann: 555-0987"    "Andy_R: 555-7654"
str_replace_all(x, ".\\b", "X")
[1] "AbbX:X55XX123X"   "AnX:X55XX098X"    "Andy_X:X55XX765X"
str_replace_all(x, "\\b.", "X")
[1] "XbbaX X55XX234"   "XnnX X55XX987"    "Xndy_RX X55XX654"
str_replace_all(x, "\\b...\\b", "X")
[1] "Abba: X-1234"   "X: X-0987"      "Andy_R: X-7654"
  • Use the {stringr} cheat sheet to see other helpful match characters such as [:alpha:], [:digit:], or [:punct:].
x
[1] "Abba: 555-1234"   "Ann: 555-0987"    "Andy_R: 555-7654"
str_replace_all(x, "[:alpha:]", "X")
[1] "XXXX: 555-1234"   "XXX: 555-0987"    "XXXX_X: 555-7654"
str_replace_all(x, "[:digit:]", "X")
[1] "Abba: XXX-XXXX"   "Ann: XXX-XXXX"    "Andy_R: XXX-XXXX"
str_replace_all(x, "[:punct:]", "X")
[1] "AbbaX 555X1234"   "AnnX 555X0987"    "AndyXRX 555X7654"

3.2.2.4 Alternates in Matches

  • [abc]: matches any of the characters inside the [], here, a, or b, or c.
str_replace(x, "A[bn][bn]", "XXXX")
[1] "XXXXa: 555-1234"  "XXXX: 555-0987"   "Andy_R: 555-7654"
  • Using ^ inside the [ ] is negation - so the match cannot have any of the characters inside the [ ].
  • [^abc]: matches anything except a, b, or c.
Note

This is a different use of ^ then we will see shortly since it is inside the [ ].

str_replace(x, "A[^b]", "XX")
[1] "Abba: 555-1234"   "XXn: 555-0987"    "XXdy_R: 555-7654"
  • abc|xyz matches either abc OR xyz. This is called “alternation”
  • You can use parentheses to control where the alternation occurs.
    • a(bc|xy)z matches either abcz or axyz.
str_replace(x, "An(n|dy)", "XXXX")
[1] "Abba: 555-1234"   "XXXX: 555-0987"   "XXXX_R: 555-7654"
  • To ignore case, place a (?i) at the beginning of the regex.
str_replace("AB", "ab", "X")
[1] "AB"
str_replace("AB", "(?i)ab", "X")
[1] "X"

3.2.2.5 Anchoring the Regex (so Search Starts at the Beginning or at the End of a String)

Anchoring the regex limits matching to start with the first character (left-to-right) in the string or backwards from the last character at the end of the string (working-right to left) or both.

  • ^ only matches patters starting at the beginning of a string.
  • $ only matches patters starting at the end of a string.
  • Use both to only match a complete string.
x <- c("apple pie", "apple", "apple cake", "baked apple")
str_replace_all(x, "^apple", "XX")
[1] "XX pie"      "XX"          "XX cake"     "baked apple"
str_replace_all(x, "apple$", "XX")
[1] "apple pie"  "XX"         "apple cake" "baked XX"  
str_replace_all(x, "^apple$", "XX")
[1] "apple pie"   "XX"          "apple cake"  "baked apple"

3.2.2.6 Using Quantifiers to Match Patterns: ?, +, *, {n}, {n,},{0,n}, {n,m}

Quantifiers provide flexibility in not having to always match an exact number of characters

  • They also help shorten regex to be more interpretable.
  • To match a pattern that occurs multiple times in a row, use:
    • ?: 0 or 1
    • +: 1 or more
    • *: 0 or more
x <- c("A", "AA", "AAA", "AAAA", "B", "BB")
x
[1] "A"    "AA"   "AAA"  "AAAA" "B"    "BB"  
str_replace_all(x, "^A?", "X") ## starting with 0 or 1 A
[1] "X"    "XA"   "XAA"  "XAAA" "XB"   "XBB" 
str_replace_all(x, "^A+", "X") ## starting with 1 or more A's
[1] "X"  "X"  "X"  "X"  "B"  "BB"
str_replace_all(x, "^A*", "X") ## starting with 0 or more A's
[1] "X"   "X"   "X"   "X"   "XB"  "XBB"
  • To control exactly how many repetitions are in a match, use:
    • {n}: exactly n.
    • {n,}: n or more.
    • {0,m}: at most m.
    • {n,m}: between n and m.

Regex are “greedy” by default, matching the longest string possible and move on by default.

  • You can make the matches “lazy”, matching the shortest string possible, by putting a ? after the pattern:
x
[1] "A"    "AA"   "AAA"  "AAAA" "B"    "BB"  
str_replace_all(x, "A{2}", "X")   ## exactly 2 A's
[1] "A"  "X"  "XA" "XX" "B"  "BB"
str_replace_all(x, "A{2,}", "X")  ## 2 or more A's
[1] "A"  "X"  "X"  "X"  "B"  "BB"
str_replace_all(x, "A{0,2}", "X") ## 0 to 2 A's
[1] "XX"    "XX"    "XXX"   "XXX"   "XBX"   "XBXBX"
str_replace_all(x, "A{3,4}", "X") ## 3 to 4 A's
[1] "A"  "AA" "X"  "X"  "B"  "BB"

3.2.2.7 Grouping and Back-References

  • Parentheses can also create a group.
  • R assigns each group in a regex match string an index number based on order of its appearance in the regex pattern.
  • You can use a group’s index number as a reference back to the what was found in the match - a back-reference.
  • Use the group index number preceded by \\ to represent the group in a matching pattern and/or a replacement pattern.
    • Use \\1 to refer to what was matched by the first group, \\2 by the second group, etc. …
  • The back-reference substitutes the actual values that were matched, not the original regex itself into the pattern.
    • If the first grouped regex pattern is (a[bd]c) and it matches to an adc, then the back reference \\1 will substitute adc in any pattern where it is used.
  • What is happening in the following?
str_replace("cococola", "(..)", "pepsi")
[1] "pepsicocola"
str_replace("cococola", "(..)\\1", "pepsi")
[1] "pepsicola"
str_replace("cococola", "(..)\\1\\1", "pepsi")
[1] "pepsila"
str_replace("banana", "([aeiou][^aeiou])\\1", "XX")
[1] "bXXa"
  • You can use back-references in both the match AND the replacement.
str_replace("Mississippi", "(i)(s)\\2\\1\\2\\2", "X")
[1] "MXippi"
#finds "ississ" and replaces with XXXXXX

str_replace("Mississippi", "(i)(s)\\2\\1\\2\\2", "X\\1")
[1] "MXiippi"
## finds "ississ" and replaces it with "xi"

str_replace("Mississippi", "(i)(s)\\2\\1\\2\\2", "X\\2")
[1] "MXsippi"
## finds "ississ" and replaces it with "Xs"

str_replace("Mississippi","(i)(s)\\2\\1\\2\\2", "\\2\\1\\1\\2\\1\\1")
[1] "Msiisiiippi"
## finds "ississ" and replaces with "siisii"

3.2.2.8 Look Arounds

  • Most regex engines allow you to use characters either directly before or after a string to determine if there is a match but Not extract/replace/ or otherwise affect those characters.
  • This is sometimes referred to as “lookbehind” (what came before the string) or “lookahead” (after the string - what is coming next).
  • In R, it can be called “preceded by” or “followed by” (and their negations created with !).
  • You can use ( ) to combine characters in the look around.
  • Will reset after each match (a feature) so best to experiment to check results.
x <- c("Chapter 1", "Chapter I", "chapter of", "chapter's accent")
x
[1] "Chapter 1"        "Chapter I"        "chapter of"       "chapter's accent"
str_replace_all(x, "ter(?=( 1))", "Xx") ## ter followed by " 1"
[1] "ChapXx 1"         "Chapter I"        "chapter of"       "chapter's accent"
str_replace_all(x, "ter(?!( 1))", "Xx") ## ter not followed by " 1"
[1] "Chapter 1"       "ChapXx I"        "chapXx of"       "chapXx's accent"
str_replace_all(x, "(?<=(Ch))apt", "Xx") ## apt preceded by "Ch" - replace "apt"
[1] "ChXxer 1"         "ChXxer I"         "chapter of"       "chapter's accent"
str_replace_all(x, "(?<!(Ch))apt", "Xx") ## apt not not preceded by "Ch"
[1] "Chapter 1"       "Chapter I"       "chXxer of"       "chXxer's accent"

3.2.3 {stringr} Modifier Functions

  • {stringr} has several “modifier” functions that allow you to handle special cases and override the defaults.
  • We will use the regex() function in later classes. It allows you to substitute arguments for regex characters.
  • See ?stringr::modifiers or {stringr} modifiers
writeLines("The Cat\nin the Hat")
The Cat
in the Hat
str_extract_all("The Cat\nin the Hat", "[a-z]+")
[[1]]
[1] "he"  "at"  "in"  "the" "at" 
str_extract_all("The Cat\nin the Hat", regex("[a-z]+", ignore_case = TRUE))
[[1]]
[1] "The" "Cat" "in"  "the" "Hat"
str_extract_all("The Cat\nin the Hat", regex("[a-z]$+", multiline = FALSE))
[[1]]
[1] "t"
str_extract_all("The Cat\nin the Hat", regex("[a-z]$+", multiline = TRUE))
[[1]]
[1] "t" "t"

3.3 Dates and Times

3.3.1 The {lubridate} Package

  • The {lubridate} package has many functions to simplify working with dates and times.
  • The {lubridate} package is now part of the {tidyverse}, so you no longer need to load it separately.
library(tidyverse)

3.3.2 Three Main Classes for Date/Time Data

  1. Date: for just the date.
  2. POSIXct: for both the date and the time (with Time Zone).
    • “POSIXct” stands for “Portable Operating System Interface - Calendar Time”.
    • It is a part of a standardized system (based on UNIX) of representing time across many computing computing platforms.
  3. hms: for just the time - based on the hms (hours, minutes, and seconds) R package.

3.3.3 Parsing Dates and Times

3.3.3.1 Parsing Dates and Times using {readr}

  • You may recall the {readr} functions parse_date(), parse_datetime(), and parse_time() all parse a date/date-time/time based on a specification you had to carefully describe in a string.
x <- parse_date("10/11/2020", format = "%m/%d/%Y")
x
[1] "2020-10-11"
class(x)
[1] "Date"
y <- parse_datetime("10/11/2020 11:59:20", format = "%m/%d/%Y %H:%M:%S")
y
[1] "2020-10-11 11:59:20 UTC"
class(y)
[1] "POSIXct" "POSIXt" 
z <- parse_time("11:59:20", "%H:%M:%S")
z
11:59:20
class(z)
[1] "hms"      "difftime"

3.3.3.2 Parsing Dates and Times is Much Easier with {lubridate}

  • {lubridate} has many “helper” functions which parse dates/times automatically based on the name of the function.

  • The function name specifies the order of the components: year, month, day, hours, minutes, and seconds.

  • The help page for ymd shows multiple functions to parse dates with different sequences of year, month and day,

  • Only the order of year, month, and day matters.

ymd(c("2011/01-10", "2011-01/10", "20110110"))
[1] "2011-01-10" "2011-01-10" "2011-01-10"
mdy(c("01/10/2011", "01 adsl; 10 df 2011", "January 10, 2011"))
[1] "2011-01-10" "2011-01-10" "2011-01-10"
  • The help page for ms shows multiple functions to parse periods based on hours, minutes and seconds
  • Only the order of hours, minutes, and seconds matter.
hms(c("10:40:10", "10 40 10"))
[1] "10H 40M 10S" "10H 40M 10S"
  • Note: ms(), hm(), and hms() don’t recognize “-” as a separator because they treat it as negative time. So use parse_time() here.
    ms("10-10")
[1] "10M -10S"
  • The help page for ymd_hms shows multiple functions to parse date-times.
ymd_hms("2021,01,17,12,12,12", tz = Sys.timezone())
[1] "2021-01-17 12:12:12 EST"

3.3.4 Creating Date-Time Values from Individual Components

  • Use make_date() or make_datetime() to create dates and date-times if you have a vector of years, months, days, hours, minutes, and/or seconds.
make_date(year = 1981, month = 6, day = 25)
[1] "1981-06-25"
make_datetime(year = 1972, month = 2, day = 22, hour = 10, min = 9, sec = 01)
[1] "1972-02-22 10:09:01 UTC"
  • nycflights13 example:
library(nycflights13)
data("flights")
head(flights)
# A tibble: 6 × 19
   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
1  2013     1     1      517            515         2      830            819
2  2013     1     1      533            529         4      850            830
3  2013     1     1      542            540         2      923            850
4  2013     1     1      544            545        -1     1004           1022
5  2013     1     1      554            600        -6      812            837
6  2013     1     1      554            558        -4      740            728
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
  • You can see variables for the year, month, day, hour, and minute of the scheduled departure time.
flights |>
   mutate(datetime = make_datetime(year   = year,
                                  month  = month,
                                  day    = day,
                                  hour   = hour,
                                  min = minute)) ->
  flights
select(flights, datetime)
# A tibble: 336,776 × 1
   datetime           
   <dttm>             
 1 2013-01-01 05:15:00
 2 2013-01-01 05:29:00
 3 2013-01-01 05:40:00
 4 2013-01-01 05:45:00
 5 2013-01-01 06:00:00
 6 2013-01-01 05:58:00
 7 2013-01-01 06:00:00
 8 2013-01-01 06:00:00
 9 2013-01-01 06:00:00
10 2013-01-01 06:00:00
# ℹ 336,766 more rows
  • Having it in the date-time format makes it easier to plot.
ggplot(flights, aes(x = datetime)) +
  geom_freqpoly(bins = 365)

  • It also makes it easier to filter by date usingas_date() and ymd.
flights |>
  filter(as_date(datetime) == ymd(20130704)) |>
  ggplot(aes(x = datetime)) +
  geom_freqpoly(binwidth = 600)

  • as_datetime() tries to coerce an object to a POSIXct object.

3.3.5 Getting/Setting Components of a Date-Time

3.3.5.1 Getting (extracting) the Components of a Date-Time

  • year() extracts the year.
  • month() extracts the month.
  • week() extracts the week.
  • mday() extracts the day of the month (1, 2, 3, …).
  • wday() extracts the day of the week (Saturday, Sunday, Monday …).
  • yday() extracts the day of the year (1, 2, 3, …).
  • hour() extracts the hour.
  • minute() extract the minute.
  • second() extracts the second.
ddat <- mdy_hms("01/02/1970 03:51:44")
ddat
[1] "1970-01-02 03:51:44 UTC"
year(ddat)
[1] 1970
month(ddat, label = TRUE)
[1] Jan
12 Levels: Jan < Feb < Mar < Apr < May < Jun < Jul < Aug < Sep < ... < Dec
week(ddat)
[1] 1
mday(ddat)
[1] 2
wday(ddat, label = TRUE)
[1] Fri
Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat
yday(ddat)
[1] 2
hour(ddat)
[1] 3
minute(ddat)
[1] 51
second(ddat)
[1] 44

3.3.5.2 Setting the Components of a Date-Time

  • You can use the same functions to set or overwrite components of a date-time.
ddat <- mdy_hms("01/02/1970 03:51:44")
ddat
[1] "1970-01-02 03:51:44 UTC"
year(ddat) <- 1988
ddat
[1] "1988-01-02 03:51:44 UTC"
  • Use round_date() to round to the levels of individual components.
  • There is also floor_date() and ceiling_date().
ddat <- mdy_hms("01/04/1971 03:51:44")
ddat
[1] "1971-01-04 03:51:44 UTC"
round_date(ddat, unit = "year")
[1] "1971-01-01 UTC"
 round_date(ddat, unit = "month")
[1] "1971-01-01 UTC"
 floor_date(ddat, unit = "day")
[1] "1971-01-04 UTC"
 ceiling_date(ddat, unit = "hour")
[1] "1971-01-04 04:00:00 UTC"

3.3.6 Three Different Types of Time Spans

3.3.6.1 Durations - Keeping track of Seconds

  • A duration counts the exact number of seconds in a time span, between two dates.
  • Durations allow exact comparisons with other durations.
    • However, they don’t always match the number of hours, months, or days you might calculate.
  • We can find out Lady Gaga’s “approximate” age using durations.
d1 <- ymd(19860328)
d2 <- today()
agesec <- as.duration(d2 - d1)
agesec
[1] "1218326400s (~38.61 years)"
  • Create durations from years with dyears(), from days with ddays(), etc..
dyears(1)
[1] "31557600s (~1 years)"
ddays(1)
[1] "86400s (~1 days)"
dhours(1)
[1] "3600s (~1 hours)"
dminutes(1)
[1] "60s (~1 minutes)"
dseconds(1)
[1] "1s"
  • You can add durations to date-times, but it always adds seconds.
    • If there is a daylight savings change, you get weird results (add a day but the time is not the same as the time the previous day).
one_pm <- ymd_hms("2016-03-12 13:00:00", tz = "America/New_York")
one_pm
[1] "2016-03-12 13:00:00 EST"
one_pm + ddays(1)
[1] "2016-03-13 14:00:00 EDT"

3.3.6.2 Periods

  • Periods track the change in the human “clock time” between two date-times by ignoring actual physical time spans.
    • Adding a period takes into account daylight savings.
one_pm
[1] "2016-03-12 13:00:00 EST"
one_pm + days(1)
[1] "2016-03-13 13:00:00 EDT"
  • See help for periods to be aware of the unintuitive calculations that can result if not careful.
help("Period-class")
ymd(19800128) + months(1)
ymd(19800131) + months(1)

3.3.6.3 Intervals

  • Intervals record time spans as a sequence of seconds beginning at a specified date.
  • Intervals are like durations, but with an associated start time.
  • Intervals and durations are good for physical processes that don’t care about human clock time.
  • Working with intervals requires some care:
    • Divide an interval by a duration to determine its physical length.
    • Divide an interval by a period to determine its implied length in clock time.

3.3.7 Time Zones

  • Time zones are specified using the tz or tzone arguments (for example, in the call to ymd_hms() above).

    • Time zones are specified by “continent/city.” For example, "America/New_York" and "Europe_Paris".
    • You can see a complete list of time zones with OlsonNames().
    • You can also use tz = Sys.timezone() to use the time zone of the computer you are working on.
  • The default time zone is UTC (which has no daylight savings).

    • Coordinated Universal Time UTC) is the successor time standard to Greenwich Mean Time (GMT) which is a specific time zone.
  • You usually don’t have to worry about timezones unless they loaded incorrectly. For example, R might think it’s UTC even though it should be America/New_York and then forget daylight savings.

  • If a date-time is labeled with the incorrect time zone, use force_tz() to change it.

    d1 <- ymd_hms("20140101 10:01:11")
    d1
[1] "2014-01-01 10:01:11 UTC"
    force_tz(d1, tzone = "America/New_York")
[1] "2014-01-01 10:01:11 EST"
    force_tz(d1, tzone = Sys.timezone())
[1] "2014-01-01 10:01:11 EST"
  • If the timezone is correct, but you want to change it, use with_tz().
    d1
[1] "2014-01-01 10:01:11 UTC"
    with_tz(d1, tzone = "America/New_York")
[1] "2014-01-01 05:01:11 EST"
    with_tz(d1, tzone = Sys.timezone())
[1] "2014-01-01 05:01:11 EST"

3.4 Factors

3.4.1 Introduction

  • R uses the class factor to represent categorical data, i.e., a variable where each observational/experimental unit belongs to one and only one group, category, or level out of a finite set of groups, categories, or levels.
  • Being designated as a factor allows R to provide special treatment to the data
    • Hair Color could be a factor with levels: Auburn, Red, Brown, Black, Blonde, …
    • City sizes: Small, Medium, Large, or Megalopolis
    • Major: Data Science, Mathematics, Statistics, or Other
  • Factors may have words as labels for the levels but they are different from a character variable such as “Brown” or “Small”:
    1. Factors have a fixed (usually small) number of “levels” (possible values)
      • A character variable, e.g., “City_name”, can have many different values.
    2. Factors have a default ordering of the levels.
      • Character variables are only ordered through alpha-numeric sequencing.
      • The default ordering for factors in R is alpha-numeric but you can change it to whatever makes sense for the problem.
      • Ordering determines how categories are arrayed (ordered) in plots.

3.4.2 Factors have Two Different Representations

When we look at the data, we see the labels of the levels be they words or numbers.

  • Use levels(factor_var) to access the labels and the order of the levels.

R actually stores factors as integers from 1, 2, … number of levels.

  • Use as.numeric(factor_var) to access the stored integer values.

Example: An experimental treatment has levels 2, 15, 32, and 51.

x  <- c("51", "32", "15", "2", "32")
xf <- factor(x)
x
[1] "51" "32" "15" "2"  "32"
xf
[1] 51 32 15 2  32
Levels: 15 2 32 51
class(x)
[1] "character"
class(xf)
[1] "factor"
typeof(x)
[1] "character"
typeof(xf)
[1] "integer"
as.numeric(x)
[1] 51 32 15  2 32
as.numeric(xf)
[1] 4 3 1 2 3
Note

As of R 4.1.0, combining factors with c() now creates a new factor with changed levels, not a factor with integrated levels.

fac1 <- factor(c("x1", "x2", "x3", "x4"))
fac2 <- factor(c("y1", "y2", "y3"))
fac1
[1] x1 x2 x3 x4
Levels: x1 x2 x3 x4
fac2
[1] y1 y2 y3
Levels: y1 y2 y3
as.numeric(fac1)
[1] 1 2 3 4
as.numeric(fac2)
[1] 1 2 3
c(fac1, fac2)
[1] x1 x2 x3 x4 y1 y2 y3
Levels: x1 x2 x3 x4 y1 y2 y3
as.numeric(c(fac1, fac2))
[1] 1 2 3 4 5 6 7

3.4.3 Creating Factors

3.4.3.1 Both {readr} and Base R have Functions to Create Factors

You can use readr::parse_factor() or base::factor() to create a factor variable.

  • However, they sort differently by default.
  • readr::parse_factor() returns better warnings.

Let’s create a factor for the months of the year using their standard abbreviations.

library(tidyverse)
  • Note the differences in the order of the levels
month.abb ## a built-in constant for the month abbreviations
 [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
factor(month.abb)
 [1] Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Levels: Apr Aug Dec Feb Jan Jul Jun Mar May Nov Oct Sep
parse_factor(month.abb)
 [1] Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
as.numeric(factor(month.abb))
 [1]  5  4  8  1  9  7  6  2 12 11 10  3
as.numeric(parse_factor(month.abb))
 [1]  1  2  3  4  5  6  7  8  9 10 11 12

Both functions assume the levels are the unique values of the vector but they take different approaches to how to “order” the levels

  • Base R factor() sets the order of the levels to be the same order returned by sort().
  • readr::parse_factor() sets the order of the levels to be the order of the values as introduced in the data.
    • You can change the order to be alphabetical using the levels = sort(x) argument

To access or extract the levels of a factor (and their order) use the levels() function.

levels(parse_factor(month.abb))
 [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
levels(parse_factor(month.abb, levels = sort(month.abb)))
 [1] "Apr" "Aug" "Dec" "Feb" "Jan" "Jul" "Jun" "Mar" "May" "Nov" "Oct" "Sep"
  • You can count the number of levels with nlevels().
nlevels(parse_factor(month.abb))
[1] 12

3.4.4 The Tidyverse {forcats} Package

The {forcats} package provides functions to make it easier to manipulate factors.

  • All forcats functions begin with fct_. So you can type fct_ then use tab-completion to scroll through the possible functions.
  • {forcats} is loaded with the {tidyverse} package so there is no need to reload it after loading {tidyverse}.

Three common tasks:

  • Change the order of the levels:
    • fct_reorder(): Reorder the levels of a factor based on the values of another variable.
    • fct_relevel(): Reorder the levels of a factor manually.
    • fct_infreq(): Reorder the levels of a factor by the frequency of its values in the data.
  • Change the values of the levels (the labels)
    • fct_recode(): Change the labels of the levels
    • fct_collapse(): Collapsing the least/most frequent values of a factor into “other”.
  • Add or Drop levels
    • fct_expand(): Add levels
    • fct_drop: Drop unused levels
    • fct_explicit_na(): Assign a level to NAs so you can plot them. Check the help and compare to Base R factor()

3.4.4.1 Change the Order of the Levels

You often want to change the order of the levels of a factor to make the data easier for humans to work with or plots easier to interpret.

  • You can reorder the levels manually, e.g., to making working with the data easier.
  • You can reorder the levels based on the value of another variable, e.g., for better plotting.
  • You can change the order by using a {forcats} function inside a mutate()
  • You can change the order by using a {forcats} function inside an aes() call for a plot.

Let’s use the subset of the data from the General Social Survey stored in the gss_cat data set in {forcats}.

data(gss_cat)
glimpse(gss_cat)
Rows: 21,483
Columns: 9
$ year    <int> 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 20…
$ marital <fct> Never married, Divorced, Widowed, Never married, Divorced, Mar…
$ age     <int> 26, 48, 67, 39, 25, 25, 36, 44, 44, 47, 53, 52, 52, 51, 52, 40…
$ race    <fct> White, White, White, White, White, White, White, White, White,…
$ rincome <fct> $8000 to 9999, $8000 to 9999, Not applicable, Not applicable, …
$ partyid <fct> "Ind,near rep", "Not str republican", "Independent", "Ind,near…
$ relig   <fct> Protestant, Protestant, Protestant, Orthodox-christian, None, …
$ denom   <fct> "Southern baptist", "Baptist-dk which", "No denomination", "No…
$ tvhours <int> 12, NA, 2, 4, 1, NA, 3, NA, 0, 3, 2, NA, 1, NA, 1, 7, NA, 3, 3…

Example: TV hours by Religion - default order.

gss_cat |>
  group_by(relig) |>
  summarize(tvhours_mean = mean(tvhours, na.rm = TRUE)) ->
  tvdat

ggplot(tvdat, aes(x = tvhours_mean, y = relig)) +
  geom_point() +
  xlab("Average TV Hours") +
  ylab("Religion")

3.4.4.2 Reordering to be Alphabetical or Based on Order in the Data

If you used factor() or parse_factor() with the default levels you can change the order of the levels.

  • Use fct_relevel() with sort to convert to alpha-numeric order of the levels.
gss_cat |>
  mutate(relig = fct_relevel(relig, sort)) ->
  gss_cat2
  levels(gss_cat2$relig)
 [1] "Buddhism"                "Catholic"               
 [3] "Christian"               "Don't know"             
 [5] "Hinduism"                "Inter-nondenominational"
 [7] "Jewish"                  "Moslem/islam"           
 [9] "Native american"         "No answer"              
[11] "None"                    "Not applicable"         
[13] "Orthodox-christian"      "Other"                  
[15] "Other eastern"           "Protestant"             
  • Use fct_inorder() to convert to the order as in the data.
    • This will stop with an error if there is a level with no rows in the data!
levels(gss_cat$partyid)
 [1] "No answer"          "Don't know"         "Other party"       
 [4] "Strong republican"  "Not str republican" "Ind,near rep"      
 [7] "Independent"        "Ind,near dem"       "Not str democrat"  
[10] "Strong democrat"   
gss_cat |>
  mutate(partyid = fct_inorder(partyid)) ->
gss_cat2
levels(gss_cat2$partyid)
 [1] "Ind,near rep"       "Not str republican" "Independent"       
 [4] "Not str democrat"   "Strong democrat"    "Ind,near dem"      
 [7] "Strong republican"  "Other party"        "No answer"         
[10] "Don't know"        
## gss_cat |>
##   mutate(relig = fct_inorder(relig)) ->
## gss_cat2
gss_cat |>
  count(relig) |> nrow()
[1] 15
nlevels(gss_cat$relig)
[1] 16

3.4.4.3 Reordering the Levels Based on Another Variable

Use fct_reorder(f, x, fun, ..., .desc) to reorder the levels of a factor based on the values of a numeric variable.

  • .f: The factor vector you want to reorder.
  • .x: A numeric vector you want to use to reorder the levels.
  • .fun: A function to apply to x whose result will be used to order the levels of f. + default is fun = median and in ascending order desc = FALSE.
    • Note the . in the argument names.

Let’s reorder relig based on the mean of TV hours we calculated earlier.

  • We will use fct_reorder() inside a mutate() and save the data frame.
levels(tvdat$relig)
 [1] "No answer"               "Don't know"             
 [3] "Inter-nondenominational" "Native american"        
 [5] "Christian"               "Orthodox-christian"     
 [7] "Moslem/islam"            "Other eastern"          
 [9] "Hinduism"                "Buddhism"               
[11] "Other"                   "None"                   
[13] "Jewish"                  "Catholic"               
[15] "Protestant"              "Not applicable"         
tvdat |>
  mutate(relig = fct_reorder(.f = relig, .x = tvhours_mean)) ->
  tvdat
levels(tvdat$relig)
 [1] "Other eastern"           "Hinduism"               
 [3] "Buddhism"                "Orthodox-christian"     
 [5] "Moslem/islam"            "Jewish"                 
 [7] "None"                    "No answer"              
 [9] "Other"                   "Christian"              
[11] "Inter-nondenominational" "Catholic"               
[13] "Protestant"              "Native american"        
[15] "Don't know"              "Not applicable"         
  • Using the .fun = max on the original data.
levels(gss_cat$relig)
 [1] "No answer"               "Don't know"             
 [3] "Inter-nondenominational" "Native american"        
 [5] "Christian"               "Orthodox-christian"     
 [7] "Moslem/islam"            "Other eastern"          
 [9] "Hinduism"                "Buddhism"               
[11] "Other"                   "None"                   
[13] "Jewish"                  "Catholic"               
[15] "Protestant"              "Not applicable"         
gss_cat |>
  mutate(relig = fct_reorder(.f = relig, .x = tvhours, .fun = max, na.rm = TRUE)) ->
  gss_cat2
levels(gss_cat2$relig)
 [1] "Other eastern"           "Hinduism"               
 [3] "Orthodox-christian"      "Buddhism"               
 [5] "No answer"               "Other"                  
 [7] "Jewish"                  "Moslem/islam"           
 [9] "Inter-nondenominational" "Native american"        
[11] "Christian"               "Don't know"             
[13] "None"                    "Catholic"               
[15] "Protestant"              "Not applicable"         
  • The plot now reorders the y-axis according to the new level order.
ggplot(tvdat, aes(x = tvhours_mean, y = relig)) +
  geom_point() +
  ggtitle("Ordered by Mean") +
  xlab("Average TV Hours") +
  ylab("Religion")

gss_cat2 |>
  filter(!is.na(tvhours), !is.na(relig)) |>
ggplot(aes(x = tvhours, y = relig)) +
  geom_boxplot() +
  ggtitle("Ordered by Max") +
  xlab("Average TV Hours") +
  ylab("Religion")

fct_relevel(.f, ..., after = 0L) also allows you to manually move existing levels to any location you choose.

  • Note the use of integer values, e.g., 2L.
  • Any levels not mentioned will be left in existing order.
### Moves "None" to first level
fct_relevel(tvdat$relig, "None") |>
  levels()
 [1] "None"                    "Other eastern"          
 [3] "Hinduism"                "Buddhism"               
 [5] "Orthodox-christian"      "Moslem/islam"           
 [7] "Jewish"                  "No answer"              
 [9] "Other"                   "Christian"              
[11] "Inter-nondenominational" "Catholic"               
[13] "Protestant"              "Native american"        
[15] "Don't know"              "Not applicable"         
### Moves "None" to the third level
fct_relevel(tvdat$relig, "None", after = 2L) |>
  levels()
 [1] "Other eastern"           "Hinduism"               
 [3] "None"                    "Buddhism"               
 [5] "Orthodox-christian"      "Moslem/islam"           
 [7] "Jewish"                  "No answer"              
 [9] "Other"                   "Christian"              
[11] "Inter-nondenominational" "Catholic"               
[13] "Protestant"              "Native american"        
[15] "Don't know"              "Not applicable"         
### Moves "None" to the last level
fct_relevel(tvdat$relig, "None", after = nlevels(tvdat$relig)) |>
  levels()
 [1] "Other eastern"           "Hinduism"               
 [3] "Buddhism"                "Orthodox-christian"     
 [5] "Moslem/islam"            "Jewish"                 
 [7] "No answer"               "Other"                  
 [9] "Christian"               "Inter-nondenominational"
[11] "Catholic"                "Protestant"             
[13] "Native american"         "Don't know"             
[15] "Not applicable"          "None"                   
### Returns a warning because "Cthulhuism" is not a level
fct_relevel(tvdat$relig, "Cthulhuism")
 [1] No answer               Don't know              Inter-nondenominational
 [4] Native american         Christian               Orthodox-christian     
 [7] Moslem/islam            Other eastern           Hinduism               
[10] Buddhism                Other                   None                   
[13] Jewish                  Catholic                Protestant             
16 Levels: Other eastern Hinduism Buddhism Orthodox-christian ... Not applicable

fct_infreq() reorders the levels by frequency of a level in the data from largest to smallest.

fct_count(): will count the number of values in each level

gss_cat |>
  mutate(relig = fct_infreq(relig)) ->
  gss_cat2
levels(gss_cat2$relig)
 [1] "Protestant"              "Catholic"               
 [3] "None"                    "Christian"              
 [5] "Jewish"                  "Other"                  
 [7] "Buddhism"                "Inter-nondenominational"
 [9] "Moslem/islam"            "Orthodox-christian"     
[11] "No answer"               "Hinduism"               
[13] "Other eastern"           "Native american"        
[15] "Don't know"              "Not applicable"         
fct_count(gss_cat2$relig)
# A tibble: 16 × 2
   f                           n
   <fct>                   <int>
 1 Protestant              10846
 2 Catholic                 5124
 3 None                     3523
 4 Christian                 689
 5 Jewish                    388
 6 Other                     224
 7 Buddhism                  147
 8 Inter-nondenominational   109
 9 Moslem/islam              104
10 Orthodox-christian         95
11 No answer                  93
12 Hinduism                   71
13 Other eastern              32
14 Native american            23
15 Don't know                 15
16 Not applicable              0

fct_rev() reverses the order of the factors.

tvdat |>
  mutate(relig = fct_rev(relig)) |>
  ggplot(aes(x = tvhours_mean, y = relig)) +
geom_point() +
xlab("Average TV Hours") +
ylab("Religion")

3.4.4.4 Change Orders of Levels in a Plot

Instead of mutating the data before the plot, you can change the levels inside the aes() call.

tvdat |>
  ggplot(aes(x = tvhours_mean, y = fct_rev(relig))) +
geom_point() +
xlab("Average TV Hours") +
ylab("Religion")

gss_cat |>
  ggplot(aes(x = tvhours, y = fct_rev(fct_infreq(relig)))) +
geom_boxplot() +
xlab("Average TV Hours") +
ylab("Religion")

3.4.5 Change the Values of the Levels (the Labels)

Use fct_recode() to manually change the labels for the levels.

  • The new label level goes on the left of the equals sign.
  • The old label level goes on the right.
  • Think New <- Old just like mutate().
gss_cat |>
  mutate(partyid = fct_recode(partyid,
                      "Republican, strong"    = "Strong republican",
                      "Republican, weak"      = "Not str republican",
                      "Independent, near rep" = "Ind,near rep",
                      "Independent, near dem" = "Ind,near dem",
                      "Democrat, weak"        = "Not str democrat",
                      "Democrat, strong"      = "Strong democrat"
                      )) ->
gss_cat2
levels(gss_cat2$partyid)
 [1] "No answer"             "Don't know"            "Other party"          
 [4] "Republican, strong"    "Republican, weak"      "Independent, near rep"
 [7] "Independent"           "Independent, near dem" "Democrat, weak"       
[10] "Democrat, strong"     

fct_collapse(): combines multiple levels into one level.

fct_count(gss_cat$partyid)
# A tibble: 10 × 2
   f                      n
   <fct>              <int>
 1 No answer            154
 2 Don't know             1
 3 Other party          393
 4 Strong republican   2314
 5 Not str republican  3032
 6 Ind,near rep        1791
 7 Independent         4119
 8 Ind,near dem        2499
 9 Not str democrat    3690
10 Strong democrat     3490
gss_cat |>
  mutate(partyid = fct_collapse(partyid,
    missing = c("No answer", "Don't know"),
    other = "Other party",
    rep = c("Strong republican", "Not str republican"),
    ind = c("Ind,near rep", "Independent", "Ind,near dem"),
    dem = c("Not str democrat", "Strong democrat")
    )) ->
gss_cat2
fct_count(gss_cat2$partyid)
# A tibble: 5 × 2
  f           n
  <fct>   <int>
1 missing   155
2 other     393
3 rep      5346
4 ind      8409
5 dem      7180

3.4.6 Add or Drop Levels

fct_expand(): adds one or more new levels.

fc <- parse_factor(c("A", "B"))
fc
[1] A B
Levels: A B
fct_expand(fc, "C", "Z")
[1] A B
Levels: A B C Z

fct_drop(): removes any levels that are unused.

fc <- parse_factor(c("A", "B"), levels = c("A", "B", "C"))
fc
[1] A B
Levels: A B C
fct_drop(fc)
[1] A B
Levels: A B

Example of using fct_drop to clean up gss_cat$relig so we can use fct_inorder.

gss_cat |>
  mutate(relig = fct_inorder(fct_drop(relig))) -> gss_cat2
levels(gss_cat2$relig)
 [1] "Protestant"              "Orthodox-christian"     
 [3] "None"                    "Christian"              
 [5] "Jewish"                  "Catholic"               
 [7] "Other"                   "Inter-nondenominational"
 [9] "Hinduism"                "Native american"        
[11] "No answer"               "Buddhism"               
[13] "Moslem/islam"            "Other eastern"          
[15] "Don't know"             

fct_explicit_na() allows you to create a level for NAs in case you want to plot them as a category.

3.5 Tidyverse Review

3.5.1 Things to Consider with {readr}

  • Use it!
  • Always use a relative path instead of an absolute path to ensure reproducibility.
  • Use the read_csv() function instead of the read.csv() function for better warnings and speed.
  • Use the na = and other read_csv() arguments to reduce parsing errors.

3.5.2 Things to Consider with {ggplot2}

  • Using a log scale is not the same as logging the data.
    • Both give a different view of the data but the log scale preserves the values and changes the “scale” on the axis.
    • Logging the data changes the values of the data and the values of the scale.
    • You may have to use log scale on the plot and then log the data in the statistical model.
  • Whenever you are asked to plot data for an association between two variables, consider the question in terms of how does the variable on the X axis explain what is happening on the Y axis (the response). Typically read the problem as plot Y vs X.
  • Whenever plotting to look for an association, use a smoother.
    • Start with the default, non-linear, geom_smooth(se = FALSE).
    • Set method = "lm" to see a linear smoother.
    • On assignments, check the question or rubric if in doubt which to use.
  • Consider the difference between a linear smoother (Ordinary Least Squares OLS regression line) and the default geom_abline() where y = x.
    • The first is useful for examining if there is a linear relationship between two variables.
    • The second is useful for determining if there is a pattern to the relative values of the two variables compared to the y = x line.
  • Consider the options for ggtitle() and labs().
  • Consider using str_glue() or glue() to create dynamic labels.
  • Use themes to change colors, legends, and adjust other elements of the plots
  • Consider {ggthemes::scale_color_blind} or alternative color palates such as the viridis family for accessibility.
  • Remember the difference between color and fill for different kinds of plots.

3.5.3 Things to Consider with {dplyr}

  • Resist using a full_join() whenever possible. They can lead to trouble. Consider a left_join() as the start.
  • Consider the filtering joins: anti_join() and semi_join() as they save a lot of code.
  • If you did not cover across() with helper functions in your previous class review Column-wise operations vignette.
  • If you did not cover rowwise in your previous class, review Row-wise operations vignette.
  • When you see a statement like ” for each xxx”, that should suggest the need to group using group_by().
    • If the statement says : for each “xxx, yyy, and zzz” then consider grouping by the three things.
  • Think about whether you want to summarize() or mutate() and the implications of one over the other.
    • summarize() reduces rows and columns (only those in the summarize and the grouping variables).
    • mutate() keeps all the rows and columns but makes duplicate values.
  • Consider how to deal with the output of using group_by(), especially after grouping more than one variable and then using summarize(). Do you want to ungroup() or do you want to operate next on a grouped data frame?
  • Consider how the slice() functions work with ties as compared to head().
    • If a problem says find the n largest values, that generally suggests using slice_max( order_by = xxx, n = n) and not using arrange() and then head(n = n) in case there are ties.

3.5.4 Things to Consider with {tidyr}

  • You can reshape the data to get the variables you need to do some calculations and then reshape again.
  • Use the pivot_wider() and pivot_longer() functions and not the deprecated gather() and spread().
  • Consider using the additional arguments for the pivots to minimize extra code to transform the values/class of the variables.

3.6 Strategies for the Course and Data Science

3.6.1 Data Science Cheat Sheet

There are many cheat sheets for tidyverse and other R packages. Here is one for doing well on assignments (and future data science work).

  • Use the starter file. If you don’t, you run the risk of missing a part of a question and your homework is harder to grade. Be sure to rename the file and put your name as Author.
  • Read the Problem all the way to the end.
    • Are there multiple parts?
    • What is the output required to answer the question?
    • What format should the answer be in?
    • Are there restrictions on the packages to be used?
    • Does it require re-saving the modified data? If so to what/where?
    • Check the Rubric in the HTML file to see if it makes sense.
  • Organize the files in the proper directories
  • Look at the data after loading.
    • Visually inspect the data after you read it in.
    • Check for parsing errors - if so - why?
    • Check the number of rows and columns - does it make sense?
    • Is the data in the right shape?
    • Does it have the right variables?
  • Develop a Strategy and Plan for how to Solve the Problem
    • Think about the starting point and where you want to end up.
    • Some problems may be easier by thinking from the end state and working backwards towards the beginning.
    • What should the final data frame look like and how do I get there.
    • What should it look like on the way?
    • How can I test I am getting the right results along the way?
  • Build a little - Test a little!
    • Do not try to solve all the problem at once!
    • Enter a few lines and check your results.
      • Does the data have the right shape?
      • Does it have about the right number of rows/columns?
      • Are the values reasonable?
      • Is there another source you can check to validate your results?
  • Use the Help function, RStudio Cheat Sheets, Google, StackOverflow etc..
    • You are not expected to remember every bit of syntax or possible argument to a function.
    • You are expected to be able to read the help.
    • If a problem says use function or an argument to do something, use the help to get to the function or argument and how to use it.
    • The arguments are there to save you extra lines of code and computation to make your code faster and easier to read.
  • Save and Commit Early and Often!
    • Push Regularly
  • When creating plots, check the instructions/rubric for any specifications or constraints.
    • Read the HTML if you are trying to reproduce a plot.
  • When interpreting a plot - don’t just describe it, but say what you think it means and what the implications for your analysis might be.
  • Knit to HTML and then add commit and push and then submit a note on Canvas.

3.6.2 Things to Avoid

  • Do Not Use Absolute Paths. If your name is in the path it means the repo is not reproducible on someone else’s machine. Use Relative Paths.
  • Do Not Show Excessive Data.
    • We will be using data sets that can easily generate hundreds of pages in your HTML output.
    • That is not a good thing.
    • Depending upon the structure of the data use head() or glimpse() if you want to check something.
    • You can also use the code chunk option include: FALSE so no results from the chunk go into the output.
  • Do Not Submit Messy Code.
    • Do follow tidyverse style and especially do the following:
      • Use a linebreak after each pipe.
      • Use a linebreak after each + in ggplot2.
      • Use a linebreak after each { and }.
      • Put spaces on both sides of infix operators, e.g.,= and +, <-, ->, %>% or |>.
    • Use the {lintr} or {stylr} packages (and RStudio Addins) to assist.
    • Tidyverse style is not just about looking pretty but being able to read and debug code faster than if the code is messy.
  • Do not use unusual packages for most assignments.
    • Some assignments and the group project may require specialized packages not taught in class. For the most part however, any packages you need will be covered in class.
  • Do not use excessive Base R when a tidyverse function is appropriate.
    • This is a course focused on tidyverse approaches so there is no need to use grep, lapply, etc..
    • Base R primitives and common functions without tidyverse equivalents are fine.
  • Do not set or change the working directory in an .Rmd file.
    • The default is the directory where it is stored.
    • Use relative paths to load data or save output as desired.
  • Do not leave lots of excess code in an .qmd file.
    • It’s great to try things out but no one wants to see all extra code.
    • Delete it or comment it out (use CTRL Shift C or Command Shift C).
    • Even when submitting code that is not working, only include what you think is the “closest” to what you are trying to do.

3.6.3 Final Thoughts

Use the RStudio capabilities to assist you. Use the environment, files, and packages panes to keep track of what is going on.

Use RStudio Projects to simplify working with Git and GitHub and consider the Git Pane for efficient routine workflow.

Experiment with small examples. If you are not sure what the code is doing, use the .Rmd or console to try on a small data set such as mtcars that is already present.

Collaborate with your peers on what the question means and what the answers should look like. Do not share code or show code to each other.

Ask the TA or instructor for help and /or office hours! Push your code to GitHub so the instructor can see it as screen shots are always missing some context.

3.7 Using Quarto for Literate Programming

Quarto is a new open-source scientific and technical publishing system from Posit.

Quarto users can:

  • Create multi-language projects with R, Python, Julia, Observable, … (R not required).
  • Use one document to create reports, journal articles, presentations, web content,… in HTML, PDF, Word, PowerPoint, … All Formats.
  • Write/code in multiple tools such as RStudio, JupyterLab, VS Code, Text Editors, ….
  • Use .Rmd files or python .ipynb, or “upgrade” to .qmd for more features ….
  • Collaborate across teams using different formats, languages, and tools.

Quarto has a shallow learning curve, especially for R Markdown users.

  • It’s the next-generation of R Markdown and Posit’s focus for new capabilities.
  • Comes “batteries included” so no need to manage multiple packages.
  • Extensive on-line documentation, an active community, and numerous videos and blogs, e.g. A Quarto tip a day .
  • Multiple themes with options for customized CSS/SCSS, or \(\LaTeX\).

3.7.1 Quarto Unifies the R Markdown Ecosystem

Quarto incorporates capabilities of multiple packages to allow for diverse outputs:

  • Documents: HTML, PDF, Word
  • Websites
  • Blogs
  • Web-Books
  • Presentations:
    • HTML (Revealjs)
    • PDF (Beamer)
    • PowerPoint (.pptx)

Quarto tries to create consistency across formats but HTML can be interactive and PDF cannot.

Quarto supports multiple languages: R, Python, Julia, Observable, ….

  • Can use either Knitr or Jupyter computation engines (R is NOT required).
  • Can support R and Python in the same file with R {reticulate} package.

Quarto operates as a Command Line Interface in the Terminal window and ….

  • Is integrated into RStudio (v2022.07+) and works in JupyterLab , VS Code, Text Editors.
  • Can render python .ipynb.

Quarto supports \(\LaTeX\) and math with MathJax (including AMSmath).

3.7.2 Quarto Extends R Markdown

Quarto creates plain-text files using YAML and R Markdown tags.

Quarto works with .Rmd files but recommends upgrading for more features.

  • Quarto files end in .qmd instead of .Rmd.
  • Use a slightly different YAML Syntax to make it easier to change outputs and be consistent across languages and tools.
    • format: html instead of output: html_document.
  • Use #| option: value for chunk options instead of fitting into the chunk header.
  • You can edit to convert the chunk options or use knitr::convert_chunk_header().
    • knitr::convert_chunk_header("your.Rmd", "your.qmd", type = "yaml")

R Markdown Chunk Options

```{r mtcars, eval=FALSE}
nrow(mtcars)
```

Quarto Chunk Options as #|

```{r}
#| label: mtcars2
#| eval: false
nrow(mtcars)
```

3.7.3 Getting Started

Getting Started with Quarto in RStudio is a good resource if you are new to RMarkdown.

Quarto is fully integrated into RStudio (V2022.07 +).

  • You can operate using the IDE or use the console command line.

You can use the IDE to:

  • Create new Quarto documents using the menu for File/New File/Quarto Document.
  • Auto-complete yaml and chunk options.
  • Open an existing .Rmd file or edit an existing .qmd file as usual.

You can add many options to your YAML and see the auto complete in action.

---
title: "Untitled"
format: html
editor: visual
date: today
theme: cerulean  # lots of options for themes
number-sections: true # automated numbering of headings
self-contained: true # For HTML to allow sharing of complete files (makes them larger)
code-fold: true # collapse code chunks with option to show the code
execute:        # set global code chunk options as in Rmarkdown Setup chunk
  echo: true
  eval: true
  freeze: true  # New capability to save calculations to speed later renders
---

3.7.3.1 Quarto has an Expanded Visual Editor in RStudio

The RStudio Visual Editor provides a WYSIWYM capability.

  • It shows what you “mean” but does not completely render all the elements on the page.
  • You can preview the rendered version in either the Viewer or Presentations Panes or switch to a Window view in a browser using the open browser icon .

The Visual Editor provides access to many formatting options and special features through the Format and Insert menus.

I often go back and forth between the Visual Editor and the Source Editor as I find it much easier to “fix” things in the Source Editor.

The Visual Editor is great for inserting features such as tab panels, columns, callout blocks, or span formatting and then going to the Source Editor to fine tune them.

Tables are still limited in R Markdown but the Visual Editor makes it much easier to create and edit large tables.

The Visual Editor will “Escape” errors with a \\.

If you edit something in source editor, such as a table or math, and there is an error in it and you try to see it in the visual editor, Quarto will “escape” all special characters such as [ or | with a \ so you have a lot to clean up manually. Using the IDE Find and Replace can help.

3.7.4 Writing Math and \(\LaTeX\) in Quarto

Quarto supports \(\LaTeX\) for writing math equations (\(\text{Area} = \pi r^2\)), matrices, as well as numbering and cross-references using {#eq-my-equation}.

\[ \begin{align} \hat{\beta}_1 & = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sum(x_i - \bar{x})^2} \\ \hat{\beta}_0 & = \bar{y} - \hat{\beta}_1x \end{align} \tag{3.1}\]

\[ A = \begin{bmatrix}a_{11}&a_{12}&\cdots &a_{1m} \\ \vdots & \vdots & \ddots & \vdots\\a_{n1}&a_{n2}&\cdots &a_{nm}\end{bmatrix} \]

\[ m_{ij} = (a_{i1}b_{1j} + a_{i2}b_{2j} + \cdots + a_{im}b_{mj}) = \sum_{k=1}^{m} a_{ik}b_{kj} \tag{3.2}\]

Careful with Spacing

Both source and visual editor views will preview correct latex.

  • No space after leading $ or before trailing $ in inline mode.
  • You can put $$ on separate lines.
  • If using a \(LaTeX\) environment, e.g.,\begin{array} you can skip the $$.

If copying and pasting latex code, be sure to paste into the Source editor otherwise all of your \ may get escaped as \\.

For revealjs presentations, adding html-math-method: mathjax to YAML avoids a formatting bug in mathjax for Chrome.

3.7.5 Divs and Spans

Quarto allows you to create a Div or a Span to customize sections of the document, or pieces of code.

See Divs and Spans for detailed help.

A Div creates a block (division) of the document that gets special treatment.

  • Divs are used for many built-in features to include tab-sets and callout boxes.
  • You can customize a Div and nest Divs within other divs.
  • Use the Visual editor to Insert and or enter a pair of ::: to delineate the div.
  • The Div should be separated by blank lines from preceding and following blocks.

A Span is used for giving special treatment to a section (span) of text.

  • You can customize a Span using CSS and nest Spans within other Spans
  • Use the Visual editor to Format a span or delineate the text inside and add the attributes you want after it using [This is *some text*]{.class key="val"}.

I often use the Visual editor to insert a div or span, especially tabsets or callout boxes, and then switch to the Source editor to move content around.

3.7.6 Quarto Extensions

Quarto has an array of Extensions that allow you to add additional capabilities to your document.

  • Extensions exist for Shortcodes/Filters, Journal Articles, Custom Formats, and Revealjs.

Extensions must be installed for each Quarto project.

Consider the FontAwesome extension for including 2,000 free icons in either HTML or PDF.

  • Install using the terminal with quarto install extension quarto-ext/fontawesome.

  • This will create a directory _extensions/quarto-ext/fontawesome

  • Users can adjust the size, color, and accessible text for the icons. {}

  • Use shortcode {{< fa name >}} : fa thumbs-up fa folder fa chess-pawn fa brands bluetooth fa brands github size=2x fa battery-half size=Huge fa envelope title= “an envelope”

See Icons in Quarto for more ideas.

3.7.7 Scholarly Writing

Quarto has features to simplify formatting for publication on the web or in academic journals.

Cross References provide numbered references and hyperlinks to elements of your document that you choose.

Labels {#type-name} go after the element (separated by a space) or as an option in a code chunk, e.g., #| label: fig-plots

  • Common elements include Figures #fig-, Tables #tbl-, equations #eq-, Section Headings #sec-, and others
  • The name must be unique within the document (or across documents in a project)
  • Add crossref: chapters : true to the YAML to indicate that top-level headings (H1) in your document correspond to chapters, and cross-references should be sub-numbered by chapter.

Figures and tables, whether generated by markdown or in a code chunk, must have a Caption as well to be cross-referenced.

To reference a labeled item, use @type-name for the item of interest.

Quarto can support BibLaTeX (‘.bib’) or BibTeX (‘.bibtex’) files for Citations

  • Add the relative path to the file to the YAML, e.g., bibliography: references.bib
  • Recommend using Zotero with the Better Bibtex addin to manage your collection of citations.

Quarto uses the standard format [@citation-key where citations go inside square brackets and are separated by semicolons - The citation-key must be present in the referenced .bib file.

The default is the Chicago Manual of Style author-date format, but you can specify a custom format by adding csl: myformat.csl to the YAML where myformat is a file such as found at Zotero Style Repository

Quarto has Templates for several journals and Posit and the open source community are building more (and you can build your own as well).

For pre-built templates, use the terminal to install the extension, e.g., quarto install extension quarto-journals/jasa.

Then edit the YAML to use the template format, e.g., format: jasa-pdf.

The pre-built templates usually have an example .qmd file with their options.

3.7.8 Customizing Quarto

Quarto has many options to customize your output for each type of format.

  • The quarto Gallery has lots of examples.

If one of the built in themes fonts, colors, sizes, layout does not suit your purpose, you can create your own variations.

  • for HTML: embed HTML tags or more generally, use a custom SCSS file to adjust existing CSS classes or add your own.
  • for PDF, you can use any option Pandoc allows for PDF Latex.

You can change the page layout

  • Add columns using a div
  • Move content to the margins

You can choose different languages for the document.

3.7.9 Quarto Projects

3.7.9.1 Quarto Projects Overview

Quarto Projects are not the same as RStudio Projects.

A Quarto Project is a folder with (usually) multiple documents that should be rendered together under common options.

  • You can still override the common options though for portions of the project

Projects are used to create articles from multiple source documents, websites, books or other collaborative efforts.

Quarto projects are characterized by their use of a _quarto.yml file which serves as a configuration file for the project.

  • Edit the _quarto.yml file to define the elements of the project and any common options you want to apply across the project.
  • Adding the The YAML freeze: auto option will keep track of all documents that have been rendered and only re-render a file when it has changed.

Use the RStudio File/New project command or run quarto create_project in the terminal in the directory where you want to create a project.

3.7.9.2 Personal Websites and Blogs

The {blogdown} package is an R Markdown based package for creating personal websites or blogs using Hugo.

Quarto streamlines creating a website/blog with many options for customizing and enhancing it over time.

The Quarto Guide Creating a Website shows how to create a website or a blog.

  • These are both Quarto Projects as you use multiple .qmd files for a website or blog.
  • A Website organizes your files based on the structure navigation you define in the _quarto.yaml , e.g. navigation bars of Tabs and/or Menus and/or a sidebar menu etc..
  • A Blog uses a listing page which will search for files under a directory and organize the files by date/time.

If you want a combination of website and blog, recommend creating a website and then in the navigation add a .qmd file with the line listing: default in the YAML header.

Quarto generates a standard structure of HTML files for all the contents of your website/blog.

Thus you can publish it in a wide variety of places for free.

Quarto has its own site, quarto.pub.com, where you can publish your work.

Once you set up an account, it is a two step process:

  1. Render your website in RStudio (this is where the freeze: auto speeds things up).
  2. Publish your website using the command quarto publish in the terminal window in the project directory.

You can add new files and update your website as often as you wish.

Listing 3.1: _quarto.yml
    project:
      type: website
    website:
      title: "Richard Ressler's Data Science Site"
      navbar:
        left:
          - text: "Home"
            file: index.qmd
          - text: "Data Sets"
            file: data_sets/data_sets.qmd
          - text: "Blog"
            file: posts/posts.qmd
          - text: "Talks"
            file: talks/talks.qmd
    execute:
      freeze: false # re-render only when source changes
    format:
      html:
        theme:
          light: flatly
          dark: darkly
        css: styles.css
        toc: true
    editor: visual

Website Directory Structure

3.7.10 Quarto Summary

Quarto is the next generation of R Markdown and integrates capabilities from multiple packages for a more consistent approach to literate programming and scientific and technical publishing.

These notes just scratch the surface of its capabilities right out of the box and how you can customize your work.

More developments are coming:

  • Add Track Changes/Comments in the visual editor for collaboration.
  • Add HTML widgets.
  • Add more templates for journals.

Use Quarto for reproducible work at AU and beyond!