5  Functions in R

Published

November 6, 2024

Keywords

functions, environments, formals, anonymous functions, packages

5.1 Introduction

5.1.1 Learning Outcomes

  • Create custom R functions for efficient coding.
  • Apply understanding of R function structure and rules when creating and debugging functions.
  • Apply considerations for creating functions and R scripts to create effective and reproducible code.
  • Employ a variety of debugging strategies, methods and R and RStudio tools to write better code, faster.

5.1.2 References:

5.1.2.1 Other References

5.2 Creating New Functions in R

5.2.1 You should create a new function when …

You have a complicated task you need to accomplish more than once:

  • within the same analysis.
  • across multiple analyses.

The task involves manipulating inputs to get one or more outputs over and over.

  • Data conversions or calculations.
  • Generating multiple plots for a weekly report.

You want to simplify your code and make it more robust and easier to debug and maintain.

  • This is especially true in building R Shiny apps when you want to minimize the amount of code in the reactive server code.
  • You want to make your life simpler in the long run!
Tip

A rule of thumb: “If you have to do the same thing three or more times, write a function for it.”

5.2.2 Creating a function with function()

To create and name a function: use the function() function to create the function object and bind a name to it with <-.

  • This is the same process as vectors; you create an object with some value and bind a name to it (not the other way around).

The syntax for function is function(arglist) expr where

  • arglist is a (possibly empty) set of terms to be passed to the function.
  • expr is an R object of class expression - a list of function calls to be evaluated in sequence by the function object when it is called.

R expressions use the { primitive to allow for multiple function calls that can take many lines of code inside the {...}.

  • The use of { } is recommended as organizing the code into multiple lines makes the code much easier to read, interpret, and maintain.
  • One can also nest expressions within expressions to allow for complex functions.
  • When the expression is “simple” and fits on one line, the {} are not needed.
f01_sin <- function(x) {
  sin(1 / x^2)
}
f01_sin(2)
[1] 0.247404
f01_sin_emb <- function(x) {
  if (x > 0) {
    sin(1 / x^2)
  } else {
    x
  }
}
f01_sin_emb(-2)
[1] -2
f01_sin_oneline <- function(x) sin(1 / x^2)

f01_sin_oneline(2)
[1] 0.247404

5.2.3 Creating a function with the backslash \

Starting with R version 4.1, R has a new way to define a function using just a backslash \.

  • The syntax is still \( arglist ) expr.
f01_sin_backslash <- \(x) {
  sin(1 / x^2)
}
f01_sin_backslash(2)
[1] 0.247404

5.2.4 Creating Anonymous Functions

You can choose to not bind a name to the function object. That makes it an anonymous function.

  • This is useful when it’s not worth the effort to figure out a name (it’s a one-time usage) and you are usually executing several steps in a single line.
    • We do this a lot without thinking about it.
    • The {purrr} package functions such as map() allow anonymous functions including with those created with the \ shorthand.

Two examples of anonymous functions:

purrr::map_dbl(mtcars, function(x) length(unique(x)))
 mpg  cyl disp   hp drat   wt qsec   vs   am gear carb 
  25    3   27   22   22   29   30    2    2    3    6 
Filter(\(x) !is.numeric(x), mtcars) ## Filter(f, x)
data frame with 0 columns and 32 rows
## Filter is a base R function to extracts the elements of a vector
## for which a predicate (logical) function gives TRUE.

5.2.5 Employing Functions in Sequence

We have connected multiple functions in a sequence (sending outputs from one as inputs to the next) in three ways:

  • Nesting, e.g., sum(exp(c(1,2,5)))
  • Intermediate variables
  • Using either the R natural pipe operator |> or the {magrittr} forward pipe operator %>%.
sum(exp(c(1, 2, 5)))
[1] 158.5205
x <- c(1, 2, 5)
exp_x <- exp(x)
sum_ex <- sum(exp_x)
sum_ex
[1] 158.5205
library(magrittr) # add to the search path
c(1, 2, 5) |>
  exp() %>%
  sum()
[1] 158.5205
detach(package:magrittr) # remove from the search path
  • Each has advantages and disadvantages. The choice depends upon how complicated (long) the overall body of the expression is and whether you need the intermediate variables for other purposes.

Before you write too many functions though, it is good to understand more about what they are and how they operate.

5.3 The Structure of Functions

5.3.1 Functions are Objects

Functions are objects, just as vectors are objects.

  • A function object has three parts:
    1. The formals: the list of arguments controlling how you call the function, often specified as a dotted pairlist.
    2. The body: the code inside the function.
    3. The environment: the data structure that determines how the function finds the values associated with the names.
  • There are some exceptions: a small selection of “primitive” base functions implemented in C, e.g. sum.

You specify the formals and the body when you create a function.

  • The environment is specified implicitly, based on where you define the function.

You can check the formals, body, and environment by calling functions with those names:

f02_add <- function(x, y) {
  ## A comment
  x + y
}
formals(f02_add)
$x


$y
body(f02_add)
{
    x + y
}
environment(f02_add)
<environment: R_GlobalEnv>

A response of NULL means either it’s not a function or it’s a primitive.

formals(99)
NULL
## formals([])

Let’s load {tidyverse} and look at the formals, body, and environment with sum() and ggplot().

library(tidyverse)
formals(sum)
NULL
body(sum)
NULL
environment(sum)
NULL
formals(ggplot)
$data
NULL

$mapping
aes()

$...


$environment
parent.frame()
body(ggplot)
{
    UseMethod("ggplot")
}
environment(ggplot)
<environment: namespace:ggplot2>

As a functional programming language, R functions are objects in their own right, a language property often called “first-class functions”.

5.3.2 Environments and Functions

R operates with multiple environments that are nested like layers of an onion within the “Global Environment”, the outer layer.

  • An environment in R is a data structure that associates, or binds, a set of names to a set of values.
    • You can think of an environment as a bag of names, with no implied order.

Two Key environments are:

  • The current environment, or current_env() is the environment in which your function or code is currently executing.
  • The global environment, or global_env(), sometimes called your “workspace”, is where all interactive (i.e., outside of a function) computation takes place.

Every environment has a parent, another higher-level or encompassing environment (like layers of an onion), all the way up to the global environment.

All functions operate inside an R environment.

  • Every function environment has its own “NAMESPACE” which lists all the functions and variables defined by that function and that becomes part of the function’s environment.
  • When executing a function, R will look first in the current environment (for the function and its NAMESPACE) for any functions or variables before it looks elsewhere.

5.3.3 Calling a Function Creates a New Environment

We are used to referring to functions with functionname() however this is imprecise.

  • The actual function is the object to which functionnameis bound.
  • The open parenthesis “Paren” function ( is a primitive function in R evaluates the object that precedes it.

The code functioname() is referred to as “calling the function” which means to evaluate the function (with its arguments).

When you call a function, R creates a new environment, just for that function, that nests inside a layer of the Global Environment, and uses the new environment as the current working environment.

  • As an example, when you call the function mean with mean(x), it gets its own environment just under the Global Environment.

When you nest functions inside other functions, R creates corresponding nested environments.

  • If you call mean(log(x)), then R creates the environment for mean(), just below the Global Environment, and then,
  • R sees the nested log() function inside the mean() environment, so it creates a new environment for log() and makes that the current working environment.

When R completes a function and returns an evaluated result, it deletes the function’s environment as it is no longer needed.

5.3.4 Function Forms

Everything that happens in R is a result of a function call, but not all calls look the same.

Function calls come in four varieties:

  • prefix: the function name comes before its arguments, like foofy(a, b, c).
    • These constitute the majority of function calls in R.
  • infix: the function name comes in between its arguments, like x - y.
    • Infix forms are used for many mathematical operators, and for user-defined functions that begin and end with %.
  • replacement: functions that replace values by assignment, like names(df) <- c("a", "b", "c").
    • They actually look like prefix functions.
  • special: functions like [[, if, and for.
    • While they don’t have a consistent structure, they play important roles in R’s syntax.

While there are four forms, you actually only need one because you can write any call in prefix form and it’s the most common.

Prefix calls allow you to specify arguments in three ways:

  1. By position, like help(mean).
  2. Using partial matching, like help(top = mean).
  3. By name, like help(topic = mean).

Arguments are matched by exact name, then with unique prefixes, and finally by position.

k01 <- function(abcdef, bcde1, bcde2) {
  list(a = abcdef, b1 = bcde1, b2 = bcde2)
}
str(k01(1, 2, 3))
List of 3
 $ a : num 1
 $ b1: num 2
 $ b2: num 3
str(k01(2, 3, abcdef = 1))
List of 3
 $ a : num 1
 $ b1: num 2
 $ b2: num 3
## Can abbreviate long argument names:
str(k01(2, 3, a = 1))
List of 3
 $ a : num 1
 $ b1: num 2
 $ b2: num 3
## But this does not work because abbreviation is ambiguous
str(k01(1, 3, b = 1))
Error in k01(1, 3, b = 1): argument 3 matches multiple formal arguments

Prefix form is what allows functions to change other functions by manipulating their arguments programmatically.

Tip

Some Best Practices for working with arguments.

  • Use positional matching only for the first one or two arguments; they will be the most commonly used, and most users will know what they are.
    • Avoid using positional matching for less commonly used arguments
  • Set default values for non-data arguments expected to be the most commonly-used values
    • Never use partial matching.
    • Unfortunately you can’t disable partial matching, but you can turn it into a warning with the warnPartialMatchArgs option:

5.4 Rules for Function Evaluation

R evaluates functions using “Lexical Scoping.”

R looks up the values of names based on how a function is defined, not how it is called.

  • Here, “lexical” is a computer science term meaning the scoping rules use a parse-time, rather than a run-time approach.

Four primary rules:

  • Name masking: names defined inside a function mask names defined outside a function.
  • Functions versus variables
  • A fresh start
  • Dynamic lookup

5.4.1 Name Masking Rule

Try to figure out what value the following will return before you run it.

  • Then execute each step to see what happens.
x <- 10
y <- 20
g02 <- function() {
  x <- 1
  y <- 2
  c(x, y)
}
g02()
[1] 1 2
c(x, y)
[1] 10 20

If a name is not defined inside a function, R looks one level up to the next higher environment.

x <- 2
g03 <- function() {
  y <- 1
  c(x, y)
}
g03()
[1] 2 1
c(x, y) ## y is unchanged
[1]  2 20

If a function is defined inside another function, R looks inside the current function for names.

  • If it does not find them, it looks up one level to where that function was defined, and so on, all the way up to the global environment.
  • Finally, it looks in other loaded packages.
  • If it can’t find a name you get the function or variable not found error message.

Look at help for mean().

  • The first argument is x (for the data object).
  • Since x is defined in the mean() function, R looks no further and it does not care that you may have a variable called x in your Global Environment.
  • R assigns whatever input you gave the function for x to the function variable x and uses its own x in executing the function.

5.4.2 Functions versus Variables Rule

Run the following code in your head. What is the result? Then run the code to check your answer:

x <- 3
g04 <- function() {
  y <- 2
  i <- function() {
    z <- 1
    c(x, y, z)
  }
  i()
}
g04()

Functions are ordinary objects so the scoping rules (name masking) also apply to functions.

g07 <- function(x) x - 1
g08 <- function() {
  g07 <- function(x) x - 100
  g07(10)
}
g08()
[1] -90
g07(1)
[1] 0
Tip

Best practices for Functions and Variable Names

When a function and a non-function share the same name (from different environments), it gets complicated.

  • When you use a name in a function call, R ignores non-function objects when looking for that value.
  • Just because you can does not mean it’s a good idea to reuse names, especially R function names.

To minimize issues with names and scoping across levels:

  • Avoid reusing data-specific names inside and outside of functions and between your documents and the console for different values.
  • Use generic common or names for objects (variables or other new functions) inside your functions such as x, or y and df for data frame or otherwise names specific to that function.

5.4.3 Fresh Start Rule

Every time a function is called, R creates a new environment to host its execution.

  • A function has no way to tell what happened the last time it was run;
  • Each invocation is completely independent - they are “memory-less processes”.
g11 <- function() {
  if (!exists("b")) {
    b <- 1
  } else {
    b <- b - 1
  }
  b
}

g11()
[1] 1
g11() ## does not know b exists from before
[1] 1

5.4.4 Dynamic Lookup (at Run Time) Rule

Lexical scoping determines where, but not when, to look for values.

  • R looks for values when the function is run, not when the function is created.
  • The output of a function can differ depending on the objects outside the function’s environment.
#|label:  dynamic-lookup
g12 <- function() x - 1
x <- 15
g12()
[1] 14
x <- 20 ## new external variable value
g12()
[1] 19
  • This can be a pain if you have misspelled a variable name and it happens to match an external variable name.
  • R has no idea it was an error, unless it violates some other rule, e.g., taking a log of a negative number, so there will be no error message! (bug or feature…?)

Use codetools::findGlobals() to list all the external dependencies (unbound symbols) within a function:

codetools::findGlobals(g12)
[1] "-" "x"

A drastic approach to check things is to manually change the function’s environment to the emptyenv(), an environment which contains nothing, to include the primitive operators like +:

x <- 4
environment(g12) <- emptyenv()
g12()
Error in x - 1: could not find function "-"

5.4.4.1 Exercise

What value does the following function return? Make a prediction before running the code yourself.

f <- function(x) {
  f <- function(x) {
    f <- function() {
      x^2
    }
    f() - 1
  }
  f(x) * 2
}
f(10)

5.4.5 Evaluating Function Arguments

5.4.5.1 Lazy Evaluation

R uses lazy evaluation of function arguments; they’re only evaluated if accessed.

  • This code does not generate an error becausex is never used:
h01 <- function(x) {
  10
}

h01(stop("This is an error!"))
[1] 10
#> [1] 10
  • This is an important feature because it allows you to do things like include potentially expensive computations in function arguments that will only be evaluated if needed.

Lazy evaluation is powered by a data structure called a promise, or (less commonly) a thunk.

  • A promise has three components:
  1. An expression, like x - y, which gives rise to the delayed computation.
  2. An environment where the expression should be evaluated, i.e., the environment where the function is called.
    • This makes sure the following function returns 11, not 101:
y <- 10
h02 <- function(x) {
  y <- 100
  x + 1
}

h02(y)
[1] 11

This also means that when you do an assignment inside a call to a function, the variable is bound outside of the function, not inside of it.

  • What is the new value of y in the global environment?
h02(y <- 1000)
[1] 1001
y
[1] 1000
  1. A value, which is computed and cached the first time a promise is accessed when the expression is evaluated in the specified environment.
    • This ensures the promise is evaluated at most once
    • Example: This is why you only see Calculating… printed once in the following example.
triple <- function(x) {
  message("Calculating...")
  x * 3
}

h03 <- function(x) {
  c(x, x)
}

h03(triple(20))
[1] 60 60

You cannot manipulate promises with R code They are behind the scenes.

5.4.5.2 Default Arguments

Lazy evaluation means default values can be defined in terms of other arguments, or even in terms of variables defined later in the function.

  • Again, a programming technique to be careful about and ensure you document with comments!
h04 <- function(x = 2, y = x * 2, z = a - b) {
  a <- 10
  b <- 100
  c(x, y, z)
}

h04()
[1]   2   4 -90
  • Many base R functions use this technique, but it’s not a best practice in general; it makes the code harder to understand.
  • To predict what will be returned, you need to know the exact order in which the default arguments are evaluated.

The evaluation environment is slightly different for default and user-supplied arguments

  • Default arguments are evaluated inside the function, not like arguments which using values that affect the external environment.
  • Seemingly identical calls can yield different results.

The following is an extreme example using ls(), which returns a character vector of the names of the objects in the specified environment:

h05 <- function(x = ls()) {
  a <- 1
  x
}

ls()
 [1] "exp_x"             "f01_sin"           "f01_sin_backslash"
 [4] "f01_sin_emb"       "f01_sin_oneline"   "f02_add"          
 [7] "g02"               "g03"               "g07"              
[10] "g08"               "g11"               "g12"              
[13] "h01"               "h02"               "h03"              
[16] "h04"               "h05"               "k01"              
[19] "object"            "sum_ex"            "triple"           
[22] "x"                 "y"                
## ls() evaluated inside h05:
h05()
[1] "a" "x"
## ls() evaluated in global environment:
h05(ls())
 [1] "exp_x"             "f01_sin"           "f01_sin_backslash"
 [4] "f01_sin_emb"       "f01_sin_oneline"   "f02_add"          
 [7] "g02"               "g03"               "g07"              
[10] "g08"               "g11"               "g12"              
[13] "h01"               "h02"               "h03"              
[16] "h04"               "h05"               "k01"              
[19] "object"            "sum_ex"            "triple"           
[22] "x"                 "y"                

5.4.5.3 Missing Arguments

To determine if an argument’s value comes from the user (FALSE), or from a default (TRUE), you can use missing()

  • for missing from the user input
h06 <- function(x = 10) {
  list(missing(x), x)
}
str(h06())
List of 2
 $ : logi TRUE
 $ : num 10
str(h06(10))
List of 2
 $ : logi FALSE
 $ : num 10
  • missing() is best used sparingly.

5.4.5.4 The ... (dot,dot,dot) Argument

Functions can have a special argument, ..., which allows the function to take any number of additional arguments.

  • This is often used to pass any optional additional arguments on to another function that may be called inside the function, e.g. as used in map().
  • You have seen in the help of many functions, e.g., purrr::map() or 't.test().
i01 <- function(y, z) {
  list(y = y, z = z)
}

i02 <- function(x, ...) {
  i01(...)
}

str(i02(x = 1, y = 2, z = 3))
List of 2
 $ y: num 2
 $ z: num 3

5.4.6 Exiting a Function

5.4.6.1 Implicit versus Explicit Returns with return()

A function can return a value either implicitly or explicitly

  • Implicit: the functions returns the value resulting from last evaluated expression.
j01 <- function(x) {
  if (x < 10) {
    0
  } else {
    10
  }
}
j01(5)
[1] 0
j01(15)
[1] 10
  • Explicit: the function returns the value specified in return().
j02 <- function(x) {
  if (x < 10) {
    return(0)
  } else {
    return(10)
  }
}

Explicit is much more user-friendly and helps in debugging and maintaining functions of any length

  • It’s okay to go with implicit for short functions, e.g. 3 or fewer lines, where it is clear how the return is calculated.
  • Tidyverse style is to not use return if your function is expected to run from top to bottom with the last line always being the returned value.
  • If you are using conditionals, use a return() to be explicit about what could happen.

There are two approaches:

  1. Put a retun() everywhere you plan for a return.
  2. Use an intermediate variable and only one return() at the end. That can make it easier to troubleshoot.
  • The following example is shown two ways:
    • with an explicit return() for each possible ending, and,
    • with an intermediate variable and only one return() at the end.
## Explicit `return()` for each possible ending
has_name <- function(x) {
  nms <- names(x)
  if (is.null(nms)) {
    return(rep(FALSE, length(x)))
  } else {
    return(!is.na(nms) & nms != "")
  }
}
## create two sets of test data and test our function
x1 <- c(1, 2, 3)
x2 <- c(a = 1, 2, c = 3)
has_name(x1)
[1] FALSE FALSE FALSE
has_name(x2)
[1]  TRUE FALSE  TRUE
##  Use of intermediate variable for the return value
##  with only one `return()`
has_name <- function(x) {
  nms <- names(x)
  if (is.null(nms)) {
    my_return <- (rep(FALSE, length(x)))
  } else {
    my_return <- (!is.na(nms) & nms != "")
  } ## end else block
  return(my_return)
} ## end function

## test our function
has_name(x1)
[1] FALSE FALSE FALSE
has_name(x2)
[1]  TRUE FALSE  TRUE

5.4.6.2 Invisible Return Values

Most functions return visibly: calling the function in an interactive context prints the result.

j03 <- function() 1
j03()
[1] 1

You can prevent automatic printing by applying invisible() to the last value:

j04 <- function() invisible(1)
j04()

To verify this value does indeed exist, you can explicitly print it or wrap it in parentheses:

print(j04())
[1] 1
(j04())
[1] 1

The most common function that returns invisibly is <- which means you can chain assignments without lots of intermediate results.

a <- b <- c <- d <- 4
a - b - c - d
[1] -8
(a <- b <- c <- d <- 4)
[1] 4

5.4.7 Error Checking

If a function cannot complete its assigned task, it should throw an error with stop(), which immediately terminates the execution of the function.

j05 <- function() {
  stop("I'm an error")
  return(10)
}
j05()
Error in j05(): I'm an error

An error means something has gone wrong, and forces the user to deal with the problem.

  • It’s a best practice to try to predict where errors might occur and write your code to always throw an error.

A primary place for error checking is immediately after the function is defined where you evaluate the class/type, length, and value constraints of each argument.

Two common approaches:

  1. Enter an conditional check for each argument and use stop() to create a custom error message.
  2. As an alternative, stopifnot() will stop execution if any of its argument expressions evaluate to FALSE and then provide a generic error message based on the first FALSE argument.
  • This can be much faster to write with multiple conditions separated by commas, than creating a lot of individual if statements.
  • However, the default messages are generic but you can write custom messages to be more informative to the user.
  • Any conditions you are checking should also be defined in the documentation of the function.
my_fun <- function(df, y = "cat", z = TRUE){
  stopifnot(is.data.frame(df), is.character(y), is.logical(z),
            length(df) > 2, length(y) == 1, length(z) == 1,
            "data frame must have more than one row" = nrow(df) > 1) 
    c(names(df), y)
}
my_fun("cat")
Error in my_fun("cat"): is.data.frame(df) is not TRUE
my_fun(mtcars, 3)
Error in my_fun(mtcars, 3): is.character(y) is not TRUE
my_fun(mtcars, "dog", 3)
Error in my_fun(mtcars, "dog", 3): is.logical(z) is not TRUE
my_fun(mtcars[,1:2])
Error in my_fun(mtcars[, 1:2]): length(df) > 2 is not TRUE
my_fun(mtcars, c("cat", "dog"))
Error in my_fun(mtcars, c("cat", "dog")): length(y) == 1 is not TRUE
my_fun(mtcars, "dog", str_detect(names(df), "am"))
Error in my_fun(mtcars, "dog", str_detect(names(df), "am")): length(z) == 1 is not TRUE
my_fun(mtcars[1,]) # custom error message to be more informative
Error in my_fun(mtcars[1, ]): data frame must have more than one row
Note

Both options evaluate the arguments and tests one at a time and stop at the first error.

Keep in mind: stop() is usually executed after an if statement condition is TRUE, whereas stopifnot() executes if any one of the conditions is FALSE.

5.5 Considerations for Creating Functions

5.5.1 Follow Repeatable Steps When Creating a Function

  1. Work out the logic in a simple case (with low number of iterations or data).
  2. Name your function something meaningful and like a verb.
  3. Assign your function name as the output using the function() function,
  • e.g. my_function_name <- function(x,y,z){} where x, y, z represent your arguments (if any)
  1. Place code for the function inside the expression created by the set of {}.
  • Expressions may have multiple lines.
  • Always end the first line with the {.
  • Put the final } on its own line.
  • Add comments to the code to explain why it is doing something.
  1. Test your function with different inputs.
  2. “Error-proof” your functions by checking for common input errors, e.g., class/type, length, or, values.
  3. Test your error checks to include extreme cases such as 0 and very large values, positive and negative.
  4. Document any function you expect to be persistent or used by others with Roxygen2 comments.

5.5.2 Use Best Practices for Writing Functions

  • Use Tidyverse Style to improve readability. Wickham (2021)
    • Indent code to indicate the flow.
    • Limit width of code (80 columns) using multiple lines as necessary.
    • Break at commas, pipes(|> or %>%) or inside ggplot() the + with no trailing spaces.
  • Limit the scope of individual functions - each should do one task.
    • It’s easier to debug shorter functions.
    • It’s good to have a function call other functions.
  • When using conditionals, use return() to explicitly identify the source of the return value (the output result) of the function.
  • Include error checking to prevent bad arguments causing errors.
    • Use stopifnot() for rapid checking of multiple conditions for the arguments with simple error messages.
    • For more complex error messages or conditions, create custom checks using if style statements.
  • Design/Build a little then Test a little, and expand. Don’t try to create it all at once.
  • Create “helper functions to break a long function into smaller pieces.

5.5.3 Consider Reusing a Custom Function in Other Places

You can always reuse a custom function in the same document in which it is defined.

If want to reuse your function in another document or analysis, write your function in a .R file, with only code and comments, no Rmarkdown, in what is known as an R Script file.

  • It is good practice to put these files in an /R directory.
  • If there is only one function in the script file, then name the file the same as the function.
  • If there are multiple functions in the file, then name the file with a meaningful title.

5.5.4 “Sourcing” the Script Enables Reuse in Other Places

To make the function available in other files (.qmd or .R), you have to “source” it in each file.

  • This tells R the path to find the R Script.
  • Make sure you use Relative Paths!

To source the geometric mean R script in this file (so I can use the function in the script), enter:

source("./R/geo_mean.R")
geo_mean
geo_mean(c(1, 2, 3, 7))
  • Now any code in this file can use the geo_mean() function … with a caveat about dependencies.

5.5.5 Reduce Dependance on Other Packages

If an R script uses a function not in Base R but from another package, e.g., %.% or ggplot(), then it depends on that package being available in the current environment.

Managing these kinds of package dependencies is a major responsibility of R Packages.

To use the geo_mean() function we have to load the {magrittr} package to get access to the %>% pipe operator.

library(magrittr) ## to load the pipe operator
geo_mean(c(1, 6, 2, 5))

Not having to load extra packages is one reason why using the Native Pipe, |>, if appropriate, is more convenient for functions you want to share across documents or with others.

It also helps when writing functions for use by others to fully specify the function using package name and the namespace-dblcolon operator, :: as in package::function().

5.6 Documenting your Function with Roxygen2

It is good practice to add documentation to your function to answer at a minimum:

  1. What is the usage syntax?
  2. What are the input parameters?
  3. What is the value (the output)?

You can just manually use # to put comments in front of your file to answer these questions but there is a better, more standardized approach in R.

The {roxygen2} packages allows you to create consistent help documentation for functions for use by you and eventually others.

5.6.1 Creating an Roxygen2 Skeleton

  • To insert an Roxygen2 Skeleton into your function:
    • Use the console to install.packages("roxygen2")
    • Put your cursor in the line with the function()
    • Go to the Code menu and select Insert Roxygen Skeleton
  • A basic function skeleton looks like:

#' Title
#'
#' @param x
#' @param y
#'
#' @return
#'
#' @examples
my_fun_name <- function(x, y) {

}

5.6.2 Roxygen tags for functions

The first sentence is the title: that’s what you see when you look at help(package = mypackage) at the top of each help file.

  • It should fit on one line, be written in sentence case, and not end in a full stop.
  • It should be a very brief description of the function to make it recognizable, not the actual name of the function.

The second paragraph is the description: this comes first in the documentation and should briefly describe what the function does. - Save the long explanations for the Details section. - The description should start with a capital letter and end with a full stop (e.g., period or question mark). - It can span multiple lines (or even paragraphs) if really necessary.

When not creating a package, you can manually create the next two lines, before the @param tags.

  • First, enter “#' Usage” on a line by itself
  • Second, on the line below, enter the syntax as if you were using the function with all of the arguments and any default values.
    • e.g., #' my_fun(x, y, na.rm = FALSE)
  • If creating a package, the helper functions will create the Usage entry for you in the help document.

Arguments: The skeleton will list all of the function’s arguments using the tag @param name.

  • Enter a short sentence description (starting with a Capital letter) for each argument starting with the argument class (e.g., string, or numeric vector) and any constraints on it.
  • You can document multiple arguments in one place by separating the names with commas (no spaces).
  • For example, to document both x and y, you can write @param x,y numeric vectors ...

The Details paragraph(s) comes after the arguments and is optional. This could be several paragraphs to provide specifics on how the function operates.

@return describes the output from the function. Since not every function has a return it is optional but is a good idea when there is one.

  • Enter the output type (e.g., string, or numeric vector) and short description starting with a lower-case letter.

@export is used in packages to identify the functions you want others to have direct access to from your package.

  • Don’t include unless you are creating a package.

@examples are optional but important. They are typically a few lines of executable R code showing how to use the function in practice.

  • Either include sample data with the example or if creating a package, as part of the package, but you must also document the data.
  • This is an important part of the documentation because many people look at the examples first.
  • See Hadley’s book on R Packages for more details.

5.6.2.1 Examples

Here is an example of re-writing the diff() function.

#' Calculate differences in a vector  ## Title - no period
#'
#' Takes a vector of length n and calculates the differences between
#' successive values to return a vector of length n-1. ## Description
#'
#' @param x A vector of numeric values to be lagged
#'
#' @return The lagged values of x
#' @export ## not used for now
#'
#' @examples 
#' diff2(c(1, 5, 21))
#' diff(c(1, 5, 21))
diff2 <- function(x) {
  stopifnot(is.numeric(x))
  lagvec <- x[2:length(x)] - x[1:(length(x) - 1)]
  return(lagvec)
} ## end function

diff2(c(1, 5, 21))
diff(c(1, 5, 21))

5.7 Debugging in R

Use Best Practices to Minimize Debugging Time.

A best practice for code development is build-a-little, test-a-little, commit-a-little to minimize the need for debugging large chunks of code.

  • Commit and push your working code to get it into the history.
  • Use a new branch (to be discussed in a later lecture).
  • Think before you start typing code.
    • Analyze the requirements - what are you asked to do or produce.
    • What should the answer look like: size, shape, type, …?
    • Develop your logic - step by step.
    • Predict what the results of the intermediate steps should look like.
  • Code in small pieces against major requirements.
    • Build your code in modular pieces
    • Create small functions and incorporate into larger functions.
    • Use temporary variables for inputs.
    • Use meaningful names for variables.
    • Incorporate status checking as interim steps, e.g.,str(), head(), view(), ….
    • Use comments to specify the purpose for the code sections.
  • Add pieces together to create the complete function.
    • Incorporate Error Checking of inputs against arguments.
    • Add bells and whistles after the baseline code is working.
  • Test and Document.
    • Test against defaults and then non-defaults for each argument.
    • Complete the header documentation with {roxygen2}.
  • Commit and push your working code to get it into the history.
  • When ready, merge into the overall baseline (to be discussed in a later lecture).

5.7.1 When Things Go Wrong

The RStudio IDE provides indicators to help find “easy” errors such as missing closing parentheses (braces, brackets).

  • RStudio also has diagnostics you can turn on to indicate other potential issues.
  • See Code Diagnostics in RStudio Preferences

R provides three kinds of feedback that something is perhaps not quite right:

  • Messages give the user a hint something is wrong, or may be missing. They can be ignored, or suppressed altogether with suppressMessages() or message = FALSE in a code chunk option.
  • Warnings don’t stop the execution of a function, but rather give a heads up something unusual is happening. They display potential problems. They can be suppressed with suppressedWarnings() or warning = FALSE in a code chunk option.
  • Errors are problems that are fatal- code execution stops - you cannot suppress or ignore them.
Important

Don’t ignore or suppress warnings or errors on new code!

  • Make sure you know what they are trying to tell you, and that you are okay with the results, before ignoring or suppressing them.

As nice (or inscrutable) as they are, messages, warnings, and errors do not find errors in your logic or typing that are still acceptable to R but result in outputs that are not what you wanted.

  • Always check your outputs for reasonableness in matching your expectations!

5.7.2 Researching Errors and Asking for Help

5.7.2.1 Help Yourself

Look at the R Help and Package Help and vignettes. See Getting Help with R.

  • Double check the arguments of a function against the RStudio help as well as their default values and examples.
Tip

A beauty of R is the use of generic functions, e.g., plot() or tidy() that call other functions based on the class of the argument.

However, this means that the help for the generic function is of little use.

Be sure to identify what specific function is being called, e.g., plot.lm() for plotting the output of a linear model, to understand the arguments and details of the function.

Tip

In extreme cases, go look at the source code for the function at the GitHub repository for the package.

  • Every R package has a /R directory for the source code.
  • This directory usually has many files and each file may have multiple functions.
  • However, with a little bit of searching (ok, maybe more than a little bit), you can find the function you are curious about and look at the code to see what is happening.
  • These functions may use multiple other helper functions, so following the logic may not be trivial.
  • However, you examine the code, or copy and paste it, or even fork the whole package, play with it and understand what is happening and why you might have an error.

5.7.2.2 Search Online

Include R and/or the package/function name in your search.

  • Be sure to check the date on any response. A response older than two years may be out of date given changes in R and packages.
  • Some questions are common though and remain valid for years, e.g., on REGEX structure which has seen few changes.

The {searcher} package and the {errorist} packages work to help you search faster. Balamuta (2020b) Balamuta (2020a)

5.7.2.2.1 The {searcher} Package

The {searcher} package has functions for different search engines which automatically include the terms “denoting” R programming.

  • You can load the {searcher} package manually or you can set it as the default search error handler by using the Base R options() to set global options of which error= is one possible argument.

Let’s load {tidyverse} and {searcher} and create some errors

  • Note the code chunk options include #| error: true so this can render.
library(tidyverse)
library(searcher)
## Using the generic search error handler
log(-1)
[1] NaN
search_stackoverflow("NaNs produced")

options(error = searcher("google"))
z <- list(1, 2, tribble(x = 2, y = c(5, 2, 3)))
Error in `tribble()`:
! Must specify at least one column using the `~name` syntax.
options(error = NULL)
5.7.2.2.2 The {errorist} Package

The {errorist} invokes searcher automatically as specified in the setup.

  • Load {errorist} package with library(errorist) .
  • Turn it on in a code chunk using enable_errorist() where you want it.
  • Turn it off in a code chunk using disable_errorist().
  • You can unload the package like any other package with detach("package:errorist", unload = TRUE).
library(errorist)
enable_errorist()
log(-1)
[1] NaN
z <- list(1, 2, tribble(x = 2, y = c(5, 2, 3)))
Error in `tribble()`:
! Must specify at least one column using the `~name` syntax.
enable_errorist()
z <- list(1, 2, tribble(x = 2, y = c(5, 2, 3)))
Error in `tribble()`:
! Must specify at least one column using the `~name` syntax.
disable_errorist()
z <- list(1, 2, df(x = 2, y = c(5, 2, 3)))
Error in df(x = 2, y = c(5, 2, 3)): unused argument (y = c(5, 2, 3))

5.7.2.3 Make the Error Repeatable: create a “Reprex”

To find the root cause of an error, you may need to execute the code many times.

It’s often worth some upfront investment to make the problem both easy and fast to reproduce.

Create a minimal reproducible example or REPREX by copying the code into a new .R file and removing (or commenting out) code and simplifying data.

  • Keep only the code you need to generate to the error (say <15 lines).
  • Create a small test data set, with your own data or using a common dataset such as mtcars.
  • Use a few test cases to see if you can reproduce the error.
    • Check extreme values.
    • You may discover inputs that don’t trigger the error. Make note of them.

If you ask for help, with a peer or online, use a Reprex or you may not get much help other than be told to submit a reprex.

Often, creating a reprex helps you locate and diagnose the error without asking for additional help.

5.7.2.4 Reach Out to the Data Science and R Communities for Help

If you can’t find an example that helps you with the error, try asking for help on your specific error.

  1. Find the right community to ask as different communities have different areas of interest.

  2. Search to see if there is a similar question already. If not, …

  3. Formulate your request

    • Succinctly describe the issue, the error message, and your questions.
    • Be sure to include a reprex.
    • Include your operating system, and versions of R, RStudio, and key packages.
  4. Sit back and wait for a response.

  5. If you get a useful response, be sure to identify as the accepted answer.

    • This helps the answerer get “credit” and other people looking for a good answer to a similar question.
  6. Say thank you! Most answerers are volunteers so they spent their time helping you for free!

5.7.3 Locating the Source of an Error

Once you’ve made the error repeatable, the next step is to figure out where it comes from and what is causing it. This is the hard part.

Important

Keep in mind, the error may be caused by code much earlier than where the error message occurs.

  • This could be due to error in manipulating data earlier in the code, e.g., the current function is expecting a data frame and due to an earlier error, the data is now a single value.

Follow a systematic process of generating and testing hypotheses about the cause and track your test case results.

One of the advantages of using an Integrated Development Environment (IDE) like RStudio or Visual Studio Code over a plain text editor is the built in-capabilities for debugging.

Debugging tools speed the process of elimination to identify the line of code causing the error and what is actually causing the error.

These tools include functions such as traceback() and browser() and ability to use “breakpoints” in the integrated debugging environment for R Scripts.

There is no need to write multiple print(x) statements to look at one result at a time. Debugging tools allow you to interact with the code while it is running.

5.7.3.1 traceback() in RStudio

The traceback() function is not interactive, but it does show the sequence of function calls that occurred leading up to the error.

  • This can help identify where the error occurred and why.

Here’s a simple example where f() calls g() calls h() calls i(), which checks if its argument is numeric:

f <- function(a) g(a)
g <- function(b) h(b)
h <- function(c) i(c)
i <- function(d) {
  if (!is.numeric(d)) {
    stop("`d` must be numeric", call. = FALSE)
  }
  d - 10
}

Run f("a") code in an RStudio code chunk. You should get an error message and, on the right end, a blue up-facing arrow with the words Show Traceback.

f("a")
Error: `d` must be numeric

Click the “Show Traceback” arrow to see the steps that occurred along the way to the error

  • It reads from bottom to top, from the first function call (bottom) to the one that errored out (top).

5.7.3.2 traceback() as a function

If you’re not using RStudio, you can use traceback() to get the same information.

f("a")
Error: `d` must be numeric
traceback()
No traceback available 
  • Read the traceback() output from bottom to top as well.
  • If you’re calling code that you source()d into R, the traceback will also display the location of the function, in the form filename.r#linenumber.
  • These are clickable in RStudio, and will take you to the corresponding line of code in the editor.

5.8 Using the RStudio Interactive Debug Environment

RStudio has a special debugging environment designed to help you debug faster by allowing you to start code execution and then stop and interact with the code inside the function environment.

If traceback() is unclear about the location or the error is “hidden” (is actually occurring before the line that resulted in the error message), RStudio has an interactive debugger so you can pause the execution of a function and interactively explore its state.

5.8.1 The browser() Function

You can insert a call to browser() in scripts and R code chunks where you want to pause and examine the function.

For example, we could insert a call browser() in g().

  • Note the changes in the Console Tab commands and the environment Tab and the appearance of a Traceback pane.
  • You also get a special prompt in the Console Tab: Browse[1]>.
  • You need to re-source the functions.
g <- function(b) {
  browser()
  h(b)
}
f("a")

browser() is just a regular function call so you can run it conditionally by wrapping it in an if statement.

A dummy code example.

```{r}
#| eval: false
g <- function(b) {
  if (b < 0) {
    browser()
  }
  h(b)
}
```

Another dummy code example:

  • If the error does not appear right away in a loop, set an If statement condition to start debugging after hundreds of loop iterations when the error might first appear:
for (i in 1:1024) {
  ##      start_work ...
  if (i == 512) {
    browser()
  }
  ##      finish_work ...
}

5.8.2 Use Breakpoints to Help Debug R Script Files

When you are editing in a .R script file (Not a Markdown or .qmd file), you have access to a wider array of debugging tools.

Breakpoints are an easy way to identify individual lines of the code where you want to pause the function.

  • You can set a breakpoint in three ways
    1. Toggle on/off by clicking to the left of the line number in the script file.
    2. Shift F9.
    3. Debug - Toggle Breakpoint.
  • Save and source/run the script.

Once the code pauses, you are placed into the interactive Debug Environment.

  • This is why many people build their functions in R Script files and then copy them into R Markdown code chunks when they are fully working.

5.8.2.1 Extracting Code Chunks into an R Script File

The {knitr} package’s purl() function extracts all the code chunks from a .qmd or .Rmd file into an R Script File (to make for easier debugging).

Run knitr::purl("path_to_qmd_file/file_name") in the console and it will create a .R file in the directory with the file name of your file and just the code from the code chunks.

  • It may be convenient to change the Console working directory to match the .qmd file working directory first.

Assuming you set a breakpoint, when you run the function, it will pause at the breakpoint (if it gets to it).

  • When your code pauses, the IDE enters the Debug environment.

The Debug environment has a variety of tools for inspecting and altering the state of your program.

5.8.3 Environment Pane

In debug mode, the IDE views and interacts only with the currently executing function’s environment.

  • The objects you see in the Environment pane belong to the currently executing function.
  • Statements you enter in the Console will be evaluated only in the context of the function.
  • The external environment is “masked”.
  • Above the list of local objects is very small drop-list with the “environment stack” or the list of places R will search to resolve variable names to values.

5.8.4 Traceback Pane

The traceback shows how execution reached the current point, from the first function that was run (at the bottom) to the function that is running now (at the top).

  • You can click on any function in the traceback callstack to see the current contents of its environment and the execution point in the function’s code, if it can be determined.
  • Note: selecting a frame in the callstack does not change the active environment in the console!

5.8.5 Code Window

The code window shows you the currently executing function.

  • The line about to execute is highlighted in yellow.

5.8.6 Console

The R console in debug mode still supports all the same commands as the ordinary console, with a few differences:

  • Statements are evaluated in the current environment - try ls()
  • Simply pressing Enter at the console will execute the current statement and move on to the next one.
  • A variety of special debugging commands are available
  • There’s a new toolbar on top of the console with convenient buttons to execute the special debug control commands.
  • There’s no difference between using the toolbar and entering the commands at the console directly.
Command Shortcut Description
n or Enter F10 Execute next statement
s Shift+F4 Step into function
f Shift+F6 Finish function/loop
c Shift+F5 Continue running
Q Shift+F8 Stop debugging

5.8.6.1 Exercise

  • Open the R_debug_demo.R file in the R folder.
  • Set Breakpoints at lines 7 and 20.
  • Source the file.
  • Run fibvec(7) to completion.
    • What happens in the environment pane?
    • Use the console to check the value of fibvec when i = 3.
  • Run fibnos(c(4,5,6)).
    • What happens in the environment pane?
    • Use the console to check the value of fibnos when i = 2.

5.8.7 Special Situations

5.8.7.1 Debugging in Quarto and R Markdown documents

Breakpoints don’t currently work inside R code chunks in quarto or R Markdown documents.

  • Use browser() to halt execution in a chunk if needed.
  • By default, RStudio renders R Markdown documents using a separate R process when you click the Knit button.
  • However, debugging only works with the primary R process, so when rendering the document for debugging, you’ll need to ensure it renders as a primary process using the render() function.

5.8.7.2 Using rmarkdown::render()

To render as a primary process call rmarkdown::render() directly on your file via the console: rmarkdown::render("~/mydocs/doc.Rmd")

  • When the interactive debugger shows up in the console, it will not print user output.
  • If you want to see output in the console, you should use sink() in the debug console.

Finally, because R Markdown chunks don’t contain source references, most of the debugger’s visual features are disabled.

  • You won’t see the active line highlighting in the editor
  • Most debugging will need to be done in the console.

5.8.8 Debugging in Shiny Applications

Breakpoints can’t be set until the application is executed; the function objects that need to have breakpoints injected don’t exist until then.

Breakpoints in Shiny applications only work inside the shiny Server function.

  • You can use the browser() debug environment if the Breakpoints are not helping.
  • Breakpoints are not currently supported in the user interface (i.e. ui.R), globals (i.e. global.R), or other .R sources used in Shiny applications. This may be improved in a future version of RStudio.
  • Finally, be aware that Shiny’s infrastructure will show up in the callstack, and there’s quite a lot of it before control reaches your code!
  • The challenges that can arise with long shiny apps are a good reason to use functions and modules to streamline your shiny apps (for a later discussion)!

5.8.9 Fix Code and Testing Fixes

Do not just “hard code” a solution, e.g., replacing a variable with a number; a one-off “hack” inevitably leads to later problems.

Once you have found the error and understand why it is happening, update the code to fix it.

Then test the fix to see if it actually worked across multiple test cases.

Once you are satisfied…

  • Create a comment in the code to document any key or unique ideas.
  • Commit and push your code.
  • Continue to build a little and test a little. :)

If you got help from the community, acknowledge the solution worked, and say thank you.

5.9 Building R Packages

Building packages is an optional topic which may or may not be covered in class or assignments.

However, since the package is the fundamental unit of shareable code in R, information in this section and its references can be helpful in understanding how the R ecosystem works.

5.9.1 Learning Outcomes

  • Use R and RStudio tools to build packages for personal or distributed use.
  • Document functions using Roxygen2 tags.

5.9.1.1 References

Note
  • The notes below include the word Execute at the beginning of a line to indicate this is a step you should do in your console to build your own package as you go through the notes.
  • The notes identify additional actions you can do to improve your package as well, e.g., enhancing the function with {roxygen2} tags, adding more testing cases and updating other files.
  • The notes also put “{ }” around package names to help identify them as packages.

5.9.2 Introduction

  • In R, a package is the fundamental unit of shareable code.
  • A package bundles together code, data, documentation, and tests, and is easy to share with others.
  • As of 5/20/23, there were 19,530 packages available on CRAN in 43 major categories (or views).
  • This does not count the many packages that are distributed on GitHub but are not hosted by CRAN yet.
  • This huge variety of packages is one of the reasons R is so successful: the chances are that someone has already solved a problem similar to what you are working on and you can benefit from their work by downloading their package.
  • Popular packages cover the Data Science life cycle.
  • You already know how to use packages:
    • Install them from CRAN with install.packages("x").
    • If not in CRAN yet, you can often use devtools::install_github("package URL")
    • Load and attach them in R with library("x").
    • Get help on them with package?x and help(package = "x").
  • You can also write your own!
    • Save your time by enabling reuse across multiple projects
    • Share with others by bundling your code into a package so any R user can use it.
  • Some people use the package as the fundamental unit of their analysis projects, not just for code management

5.9.2.1 Getting Ready

  • Use the console to install or update several packages to help you create your own packages
  • install.packages(c("devtools", "roxygen2", "testthat", "knitr", "usethis"))
  • RStudio has tools to help you use these packages as you go through the package development workflow
    • Check if you have the latest version of the RStudio package installed.
    • install.packages("rstudioapi") in the console.
    • Then run rstudioapi::getVersion() and the date should be within the last 6 months.
  • Use the following code in the console to access the newest devtools functions.
    • Use line number 1 for ALL and say No to installing versions requiring compilation.
devtools::install_github("r-lib/devtools")
  • You will also need a C compiler and a few other command line tools. If you don’t already have them, RStudio will usually install them for you. Otherwise:
    • On Windows, download and install Rtools. NB: this is not an R package! See Using RTools40 for Windows
    • On Mac, make sure you have either XCode (available for free in the App Store) or the “Command Line Tools for Xcode”. You will need to have a (free) Apple ID.
  • Execute: Check everything is installed and working by running the following code.
  • If everything is okay, it will return Your system is ready to build packages!
library(devtools)
has_devel()

5.9.3 Creating Your Package

5.9.3.1 Packages Require a Standard File Structure

  • The structure is designed to enforce rules for working well with Base R and other compliant packages

  • Repositories often require specific implementations of the files to allow posting

  • The smallest usable package has three components:

    1. An R/ directory, with R code.
    2. A DESCRIPTION file, with package metadata.
    3. A NAMESPACE file.

5.9.3.2 The First Step: Picking a Name for your Package

  • Before you can create your first package, you need to come up with a name for it.
  • There are three formal requirements:
    • The name can only consist of letters, numbers and periods, (no “-” or “_”)
    • It must start with a letter, and
    • It cannot end with a period.
  • One cannot use hyphens or underscores and you should not use periods to avoid confusion with some other structures
5.9.3.2.1 Best Practices
  • Avoid capital letters (other than perhaps one).
  • Pick a unique name you can easily Google.
  • Check if a name exists on CRAN by trying the following in your browser: http://cran.r-project.org/web/packages/[mypackagename].
    • Replace [mypackagename] with your potential name, e.g., http://cran.r-project.org/web/packages/dplyr
    • If you get a package page, time to pick a new name!
  • Use the {available} package to check if your name is used elsewhere, or,
    • Does it have another meaning in pop culture:
install.packages("available")
library(available)
available("ggplot2")
available("rttest")
available("nsfw") #non-parametric statistics functions with weighting?
  • Find a word that evokes the problem and modify it so it’s unique:
    • plyr is generalization of the R apply family, and evokes using pliers to twist or manipulate data.
    • lubridate makes dates and times easier - it lubricates the process.
  • Use abbreviations:
    • Rcpp = R + C++ (plus plus)
    • lvplot = letter value plots.
  • Add an extra R:
    • stringr provides string tools.
    • tourr implements grand tours (a visualization method).
  • If you’re creating a package that talks to a commercial service, check the branding guidelines.
    • For example, rDrop is not called rDropbox because Dropbox prohibits any applications from using the full trademarked name.

5.9.3.3 Second Step: Create a New “Skeleton” Package

  • Pick a folder for where you want to create the new folder for your package.
    • The folder must NOT be in an existing RStudio Project or Git Repository!
    • Go to the folder location in the terminal and run git status to check.
  • Once you have confirmed the folder is not a repo or R Project, set the console working directory to be the same as the folder.
  • You create packages using the console.
  • The {usethis} package has functions designed to ease the process.
  • Execute: create_package("path/packagename") to create a “skeleton” R package and invoke a new RStudio Session.
    • Replace path with the relative path from the console working directory to your desired package location.
library(devtools)
create_package("../../../../R_packages/testpackage")

create_package() Rresults
- You should see two icons for RStudio. 
- The new one has the name of the RStudio Project on it.
5.9.3.3.1 Your New Package has a Specific Directory Structure and Default Files
  • Some of these you can edit directly and others you will only indirectly edit in other ways, and others you will not edit at all.
  • .gitignore anticipates Git usage and ignores some standard, behind-the-scenes files created by R and RStudio.
  • .Rbuildignore lists files that we need to have around but that should not be included when building the R package from source.
  • DESCRIPTION provides metadata about your package.
    • Every package must have a DESCRIPTION file.`
    • It’s the defining feature of a package (RStudio and devtools consider any directory containing a DESCRIPTION file to be a package).
    • We will edit this shortly.
  • The NAMESPACE file declares the functions your package exports for external use and the external functions your package imports from other packages.
    • At the moment, it holds temporary-yet-functional placeholder content.
  • The R/ directory is the “business end” of your package.
    • It will soon contain .R files with the function definitions for your package (or app.R files for a shiny app).
  • packagename.Rproj is the file that makes this directory an RStudio Project.
  • .Rproj.user, if you have it, is a directory used internally by RStudio.
5.9.3.3.2 Test the Initial Skeleton Package
  • You want to test your skeleton package to see if it will build (compile) and install properly.

  • Notice, there is a new tab called “Build” in the same pane as Environment and History.

  • Your package is empty of code but you can test if it works from a package perspective.

  • Execute: In the new RStudio session go to the Build Tab (or the “Build menu”) and select “Install and Restart”

Build, Install and Restart
5.9.3.3.3 Establish the New Package as a Git Repository
  • You could use the terminal with git init
  • The {usethis} package also has a function for git (and many more we will use).
  • Execute: Load the {usethis} package using the console.
library(usethis)
  • The package directory is already an R source package and an RStudio Project.

  • Execute: Now make it a Git repository, with use_git().

usethis::use_git()
  • You will get asked to commit some files. Chose to do so. This will help establish the Head pointer for the project/package/repo.
  • You will also get asked to restart so it can set up the Git Pane in RStudio. Chose to do so.
    • Go to the Files tab and select “More, Show Hidden Files” to see the .git folder
    • You can also go to your computer file manager and look for hidden files
    • CMD Shift . in Mac File Manager

5.9.4 One-Time Actions

  • There are a number of administrative actions to take to put more flesh on the skeleton package.
  • These provide basic information for package users and create directories and templated starter files to help create a user-friendly package.
  • Since documentation often lags code development, this package development process is designed to build what you need in the beginning so there is a minimum viable set of information you need for others to use your package instead of running out of time at the end.

5.9.4.1 Update the DESCRIPTION file Metadata.

  • Execute: Edit the DESCRIPTION file.
    • Make sure the Authors@R is populated correctly.
      • aut -> Author
      • cre -> Creator
    • Modify the Title and Description fields.
    • Save the file.

5.9.4.2 Choose a License for the Package

  • See Choose a License for more info (especially I don’t want to choose a license.)
    • MIT License: A short and simple permissive license with conditions only requiring preservation of copyright and license notices. Licensed works, modifications, and larger works may be distributed under different terms and without source code.
    • GNU GPLv3: Permissions of this strong copyleft license are conditioned on making available complete source code of licensed works and modifications, which include larger works using a licensed work, under the same license. Copyright and license notices must be preserved. Contributors provide an express grant of patent rights.
  • Execute: Use one of the following functions to create a license.
use_mit_license("copyright holder name")
use_gpl3_license()
use_apl2_license()
use_cc0_license()

5.9.4.3 Add a README.Rmd file for GitHub for New Users

  • Execute: Use use_readme_rmd() and you can edit it later.

  • The goal of the README.Rmd is to answer the following questions for new users about your package:

    • Why should I use it?
    • How do I use it?
    • How do I get it?
  • On GitHub, the knitted README.md will be rendered as HTML and displayed on the repository home page.

  • Recommended Structure for a README.Rmd:

    • A paragraph describing the high-level purpose of the package.
    • An example that shows how to use the package to solve a simple problem.
    • Installation instructions using code that can be copied and pasted into R.
    • An overview of the main components of the package. For more complex packages, point to vignettes.
  • Execute: Edit the README.Rmd file to change the Installation section (lines starting with “You can install the released version of {mypackagename} from CRAN …” to something like:

    • “This package is only available by permission of the author on my github repo using the install_github() function from the {devtools} or {remotes} packages.
remotes::install_github("mygithubid/mypackagename", build_vignettes = TRUE)
  • Execute: Knit the README.Rmd file and then add and commit the files so everything is up to date.

5.9.4.4 Add a News File for Updating Current Users on Changes

  • A NEWS file helps people keep track of what is happening with the package.
  • Execute: Use use_news_md() to create an markdown file NEWS.md you can edit.
  • Recommended Structure for a NEWS.Md:
    • Show the current version number as a major heading.
    • List the user-visible changes worth mentioning.
5.9.4.4.1 NEWS File Workflow
  • As you are developing your code for each new release, add a new header with the version number to the front of the file and then list the changes in a bullet list.
  • Don’t discard old items; leave them in the file after the newer items. This way, a user upgrading from any previous version can see what is new.
  • If the NEWS file gets very long, move some of the older items into a file named ONEWS and put a note at the end referring the user to that file.
  • There is a {newsmd} package that may be of interest for long-term package development

5.9.4.5 Add Spellcheck into the Workflow.

  • Execute: Install the {spelling} package and run use_spell_check().
  • It will add some files for use later on.
install.packages("spelling") # in the console
use_spell_check()

5.9.4.6 Add a Directory for Raw Data (if needed)

  • If you are going to include raw data in your package or use it in developing your package, include a data-raw folder where the data is created/formatted.

  • Execute: Use use_data_raw()

    • Save your scripts for processing the data in this folder.
  • If you are going to include processed data in your package you can use the function use_this::use_data(mydata) with the name of the data to be saved in your processing script.

    • It will create a data directory (if not already there) and save mydata with the desired format (e.g., .Rda) and compression.
    • See line 34 of Hadley’s Babynames names.R script and then check the data directory

5.9.4.7 Check Your Package

  • Execute: Run Install and Restart in RStudio to check everything still works.
  • Your screen may look something like this:

5.9.4.8 Add and Commit …

  • Execute: A good time to make some git history:)

5.9.5 Typical Workflow for Package Development

  • A typical workflow uses the following steps and may go back and forth:
# Action Keyboard Shortcut
1. Write Code and Document Functions
2. Restart R Session Cmd+Shift+F10 (Ctrl+Shift+F10 for Windows)
3. Build and Reload (or Install and Restart) Cmd+Shift+B (Ctrl+Shift+B for Windows)
4. Test Package Cmd+Shift+T (Ctrl+Shift+T for Windows)
5. Check Package Cmd+Shift+E (Ctrl+Shift+E for Windows)
6. Document Package Cmd+Shift+D (Ctrl+Shift+D for Windows)

5.9.5.1 Write Code and Document Functions

  • The bulk of your time will be writing (and debugging) code as normal.
  • Writing code for a package has a few additional considerations.
  • You will be deciding whether to use functions from other packages and if so, how to access them in your package
    • Do you import the whole package, import just a few functions, or just load the package and access individual functions via the :: operator?
  • You will be deciding which of your functions need to be exported so your package users can access them and which should stay internal, e.g., helper functions
  • As you make those decisions, you will use various functions to help implement them.
  • You will also be documenting your functions as you go along using roxygen2 tags.
  • As you update your code, remember Git and GitHub are your friends.
    • Think about using branches to address specific tasks.
    • Add, commit, and push to collaborate.
    • Use Pulls and Merges to update the master baseline.

5.9.5.2 Restart R Session

  • You can restart the R Session at any time
  • Restarting enables you to re-establish a clean environment for your code as you write and debug.
    • It is a good practice to ensure you have not done something in the console that is now lurking in the background and covering up a bug in the code.
  • Once you restart, you will have to run library(devtools) in the console again.
    • It will load and attach the {usethis} package as well.
  • Back to coding (and debugging) and Git, ….

5.9.5.3 Build and Reload

  • You can Build and Reload at any time.
    • “Build and Reload” is the same as the “Install and Restart” in RStudio Build Pane.
  • This is a good way to check your package still functions as a package.
    • Are all the necessary files in right places and using the proper syntax to interact with each other?
    • Can your package be installed?
  • Build and Reload also restarts R so…
  • Once you restart, you will have to run library(devtools) in the console.
    • It will load and attach the {usethis} package as well.
  • Back to coding (and debugging) and Git, …

5.9.5.4 Test Package

  • Build and Reload tests if the package works as a package - it does not check if your code makes sense.
  • The package development process includes several capabilities to streamline the development of tests for your code.
  • These are especially important for what is known as “regression testing” - does your new code break your old code.
  • You will build up test cases with each version and the testing capabilities make this a much less painful process than running individual tests by hand each time.

5.9.5.5 Check Package

  • Checking your package uses the devtools::check() to run R CMD check which conducts many individual checks “using all known best practices.”
    • check() runs several functions to update your files and then builds your package and runs R CMD Check mypackage.tar.gz
  • Passing R CMD check is essential if you want to submit your package to CRAN.
  • Even if you are not submitting to CRAN, at least ensure that there are no ERRORs or WARNINGs: these typically represent serious problems.

5.9.5.6 Document Package

  • As you execute the steps above, you will be creating the inputs for the essential documentation for your package.
  • You can also expand upon this documentation using vignettes as well as expanding your README and NEWS files.
  • Vignettes are R Markdown documents that describe how to use your package.
  • We’ll cover several of these steps in much more detail in the sections that follow.

5.9.6 Writing Code

  • Writing code most likely includes writing functions.
  • All code goes into the .R/ directory (other than scripts manipulating your raw data to .RDA files in data).
  • If you have already written your functions elsewhere, you can copy and paste them into the R directory.

5.9.6.1 The use_r() Function Creates or Opens any .R file

  • You may want to use the use_r() function: use_r("file_name") .
  • It opens an existing .R file or creates a new one for your function’s code.
  • use_r() can make navigating your R files much easier when you have a lot of files.
  • Suppose you wanted to create a function to concatenate the levels of two factors.
fbind <- function(a, b) {
  factor(c(as.character(a), as.character(b)))
}
### Test it
fbind(as.factor(c("dog", "cat")), as.factor(c("gerbil", "parakeet")))
  • Execute: Create a new .R file for the fbind() function.
### use_r("fbind")
  • This opens up a tab in the Source pane with the new file so you can enter your code.
  • Execute: Put the code for fbind() and only for fbind() in R/fbind.R and save it.
Important
  • Do not include any calls to libraries!
  • Packages and scripts use different mechanisms to declare their dependency on other packages.
  • We’ll document the function later.

5.9.6.2 Instead of source(), Use load_all() to Access the Function

  • Instead of using source(), we use a {devtools} function to access the function.
  • Execute: Use the console to load the {devtools} package with library().
  • Execute: Call load_all() to make fbind() available to other functions/scripts.
  • Execute: Test it in the console of your package:
    • fbind(as.factor(c("dog","cat")),as.factor(c("gerbil","parakeet")))
  • Note: load all() only simulates the process of building, installing, and attaching the new package.
  • As packages grow with more and more functions, load_all() provides a more accurate sense of how the package is developing than creating test functions in the global workspace.
    • It’s also faster than actually building, installing, and attaching the package.
  • Execute: Add and Commit the new file now.

5.9.7 Check Your Package Still Works Using devtools::check()

  • Your function worked as desired on the command line.
  • As you add more and more functions, you may want to see if all the moving parts of the package still work together.
  • Although shown as Step 5 in the table, you can run check() at any time to see if your package is in working order (it does not check your code for logic errors!).
  • check() is a convenient way to run the shell command R CMD check for checking if an R package is in full working order, without leaving your R session.
  • check() executes many checks of the code and package files for compliance with package standards.
  • It provides a list of the checks as well as outputs errors, warnings and notes at the end.
  • Execute: Use the console to run check(),
    • You should not have any errors at this point,

5.9.8 Background: the Search Path and Load vs. Attach

  • When writing a function, you will often want to use functions from other packages, e.g., {magrittr} (%>%)` or {ggplot2}.
  • As stated earlier, packages do not use the library() function to load (attach) other packages, they import them in a different manner.
  • Packages import functions from the “NAMESPACE” of other packages.
  • If you plan to submit a package to CRAN, this even applies to functions in packages you think of as always available, such as stats::median() or utils::head().
  • To use functions from another package, two conditions have to be true:
  1. The package must be installed somewhere you R can find it, and
  2. The function must be accessible either by use of the :: operator or by being in the package namespace.

5.9.8.1 The Search Path

  • To call a function (or use a variable), R first has to find it.
  • R first looks for the name in the Global Environment.
  • If R does not find the name there, it looks in the current search path.
  • The search path is a list in the global environment of all the packages you have attached.
    • You can see this list by running search().

5.9.8.2 Loading and Attaching a Package are Not the Same.

  • Let’s assume a package is installed (condition 1 above)
  • To use it in our code, we often say we need to load the package using library(), but that is technically imprecise.
  • Loading a package makes its contents (including the namespace) available in memory, but does Not add it to search path.
    • So if a package is just loaded, you still can’t access its components without using ::.
    • Confusingly, :: will also load a package if it is not already loaded, but it does not attach it.
  • It’s rare to just load a package explicitly, but you can do so with requireNamespace() or loadNamespace().
  • Attaching puts the package in the search path so you don’t need to use ::.
    • You can’t attach a package without first loading it.
    • Both library() or require() load, then attach, the package.
  • If a package is not installed, loading (and hence attaching) will fail with an error.

5.9.9 The NAMESPACE File

  • With apologies to the bard and Romeo, but when Juliet was doing her R and not talking to Romeo, she was heard to lament …

What’s in a namespace? That which we call our function
by any other name would not smell sweet at all …

5.9.9.1 Namespaces

  • As the name suggests, R namespaces provide “spaces” for “names”.
  • They allow R to look up the value of an object by its associated name.
  • As we have seen before, R uses namespaces to ensure all objects, including functions and variables, have unique names within the local environment.
  • If you attach a package that exports a function with the same name as an existing function in your global environment, R will “mask” the original function name in favor of the new one.
  • The order in which you attach packages matters; attaching {purrr} and then {map} means when R follows the search path, it will find map::map() instead of purrr::map().
    • The :: operator differentiates between functions with the same name from different packages by adding the package name to the function name.
    • So, in the prior example, you could still use the now-masked {purrr} version of the function called map(), but you would need to use purr::map().

5.9.9.2 Imports and Exports

  • Namespaces make your packages self-contained in two ways: the imports and the exports.
  • Identifying “imports” specifies how a function in one package finds a function in another.
  • Identifying “exports” specifies which functions are available outside of your package (to others).
    • Unless you export a function it is an Internal function, not visible or usable by another package.
  • Generally, you want to export a minimal set of functions; the fewer you export, the smaller the chance of a conflict with another package.

5.9.9.3 Updating the NAMESPACE File

  • create_package() always creates a file called NAMESPACE because it is essential for a package to work with other packages.

  • The NAMESPACE file is Not the namespace of the package per se, but it records the elements of the namespace for the package, which enables the creation of a functioning namespace.

  • Each line contains a “directive”: e.g., export(), importFrom() or others.

  • Each directive describes an R object and indicates if: 1. it’s exported from this package to be used by others, or, 2. it’s imported from another package to be used locally.

  • Execute: Look at your NAMESPACE file now.

  • There is no need to edit the NAMESPACE file by hand (and it’s not a good idea).

  • We will see later how using functions from {usethis} and {roxygen2} automatically update the NAMESPACE file based on how we identify packages and document our code.

5.9.10 Updating the DESCRIPTION File to Identify Needed Packages

  • The DESCRIPTION file has a field called Imports.
    • create_package() does not create the “Imports” field in the bare-bones DESCRIPTION file; it assumes no other packages are needed.
    • You can add one manually or use the steps below to add it as you identify packages to be added
  • THIS CAN GET CONFUSING: The DESCRIPTION file “Imports” field does not actually manage which packages get “imported” into the namespace.
    • It only identifies all the packages your package needs to use during execution.

5.9.10.1 Add a Package to the Imports Field with use_package()

  • To add a package to the Imports field in the DESCRIPTION file, use use_package(), e.g. use_package("dplyr").
    • If the Imports field is not present, use_package() will add it.
  • Adding to Imports, declares your general intent to use some functions from the exported functions in another package’s namespace.
  • A few common functions have their own special use_*() functions, e.g., use_pipe() (magritter pipe) and use_tibble().
  • Execute: Use the console to add {ggplot2}, the magritter pipe, and tibble to your DESCRIPTION file.
### in the package session console
use_package("ggplot2")
use_pipe()
use_tibble()
  • use_tibble() will add a file called R/mypackagename-package.R under the R directory.
  • use_pipe() will add a file called utils-pipe.R under the R directory.
  • Both of these files contain the roxygen2 tags needed to enable the import/export of the functions.
  • Look inside your R folder now to see new files
  • Look at your DESCRIPTION file to see the new imports field
  • Look at the NAMESPACE file to see the updates from the functions.
    • They added directives for both import and export of %>% and for import for tibble()
    • The import directive in the NAMESPACE will attach them to the search string so you can use them without using ::
  • You can manually add packages to the DESCRIPTION Imports field as well.
    • Put one package on each line, and,
    • Keep them in alphabetical order.
  • You can also set criteria if you need a specific (not-earlier-than) version of a package
    • Use use_package("thatpackage", min_version = TRUE) to go with the current version you are using.
    • If adding by hand, put it in parentheses after the package name:
      • Imports: dplyr (>= 0.3.0.1)

5.9.10.2 Implications of Adding a Package to the Imports Field

  • Unfortunately, the choice of the word “Imports” for the Imports field is imprecise.
    • The Imports field has nothing to do with functions being imported into the namespace.
    • Adding a package to Imports does Not mean it will be “imported” into the namespace (attached to the search path)!
    • That is the responsibly of the NAMESPACE file.
  • Adding a package to the DESCRIPTION Imports field only means it must be installed (not attached) for your package to work.
    • This is called creating a dependency. This means your package depends to some degree on the other package.
  • When someone installs your package, the install function will check if all the packages in your DESCRIPTION Imports are installed or accessible on their computer. If not already present, it will ask to install them on their computer.
    • Running devtools::load_all() also checks to see if the packages in Imports are installed.

Many users want to limit the number of packages they install on their computers and don’t like packages with lots of dependencies. So don’t add more packages than you really need to the Imports field!

  • It’s common for packages to be listed in Imports in DESCRIPTION, but not have a directive in NAMESPACE.

  • Since you may not be using a lot of functions from some packages you do not need to import/attach the whole package.

  • A best practice is to explicitly refer to external functions using the syntax package::function().

    • This makes it easier to identify the functions from other (external) packages.
    • This is especially useful when you (or others) read/maintain your code in the future.
    • Example: use forcats:fcount() inside your function. This way you do not have to worry if the package is attached.

5.9.10.3 There are Actually More Levels of Dependencies Available for the DESCRIPTION File

  • A goal in building a package is to minimize the number of packages and functions actually imported into the namespace.
    • The more you require be imported to the namespace, the more onerous your package may become for others.
    • And, you can always use the :: to refer to specific packages that are installed based on the Imports field.
  • So, the use_package() argument type = has a default of "Imports" as the “best practice.”
  • The DESCRIPTION file has other fields for other levels of dependency, e.g., Depends and Suggests.
5.9.10.3.1 Use the “Depends” Field to Import an Entire Package
  • As we have seen, the Imports field just loads the package,
  • Putting a package in the Depends field loads and attaches it (changing the search path).
  • This should be rare as you rarely need ALL the functions in a package!
    • A good package is self-contained and minimizes changes to the global environment (e.g., the search path).
    • The primary exception is if your package is designed to be used in conjunction with another package.
    • For example, packages that build on {ggplot2} should put {ggplot2} in Depends, not Imports.
  • You can use the use_package("thatpackage", type = "Depends") to add a package to the Depends field.
5.9.10.3.2 Use the “Suggests” Field to Identify Packages that May Benefit the User
  • You may have only one function or vignette or test case in your package that needs a specific package.
  • If that function is expected to be used rarely, you can add it to the Suggests field.
  • Packages listed in Suggests are not automatically installed along with your package.
    • This means you need to add a check in your code to see if the package is available before using it (use requireNamespace(x, quietly = TRUE)).
  • You can use the use_package("thatpackage", type = "Suggests") to add a package to the Suggests field.

5.9.10.4 Importing Just a Few Functions from a Package

  • This is a common practice for often-used functions that do not have a special use_* function
  • Often there are specific functions you want to use all the time but you do not want to import the whole package.
    • Example: {ggplot2} has a lot of functions but you may only need a few for your package.
  • The solution is to add the package to the Imports field and then use roxygen2 tags in the upfront comments section of your .R files
  • Run use_package("packagename") to add the package to DESCRIPTION Imports so you know the package will be loaded.
  • Add the roxygen2 tag @importFrom package function1 function2 in your .R file documentation with the function names as arguments, e.g.,
    • @importFrom ggplot2 ggplot aes geom_bar coord_flip.
  • When you update the package documentation with document(), this will update the NAMESPACE file automatically!

5.9.10.5 Special Cases

5.9.10.5.1 Unexported Functions
  • You may want to use some functions but get a warning from check() that the function is not exported directly from its package.
  • One workaround is to see if the function is exported as part of a larger element.
  • An example is the where() function in {tidyselect}.
  • The {tidyselect} package developer did not want to export where() directly since it was such a common word.
  • Instead they exported a higher-level list of functions called vars_select_helpers where where() is an element in the list.
  • So to access where(), without importing the whole {tidyselect} package, first use use_package("tidyselect") and then inside your function use tidyselect::vars_select_helpers$where().
5.9.10.5.2 no visible binding for global variable ‘xxx’
  • When running check, you may get a note similar to no visible binding for global variable ‘xxx’ where 'xxx' is one of your variables.

  • This is “a peculiarity” of using {dplyr} or {ggplot} with unquoted variable names inside a package, where “the use of bare variable names looks suspicious.” Chap 6.5.

  • This is often due to how tidy evaluation works in tidyverse functions where we can use “bare variable” names due to data masking.

  • Data masking makes interactive analysis faster but means we have to be more careful when passing variables to functions, especially in packages using tidyverse functions. See Programming withd plyr for details.

  • Three solution approaches:

    • Option 1. If the variables are truly global across the package.
      • Import {utils} with use_package("utils") and then add a line of code with utils::globalVariables(c("my_globalvar1", "my_globalvar2")) where you include each variable.
    • Option 2. If the variables are truly global across the package, a quick and dirty approach:
      • Enter a line of code with my_globalvar1 <-my_globalvar2 <- NULL.
    • Option 3: If the variables are being passed from one function to another it may be unclear if they are global or data frame variables.
      • Use the {rlang} .data and .env pronouns.
      • These provide a way to distinguish the source of a variable within tidy evaluation between the data frame (.data$my_var) or the environment (.env$myvar).
      • Import the {rlang} package with use_package("rlang"), add the #' @importFrom rlang .data and use, as an example with aes(), aes(x = .data$my_var) in your function.

5.9.11 Document Each Function with {roxygen2}

  • You saw the warning in check() about an undocumented function.
  • We should create a way to get help on fbind().
  • Every package should have a special R documentation file, e.g., man/fbind.Rd, written in an R-specific markup language (.Rd) that is sort of like LaTeX.
  • Luckily, we do not have to write that directly.

5.9.11.1 The {roxygen2} Package Documents Functions and Updates the NAMESPACE file

  • We can write specially formatted comments right above fbind(), in its source file, and then let a package called {roxygen2} handle the creation of man/fbind.Rd.

  • Execute: In RStudio, open R/fbind.R in the source editor and put the cursor somewhere in the fbind() function definition.

    • Do Code > Insert Roxygen Skeleton.
  • A set of special comments should appear above your function, in which each line begins with #'.

  • RStudio only inserts a bare bones template, so you will need to edit it to be useful (and pass check()).

5.9.11.2 Roxygen Parameters for Functions

  • You should edit each of the tags to convey the required information you want users to see in help.
  • The first sentence is the title: that’s what you see when you look at help(package = mypackage) at the top of each help file.
    • It should fit on one line, be written in sentence case, and not end in a full stop.
    • It should be a very brief description of the function to make it recognizable, not the actual name of the function.
  • The second paragraph is the description: this comes first in the documentation and should briefly describe what the function does.
    • Save the long explanations for the Details section
    • The description should start with a capital letter and end with a full stop (e.g., period or question mark).
    • It can span multiple lines (or even paragraphs) if really necessary.
  • When not creating a package, manually create the next two lines before the @param tags
    • First, enter “#' Usage” on a line by itself
    • Second, on the line below enter the syntax as if you were using the function
      • e.g., #' my_fun(x, y, na.rm = FALSE)
    • If creating a package, the helper functions will create this entry for you in the help document.
  • The skeleton should then list all of the function’s arguments using the tag @param name
    • For each argument, enter a short sentence description. Start with an uppercase letter and include the argument class and shape (data frame, vector, value, etc.) and any constraints on it ( > 0, length of 3, etc.).
    • You can document multiple arguments in one place by separating the names with commas (no spaces).
      • For example, to document both x and y, you can write @param x,y Numeric vectors ...
  • The Details paragraph(s) comes after the arguments and is optional. This could be several paragraphs to provide specifics on how the function operates.
    • You can use .Rd formatting tags such as \code{sum()} or \url{http://rstudio.com} or bullet or named lists. See Chapter 10.10 in R packages for details.
    • Named lists are useful when documenting data sets you may be including in your package.
  • @return describes the output from the function. Since not every function as a return it is optional but is a good idea when there is one.
    • Enter the output type (e.g., string, or numeric vector) and short description starting with a lower-case letter.
  • @export identifies the function as one you want others to have direct access to from your package.
    • Delete it if you do not want to export the function, e.g., it’s a helper function for internal use only.
    • Delete it from the skeleton if you are not creating a package.
  • @examples are optional but important for exported functions. They are typically a few lines of executable R code showing how to use the function in practice.
    • This is an important part of the documentation because many people look at the examples first.
    • If you are NOT exporting the function, delete the @examples tag and do not enter any examples as check() will not be able to test the examples since they are private functions.
    • Example code must work without errors as it is run automatically as part of R CMD check or check().
    • For the purpose of illustration, it’s often useful to include code that causes an error.
      • \dontrun{} allows you to include code in the example that is not run.
    • Instead of including examples directly in the documentation, you can put them in separate files and use @example path/relative/to/package/root to insert them into the documentation. (Note that the @example tag here has no ‘s’.)
  • Documentation Chapter in R Packages has more details about documentation in general.
  • Rd (documentation) tags has more information as well.

5.9.12 Documenting Data

  • If you want to include a dataset in your package you have to document it.
  • See Data Chapter in R Packages.
  • If you want to store example datasets and make them available to the user, put them in data/.
  • Each file in this directory should be be an .rda file created by save() containing a single R object, with the same name as the file.
    • The easiest way to achieve this is to use usethis::use_data().
  • If the data is cleaned and “tidyed” data from a raw source, then put the raw data under data-raw along with the R scripts used to tidy and clean it.
    • Be sure the folder data-raw is listed in .Rbuildignore.
    • usethis::use_data_raw("my_pkg_datasetname") will create data-raw and put it in .Rbuildignore
  • A typical script in data-raw/ includes code to prepare a dataset and ends with a call to use_data() to move it into the data folder.
  • Documenting data is like documenting a function only you document the name of the dataset and save it in R/.
  • Two roxygen tags are especially important for documenting datasets:
    • @format gives an overview of the dataset. For data frames, use describe{} to include a list with each variable name, the description and and their units.
    • @source provides details of where you got the data, often a URL.
  • Never @export a data set as they are made available automatically.

Here is an example from R Packages (2e) Wickham and Bryan (2023) showing how to document a dataset in an R script.

#' World Health Organization TB data
#'
#' A subset of data from the World Health Organization Global Tuberculosis
#' Report ...
#'
#' @format #### `who`
#' A data frame with 7,240 rows and 60 columns:
#' \describe{
#'   \item{country}{Country name}
#'   \item{iso2, iso3}{2 & 3 letter ISO country codes}
#'   \item{year}{Year}
#'   ...
#' }
#' @source <https://www.who.int/teams/global-tuberculosis-programme/data>
"who"


5.9.13 Use document() to Create the Man(ual) Help Files

  • Execute: Use document() to convert the new Roxygen comments into man/fbind.Rd (or for each function).
  • You should see something like:

  • You should now be able to use ?function_name to preview your help file for each function.
  • Execute Enter ?fbind in the console

5.9.13.1 document() Also Updates the NAMESPACE file

  • In addition to converting fbind()’s special comments into man/fbind.Rd, the call to document() updates the NAMESPACE file, based on any @export directives found in the Roxygen comments.

  • Execute: Open NAMESPACE for inspection.

5.9.13.2 Spelling and wordlist()

  • If you added the spelling package and it identifies words you know are spelled correctly you can use the function update_wordlist() to check spelling and add words to the wordlist.

  • The update_wordlist() function will prompt you about adding words (run after running check()).

  • Words you agree to add will be added to the packagename/inst/WORDLIST file in the package directory structure .

  • Execute: run update_wordlist()

5.9.14 Confirm Everything is Working and Install

  • You have done a lot:
    • Picked a package name
    • Used create_package() to create a new package
    • Updated the metadata and other information
    • Created some functions
    • Used use_package() to add packages to the DESCRIPTION file
    • Used Roxygen comments and tags to document the functions, especially those to be exported e.g., fbind()
    • Used document() to produce a help file and update the NAMESPACE file
    • You have saved files, added to the repo, and committed multiple times!!
  • It’s a good time to see if everything is still working

5.9.14.1 Run Final Checks and Test

  • Execute: Rerun check() to see if the warnings went away.
  • Execute: Install and Restart
  • If there is an issue, fix and retry.
  • If everything passes, congratulations on your working package!

5.9.14.2 Installing Your Package on Your system

  • You have a working package so you are ready to install it onto your computer.
  • Execute: Use install() to install your locally developed package.
  • Now you can use library() to attach and use your package like any other package!
  • Execute: Restart your R session to clear our your environment and run the following in the console.
library(mytestpackage)
fbind(as.factor(c("dog", "cat")), as.factor(c("gerbil", "parakeet")))
  • If you get an error message that your function cannot be found, check the NAMESPACE file as the function may not have been documented with the proper tags to update the NAMESPACE file with document().

5.9.15 Testing with use_testthat(), use_test() and test()

Once you have created your function and you think it is working, it is time to add some tests!

  • This is done using the use_test() function, and it works much the same way as use_r().

  • You tested fbind() informally, in a single example on the console. This is not repeatable and scalable.

  • Formalizing unit tests requires expressing a few concrete expectations about the correct fbind() result for various inputs.

  • First, declare your intent to write unit tests and to use the {testthat} package via use_testthat()

  • This initializes the unit testing machinery for your package.

    • Adds Suggests: testthat to DESCRIPTION,
    • Creates the directory tests/testthat/, and
    • Adds the script test/testthat.R.
  • Execute: Run use_testthat().

  • However, it’s still up to YOU to write the actual tests!

  • Execute: Open the fbind.R file for editing (the .R file you want to build test cases for must be open).

    • Use use_test("function_name") to create/open a test file stored under the tests/testthat folder.
    • Add the example code into the test-fbind.r to create two tests.
      • Look at the help for expect_identical().
    • Save and close the test-fbind.r file.
test_that("fbind() binds factor (or character)", {
  x <- c("a", "b")
  x_fact <- factor(x)
  y <- c("c", "d")
  z <- factor(c("a", "b", "c", "d"))

  expect_identical(fbind(x, y), z)
  expect_identical(fbind(x_fact, y), z)
})
  • This test case tests if fbind() gives an expected result when combining two factors and a character vector and a factor.
  • Attach the {testthat} package via library(testthat) in the console.
  • Run load_all()
  • Run this test interactively with test() (Hopefully you got a Woot!)
  • You can add more tests to your test-function_name file.
  • You can then run multiple tests at once! via test():

5.9.16 Writing Vignettes

  • A vignette is documentation to tell others how you expect people to use the functions in your package.
    • It should help explain your package through examples.
  • A vignette should describe the problem your package is designed to solve and then show the reader how to solve it.
    • Examples: Use the RStudio Console with browseVignettes(package = "dplyr") or browseVignettes(package = "rmarkdown")
    • A vignette should divide functions into useful categories and then demonstrate how to use individual functions and then coordinate multiple functions to solve problems.
  • Many packages have a single vignette but can have more than one as well.
  • Using the package name as the file name for the initial vignette can streamline creating a {packagedown} site.

5.9.16.1 Knitting Vignettes

  • You should knit your vignette to make sure it will knit before you run check().
  • devtools::check() will attempt to knit it and ensure any data used in the vignette is documented and any packages used in the vignette are under the Imports ~ section of the DESCRIPTION file.
  • However, the HTML (or PDF) file will NOT be saved.
    • Package development has gone through several changes where the file used to be saved. or it was saved under packagename/inst/doc. That is no longer the case.
    • The approach now is to use the VignetteEngine specified in the vignette YAML header to recreate the vignette when the package is installed.
  • When the package is installed with vignettes, the VignetteEngine will create a docfolder and then add three files for each vignette under the packagename/doc folder:
    • The original source file,
    • A readable HTML page or PDF, and
    • A file of R code
  • We will use the R Markdown vignette engine provided by knitr.

5.9.16.2 Create your First Vignette, with:usethis::use_vignette("packagename")

  • Execute: Run usethis::use_vignette("packagename") where packagename is the name of your package.
  • It will:
    • Create a vignettes/ directory.
    • Add the necessary dependencies to DESCRIPTION (i.e., it adds knitr to the Suggests and VignetteBuilder fields).
    • Create a draft vignette, vignettes/my-vignette.Rmd.
  • You can now edit and knit the vignettes R Markdown file as usual.

5.9.16.3 The Vignette has Special Fields in the YAML Header

  • The output type “rmarkdown::html_vignette” has been specifically designed to work well inside packages
    ---
    title: "Vignette Title"
    output: rmarkdown::html_vignette
    vignette: >
      %\VignetteIndexEntry{Vignette Title}
      %\VignetteEngine{knitr::rmarkdown}
      \usepackage[utf8]{inputenc}
    ---
  • You can also add other YAML fields like author or date if desired,
  • The Vignette: > entry contains a special block of metadata needed by R.
  • Modify the \VignetteIndexEntry to provide the title of your vignette as you’d like it to appear in the vignette index.
  • Leave the other two lines as is. They tell R to use knitr to process the file, and the file is encoded in UTF-8 (the only encoding you should ever use to write vignettes).

5.9.16.4 Installing the Vignettes for Your Package.

  • Packages installed by install.packages() from CRAN will have their vignettes recreated automatically and placed in the doc folder.
  • Packages installed from GitHub using devtools::install_github() or remotes::install_github() will not.
  • For packages from GitHub, you must set the argument build_vignettes = TRUE to have the vignettes recreated. If they are they will be in the doc folder.

Package without Vignettes - no doc folder
  • To install the package with vignettes from GitHub, use remotes::install_github("mygithubid/mypackagename", build_vignettes = TRUE) and it will look something like the following on the users computer.
  • Include the above syntax in the README file instructions.

Package with Vignettes in doc folder

5.9.16.5 Common Problems

  • The vignette builds interactively, but when running check(), it fails with an error about a missing package you know is installed.
    • This usually means you have forgotten to declare that dependency in the DESCRIPTION (usually it should go in Suggests).
  • Everything works interactively, but the vignette does not show up after you have installed the package.
    • Because RStudio’s “build and reload” does not build vignettes, you may need to run devtools::install() instead.

5.9.17 Final Package Directory Structure

  • Your directory should look a little different from the skeleton package you started with.
  • Something like

Final Directory Structure

  • When your package is installed by a user, it will go under the default directory for all package installs.
  • On a MAC it could be under: /Library/Frameworks/R.framework/Versions/Current/Resources/library
  • Note the structure for an installed package, (for the {broom} package below) is NOT the same as the original GitHub repo package structure.
  • The installation takes the information from your GitHub Repo to create the files necessary for the package to run.

Installed Package Folder Structure

5.9.18 Exercise: Adding a Second Function

  • Let’s add a second function to the package:
  • We want a frequency table for a factor, as a regular data frame with nice variable names where the rows are sorted so the most prevalent level is at the top.
  • Use use_r("fcount") to open a new file
  • Enter the following code in the file.
#' Make a sorted frequency table for a factor
#'
#' @param x factor
#'
#' @return A tibble
#' @export
#' @examples
#' fcount(iris$Species)
fcount <- function(x) {
  forcats::fct_count(x, sort = TRUE)
}
  • Use load_all() to simulate the installation of the package.
  • Manually Test on the command line with fcount(iris$Species).
  • Generate the associated help file via document().
  • Install using install().
  • Use library to load into another workspace and test it.
library(testpackage)
fcount(iris$Species)

5.9.19 GitHub

5.9.19.1 Creating a Package Repo under your Personal Account (Organization)

  • DO NOT DO THIS FOR ANY HOMEWORK ASSIGNMENT
  • use_github(organization = "my_organization", private = TRUE) will automatically create a private GitHub repo for your package at the GitHub organization you designate - assuming you have write privileges.
  • The default of organization = NULL will use your personal account.
  • The default for private = is FALSE so it will create it as a public repository.
  • It will look for your token using gh:"gh_token(). You should already have a Personal Authorization Token (PAT) from GitHub - see Creating a Personal Access Token
    • If not, follow the instructions and then secure your PAT using the {keyring} package.
      • Enter library(keyring) in the console
      • Use key_set("GitHub_PAT") and enter the PAT when asked for the password.
    • Once your PAT is in your keyring,
      • If you get asked for a GitHub password when attempting to push, enter your PAT.
      • If you are asked for a login password, enter your computer login password.
  • You can create the repo as normal using the terminal window
    • Create the remote repo on GitHub and leave it empty, e.g., no README file.
    • Copy the URL.
    • Go to your Terminal window and ensure you are in the folder for your package.
    • Use the normal process of git remote add origin URL where you paste in the URL and
    • For your first push, use git push -u origin main.
  • Answer a few questions and your repo will be created along with an initial push.
  • You may get a warning to re-knit your README.md to get it to commit without turning off validation.
  • You can then use normal commands in the terminal to establish the remote etc..
  • There are many other GitHub-related functions in the {usethis} package to streamline maintaining a package on GitHub

5.9.19.2 Creating a Package Repo under a different Organization

  • Use use_github(organisation = "XXX", private = TRUE) where XXX is the name of the other organization.
  • If you have already created the repo on your personal account, it will tell you to remove it using the terminal.
    • Go to the terminal window, make sure you are in top level of your package repo folder, and enter the following: git remote rm origin
      • That will remove the linkage between your local repo and GitHub.
    • Then you can run
      use_github(organisation = "XXX", private = TRUE)
    • It may also give you an error message that the URL in the DESCRIPTION FILE is incorrect Error: URL has a different value in DESCRIPTION.
    • If so, use the console to run use_github_links(overwrite = TRUE) and it will update two URLS in your DESCRIPTION file to be the current remote location.
  • You can go to your terminal window to check your status with git status and git remote -v.

5.9.20 Coding Style with the {lintr} and {styler} packages and `use_tidy_style()’

  • Many organizations enforce a style guide for writing as well as for any code.

  • Users accustomed to the Tidyverse coding style, like packages to be coded in accordance with the standards in the Tidyverse Style Guide Wickham (2021).

  • The style guide is intended to make all code easier to read, debug, and maintain by the original developer and especially by people other than the original developer.

  • There are two common packages for conducting static code analysis of your .R and .qmd/.RMD files: {lintr} and {styler}.

    • {lintr} checks for compliance and identifies potential issues in the Markers tab in RStudio (near the Console tab).
    • Lint capabilities exist for many languages and {lintr} brings them to R.
    • {styler} allows you to chose to have it automatically change your code or allow you to interactively edit your code to fix issues.
  • Both are available as RStudio Addins making it easy to keep your code properly formatted.

  • The {lintr} package Hester et al. (2024) has a default configuration for which items of “lint” to check.

    • You can also customize it to add or exclude items of lint as well.
  • The {styler} package Muller, Walthert, and Patil (2024) can assist with checking and updating code in a .R file to correspond to the style guide.

    • Use the console with install.packages("styler").
library(styler)
  • See the {styler} Getting Started vignette.
  • Warning: This function will overwrite files!
  • It is strongly suggested to only style files that are already under version control and recently pushed or to first create a backup copy.
  • There are several helper functions in the {usethis} package as well.
  • Look for help on use_tidy or go to usethis Helper functions Wickham, Bryan, and Barrett (2021).

5.9.21 Summary

Completing the steps above should have allowed you to create a working package for sharing your code with yourself and with others.