4 Lists and Iteration using R and {purrr}
vectors, lists, for-loops, purrr
4.1 Introduction
4.1.1 Learning Outcomes
- Manipulate Vectors and Lists using Base R syntax.
- Apply techniques of iteration using
- For-Loops in Base R
map*()
functions in the tidyverse {purrr} package
4.1.2 References:
- R for Data Science 2nd Edition Wickham, Cetinkaya-Rundel, and Grolemund (2023)
- Chapters 3 and 4 in Advanced R Wickham (2023)
- Purrr Overview Wickham and Henry (2023)
4.1.2.1 Other References
4.2 Review of Vectors
- We’ll use just a few tidyverse functions.
4.2.1 R has Two Kinds of Vectors: Atomic Vectors and Lists
Atomic vectors are sequences of elements of the same data type and class.
Lists are data structures where the elements do Not have to be the same class.
- Sometimes called recursive vectors because lists can contain other lists!
Vectors have two main attributes: Length and data Type.
4.2.2 R has Six Data Types
The six data types are:
1. logical,
2. integer,
3. double,
4. character,
5. complex, and
6. raw(byte-level data).
factor
, anddate
classes are special encodings of data types integer and double.- Integer and double vectors are collectively known as numeric vectors.
Missing Vectors return NULL
, like missing or empty values in a vector can return NA
.
Wickham, Cetinkaya-Rundel, and Grolemund (2023)
4.2.3 Creating Vectors
Use c()
to create a vector from the argument elements.
- use
length()
to see the length of a vector. - Use
typeof()
to see the type of vector. - Use
is_*()
to check the type of vector (from packagepurrr
).- e.g.,
is_numeric()
,is_logical
,is.character
, …. - Base R has similar functions
is.*()
.
- e.g.,
4.2.3.1 Examples
Double:
[1] 3
[1] "double"
[1] TRUE
Integer: use L
to tell R to treat (store) a number as an integer:
Character:
[1] 3
[1] "character"
[1] TRUE
Logical:
Factor: Factors are actually integers with extra attributes.
[1] A B B
Levels: A B
[1] "integer"
[1] TRUE
[1] "factor"
[1] "A" "B"
[1] 1 2 2
Dates: Dates are actually doubles with extra attributes.
- Dates are the number of days since January 1, 1970, with negative values for earlier dates.
- The POSIXct class stores date/time values as the number of seconds since January 1, 1970.
4.2.4 Working with Elements in a Vector
Each element of a vector can have a name.
See and set/change element names with the names()
function.
Subset with brackets [...]
.
- Brackets are an extract or replace operator (see help for
extract()
). - The index object
...
can be numeric, logical, character or empty. - Putting a logical vector inside the brackets returns (extracts) the elements from the outer vector where the corresponding value in the inner logical vector is
TRUE
.
[1] "like" "dogs"
[1] "I" "dogs"
Replacement
Extract with negative values to drop elements.
Extract a named vector with the name(s) of the desired element(s) as a character vector.
Two brackets [[...]]
only extracts a single element and drops the name.
- The
[[...]]
operator performs the[...]
operation twice. - Reduces an atomic vector to a named element and then extracts the element out of the named element.
- Useful in working with lists (later on).
4.2.4.1 Exercise
- Create the following vector:
Extract Yoshi and Peach from the above vector using:
- Integer subsetting.
- Negative integer subsetting.
- Logical subsetting.
- Name subsetting.
- In the vector above, replace Yoshi’s number with
19L
.
4.2.5 Recycling
You are used to doing vectorized operations.
When operating on a vector with a scalar or two vectors of different lengths, R does what is called “recycling”.
- Internally, R is reusing or “recycling” elements of the scalar or shorter vector to complete the operations. So the above example is treated the same as:
You can choose to recycle non-scalars, i.e., vectors of different lengths, but it’s almost never a good idea as R may not behave how you think it will.
- It makes your code less robust to changes in the data structure over time, e.g., if someone added additional observations to the data:
4.3 Lists
4.3.1 Creating Lists
Lists are vectors whose elements can be of different types.
- A tibble or data frame is a special kind of list (organized by columns with elements of the same class in them)
Use list()
to make a list.
- Each element in a list can have many elements (including other lists) in it.
- The length of the list is just how many elements are present at the top level of the list.
$x
[1] "a"
$y
[1] 1
$z
[1] TRUE FALSE TRUE
[[4]]
[[4]][[1]]
[1] "a"
[[4]][[2]]
[1] 1
[1] 4
The above list, of length 4, has three named elements: a character, a numeric, and a logical vector, and then it has an un-named list as the fourth element.
- The internal unnamed list has two elements
("a", 1)
and is also unnamed.
Use str()
(for structure) to see the internal properties of a list.
4.3.2 Working with Lists
Single brackets [...]
extract a sublist. You use the same extracting strategies as for vectors.
$x
[1] "a"
$y
[1] 1
List of 2
$ x: chr "a"
$ y: num 1
$y
[1] 1
Double brackets [[...]]
extract a single list element (which could also be a list).
- Each set of brackets subsets one layer.
[1] "a"
[1] TRUE FALSE TRUE
logi [1:3] TRUE FALSE TRUE
[]
or double [[]]
when subsetting elements from a list.
Consider the list below.
- Subsetting elements with a single set of brackets always returns a list with one element.
- To remove the list layer, i.e., subset an element as its own structure, you must use two sets of brackets to remove the top layer of the list.
[[1]]
[1] "a"
[1] "a"
List of 1
$ :'data.frame': 2 obs. of 2 variables:
..$ mpg: num [1:2] 21 21
..$ cyl: num [1:2] 6 6
'data.frame': 2 obs. of 2 variables:
$ mpg: num 21 21
$ cyl: num 6 6
List of 1
$ :List of 2
..$ : chr "c"
..$ : chr "d"
List of 2
$ : chr "c"
$ : chr "d"
You can subset multiple elements of different classes with single []
but not multiple elements.
Use dollar signs $
to extract named list elements (like in data frames).
Remove elements of a list by replacing them with NULL
.
List of 4
$ x: chr "a"
$ y: num 1
$ z: logi [1:3] TRUE FALSE TRUE
$ :List of 2
..$ : chr "a"
..$ : num 1
List of 3
$ y: num 1
$ z: logi [1:3] TRUE FALSE TRUE
$ :List of 2
..$ : chr "a"
..$ : num 1
4.3.2.1 Exercise
- Create the following list:
- Wario can’t actually make it.
- Remove his row from the data frame.
- Add a new named vector to the list.
- Call it
meal
with the elementsV
is"Vegetarian"
,C
is"Chicken"
, andB
is"Beef"
.
- Extract the venue and the date from
wedding
.
- Use three different techniques to do this.
"chick-fil-a"
should be capitalized.
- Capitalize the first
"c"
and last"a"
.
4.4 For-Loops in Base R
4.4.1 Motivation
Iteration is the repetition of some amount of code.
If we didn’t know the
sum()
function, how would we add up the elements of a vector?
- We could manually add the elements.
But this is prone to error (especially if we try to copy and paste multiple lines). Also, what if
x
has 10,000 elements?For
loops to the rescue! Here is an example:
4.4.2 For-Loop Structure
For-loops are a standard means in multiple computer languages for iterating (repeating) sections of code for a specified number of iterations (repetitions)
Each for-loop contains the following elements:
- Condition: This sets the number of iterations. It sets the values by which the loops will be sequenced (often by 1) and defines the variable by which the loops will be counted, normally variables such as
i
,j
, ork
.
- In the example above, the function
seq_along(x)
is a special version ofseq
which creates a vector from 1 to the length ofx
, incremented by 1 (so1, 2, 3, 4, 5
), and the variablei
will contain each successive value from the vector as the loop iterates.
- Body: This is the expression or code between the curly braces
{}
. This is the code will be evaluated each iteration with the new value ofi
for that iteration. After the end of the expression, the for loop increments the value of the index by the value set in the condition and checks if the loop needs to be run again. If the new value meets the condition, it executes the expression with the new value ofi
. If not, the for loop is complete and it moves to the line of code after the closing}
of the for loop. - Output: The variable that is produced by the iteration. This is
sumval
above.
- It’s best to allocate the memory for the output before starting the for-loop.
- Condition: This sets the number of iterations. It sets the values by which the loops will be sequenced (often by 1) and defines the variable by which the loops will be counted, normally variables such as
In the above sequence, R internally transforms the code to:
sumval <- 0
sumval <- sumval + x[[1]]
sumval <- sumval + x[[2]]
sumval <- sumval + x[[3]]
sumval <- sumval + x[[4]]
sumval <- sumval + x[[5]]
sumval
[1] 16
There are four variations on the basic theme of the for loop:
- Modifying an existing object, instead of creating a new object.
- Looping over names or values, instead of indices.
- Handling outputs of unknown length.
- Handling sequences of unknown length.
A best practice for filling a new vector with values is to create it before you run the loop.
- Create the new vector beforehand using the
vector()
function by specifying the type (mode) of the elements and the number of elements you need.- Allocating memory is slow so it is faster to do it once before the loop than to add more memory with each iteration of the loop.
- Look at help for
vector()
andseq_along()
.
- You can also create by assigning a vector of
NA
s but you should ensure they are of the correct type to avoid conversion during the loop. See help forNA
.
For example, let’s calculate a vector of the cumulative sums each element in x
.
- The for loop sets the condition and then checks if
i ==1
. - If so, it executes the next line. If not, it jumps to the
else
line. - At the end of the expression, it increments
i
by1
, soi = i +1
and goes back to the line with the for loop condition. - It checks if the new value of
i
still meets the condition. If so, it executes the expression again. If not, the for loop is done and it jumps to the line of code after the for loop ending}
.
## Allocate the memory in a new variable
cumvec <- vector(mode = "double", length = length(x))
cumvec
[1] 0 0 0 0 0
## start the for-loop
for (i in seq_along(cumvec)) {
if (i == 1) {
cumvec[[i]] <- x[[i]]
} else {
cumvec[[i]] <- cumvec[[i - 1]] + x[[i]]
}
}
cumvec
[1] 8 9 12 13 16
[1] 8 9 12 13 16
4.4.2.1 Exercise
- The first two numbers of the Fibonacci Sequence are 0 and 1. Each succeeding number is the sum of the previous two numbers in the sequence. For example, the third element is 0 + 1 = 1, while the fourth element is 1 + 1 = 2, and the fifth element is 2 + 1 = 3 and so on.
Use a for loop to calculate the first 100 Fibonacci Numbers.
As a sanity check: The \(\log_2\) of the 100th Fibonacci Number is about 67.57.
Design your steps - there are two alternatives: Set the first two values inside the loop or set them outside the loop
- Initialize an empty vector of the desired length to hold your numbers
- If you initialize outside the loop, do it now.
- Create your for loop to iterate as many times as you need.
- Create the logic to compute the numbers.
Show code
fibvec <- vector(mode = "double", length = 100)
for (i in seq_along(fibvec)) {
if (i > 2) {
fibvec[[i]] <- fibvec[[i - 1]] + fibvec[[i - 2]]
} else if (i == 1) {
fibvec[[i]] <- 0
} else if (i == 2) {
fibvec[[i]] <- 1
} else {
stop(paste0("i = ", i))
}
}
## Test the results
head(fibvec, n = 10)
log2(fibvec[100])
fibvec <- vector(mode = "double", length = 100)
## initialize outside the loop
fibvec[1:2] <- c(0, 1)
## run the loop skipping the first two numbers
for (i in seq(from = 3, to = length(fibvec))) {
fibvec[[i]] <- fibvec[[i - 1]] + fibvec[[i - 2]]
}
## Test the results
head(fibvec, n = 10)
log2(fibvec[100])
4.4.3 Looping Over the Columns of a Data Frame.
For a data frame df
, seq_along(df)
is the same as 1:ncol(df)
which is the same as 1:length(df)
(since data frames are special cases of lists).
Let’s calculate the mean of each column of mtcars
.
mean_vec <- vector(mode = "numeric", length = length(mtcars))
for (i in seq_along(mtcars)) {
mean_vec[[i]] <- mean(mtcars[[i]], na.rm = TRUE)
}
mean_vec
[1] 20.090625 6.187500 230.721875 146.687500 3.596563 3.217250
[7] 17.848750 0.437500 0.406250 3.687500 2.812500
mpg cyl disp hp drat wt qsec
20.090625 6.187500 230.721875 146.687500 3.596563 3.217250 17.848750
vs am gear carb
0.437500 0.406250 3.687500 2.812500
Why not just use colMeans()
? Well, you could if you only wanted means.
- However, if you wanted column standard deviations, there is no “
colSDs
” function. - Thus you needed some form of iteration for applying custom functions to multiple elements.
For-loops are one method for iteration.
sd_vec <- vector(mode = "numeric", length = length(mtcars))
for (i in seq_along(mtcars)) {
sd_vec[[i]] <- sd(mtcars[[i]], na.rm = TRUE)
}
sd_vec
[1] 6.0269481 1.7859216 123.9386938 68.5628685 0.5346787 0.9784574
[7] 1.7869432 0.5040161 0.4989909 0.7378041 1.6152000
Good News! With {dplyr} 1.0 we can now use across()
as another approach to iteration for many functions.
Let’s load the {tidyverse} which should have {dplyr} version 1.1.2 or greater.
- Look at help for
across()
and/or the Column-wise vignette to learn aboutacross()
and its counterpartrowwise()
.
Use across()
inside a call to summarize()
or mutate()
.
- Note the
fn
argument ofwhere()
is the name of the function - do not include the paren operator which “calls” the function.
mpg cyl disp hp drat wt qsec vs
1 6.026948 1.785922 123.9387 68.56287 0.5346787 0.9784574 1.786943 0.5040161
am gear carb
1 0.4989909 0.7378041 1.6152
So, you no longer need write a for-loop to do complex summaries of columns, but there are many times when iteration is the best approach to accomplish a desired transformation of the data.
4.4.3.1 Exercise
- Use a for-loop to calculate the standard deviation of the four numeric traits in columns 3 to 6 of the
penguins
data frame from {palmerpenguins}.
- Repeat using
across()
usingwhere()
to limit to numeric columns. - What is different about the results?
Show code
data(penguins, package = "palmerpenguins")
sdvec <- rep(NA_real_, length = 4)
for (i in seq_along(sdvec)) {
sdvec[i] <- sd(penguins[[i + 2]], na.rm = TRUE)
}
sdvec
penguins |>
summarize(across(where(is.numeric), .fns = ~ sd(.x, na.rm = TRUE)))
# for loop creates a vector and across creates a data frame.
4.4.4 The while()
function as an alternative
Sometimes you don’t know how many times to repeat the code block as it may depend upon the results of the loop.
- You might want to loop until you get three heads in a row in a simulation, or,
- You might want to loop until the difference between two values is below or above some threshold.
- You can’t do that sort of iteration with the for-loop. Instead, use a while-loop.
A while loop is simpler than for loop because it only has two components, a condition and a body.
A while-loop is also more general than a for-loop, because you can rewrite any for-loop as a while-loop, but you can’t rewrite every while-loop as a for-loop.
Example: Use a while-loop to find how many tosses of a coin it takes till one gets three heads in a row:
4.5 Using the {purrr} Package for Iteration
4.5.1 Intro to the {purrr} Package
R is a functional programming language
- You can pass functions to functions and use functions to create or change functions.
- You can compose functions together to effect what happens in the environment.
Suppose for the mtcars
data frame, we want to calculate the column-wise mean, the column-wise median, the column-wise standard deviation, the column-wise maximum, the column-wise minimum, and the column-wise median absolute deviation MAD.
The for-loop would look very similar for each function fun
as shown in this non-executable example.
## Dummy Code with a generic "fun()" function
funvec <- rep(NA_real_, length = length(mtcars))
for (i in seq_along(funvec)) {
funvec[i] <- fun(mtcars[[i]], na.rm = TRUE)
}
funvec
- Ideally, we would like to just tell R what function to apply to each column of
mtcars
instead of writing all the code for a for-loop.
This is exactly what the {purrr} package allows us to do:
- {purrr} provides a consistent syntax for identifying a set of data and then a function to be applied to that data.
- {purrr} is a part of the {tidyverse} package so does not need to be loaded separately.
4.5.2 The {purrr} map
Functions
The {purrr} functions that start with map
(known as the map*()
functions) use the same four arguments
.x
: a list or atomic vector with one or more elements..x
can be a vector, a data frame, or a list..f
: a function you want to apply to each of the elements of the.x
data structure....
: a placeholder for additional arguments to be passed to the.f
function, e.g.,na.rm=TRUE
..progress
: ATRUE/FALSE
logical for whether to show a progress bar during execution.
The .f
functions can be:
a named function from Base R or an R package present in the global environment or accessible using
package::function()
.a named function you wrote that is accessible in the environment (has been sourced).
an unnamed or “anonymous function” you define inside the
map
function. These can include those defined with the new anonymous function shorthand backslash\()
.
As of version 1.0 (Dec 2022), the the formula syntax with the ~
operator and .
pronoun is no longer recommended.
The anonymous function shorthand syntax \(arg) expr
with explicit arguments is now recommended.
The map*()
functions do the work of running a for-loop for you.
- They create the necessary data structures (and memory) of the desired type.
- They break out the
.x
data structure into its top level elements.- A
.x
vector becomes the individual elements of the vector.x[1], .x[2], ...
- A
.x
data frame becomes the individual columns of the data frame,df[,1], df[,2], ...
- A
.x
list becomes the individual elements of the list, which could be their own individual values, vectors, data frames or lists.
- A
- They iterate to pass each of the individual elements as an input argument to the
.f
function, e.g.,- the ith element returns the output of
.f(.x[i], ...)
as the ith element of the designated output structure.
- the ith element returns the output of
- They return the output in the desired form and type.
- The output has the same number of elements as the original
.x
but may be of a different form, e.g., the.f()
function may summarize columns of a data frame into a vector.
- The output has the same number of elements as the original
- The different variants of the
map_*
function return different forms of output.map()
always returns a list.map_lgl()
returns a logical vector.map_int()
returns an integer vector.map_dbl()
returns a double vector.map_chr()
returns a character vector.map_vec()
returns an atomic vector for use when.f
returns classes such as dates, factors, and date-times. You can use theptype=
(prototype) argument to specify the class.
The use of map_*
allows you to write succinct code such as the following.
- By using the
...
argument, you can pass on more arguments tomap_*()
so they can be passed to the.f
function.
mpg cyl disp hp drat wt qsec
20.090625 6.187500 230.721875 146.687500 3.596563 3.217250 17.848750
vs am gear carb
0.437500 0.406250 3.687500 2.812500
- As an example, you can use
map
to get the output ofsummary()
on each column.
- Write code using one of the map functions to:
- Create a character vector with the type of each column in
nycflights13::flights
. - Create an integer vector with the number of unique values in each column of
Toothgrowth
from the base R {datasets} package.
- Create a character vector with the type of each column in
- Repeat 1 and 2 using
dplyr::across()
without usingmap*()
.
Show code
4.5.3 Why Use {purrr}?
The chief benefit of using map*()
functions instead of for-loops is clarity, not speed; they can make your code easier to write, read, and maintain as the intent is clearer.
- The focus is on the operation being performed, e.g.,
mean()
, not the coding the bookkeeping required to create an output vector, set conditions, loop over every element, and capture the iterated output.
Base R has the apply
family of functions that are similar. Using {purrr} functions offers some advantages.
- {purrr} functions have consistent names and arguments and work well with other tidyverse functions.
- The first argument is always the data, so {purrr} works with the pipe.
- {purrr} functions use
.
as an argument prefix to avoid inadvertently mixing {purrr} function arguments with those of the.f
function. - {purrr} functions provide for all combinations of input and output variants and have specific
map2_*
functions for the common two argument case. - {purrr} functions are written in C so can be faster than other options.
- See purrr base R for more details.
When performing operations on the columns of a data frame, the dplyr::across()
function is an option besides using {purrr}.
4.5.4 {purrr} Functions for Working with Lists
The inherent flexibility of the list data structure can make working with them both necessary and challenging.
- Deeply nested lists are commonly used to capture the complexity of a data set while minimizing redundancy and storage.
- However, deeply nested lists can be hard to operate on with other functions.
Flattening and simplification are operations on lists to expose the data of interest in a convenient form.
Version 1.0 of {purrr} superseded the dfc
and dfr
functions for creating data frames with three new sets of functions for working with lists:
list_flatten()
removes a single level of hierarchy from a list; the output is always a list.- This means it collapses lower level lists into the parent element above them.
list_simplify()
reduces a list to a homogeneous vector; the output is always the same length as the input.- This means all the list elements must have length 1.
list_c(), list_cbind(), and list_rbind()
concatenate the elements of a list to produce a vector or data frame.list_c
concatenates all the elements into an atomic vector so they must be of the same class.list_cbind()
andlist_rbind()
only work with lists where every element is a data frame.list_cbind()
takes the elements and concatenates them as columns so they must each have of the same length.list_rbind()
takes the elements and concatenates them as rows.- If a row is missing a column, it will fill with
NA
. - You have to be careful with how names are assigned (see help).
- If a row is missing a column, it will fill with
Example of list_flatten()
showing you can use it to keep removing lower level lists until the list is as flat as possible.
List of 2
$ : num 1
$ :List of 3
..$ : num 2
..$ :List of 2
.. ..$ : num 3
.. ..$ : num 4
..$ : num 5
List of 4
$ : num 1
$ : num 2
$ :List of 2
..$ : num 3
..$ : num 4
$ : num 5
List of 5
$ : num 1
$ : num 2
$ : num 3
$ : num 4
$ : num 5
List of 5
$ : num 1
$ : num 2
$ : num 3
$ : num 4
$ : num 5
num [1:5] 1 2 3 4 5
Examples of the concatenation functions.
- Let’s create a list of three data frames using
split()
. split(.$cyl)
is a base R function to turn a data frame into a list of data frames where each data frame has a different value for all units based on thef =
argument.- The “
.
” insplit
references the current data frame (sincesplit()
is not tidyverse).
- Note the data frames have the same length but different numbers of observations.
List of 3
$ 4:'data.frame': 11 obs. of 11 variables:
..$ mpg : num [1:11] 22.8 24.4 22.8 32.4 30.4 33.9 21.5 27.3 26 30.4 ...
..$ cyl : num [1:11] 4 4 4 4 4 4 4 4 4 4 ...
..$ disp: num [1:11] 108 146.7 140.8 78.7 75.7 ...
..$ hp : num [1:11] 93 62 95 66 52 65 97 66 91 113 ...
..$ drat: num [1:11] 3.85 3.69 3.92 4.08 4.93 4.22 3.7 4.08 4.43 3.77 ...
..$ wt : num [1:11] 2.32 3.19 3.15 2.2 1.61 ...
..$ qsec: num [1:11] 18.6 20 22.9 19.5 18.5 ...
..$ vs : num [1:11] 1 1 1 1 1 1 1 1 0 1 ...
..$ am : num [1:11] 1 0 0 1 1 1 0 1 1 1 ...
..$ gear: num [1:11] 4 4 4 4 4 4 3 4 5 5 ...
..$ carb: num [1:11] 1 2 2 1 2 1 1 1 2 2 ...
$ 6:'data.frame': 7 obs. of 11 variables:
..$ mpg : num [1:7] 21 21 21.4 18.1 19.2 17.8 19.7
..$ cyl : num [1:7] 6 6 6 6 6 6 6
..$ disp: num [1:7] 160 160 258 225 168 ...
..$ hp : num [1:7] 110 110 110 105 123 123 175
..$ drat: num [1:7] 3.9 3.9 3.08 2.76 3.92 3.92 3.62
..$ wt : num [1:7] 2.62 2.88 3.21 3.46 3.44 ...
..$ qsec: num [1:7] 16.5 17 19.4 20.2 18.3 ...
..$ vs : num [1:7] 0 0 1 1 1 1 0
..$ am : num [1:7] 1 1 0 0 0 0 1
..$ gear: num [1:7] 4 4 3 3 4 4 5
..$ carb: num [1:7] 4 4 1 1 4 4 6
$ 8:'data.frame': 14 obs. of 11 variables:
..$ mpg : num [1:14] 18.7 14.3 16.4 17.3 15.2 10.4 10.4 14.7 15.5 15.2 ...
..$ cyl : num [1:14] 8 8 8 8 8 8 8 8 8 8 ...
..$ disp: num [1:14] 360 360 276 276 276 ...
..$ hp : num [1:14] 175 245 180 180 180 205 215 230 150 150 ...
..$ drat: num [1:14] 3.15 3.21 3.07 3.07 3.07 2.93 3 3.23 2.76 3.15 ...
..$ wt : num [1:14] 3.44 3.57 4.07 3.73 3.78 ...
..$ qsec: num [1:14] 17 15.8 17.4 17.6 18 ...
..$ vs : num [1:14] 0 0 0 0 0 0 0 0 0 0 ...
..$ am : num [1:14] 0 0 0 0 0 0 0 0 0 0 ...
..$ gear: num [1:14] 3 3 3 3 3 3 3 3 3 3 ...
..$ carb: num [1:14] 2 4 3 3 3 4 4 4 2 2 ...
list_c()
concatenates the three data frames from the list back into one data frame.
mpg cyl disp hp drat wt qsec vs am gear carb
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
list_rbind()
concatenates the three data frames from the list back into one data frame as well since they have the same column structure.
mpg cyl disp hp drat wt qsec vs am gear carb
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
list_cbind()
generates an error as the three data frames from the list have different numbers of rows so can’t fit in the same column structure.
This example from the list_cbind()
help shows that list_cbind()
returns a data frame with packed columns where the packed data frames retain the names of the original list elements and their columns retain the names from the original data frame.
- You can still access these columns directly though the names and subsetting.
x2 <- list(
a = data.frame(x = 1:2),
b = data.frame(y = "a")
)
list_cbind(x2) -> x2_df
x2_df |> str()
'data.frame': 2 obs. of 2 variables:
$ a:'data.frame': 2 obs. of 1 variable:
..$ x: int 1 2
$ b:'data.frame': 2 obs. of 1 variable:
..$ y: chr "a" "a"
[1] "a" "b"
'data.frame': 2 obs. of 1 variable:
$ x: int 1 2
[1] "x"
x
1 1
2 2
x
1 1
2 2
You can also unpack the data frame columns with tidyr::unpack()
.
- Again you have to be careful with how to manage the names.
unpack(x2_df, cols = everything()) -> x2_un
x2_un |> str() # note the data frame names a and b are gone.
tibble [2 × 2] (S3: tbl_df/tbl/data.frame)
$ x: int [1:2] 1 2
$ y: chr [1:2] "a" "a"
[1] "x" "y"
[1] 1 2
This example shows list_cbind()
with a larger data set, dplyr::starwars
.
- We drop the list columns (12-14) as they cannot be converted to data frames.
Rows: 87
Columns: 14
$ name <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Or…
$ height <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 2…
$ mass <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.…
$ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "brown", N…
$ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "light", "…
$ eye_color <chr> "blue", "yellow", "red", "yellow", "brown", "blue", "blue",…
$ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0, 57.0, …
$ sex <chr> "male", "none", "none", "male", "female", "male", "female",…
$ gender <chr> "masculine", "masculine", "masculine", "masculine", "femini…
$ homeworld <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", "T…
$ species <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Huma…
$ films <list> <"A New Hope", "The Empire Strikes Back", "Return of the J…
$ vehicles <list> <"Snowspeeder", "Imperial Speeder Bike">, <>, <>, <>, "Imp…
$ starships <list> <"X-wing", "Imperial shuttle">, <>, <>, "TIE Advanced x1",…
sw_c <- map(starwars[1:11], \(x) as.data.frame(x))
list_cbind(sw_c, name_repair = "unique") -> sw_cb
sw_cb |> str()
'data.frame': 87 obs. of 11 variables:
$ name :'data.frame': 87 obs. of 1 variable:
..$ x: chr "Luke Skywalker" "C-3PO" "R2-D2" "Darth Vader" ...
$ height :'data.frame': 87 obs. of 1 variable:
..$ x: int 172 167 96 202 150 178 165 97 183 182 ...
$ mass :'data.frame': 87 obs. of 1 variable:
..$ x: num 77 75 32 136 49 120 75 32 84 77 ...
$ hair_color:'data.frame': 87 obs. of 1 variable:
..$ x: chr "blond" NA NA "none" ...
$ skin_color:'data.frame': 87 obs. of 1 variable:
..$ x: chr "fair" "gold" "white, blue" "white" ...
$ eye_color :'data.frame': 87 obs. of 1 variable:
..$ x: chr "blue" "yellow" "red" "yellow" ...
$ birth_year:'data.frame': 87 obs. of 1 variable:
..$ x: num 19 112 33 41.9 19 52 47 NA 24 57 ...
$ sex :'data.frame': 87 obs. of 1 variable:
..$ x: chr "male" "none" "none" "male" ...
$ gender :'data.frame': 87 obs. of 1 variable:
..$ x: chr "masculine" "masculine" "masculine" "masculine" ...
$ homeworld :'data.frame': 87 obs. of 1 variable:
..$ x: chr "Tatooine" "Tatooine" "Naboo" "Tatooine" ...
$ species :'data.frame': 87 obs. of 1 variable:
..$ x: chr "Human" "Droid" "Droid" "Human" ...
[1] "name" "height" "mass" "hair_color" "skin_color"
[6] "eye_color" "birth_year" "sex" "gender" "homeworld"
[11] "species"
unpack(sw_cb,
cols = everything(),
names_sep = ""
) # adds the list element names to the data frame x names.
# A tibble: 87 × 11
namex heightx massx hair_colorx skin_colorx eye_colorx birth_yearx sexx
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
1 Luke Skyw… 172 77 blond fair blue 19 male
2 C-3PO 167 75 <NA> gold yellow 112 none
3 R2-D2 96 32 <NA> white, blue red 33 none
4 Darth Vad… 202 136 none white yellow 41.9 male
5 Leia Orga… 150 49 brown light brown 19 fema…
6 Owen Lars 178 120 brown, grey light blue 52 male
7 Beru Whit… 165 75 brown light blue 47 fema…
8 R5-D4 97 32 <NA> white, red red NA none
9 Biggs Dar… 183 84 black light brown 24 male
10 Obi-Wan K… 182 77 auburn, wh… fair blue-gray 57 male
# ℹ 77 more rows
# ℹ 3 more variables: genderx <chr>, homeworldx <chr>, speciesx <chr>
4.5.4.1 Exercise
Generate a data frame where the rows are sample of 10 random values from a Normal\((\mu, 1)\) distributions for each of element in the sequence \(\mu = -10, 0, 10, \ldots, 100\).
4.5.5 Using map_*
to Create Models and Extract Values
Consider the following chunk of code which allows us to fit many simple linear regression models:
split
converts the data framemtcars
into a list with three data frames as elements based on the three levels ofcyl
.function(df) lm(mpg ~ wt, data = df)
defines an “anonymous function” to fit a linear model ofmpg
onwt
based on the variables in the data framedf
passed to it (bymap()
) as an input argument.- It returns as output a list of three
lm
objects you can use to get fitted values and summaries.
Call:
lm(formula = mpg ~ wt, data = df)
Residuals:
Min 1Q Median 3Q Max
-4.1513 -1.9795 -0.6272 1.9299 5.2523
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 39.571 4.347 9.104 7.77e-06 ***
wt -5.647 1.850 -3.052 0.0137 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.332 on 9 degrees of freedom
Multiple R-squared: 0.5086, Adjusted R-squared: 0.454
F-statistic: 9.316 on 1 and 9 DF, p-value: 0.01374
Since the output of map()
is a list, we can then use map()
to generate a list with the summary()
for each linear model object.
Extracting named elements is a common operation, so {purrr} provides a shortcut; you can use a string.
[1] 0.5086326
4 6 8
0.5086326 0.4645102 0.4229655
$`4`
value numdf dendf
9.316233 1.000000 9.000000
$`6`
value numdf dendf
4.337245 1.000000 5.000000
$`8`
value numdf dendf
8.795985 1.000000 12.000000
You can also use an integer to select elements by position:
4.5.6 Building map
Expressions (Backwards) from an Example
It can be daunting at first to think through how to build a map
expression from left to right. As an alternative, consider working from right to left.
- Write the code to do just one element of the
.x
input data. - Once that works, convert it to an anonymous function using
\(arg)
where you put in a dummy argument representing the element - Wrap the anonymous function with
map()
with the.x
(perhaps piped in).
- Since
map()
always returns a list, think about what kind of output you want and consider differentmap_*()
variants, e.g.,map_df()
ormap_dfr()
,map_lgl()
, etc.. - If you don’t want the
map
code to operate on every element of the.x
input, then consider using themap_if()
function for conditional execution on an element.
- We often use a \(t\)-test to test if differences in population means are “real”. R implements this with
t.test()
.
- For example, to test for differences between the mean
mpg
of automatics and manuals (coded in variableam
), we would use the following syntax. Note, the output oft.test()
is a list.
- Use
split()
to create three subsets ofmtcars
and then try the backwards approach to build amap()
expression to conduct the t.test on each subset. - Pipe to a map expression to get a vector of the \(p\)-values for each subset of
cyl
.
4.5.7 map2()
and pmap()
Enable Mapping over Multiple Arguments in Parallel
If you have multiple related input vectors you need to iterate along in parallel, that’s the job of the map2()
and pmap()
functions.
Example: you want to simulate five random draws from three different Normal distributions with different means and variances.
- You could iterate over the indices of the inputs using
seq_along()
and index into vectors of means and standard deviations:
# N(1,1), N(100,5), N(-10, 20)
mu <- list(1, 100, -10)
sigma <- list(1, 5, 20)
set.seed(1)
seq_along(mu) |>
map(\(i) rnorm(5, mu[[i]], sigma[[i]])) |>
str()
List of 3
$ : num [1:5] 0.374 1.184 0.164 2.595 1.33
$ : num [1:5] 95.9 102.4 103.7 102.9 98.5
$ : num [1:5] 20.2 -2.2 -22.4 -54.3 12.5
That can be confusing code to read, so {purrr} provides the function map2()
.
map2()
has arguments for.x
and.y
.- They should be of the same length or one can be of length 1 which is recycled.
# N(1,1), N(100,5), N(-10, 20)
mu <- list(1, 100, -10)
sigma <- list(1, 5, 20)
set.seed(1)
map2(mu, sigma, \(x_mu, y_sigma)
rnorm(n = 5, mean = x_mu, sd = y_sigma)) |>
str()
List of 3
$ : num [1:5] 0.374 1.184 0.164 2.595 1.33
$ : num [1:5] 95.9 102.4 103.7 102.9 98.5
$ : num [1:5] 20.2 -2.2 -22.4 -54.3 12.5
map2(mu, sigma, \(x_mu, y_sigma)
rnorm(n = 5, mean = x_mu, sd = y_sigma)) |>
as.data.frame(col.names = c("mu_1_sigma_1", "mu_100_sigma_5", "mu_-10_sigma_20"))
mu_1_sigma_1 mu_100_sigma_5 mu_.10_sigma_20
1 0.9550664 104.59489 -11.122575
2 0.9838097 103.91068 -13.115910
3 1.9438362 100.37282 -39.415048
4 1.8212212 90.05324 -19.563001
5 1.5939013 103.09913 -1.641169
When you have multiple vectors to iterate over, purrr::pmap()
handles more than two vectors.
- Instead of
.x
and.y
, there is a.l
argument for a list of inputs.
Suppose we want a different number of samples from the three distributions.
- We create a list of our arguments: Three distributions and three arguments for each distribution.
- That becomes our the
.l
argument (instead of.x
).
n <- list(20, 30, 50) # the number of samples for the three N(mu,sigma)
args_list <- list(n = n, mean = mu, sd = sigma)
str(args_list)
List of 3
$ n :List of 3
..$ : num 20
..$ : num 30
..$ : num 50
$ mean:List of 3
..$ : num 1
..$ : num 100
..$ : num -10
$ sd :List of 3
..$ : num 1
..$ : num 5
..$ : num 20
List of 3
$ : num [1:20] 2.359 0.897 1.388 0.946 -0.377 ...
$ : num [1:30] 102 96.9 101.7 94.4 107.2 ...
$ : num [1:50] -21.37 -12.7 13.56 -40.47 1.88 ...
4.5.8 {purrr} keep()
and discard()
Select Columns with Logicals
keep()
selects all variables that return TRUE
according to a function you choose or define.
- It is similar in concept to select with
tidyr_tidy_select
functions
Example: let’s keep all numeric variables in the starwars
data frame and calculate their means as a vector.
height mass birth_year
174.60494 97.31186 87.56512
discard()
will select all variables that return FALSE
according to some function.
Let’s get the summary for each column that is not a list or character.
name height mass hair_color skin_color eye_color
"character" "integer" "numeric" "character" "character" "character"
birth_year sex gender homeworld species films
"numeric" "character" "character" "character" "character" "list"
vehicles starships
"list" "list"
$height
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
66.0 167.0 180.0 174.6 191.0 264.0 6
$mass
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
15.00 55.60 79.00 97.31 84.50 1358.00 28
$birth_year
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
8.00 35.00 52.00 87.57 72.00 896.00 44
4.5.8.1 Exercise
- In the
mtcars
data frame, use only three lines of code to create a vector with the mean of only the variables that have a mean greater than10
.
4.5.9 {purrr} keep_at()
and discard_at()
Select Columns with Names
These functions work similar to keep()
and discard()
only the predicate arguments use the column names as input.
You can use a character vector of the names.
height mass
174.60494 97.31186
You can use an anonymous function of the names, e.g., with str_detect()
.
# A tibble: 87 × 3
hair_color skin_color eye_color
<chr> <chr> <chr>
1 blond fair blue
2 <NA> gold yellow
3 <NA> white, blue red
4 none white yellow
5 brown light brown
6 brown, grey light blue
7 brown light blue
8 <NA> white, red red
9 black light brown
10 auburn, white fair blue-gray
# ℹ 77 more rows
You can use your own named function and also pipe to other functions.
[1] FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE
Rows: 87
Columns: 7
$ name <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", "Leia Org…
$ height <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188, 180, 22…
$ mass <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 84.0, 77.0…
$ sex <chr> "male", "none", "none", "male", "female", "male", "female", …
$ gender <chr> "masculine", "masculine", "masculine", "masculine", "feminin…
$ homeworld <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alderaan", "Ta…
$ species <chr> "Human", "Droid", "Droid", "Human", "Human", "Human", "Human…
4.5.10 Get or Set an Element Deep in a Nested Data Structure
The {purrr package} has a function called pluck()
which you can use to get or set elements deep within nested lists or data frames without having to manipulate the entire nested data structure.
pluck()
is a shorthand way of combining multiple subset operators []
.
- pluck() provides a way of retrieving objects from such data structures using a combination of numeric positions, vectors, or list names.
In our starwars
example we can go the the films
column, which is a list, and get the 4th element of the first row.
4.5.11 {purrr} Summary
The {purrr} package facilitates functional programming with provides functions for working with vectors and using functions as arguments for other functions.
- The
map_*
functions take care of iterating through code so you don’t have to write for-loops. - They can streamline your code, making it easier to interpret and maintain.
Learning the {purrr} functions can be a good use of your time as you work with more complicated data sets.