6  Getting Data Using APIs

6.1 Introduction

6.1.1 Learning Outcomes

  • Identify potential sources for data from federal, state, local and international governments and agencies.
  • Use R tools for interacting with Application Programming Interfaces (APIs) to obtain data from the web and other sources.
  • Apply {tidyr} functions to assist in tidying JSON data obtained from a web API output.

6.1.2 References:

6.1.2.1 Other References

6.2 Governmental Open Data

On Jan 14, 2019, the Foundations for Evidence-Based Policymaking Act of 2018 became law.

  • Known as the “Open Data Law”, it requires federal agencies to publish their information online as “open data”, using standardized, machine-readable data formats, with their metadata included in the Data.gov catalog.

  • The Federal Government Data Catalog has metadata on over 250,000 databases available for public use. Most are from state and local governments but there are many from the Federal level or international sources.

  • The topics cover a wide range of government functions.

  • The data catalog has links to web sites or repositories from the sponsoring agency.

  • In many cases, the data is not produced by the source but is curated by an organization as part of an international arrangement, e.g., meteorological or oceanic data.

6.2.1 Federal Government Web Sites

Most Departments, Agencies, and Offices of the federal government have their own websites for providing open data and/or use the central catalog.

  • These websites often have tools for visualizing the data as well as downloads.
  • Downloads options can include multiple formats: e.g., zip, csv, html, or Excel as well as special formats for mapping data.
  • Many sites allow for direct download via your browser while others may require FTP for large data sets.

Some Sites Are Specialized; Others use data.gov

6.2.2 Local Government Open Data Initiatives

Many locales have publicly-available data. A few examples:

6.2.3 Open Data exists in, or for, Many Countries.

Sites with Open Data Around the World

6.2.4 Organizations Sponsor Open Data Sites (some charge a fee)

6.2.5 Bottom Line: Data, Data Everywhere
with Billions and Billions of Bytes to Think

with apologies to Samuel Taylor Coleridge

6.3 APIs and Authentication

6.3.1 Accessing Data on the Web

Where is Data on the Internet

Nohe (2017)

If you think of the web as an iceberg, Open Data is the 4% above the water level.

  • This where web-crawlers and search engines roam freely, the surface web . Like an iceberg, there is much more data below the surface - in the deep web.
  • This data is protected behind firewalls or application programming interfaces where general search engines (or you) can’t access it without additional steps to establish your identity.

Organizations limit access to their data for many reasons:

  • They want want to know who is accessing it (so they can prohibit excess usage).
  • They want to protect data integrity (security), intellectual property, or privacy.

Below the Deep web is about 6% of the data in the “Dark Web”, protected by special browsers and routers.

  • We will not be working at all with the “Dark Web”.

There are at least four ways to access data via the web:

  1. Click on a link to download a .csv/.xlsx/.txt/.json/.XML/… file.
  2. Use a package to interact with an API.
  3. Write code to interact directly with an API.
  4. Scrape from directly from the HTML file.

We know how to do 1; this class is about doing 2 and 3 and next class is about 4.

6.3.2 What is an API?

When organizations put their data behind an Application Programming Interface (API) on the Deep Web, they are using software to act as a gateway to control the access to the data.

  • A common form of API is known as the Representational state transfer or REST API.
  • This architecture is especially useful accessing data as a transaction through public-facing APIs.

APIs can control Who can access, What can be accessed, How can it be accessed, When it can be accessed, and What is returned.

  • An API Endpoint is a location of a resource to be accessed by an API (think of an endpoint as aligned with a specific URL).
  • APIs may act as gateways to many different endpoints which can be turned open or closed.

Organizations typically publish an API description document to provide users information on how to interact with the API.

  • A “how to” guide for requesting data from a particular API (and endpoint), and a description of what type of response you get after the request.

These days, most API will return data in one of two formats:

API’s are an abstraction and can be implemented in many programming languages.

  • Think of it like a user interface for a computer program.
  • You click a button and expect the user interface to do something.
  • An API lets a program click a metaphorical “button” and it expects to get something back.

There are lots of free and public APIs: https://github.com/public-apis/public-apis

  • Remember this link for the homework if you can’t think of a site.
Important

Most APIs use common internet communication protocols and data structures.

However, they implement them differently so each API and each API wrapper package will be unique.

You have to read the documentation.

The {aRxiv} package is an example of an R package that wraps the arXiv API.

  • One can use it to access metadata about all of the papers on arXiv as well as the papers themselves.
  • The following example uses the {aRxiv} package Karthik and Broman (2023) to find all of the arXiv scholarly articles with Hadley Wickham as an author and in the categories under “stat”(istics).
library(aRxiv)
z <- arxiv_search(query = 'au:"Hadley Wickham" AND cat:stat*', limit = 50)
dplyr::glimpse(z)
Rows: 3
Columns: 15
$ id               <chr> "1401.3269v1", "1502.00318v1", "1712.07349v1"
$ submitted        <chr> "2014-01-14 17:43:11", "2015-02-01 21:43:51", "2017-1…
$ updated          <chr> "2014-01-14 17:43:11", "2015-02-01 21:43:51", "2017-1…
$ title            <chr> "Teaching precursors to data science in introductory …
$ abstract         <chr> "  Statistics students need to develop the capacity t…
$ authors          <chr> "Nicholas J Horton|Benjamin S Baumer|Hadley Wickham",…
$ affiliations     <chr> "", "", ""
$ link_abstract    <chr> "http://arxiv.org/abs/1401.3269v1", "http://arxiv.org…
$ link_pdf         <chr> "http://arxiv.org/pdf/1401.3269v1", "http://arxiv.org…
$ link_doi         <chr> "", "http://dx.doi.org/10.1080/09332480.2015.1042739"…
$ comment          <chr> "", "", ""
$ journal_ref      <chr> "", "", ""
$ doi              <chr> "", "10.1080/09332480.2015.1042739", ""
$ primary_category <chr> "stat.CO", "stat.CO", "stat.OT"
$ categories       <chr> "stat.CO|cs.CY|stat.OT|62-07", "stat.CO|cs.CY|stat.OT…
  • The package has other search arguments to customize your search
  • The package has additional functions for working with the articles it retrieves.

6.3.3 Authentication

Organizations differ in the level of control they establish for their APIs.

  • Some APIs are fully open and do not require any identification or authentication, e.g., the arXive API in the example above. They may have terms of service to guide your usage but they do not require you to identify yourself.
  • Some APIs require you to register your email address so they know who is using their data.
  • The next level of greater control is basic authentication where you establish a username and password with the API owner for using the API.
  • Many APIs, especially for commercial organizations and US government agencies, now require more advanced authentication such as establishing a Personal Access Token or Access Key.
  • Finally, some organizations require you to register as a “developer” and request permission to use their data. They can issue multiple unique authorization tokens/keys for each “app” you are developing to govern your access and the access of any of your “App” users.

Organizations will typically provide documentation on their required level of control and how you gain access to using the API.

6.3.3.1 Basic Authentication

Some API’s use what’s called “Basic” authentication, where you provide a username and password. The basic syntax for this is:

GET("url", authenticate("username", "password"))
  • You can usually apply for a username and password.
  • Keep your password secure using the {keyring} package functions and variable names as discussed below.

6.3.3.2 API Keys and Personal Access Tokens

More secure APIs now require you to obtain an API key or personal access token (PAT) credential to establish your identity.

  • You may also have to register as a developer.
Important
  • Once you get a key, you are responsible for all behavior associated with your key.
  • Never save or display your private key in a file you might share such as a R script, .qmd, or .Rmd file.

There are two main ways to store your key so you can access it in R.

  • Storing it in .Renviron file. This approach is easy and does not depend on other libraries to access the key but it leaves your key exposed in a plain text file.
  • Using the {keyring} package. This approach requires installing the package and interacting with your computer credential manager but it keeps your key password protected at the same level as your computer.

6.3.3.3 Example: Getting an API Key for OMDB

OMDB is the Open Movie Datbase, an open source for information on movies maintained by volunteers. The site is not endorsed by or affiliated with IMDb.com.

OMDB requires you to register your email to obtain a key to use the API.

Make sure to look for a link to verify your email and/or activate the key.

6.3.3.3.1 Option 1: Use the .Renviron File to Store and Access Your Key or PAT (Easy but Not Very Secure)

This is safer than pasting your key into an .qmd or .R file which you might share.

  • Make sure you have the {use_this} package installed on your computer (comes with {devtools}).
  • Use the console to open up your .Renviron file using the {usethis} package function edit_r_environ() e.g., usethis::edit_r_environ()
  • Enter the following in your .Renviron: OMDB_API_KEY = <your-private-key>
    • “OMDB_API_KEY” is the key name you chose for your key and
    • ” is the key OMDB sent you in the email.
  • Restart R.
  • You can now always access your private key using the function: Sys.getenv(key_name), e.g., Sys.getenv("OMDB_API_KEY").
6.3.3.3.2 Option 2: Use the {keyring} Package to Store and Access Your Key or PAT

Using .Renviron means your key is still in a plain text environment in hidden but unsecure file.

Add a layer of security by using the {keyring} package: https://github.com/r-lib/keyring.

  • This puts the key into your computer’s key and credential manager for storage.
  • Use the console to install the {keyring} package if you do not have it already: install.packages("keyring").
  • Load the {keyring} package using library(keyring).
library(keyring)

Use the console to save the key into your computer’s login credential manager using key_set(key_name).

  • You only do this once!
  • This should bring up a pop-up window where you can paste the key you got from OMDB in their email.
  • If asked, say Allow Always.

For this example, use the console to enter key_set("API_KEY_OMDB").

  • To access the key, a.k.a. credential, in your code, use the key_get(key_name) function.
  • You can do this inside your .qmd or .R file as many times as you need.
  • Your key is not exposed either in the file or your .Rhistory.
```{r}
key_get("API_KEY_OMDB")
```
[1] "a76d624c"
  • Note: For Mac Users. You may have to go your Mac’s KeyChain Access utility to change the permissions under login passwords to “allow all applications to access this item” so it works in RStudio.
Warning

A person with password access to your computer can open your credential manager to see your key.

If they just have access to your computer and know R and the {keyring} package, they could still get to your key.

However, using the credential manager with {keyring} is more secure than placing your key in a plain text file like .Renviron.

6.3.3.4 Using Your Key in your Code

You could access your key and save it to a variable in your file, e.g, my_key <- keyring::get_key("my_API_key_name").

  • However that exposes your key in the global environment and .Rhistory.

A better approach is to just use key_get(key_name) or Sys.getenv(key_name) inside your API functions to access your credential without creating an intermediate variable.

```{r}
#| eval: false
mout <- GET(url = "url", apikey = key_get("my_credential"))
```

6.4 Using the {httr2} Package to Access APIs

Most website APIs use HTTP (Hyper-Text Transfer Protocol). It’s a language for querying and obtaining data.

  • The six most common http verbs are : GET(), PATCH(), POST(), HEAD(), PUT(), and DELETE().

We won’t learn HTTP, but we will use the R {httr2} package to interface with APIs through HTTP.

  • The goal of this section is not to provide a comprehensive lesson on HTTP and extracting data using API’s. Rather, this just points you to some resources and gives you some examples.

Every API is different, so you always have to figure out how to interact with a new API.

APIs should have documentation you can read to understand important info:

  • What parameters you can send?
  • How do you need to send them?
  • What data you should get back based on the values of those parameters?

6.4.1 The {httr2} Package

The {httr2} package is a complete rewrite of the {httr} package which is now superseded (not updated).

  • The package uses several other packages such as {curl} which provides lower level interfaces using HTTP.
  • The R {curl} package uses curl capabilities that you may see mentioned in many online posts about accessing data from URLs.

The new {httr2} package transitions from using a single function, GET(), which was executed immediately, to a family of functions around the concept of a “request” object.

  • You will still see evidence of underlying usage of GET() but you do not have to use it as {httr2} allows you to operate at a higher level.

This family of functions allows you to create a request as an R object so you can work with it and refine your request without always executing it.

Use the console and install the {httr2} package.

Use library(httr2) to load and attach the {httr2} package.

library(httr2)

We will use the API for OMDB which is fairly simple and is documented at http://www.omdbapi.com/.

Let’s build a request object and fetch data from OMDB on the movie [The Lighthouse](https://en.wikipedia.org/wiki/The_Lighthouse_(2019_film%29).

6.4.2 Building a Request Object

6.4.2.1 Start with the URL

Every request starts with a URL.

  • OMDB API documentation says to send all requests to “http://www.omdbapi.com/?apikey=[yourkey]&”.
  • The terms after ? are queries, so we just use the base URL to start.

OMDB API Documentation on URLs

Use request() with the URL and assign a variable name to it.

  • To manage the request object we will be using functions that start with req_....
my_url <- "http://www.omdbapi.com/"
req <- request(my_url)
req
  • You can see the GET as the start of the request with the URL but the body of the request is empty.

You can run a “dry-run” to see exactly what will be sent to the URL without sending it.

req |>
  req_dry_run()
GET / HTTP/1.1
Host: www.omdbapi.com
User-Agent: httr2/1.0.5 r-curl/5.2.3 libcurl/8.7.1
Accept: */*
Accept-Encoding: gzip
  • The first line is the HTTP method which tells the server what you want to do, along with the protocol version. Here it’s GET but it could be other verbs if we were managing a website or posting to a blog.
  • The host is the URL (stripped of protocol information) and where we will build our query.
  • You can modify the additional parameters but we don’t need to do that now.

6.4.2.2 Add Parameters to the Query

We use the req_url_query() function to add name-value pairs to our URL as part of our query, again using the API documentation to define the pair parameter names and possible values.

  • The API documentation shows multiple possible parameters:

OMDB API Documentation on Query Parameters

We can build up the query one parameter at a time using the pipe so it is easy to modify.

  • Run req_dry_run() to see what it looks like.
req |>
  req_url_query(t = "The Lighthouse") |> # Movie title
  req_url_query(plot = "short") |> # Short or full plot
  req_url_query(r = "json") |> # data type for return
  req_url_query(apikey = keyring::key_get("API_KEY_OMDB")) |> # credential
  req_dry_run()
GET /?t=The%20Lighthouse&plot=short&r=json&apikey=a76d624c HTTP/1.1
Host: www.omdbapi.com
User-Agent: httr2/1.0.5 r-curl/5.2.3 libcurl/8.7.1
Accept: */*
Accept-Encoding: gzip

When you are happy with the request, use req_perform() to actually run the request and then save the response.

req |>
  req_url_query(t = "The Lighthouse") |> # Movie title
  req_url_query(plot = "short") |> # Short or full plot
  req_url_query(r = "json") |> # data type for return
  req_url_query(apikey = keyring::key_get("API_KEY_OMDB")) |> # credential
  req_perform() ->
my_resp

6.4.3 Working with the Response

We can see the response object has class httr2_response and displays information we can scan to quickly see what happened.

  • First line is the request URL with query.
  • Second line is the status code.
  • Third line is the Content type. It says application/json, but it is stored in RAW format.
  • Fourth line is the size of the body content in RAW form.
my_resp |> class()
[1] "httr2_response"
my_resp

The response object is a named list.

  • We have seen the $method, $URL, and status_code already.
  • We can use the $headers for troubleshooting if we absolutely need to with resp_headers().
  • The content we want is in the $body. Note the RAW encoding with 2-character hexadecimal codes.
dplyr::glimpse(my_resp)
List of 7
 $ method     : chr "GET"
 $ url        : chr "http://www.omdbapi.com/?t=The%20Lighthouse&plot=short&r=json&apikey=a76d624c"
 $ status_code: int 200
 $ headers    :List of 15
  ..$ Date                       : chr "Mon, 04 Nov 2024 22:54:27 GMT"
  ..$ Content-Type               : chr "application/json; charset=utf-8"
  ..$ Transfer-Encoding          : chr "chunked"
  ..$ Connection                 : chr "keep-alive"
  ..$ Cache-Control              : chr "public, max-age=86400"
  ..$ Expires                    : chr "Mon, 04 Nov 2024 23:54:27 GMT"
  ..$ Last-Modified              : chr "Mon, 04 Nov 2024 22:54:27 GMT"
  ..$ Vary                       : chr "*, Accept-Encoding"
  ..$ X-AspNet-Version           : chr "4.0.30319"
  ..$ X-Powered-By               : chr "ASP.NET"
  ..$ Access-Control-Allow-Origin: chr "*"
  ..$ CF-Cache-Status            : chr "MISS"
  ..$ Server                     : chr "cloudflare"
  ..$ CF-RAY                     : chr "8dd82fa0394ac93c-IAD"
  ..$ Content-Encoding           : chr "gzip"
  ..- attr(*, "class")= chr "httr2_headers"
 $ body       : raw [1:970] 7b 22 54 69 ...
 $ request    :List of 7
  ..$ url     : chr "http://www.omdbapi.com/?t=The%20Lighthouse&plot=short&r=json&apikey=a76d624c"
  ..$ method  : NULL
  ..$ headers : list()
  ..$ body    : NULL
  ..$ fields  : list()
  ..$ options : list()
  ..$ policies: list()
  ..- attr(*, "class")= chr "httr2_request"
 $ cache      :<environment: 0x116f80b68> 
 - attr(*, "class")= chr "httr2_response"
  • To manage the response we will be using functions that start with resp_....

6.4.3.1 Check the Status of the Request

The status code describes whether your request was successful.

Use resp_status() to get the code for our request:

resp_status(my_resp)
[1] 200

Here is a list of some possible status codes.

  • 200 OK means our request processed correctly through the API. It does Not mean we got the data we want.
  • 204 No Content means our request was accepted but did not return any content.
  • 400 Bad Request means the API could not interpret our request.
  • 401 Unauthorized means the apikey was not recognized or not validated.
  • 408 Time Out means it took too long to process due to a slow or broken internet connection.

Codes 204, 400, and 401 mean you should look at the query request and the help documentation for the API to ensure it is correct. Check for typos etc..

Responses with status codes 4xx and 5xx are HTTP errors and {httr2} automatically turns these into R errors.

6.4.3.2 Manipulate the Body Content

There are several resp_body_* functions to convert the content into useful form.

  • resp_body_raw() shows the raw content in hexidecimal so not so useful for us.
resp_body_raw(my_resp)[1:99]
 [1] 7b 22 54 69 74 6c 65 22 3a 22 54 68 65 20 4c 69 67 68 74 68 6f 75 73 65 22
[26] 2c 22 59 65 61 72 22 3a 22 32 30 31 39 22 2c 22 52 61 74 65 64 22 3a 22 52
[51] 22 2c 22 52 65 6c 65 61 73 65 64 22 3a 22 30 31 20 4e 6f 76 20 32 30 31 39
[76] 22 2c 22 52 75 6e 74 69 6d 65 22 3a 22 31 30 39 20 6d 69 6e 22 2c 22 47
  • resp_body_json() converts the RAW content into JSON content with all of the Key:Value pairs.
resp_body_json(my_resp) |> dplyr::glimpse()
List of 25
 $ Title     : chr "The Lighthouse"
 $ Year      : chr "2019"
 $ Rated     : chr "R"
 $ Released  : chr "01 Nov 2019"
 $ Runtime   : chr "109 min"
 $ Genre     : chr "Drama, Fantasy, Horror"
 $ Director  : chr "Robert Eggers"
 $ Writer    : chr "Robert Eggers, Max Eggers"
 $ Actors    : chr "Robert Pattinson, Willem Dafoe, Valeriia Karaman"
 $ Plot      : chr "Two lighthouse keepers try to maintain their sanity while living on a remote and mysterious New England island in the 1890s."
 $ Language  : chr "English"
 $ Country   : chr "Canada, United States"
 $ Awards    : chr "Nominated for 1 Oscar. 34 wins & 143 nominations total"
 $ Poster    : chr "https://m.media-amazon.com/images/M/MV5BMTI4MjFhMjAtNmQxYi00N2IxLWJjMGEtYWY1YmU3OTQ0Zjk3XkEyXkFqcGc@._V1_SX300.jpg"
 $ Ratings   :List of 3
  ..$ :List of 2
  .. ..$ Source: chr "Internet Movie Database"
  .. ..$ Value : chr "7.4/10"
  ..$ :List of 2
  .. ..$ Source: chr "Rotten Tomatoes"
  .. ..$ Value : chr "90%"
  ..$ :List of 2
  .. ..$ Source: chr "Metacritic"
  .. ..$ Value : chr "83/100"
 $ Metascore : chr "83"
 $ imdbRating: chr "7.4"
 $ imdbVotes : chr "265,578"
 $ imdbID    : chr "tt7984734"
 $ Type      : chr "movie"
 $ DVD       : chr "N/A"
 $ BoxOffice : chr "$10,867,104"
 $ Production: chr "N/A"
 $ Website   : chr "N/A"
 $ Response  : chr "True"
  • resp_body_string() converts the RAW content into a character vector of the JSON Key:Value pairs.
resp_body_string(my_resp) |> str()
 chr "{\"Title\":\"The Lighthouse\",\"Year\":\"2019\",\"Rated\":\"R\",\"Released\":\"01 Nov 2019\",\"Runtime\":\"109 "| __truncated__
  • Here you can see the JSON key:value pairs, e.g., "Title\":\"The Lightouse\".

  • If we had used parameter "r = xml", we could use resp_body_html() to convert the RAW content into an HTML object.

  • You can pipe the response from resp_body_json(my_resp) to as.data.frame() to convert it to a data.frame where the $Ratings list column has been flattened.

resp_body_json(my_resp) |>
  as.data.frame() |>
  dplyr::glimpse()
Rows: 1
Columns: 30
$ Title            <chr> "The Lighthouse"
$ Year             <chr> "2019"
$ Rated            <chr> "R"
$ Released         <chr> "01 Nov 2019"
$ Runtime          <chr> "109 min"
$ Genre            <chr> "Drama, Fantasy, Horror"
$ Director         <chr> "Robert Eggers"
$ Writer           <chr> "Robert Eggers, Max Eggers"
$ Actors           <chr> "Robert Pattinson, Willem Dafoe, Valeriia Karaman"
$ Plot             <chr> "Two lighthouse keepers try to maintain their sanity …
$ Language         <chr> "English"
$ Country          <chr> "Canada, United States"
$ Awards           <chr> "Nominated for 1 Oscar. 34 wins & 143 nominations tot…
$ Poster           <chr> "https://m.media-amazon.com/images/M/MV5BMTI4MjFhMjAt…
$ Ratings.Source   <chr> "Internet Movie Database"
$ Ratings.Value    <chr> "7.4/10"
$ Ratings.Source.1 <chr> "Rotten Tomatoes"
$ Ratings.Value.1  <chr> "90%"
$ Ratings.Source.2 <chr> "Metacritic"
$ Ratings.Value.2  <chr> "83/100"
$ Metascore        <chr> "83"
$ imdbRating       <chr> "7.4"
$ imdbVotes        <chr> "265,578"
$ imdbID           <chr> "tt7984734"
$ Type             <chr> "movie"
$ DVD              <chr> "N/A"
$ BoxOffice        <chr> "$10,867,104"
$ Production       <chr> "N/A"
$ Website          <chr> "N/A"
$ Response         <chr> "True"

We will see more methods for working with JSON data later.

6.5 R Packages to “Wrap” APIs

Many of the most popular databases/websites already have an R package to “wrap the API”.

The developers have built functions that use familiar R syntax to interact with the API on your behalf and return the data in a form amenable to working in R.

Tip

If an R package exists for an API, you should try it first.

  • The developers have tried to make it easier and faster for you to access the data through the API.
  • If the R package is missing some functionality you need, you can always use its source code as a guide for developing your own code to interact with the API based on the documentation for the API.

Examples of R packages that wrap APIs:

6.6 The {tidycensus} Package

The {tidycensus} package allows users to interact with the US Census Bureau’s decennial Census and every-five-year American Community Survey (ACS) APIs and return tidyverse-ready data frames, optionally with simple feature geometry included. Walker and Herman (2023b)

There are other Census Bureau API “endpoints” - see Available APIs.

  • For our purposes, an endpoint” is essentially a specific URL a system owner has established to enable a user to access a set of data by using an API.
  • {tidycensus} has expanded to additional endpoints such as the Migration Flows.

A decennial census may have multiple “Summary files”. See the Census Guidance.

There are other R packages for the Census bureau data.

6.6.1 Setup

  • Install the {tidycensus} package using the console.
  • Load {tidyverse} and {tidycensus}.
library(tidyverse)
library(tidycensus)
Note

The changelog for tidycensus 1.320 22-12-09 states

Given that the Census API allows for 500 queries per day without an API key, the API key requirement in the package has been removed to support reproducibility. Users without a key are now warned of potential performance limitations.

In case you need a key, these notes will proceed as if a key is needed.

Request a key from the US Census Bureau:https://api.census.gov/data/key_signup.html.

  • Use your name and email.
  • You should get an email in a few seconds.
  • Click on the link in the email to activate your key.
  • Copy the key from the email - it will be long….
  • Set up your key using the console pane: key_set("API_KEY_CENSUS").
  • When queried for a password, use your new key.
    • Do not run with INSTALL = TRUE as it will put your key in plain text in your .Renviron file

You could put your key value into a variable but it is better to just use the function key_get() each time e.g., key_get("API_KEY_CENSUS").

6.6.2 Accessing US Census Bureau APIs

The {tidycensus} package has two Main Functions to interact with the Census Bureau’s APIs:

The function get_decennial() grants access to the 1990, 2000, and 2010 decennial US Census APIs as well as the 2020 decennial Census PL-94171 redistricting data.

  • Unfortunately, as of 21 Sept, 2020, Error: The 1990 decennial Census endpoint has been removed by the Census Bureau. We will support 1990 data again when the endpoint is updated; in the meantime, we recommend using NHGIS (https://nhgis.org) and the {ipumsr} R package
## This should error out as they have not fixed the end point as of 05/23/2023
m90 <- get_decennial(
  geography = "state", variables = "H043A001", year = 1990,
  key = key_get("API_KEY_CENSUS")
)

The function get_acs() grants access to the 5-year American Community Survey (ACS) APIs.

  • ACS data differ from decennial census data in that ACS data are based on an annual sample of approximately 3 million households, rather than a complete enumeration of the US population.

  • Thus, ACS data points are estimates with an associated margin of error.

  • {tidycensus} will always return the estimate and margin of error together for any requested variables when using get_acs().

  • get_acs() now defaults to year = 2020 to retrieve data from the 2016-2020 five-year ACS. One-year data for 2020 ACS are not yet available in {tdycensus}.

  • Data may be available at multiple levels of resolution or “geographies”, e.g., State, county, tract, etc.. See the full list at Standard Hierarchy of Census Geographies.

6.6.3 First Example - Median Gross Rent

  • Let’s look at median gross rent by state in 2000 from the Summary File 3.
  • The geography is “state” and the variable is “H063001”.
mgr00 <- get_decennial(
  geography = "state", variables = "H063001", year = 2000,
  sumfile = "sf3",
  key = key_get("API_KEY_CENSUS")
)
head(mgr00)
# A tibble: 6 × 4
  GEOID NAME       variable value
  <chr> <chr>      <chr>    <dbl>
1 01    Alabama    H063001    447
2 02    Alaska     H063001    720
3 04    Arizona    H063001    619
4 05    Arkansas   H063001    453
5 06    California H063001    747
6 08    Colorado   H063001    671
nrow(mgr00)
[1] 52
mgr00 |>
  ggplot(aes(x = value, y = fct_reorder(NAME, value))) +
  geom_point() +
  ylab("") +
  xlab("Median Gross Rent")

6.6.4 Searching for Census Bureau Variable IDs

There are thousands of variables IDs across the different census bureau files and their IDs are not consistent (but they are configuration managed).

Getting variables from the decennial census or ACS requires knowing the variable ID for the specific product based on the year(s).

To rapidly search for variables, use the load_variables() function.

  • Two required arguments:
    • the year of the census or the end year of the ACS sample, and
    • the dataset: “sf1”, “sf3”, “acs5”, or “acs1”.
  • For ideal functionality, assign the output to a variable, setting cache = TRUE to store the result on your computer for future access.
  • Then you can use the View function in RStudio to interactively browse for variables or use filter(str_detect()) if you know part of the name or concept.
v17a <- load_variables(2017, "acs5", cache = TRUE)
v19a <- load_variables(2019, "acs5", cache = TRUE)
v00d1 <- load_variables(2000, "sf1", cache = TRUE)
v00d3 <- load_variables(2000, "sf3", cache = TRUE)
v10d1 <- load_variables(2010, "sf1", cache = TRUE)
v20_pl <- load_variables(2020, "pl", cache = TRUE)
v20a <- load_variables(2020, "acs5", cache = TRUE)
## v10d3 <- load_variables(2010, "sf3", cache = TRUE)

6.6.5 Median Gross Rent Continued

Now let’s look at changes over time.

names(v10d1)
[1] "name"    "label"   "concept"
v10d1 |>
  filter(str_detect(concept, "MEDIAN GROSS RENT")) |>
  head()
# A tibble: 0 × 3
# ℹ 3 variables: name <chr>, label <chr>, concept <chr>
  • Surprise!! There is no median gross rent available for 2010 census.

Let’s jump to the latest data for the ACS 2020 to see if we can find an equivalent variable.

names(v20a)
[1] "name"      "label"     "concept"   "geography"
v20a |>
  filter(str_detect(concept, "MEDIAN GROSS RENT")) |>
  head()
# A tibble: 6 × 4
  name       label                                             concept geography
  <chr>      <chr>                                             <chr>   <chr>    
1 B25031_001 Estimate!!Median gross rent --!!Total:            MEDIAN… tract    
2 B25031_002 Estimate!!Median gross rent --!!Total:!!No bedro… MEDIAN… tract    
3 B25031_003 Estimate!!Median gross rent --!!Total:!!1 bedroom MEDIAN… tract    
4 B25031_004 Estimate!!Median gross rent --!!Total:!!2 bedroo… MEDIAN… tract    
5 B25031_005 Estimate!!Median gross rent --!!Total:!!3 bedroo… MEDIAN… tract    
6 B25031_006 Estimate!!Median gross rent --!!Total:!!4 bedroo… MEDIAN… tract    
mgr20 <- get_acs(
  geography = "state", variables = "B25031_001", year = 2020,
  key = key_get("API_KEY_CENSUS")
)
head(mgr20)
# A tibble: 6 × 5
  GEOID NAME       variable   estimate   moe
  <chr> <chr>      <chr>         <dbl> <dbl>
1 01    Alabama    B25031_001      811     5
2 02    Alaska     B25031_001     1240    16
3 04    Arizona    B25031_001     1097     5
4 05    Arkansas   B25031_001      760     5
5 06    California B25031_001     1586     4
6 08    Colorado   B25031_001     1335     6
nrow(mgr20)
[1] 52

Now let’s combine the data frames and plot in sorted order by the 2000 values.

mgr00 |>
  mutate(estimate = value) |>
  bind_rows(mgr20) |>
  group_by(NAME) |>
  mutate(
    delta = 100 * (max(estimate) - min(estimate)) / min(estimate),
    min = min(estimate)
  ) |>
  ungroup() ->
mgr_00_20
mgr_00_20 |>
  ggplot(aes(x = fct_reorder(NAME, min), y = estimate, color = variable)) +
  geom_point() +
  coord_flip() +
  scale_color_discrete(name = "Year", labels = c("2020", "2000")) +
  xlab("") +
  theme(legend.position = "left") +
  ylab("Median Gross Rent") +
  ggtitle("Median Gross Rent") ->
p1
p1
Figure 6.1: Median Gross Rent in 2000 and 2020 by State

Now let’s combine the data frames and plot the relative percent increase in rent.

mgr_00_20 |>
  ggplot(aes(x = fct_reorder(NAME, min), y = delta)) +
  geom_point() +
  coord_flip() +
  xlab("") +
  theme(
    ## axis.text.x = element_blank(),
    axis.text.y = element_blank(),
    axis.ticks = element_blank()
  ) +
  ylab("% Increase Median Gross Rent") +
  theme(legend.position = "bottom") +
  ggtitle("% Change in MGR 2000 to 2020") ->
p2

p1
p2

Let’s plot the two years as a scatter plot and add a linear smoother.

mgr_00_20 |>
  mutate(year = ifelse(str_detect(variable, "^H"), 2000, 2020)) |>
  select(-c(variable, value, moe)) |>
  pivot_wider(names_from = year, names_prefix = "y", values_from = estimate) ->
mgr_wide

mgr_wide |>
  ggplot(aes(x = y2000, y = y2020)) +
  geom_point() +
  geom_smooth(se = FALSE, method = lm, color = "blue", lty = 2)

Run a linear model and check the output.

lmout <- lm(y2020 ~ y2000, data = mgr_wide)
summary(lmout)

Call:
lm(formula = y2020 ~ y2000, data = mgr_wide)

Residuals:
    Min      1Q  Median      3Q     Max 
-171.91  -52.83   -9.86   33.90  452.21 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -188.9385    72.5423  -2.605   0.0121 *  
y2000          2.1743     0.1276  17.035   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 96.19 on 50 degrees of freedom
Multiple R-squared:  0.853, Adjusted R-squared:  0.8501 
F-statistic: 290.2 on 1 and 50 DF,  p-value: < 2.2e-16
broom::tidy(lmout, conf.int = TRUE)
# A tibble: 2 × 7
  term        estimate std.error statistic  p.value conf.low conf.high
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Intercept)  -189.      72.5       -2.60 1.21e- 2  -335.      -43.2 
2 y2000           2.17     0.128     17.0  1.84e-22     1.92      2.43

Get the decennial census data for median age by state in 2010, the variable ID is P013001.

Plot the data and reorder the states, on the y axis, by value.

Show code
age10 <- get_decennial(
  geography = "state",
  variables = "P013001",
  year = 2010,
  key = key_get("API_KEY_CENSUS")
)

head(age10)
nrow(age10)
Show code
age10 |>
  ggplot(aes(x = value, y = fct_reorder(NAME, value))) +
  geom_point() +
  ylab(NULL)

6.6.6 Example: Median Age by Race

Find the variables for median age for both sexes for the 2000 census in the sf1 file from the v00d1 downloaded earlier.

v00d1 |>
  filter(
    str_detect(label, "Both sexes"),
    str_detect(concept, "MEDIAN AGE BY SEX")
  ) |>
  select(name, concept)
# A tibble: 10 × 2
   name     concept                                                             
   <chr>    <chr>                                                               
 1 P013001  MEDIAN AGE BY SEX [3]                                               
 2 P013A001 MEDIAN AGE BY SEX (WHITE ALONE) [3]                                 
 3 P013B001 MEDIAN AGE BY SEX (BLACK OR AFRICAN AMERICAN ALONE) [3]             
 4 P013C001 MEDIAN AGE BY SEX (AMERICAN INDIAN AND ALASKA NATIVE ALONE) [3]     
 5 P013D001 MEDIAN AGE BY SEX (ASIAN ALONE) [3]                                 
 6 P013E001 MEDIAN AGE BY SEX (NATIVE HAWAIIAN AND OTHER PACIFIC ISLANDER ALONE…
 7 P013F001 MEDIAN AGE BY SEX (SOME OTHER RACE ALONE) [3]                       
 8 P013G001 MEDIAN AGE BY SEX (TWO OR MORE RACES) [3]                           
 9 P013H001 MEDIAN AGE BY SEX (HISPANIC OR LATINO) [3]                          
10 P013I001 MEDIAN AGE BY SEX (WHITE ALONE, NOT HISPANIC OR LATINO) [3]         

Get the 2000 data for median age for both sexes for White, Black, American Indian/Alaska, and Asian, doing two variables at a time and combining into a single tibble.

get_decennial(
  geography = "state",
  variables = c("P013A001", "P013B001"),
  year = 2000,
  key = key_get("API_KEY_CENSUS")
) ->
m2000

bind_rows(
  m2000,
  get_decennial(
    geography = "state",
    variables = c("P013C001", "P013D001"),
    year = 2000,
    key = key_get("API_KEY_CENSUS")
  )
) ->
m2000

Recode variable as a factor with better levels.

fct_recode(parse_factor(m2000$variable),
  White = "P013A001",
  Black = "P013B001",
  American_Indian_Alaskan = "P013C001",
  Asian = "P013D001"
) ->
m2000$variable

Plot median age by Race. How do you interpret the results?

m2000 |>
  ggplot(aes(x = variable, y = value)) +
  geom_boxplot(notch = TRUE) +
  ggtitle("Median Age by Race (as a Social Identity)")

Run an Analysis of Variance to test if all of the Races have the same average median age.

  • We have one quantitative response variable (value) and one categorical (factor) variable (variable).
  • How do you interpret the results?
aout <- aov(value ~ variable, data = m2000)
summary(aout)
             Df Sum Sq Mean Sq F value Pr(>F)    
variable      3   2411   803.7   93.64 <2e-16 ***
Residuals   204   1751     8.6                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
broom::tidy(aout)
# A tibble: 2 × 6
  term         df sumsq meansq statistic   p.value
  <chr>     <dbl> <dbl>  <dbl>     <dbl>     <dbl>
1 variable      3 2411. 804.        93.6  3.85e-38
2 Residuals   204 1751.   8.58      NA   NA       

Without creating any intermediate variables, let’s find the top 10 entities in terms of the difference between the highest and lowest median ages by Race and plot.

m2000 |>
  group_by(NAME) |>
  summarize(delta = max(value) - min(value)) |>
  slice_max(delta, n = 10) |>
  left_join(m2000) |>
  ggplot(aes(x = fct_reorder(NAME, delta), y = value, )) +
  geom_point(aes(color = variable)) +
  coord_flip() +
  xlab("Entity") +
  ylab("Median Age") +
  scale_color_discrete(name = "Race") +
  theme(legend.position = "top") +
  ggtitle(paste0("Top 10 Entities in terms of the Delta between Max and Min\nacross all Races"))

6.6.7 2020 Decennial Redistricting Data

A limited set of variables as required by the Public Law.

v20_pl |>
  group_by(concept) |>
  summarize(n = n())
# A tibble: 6 × 2
  concept                                                                      n
  <chr>                                                                    <int>
1 GROUP QUARTERS POPULATION BY MAJOR GROUP QUARTERS TYPE                      10
2 HISPANIC OR LATINO, AND NOT HISPANIC OR LATINO BY RACE                      73
3 HISPANIC OR LATINO, AND NOT HISPANIC OR LATINO BY RACE FOR THE POPULATI…    73
4 OCCUPANCY STATUS                                                             3
5 RACE                                                                        71
6 RACE FOR THE POPULATION 18 YEARS AND OVER                                   71

There are only a limited number of concepts.

Let’s look at the concept HISPANIC OR LATINO, AND NOT HISPANIC OR LATINO BY RACE.

  • We are not interested in the breakout by Race so we will work with the top level variables.
v20_pl |> 
  filter(str_detect(concept, "LATINO BY RACE$"),
         str_detect(label, "Population", negate = TRUE)) |> 
  select(name, label)
# A tibble: 3 × 2
  name    label                               
  <chr>   <chr>                               
1 P2_001N " !!Total:"                         
2 P2_002N " !!Total:!!Hispanic or Latino"     
3 P2_003N " !!Total:!!Not Hispanic or Latino:"
  • Note that P2_001N is the total of P2_002N and P2_003N.

Let’s look at the distribution of the percentage of Hispanic or Latino in each county in each state, ordered by total state populations.

  • How do you interpret the results?
m2020 <- get_decennial(
  geography = "county",
  variables = c("P2_002N", "P2_003N"),
  year = 2020,
  summary_var = "P2_001N",
  key = key_get("API_KEY_CENSUS")
)
m2020 |>
  separate_wider_delim(NAME, names = c("County", "State"), delim = ", ") |>
  mutate(pct = value / summary_value) ->
m2020
m2020 |>
  group_by(State, variable) |>
  mutate(state_pop = sum(summary_value, na.rm = TRUE)) |>
  ungroup() |>
  filter(variable == "P2_002N") |>
  ggplot(aes(x = fct_reorder(State, state_pop), y = pct)) +
  geom_boxplot() +
  coord_flip()

Let’s look at the percentage of Hispanic or Latino in each county by census region.

census_regions <- read_csv("https://raw.githubusercontent.com/AU-datascience/data/main/413-613/census_region_states.csv")

m2020 |>
  inner_join(census_regions) |>
  group_by(Division, variable) |>
  mutate(div_pop = sum(summary_value, na.rm = TRUE)) |>
  ungroup() |>
  filter(variable == "P2_002N") |>
  ggplot(aes(x = Division, y = pct)) +
  geom_boxplot() +
  coord_flip()

6.6.8 Using get_acs()

Let’s fetch median household income data from the 2016-2020 ACS for counties in Virginia.

va <- get_acs(
  geography = "county",
  variables = c(medincome = "B19013_001"),
  state = "VA",
  year = 2020,
  key = key_get("API_KEY_CENSUS")
)
head(va)
# A tibble: 6 × 5
  GEOID NAME                        variable  estimate   moe
  <chr> <chr>                       <chr>        <dbl> <dbl>
1 51001 Accomack County, Virginia   medincome    46178  2575
2 51003 Albemarle County, Virginia  medincome    84643  3217
3 51005 Alleghany County, Virginia  medincome    48513  4275
4 51007 Amelia County, Virginia     medincome    63918  6276
5 51009 Amherst County, Virginia    medincome    57368  4546
6 51011 Appomattox County, Virginia medincome    55457  5246
tail(va)
# A tibble: 6 × 5
  GEOID NAME                          variable  estimate   moe
  <chr> <chr>                         <chr>        <dbl> <dbl>
1 51790 Staunton city, Virginia       medincome    52292  3116
2 51800 Suffolk city, Virginia        medincome    79899  3566
3 51810 Virginia Beach city, Virginia medincome    78136  1419
4 51820 Waynesboro city, Virginia     medincome    43480  3487
5 51830 Williamsburg city, Virginia   medincome    59288  5801
6 51840 Winchester city, Virginia     medincome    61102  4170

The output is similar to a call to get_decennial(), but instead of a value column, get_acs() returns estimate and moe (margin of error) columns.

  • The moe represents (as the default) a 90 percent confidence level around the estimate; this can be changed to 95 or 99 percent with the moe_level parameter in get_acs()if desired.

As we have the margin of error, we can visualize the uncertainty around the estimate.

Given the large number of counties and independent cities in VA, let’s randomly sample 40 counties/cities and plot the moe for their median house income.

  • How would you interpret the graph?
set.seed(1)
va |>
  mutate(NAME = str_remove_all(NAME, "(County)|(Virginia)|(,)")) |>
  slice_sample(n = 40) |>
  ggplot(aes(x = estimate, y = fct_reorder(NAME, estimate))) +
  geom_point(color = "red", size = 2) +
  geom_errorbarh(aes(xmin = estimate - moe, xmax = estimate + moe)) +
  labs(
    title = "A Sample of Median Household Income\nfor Counties/Cities in Virginia",
    subtitle = "2016-2020 American Community Survey",
    y = "",
    x = "ACS estimate \n(bars are the margin of error at the 90% level)"
  )

6.7 Working with JSON Structured Data

JSON (Javascript Object Notation) is an open-standard text format that is largely replacing XML.

  • JSON is completely language independent but uses conventions familiar to programmers for many languages so it’s an ideal data-interchange language.

JSON uses data arrays of attribute-value pairs with multiple data types:

  • numbers (double)
  • strings (double-quotes)
  • boolean (T or F)
  • arrays (ordered list, square bracket notation with elements comma-separated)
  • null

6.7.1 JSON R packages

There are multiple packages for working with JSON when a download or API returns a JSON file (or you have a URL direct to a JSON file).

  • They typically can convert between JSON and a nested R list or data frame.

The rjson package was an early package that has been updated to be faster working with large R objects. It outputs unnamed lists so can be unwieldy.

The jsonlite package is a JSON parser/generator optimized for the web.

  • It has bidirectional mappings between JSON data and the most important R data types.
  • One can convert between R objects and JSON without loss of type or information, and without the need for any manual data manipulation.
  • It also includes functions to stream, validate, and prettify JSON data.
  • It’s ideal for interacting with web APIs, or to build pipelines where data structures seamlessly flow in and out of R using JSON.

The {tidyjson} package takes an alternate approach to structuring JSON data into tidy data.frames. Similar to {tidyr}, {tidyjson} builds a grammar for manipulating JSON into a tidy table structure.

We’ll work with {jsonlite} package.

  • Use the console to install and then load/attach {jsonlite} using library().
library(jsonlite)

6.7.2 Simplification in the {jsonlite} Package

Simplification is the process by which JSON arrays are automatically converted from a list into a more specific R class.

  • The fromJSON() function has 3 arguments which control the simplification process: simplifyVector, simplifyDataFrame and simplifyMatrix.
Type Example Jason Simplifies to Argument in fromJSON()
Array of primitives, [“Amsterdam”, “Rotterdam”, “Utrecht”, “Den Haag”] Atomic Vector simplyVector
Array of objects, [{“name”:“Erik”, “age”:43}, {“name”:“Anna”, “age”:32}] Data Frame simplifyDataFrame
Array of arrays, [ [1, 2, 3], [4, 5, 6] ] Matrix simplifyMatrix

6.7.2.1 Examples for fromJSON() and toJSON()

Simplify is TRUE by default and {jsonlite} works with the structure of the input array to determine what kind of output.

  • For an array of like primitives, you get an atomic vector.
json <- '["Mario", "Peach", null, "Bowser"]'
fromJSON(json)
[1] "Mario"  "Peach"  NA       "Bowser"
str(fromJSON(json))
 chr [1:4] "Mario" "Peach" NA "Bowser"
  • Turn off simplify and you get a list from any JSON array.
fromJSON(json, simplifyVector = FALSE)
[[1]]
[1] "Mario"

[[2]]
[1] "Peach"

[[3]]
NULL

[[4]]
[1] "Bowser"

When you have a JSON array with key-value pair objects, you can simplify it into a data frame by default.

json <-
  '[
      {"Name" : "Mario", "Age" : 32, "Occupation" : "Plumber"},
      {"Name" : "Peach", "Age" : 21, "Occupation" : "Princess"},
      {},
      {"Name" : "Bowser", "Occupation" : "Koopa"}
    ]'
mydf <- fromJSON(json)
mydf
    Name Age Occupation
1  Mario  32    Plumber
2  Peach  21   Princess
3   <NA>  NA       <NA>
4 Bowser  NA      Koopa

You can convert back with toJSON().

mydf$Ranking <- c(3, 1, 2, 4)
toJSON(mydf, pretty = TRUE)
[
  {
    "Name": "Mario",
    "Age": 32,
    "Occupation": "Plumber",
    "Ranking": 3
  },
  {
    "Name": "Peach",
    "Age": 21,
    "Occupation": "Princess",
    "Ranking": 1
  },
  {
    "Ranking": 2
  },
  {
    "Name": "Bowser",
    "Occupation": "Koopa",
    "Ranking": 4
  }
] 

6.7.3 Exercise with the US Congress JSON data

Question: Which gender in congress do you think has the higher average age? Just take a guess. Let’s get some data to see what the data shows.

Get data on the current members of Congress from the @unitedstates project.

  • Create a variable url_congress <- "https://theunitedstates.io/congress-legislators/legislators-current.json"
  • Use this as an argument of fromJSON().
  • What is the structure of the output?
Show code
url_congress <- "https://theunitedstates.io/congress-legislators/legislators-current.json"
json_congress <- fromJSON(url_congress)
glimpse(json_congress)

# A nested data frame with 3 columns of packed data frames and four columns of lists.

Using view() will automatically flatten the structure for viewing in RStudio.

  • Notice the variable names for the packed columns are in the form data-frame-name$column-name, e.g., id$bioguide.
view(json_congress)

Rerun with the flatten = TRUE argument. How did the structure change?

Show code
json_congress <- fromJSON(url_congress, flatten = TRUE)
glimpse(json_congress)
# A unpacked data frame with 23 columns of data and the same four columns of lists.

Notice the variable names for the packed columns are in the form data-frame-name.column-name, e.g., id.bioguide, where the $ operator has been replaced by a period.

Use {lubridate} functions to convert birthday to a date class and add a column for age in years, convert gender to a factor, add a column for their current term party, and save back to the same data frame.

  • Hint: look at help for time_length().
  • Hint: Note that the $terms column is an unnamed list of data frames where each row represents a term in Congress so each party member has different numbers of rows. However, each data frame has $party as a column of type character. Consider using map_chr() with dplyr::last() to extract the last value of $party for each member.
  • Hint: do not try to solve all of these at once. Code one and test it. Then code the next one and test it. and so on.

Who are the three oldest and the three youngest members? Show just the official full name, birthday, gender, and current party.

What is the count by gender? Create a table of counts by gender and current party.

Convert data.

Show code
json_congress |>
  mutate(
    bio.birthday = ymd(bio.birthday),
    age = round(time_length(today() - ymd(bio.birthday), unit = "years"), 1),
    bio.gender = as_factor(bio.gender),
    current_party = map_chr(json_congress$terms, \(x) last(x$party))
  ) ->
json_congress

Three oldest and youngest.

Show code
json_congress |>
  select(name.official_full, bio.birthday, age, bio.gender, current_party) |>
  arrange(bio.birthday) |>
  slice(c(1:3, (nrow(json_congress) - 2):nrow(json_congress)))

Count by gender and table of counts by gender and current party.

Show code
fct_count(json_congress$bio.gender)
table(json_congress$bio.gender, json_congress$current_party)

Plot the distribution of ages by gender. What do you notice?

Multiple Options

Show code
json_congress |>
  ggplot(aes(y = age, x = bio.gender)) ->
my_plot

my_plot + geom_violin()
my_plot + geom_boxplot()

json_congress |>
  ggplot(aes(x = age, color = bio.gender)) +
  geom_density()

my_plot + geom_jitter()

# Each plot shows similarities between the distribution of ages for
# Females and Males. Both distributions appear symmetric. The violin and box
# plots have similar medians and the density plots have similar shape and mode.
# THe jitter plot shows similar distribution of points on the Y (age) axis.

Use a t-test to determine if there is a difference between the mean ages of the genders.

Show code
tout <- t.test(age ~ bio.gender, data = json_congress)
tout
# The high p-value of .84 indicates there is almost no evidence to reject the
# Null hypothesis that the mean age is the same for both genders.

6.8 Summary for Getting Data Using APIs

Organizations use APIs to provide access to select data in a structured way that allows them control over who is accessing and how much, while allowing the user to customize which data they want.

If there is an R package to wrap the API, that often saves you time and effort. If there is not one, you can work with the {httr} package and the API documentation to access the data.

Given the complexity of data sets, the semi-structured JSON format is convenient for data exchange with APIs. Multiple R packages provide functions to help you manipulate the output from APIs to support your analysis.

6.9 Manipulating Deeply Nested Data

Data from many APIs, often called “wild-caught JSON”, can be in deeply nested data structures as organizations try to maximize the efficiency of their data structures for storage and transfer rather than analysis.

“Rectangling” is the “art and craft” of taking a “wild” deeply-nested list and “taming” it into a tidy data set of rows and columns for the data of interest.

The goal of the {tidyr} package Wickham, Vaughan, and Girlich (2023) is to help users convert data from a variety of sources into “tidy” data amenable for analysis and modeling.

Functions in the {tidyr} package that are particularly useful for rectangling include:

  • unnest_longer(): takes each element of a list-column and makes a new row.
  • unnest_wider(): takes each element of a list-column and makes a new column.
  • unnest_auto(): guesses whether you want unnest_longer() or unnest_wider().
  • hoist(): is similar to unnest_wider() but only plucks out selected components, and can reach down multiple levels.

Many data rectangling problems can be solved by combining these functions with a splash of {dplyr} (largely eliminating prior approaches that combined mutate() with multiple purrr::map()s).

Let’s load several packages we need for this section.

library(tidyverse)
library(keyring)
library(jsonlite)

The {repurrrsive} package Bryan (2023) provides a number deeply-nested lists (most originally captured from web APIs).

Install the {repurrrsive} package using the console and load it.

library(repurrrsive)

6.9.1 Initial Example: GitHub Users

We’ll start with gh_users, an unnamed list with data on six GitHub users.

  • Each GitHub user is a named list with 30 identical elements, each of which is an atomic vector of length one.
  • These elements are what we want to become columns in a data frame.

To begin, let’s use tibble() to convert the gh_users list into a data frame where the list is now a list-column in the data frame.

# View(gh_users)
users <- tibble(user = gh_users)
glimpse(users)
Rows: 6
Columns: 1
$ user <list> ["gaborcsardi", 660288, "https://avatars.githubusercontent.com/u…
  • This seems a bit counter-intuitive: Why is the first step in making a list simpler to make it more complicated?
  • Because, converting into a data frame creates a big advantage for us; multiple vectors are now tracked together in a single data frame with a consistent structure.
  • We have multiple tools in Tidyverse for working with data frames.

Let’s use unnest_wider() to turn each element of the list column into a new column:

  • Look at help for unnest_wider() to see there are two mandatory arguments: data and col.

We have one list column in our data frame so unnest_wider() will flatten the row for each user into new columns in our data frame in one step.

users |> unnest_wider(user) # a 6 x 30 tibble
# A tibble: 6 × 30
  login     id avatar_url gravatar_id url   html_url followers_url following_url
  <chr>  <int> <chr>      <chr>       <chr> <chr>    <chr>         <chr>        
1 gabo… 6.60e5 https://a… ""          http… https:/… https://api.… https://api.…
2 jenn… 5.99e5 https://a… ""          http… https:/… https://api.… https://api.…
3 jtle… 1.57e6 https://a… ""          http… https:/… https://api.… https://api.…
4 juli… 1.25e7 https://a… ""          http… https:/… https://api.… https://api.…
5 leep… 3.51e6 https://a… ""          http… https:/… https://api.… https://api.…
6 masa… 8.36e6 https://a… ""          http… https:/… https://api.… https://api.…
# ℹ 22 more variables: gists_url <chr>, starred_url <chr>,
#   subscriptions_url <chr>, organizations_url <chr>, repos_url <chr>,
#   events_url <chr>, received_events_url <chr>, type <chr>, site_admin <lgl>,
#   name <chr>, company <chr>, blog <chr>, location <chr>, email <chr>,
#   hireable <lgl>, bio <chr>, public_repos <int>, public_gists <int>,
#   followers <int>, following <int>, created_at <chr>, updated_at <chr>

However, we are only interested in a few elements of the user lists so we don’t need to flatten all of the elements into the data frame.

We can instead use tidyr::hoist() to pull out the elements of interest using the same syntax as purrr::pluck()and leave the rest in the list column.

Let’s hoist the variables login,html_url,and followers, from column user.

users |> hoist(
  .col = user,
  login = "login",
  url = "html_url",
  followers = "followers"
)
# A tibble: 6 × 4
  login       url                            followers user             
  <chr>       <chr>                              <int> <list>           
1 gaborcsardi https://github.com/gaborcsardi       303 <named list [27]>
2 jennybc     https://github.com/jennybc           780 <named list [27]>
3 jtleek      https://github.com/jtleek           3958 <named list [27]>
4 juliasilge  https://github.com/juliasilge        115 <named list [27]>
5 leeper      https://github.com/leeper            213 <named list [27]>
6 masalmon    https://github.com/masalmon           34 <named list [27]>
  • Notice it leaves the other 27 elements in the list user.
  • hoist() removes the named components from the user list-column, so you can think of it as moving components out of the “inner list” into the “top-level data frame”.

6.9.2 A More Complicated Example: gh_repos

The data set gh_repos is an un-named list with 6 elements.

  • Each element is itself a list of 26 or 30 repos for a specific GitHub user.
  • Each repo is a list of at least 68 elements with data about each repo such as name, owner (another list), fork status, and creation date.

Let’s use map_*() functions to confirm the structure.

map_dbl(gh_repos, \(x) length(x)) # number of repos for each user
[1] 30 30 30 26 30 30
map(gh_repos, \(x) len = map_dbl(x, \(y) length(y))) # nested map calls.
[[1]]
 [1] 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68
[26] 68 68 68 68 68

[[2]]
 [1] 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68
[26] 68 68 68 68 68

[[3]]
 [1] 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68
[26] 68 68 68 68 68

[[4]]
 [1] 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68
[26] 68

[[5]]
 [1] 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68
[26] 68 68 68 68 68

[[6]]
 [1] 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68 68
[26] 68 68 68 68 68
Tip

Use care in calling functions such as str() or glimpse() on deeply nested objects as they will go through each layer.

When using RStudio, Base R utils::View() will allow you to examine the data interactively to see the structure and contents.

Let’s convert gh_repos into a tibble like before.

# View(gh_repos)
repos <- tibble(repo = gh_repos)
total_repos <- map_dbl(repos$repo, \(x) length(x)) # repos per user
total_repos
[1] 30 30 30 26 30 30

The elements of user are a list of repositories that belong to each user.

  • Consider them as “observations”, so they could become new rows.

Use unnest_longer() to create rows (rather than unnest_wider() which creates columns).

  • This is similar to {tidyr}’s pivot_longer() and pivot_wider().
repos_u <- repos |> unnest_longer(repo)
repos_u |> head()
# A tibble: 6 × 1
  repo             
  <list>           
1 <named list [68]>
2 <named list [68]>
3 <named list [68]>
4 <named list [68]>
5 <named list [68]>
6 <named list [68]>
repos_u |> nrow()
[1] 176
sum(total_repos)
[1] 176

We have removed one layer of the list if you will, so, we now have a list-column where each of the 176 repos is now a row.

  • Each of these rows/observations is a list with 68 potential columns.
  • Not all 68 plus columns are of length 1. Each repo has an owner element which is a list with 17 elements.
  • Base R unlist() can convert the list to a named vector.
(pluck(repos_u$repo, 1, "owner") |> unlist())[1:4]
                                               login 
                                       "gaborcsardi" 
                                                  id 
                                            "660288" 
                                          avatar_url 
"https://avatars.githubusercontent.com/u/660288?v=3" 
                                         gravatar_id 
                                                  "" 
# note the () around the function call so we can use the [] operator after
# using the natural pipe.

We can now use unnest_wider() or hoist() as before on this list column to get all or just specific element as columns in a data frame.

set.seed(1234)
repos_u |> hoist(repo,
  login = c("owner", "login"),
  name = "name",
  watchers = "watchers_count",
  homepage = "homepage"
) |> slice_sample(n = 10)
# A tibble: 10 × 5
   login       name                 watchers homepage               repo        
   <chr>       <chr>                   <int> <chr>                  <list>      
 1 gaborcsardi pingr                       2 ""                     <named list>
 2 jtleek      genstats_site               7  <NA>                  <named list>
 3 masalmon    convertagd                  0 ""                     <named list>
 4 juliasilge  juliasilge.github.io        0 "http://juliasilge.co… <named list>
 5 juliasilge  r-travis                    0  <NA>                  <named list>
 6 leeper      drat                        0  <NA>                  <named list>
 7 leeper      designcourse                1  <NA>                  <named list>
 8 masalmon    openaq_figures              1 ""                     <named list>
 9 leeper      GREA                        0 ""                     <named list>
10 leeper      dataverse-1                 0 "http://dataverse.org" <named list>
  • Note the use of c("owner", "login"): this allowed us to reach two levels deep into of each list.
  • An alternative approach would be to pull out just owner and then put each of the 17 elements of it in a column:
repos_u |>
  hoist(repo, owner = "owner") |>
  unnest_wider(owner)
# A tibble: 176 × 18
   login           id avatar_url        gravatar_id url   html_url followers_url
   <chr>        <int> <chr>             <chr>       <chr> <chr>    <chr>        
 1 gaborcsardi 660288 https://avatars.… ""          http… https:/… https://api.…
 2 gaborcsardi 660288 https://avatars.… ""          http… https:/… https://api.…
 3 gaborcsardi 660288 https://avatars.… ""          http… https:/… https://api.…
 4 gaborcsardi 660288 https://avatars.… ""          http… https:/… https://api.…
 5 gaborcsardi 660288 https://avatars.… ""          http… https:/… https://api.…
 6 gaborcsardi 660288 https://avatars.… ""          http… https:/… https://api.…
 7 gaborcsardi 660288 https://avatars.… ""          http… https:/… https://api.…
 8 gaborcsardi 660288 https://avatars.… ""          http… https:/… https://api.…
 9 gaborcsardi 660288 https://avatars.… ""          http… https:/… https://api.…
10 gaborcsardi 660288 https://avatars.… ""          http… https:/… https://api.…
# ℹ 166 more rows
# ℹ 11 more variables: following_url <chr>, gists_url <chr>, starred_url <chr>,
#   subscriptions_url <chr>, organizations_url <chr>, repos_url <chr>,
#   events_url <chr>, received_events_url <chr>, type <chr>, site_admin <lgl>,
#   repo <list>

Instead of looking at the list and thinking about whether it needs to become rows or columns, you can try unnest_auto().

  • It uses a few heuristics to figure out whether unnest_longer() or unnest_wider() is appropriate, and tells you about its reasoning.
  • This is convenient for exploration but it is not advised to rely on it in long-term code.
  • unnest_auto() has the undesirable property that it will always succeed; that means if your data structure changes, unnest_auto() will continue to work, but might give very different output that causes cryptic failures from downstream functions.

6.9.3 An Even More Complicated Example: Game of Thrones Characters

The got_chars data set has a similar structure to gh_users; it’s a list of named lists, where each element of the inner list describes some attribute of a GoT character.

We start in the same way, first by creating a data frame and then by unnesting each component into a column.

tibble(char = got_chars) |> 
 unnest_wider(char) ->
chars2
glimpse(chars2)
Rows: 30
Columns: 18
$ url         <chr> "https://www.anapioficeandfire.com/api/characters/1022", "…
$ id          <int> 1022, 1052, 1074, 1109, 1166, 1267, 1295, 130, 1303, 1319,…
$ name        <chr> "Theon Greyjoy", "Tyrion Lannister", "Victarion Greyjoy", …
$ gender      <chr> "Male", "Male", "Male", "Male", "Male", "Male", "Male", "F…
$ culture     <chr> "Ironborn", "", "Ironborn", "", "Norvoshi", "", "", "Dorni…
$ born        <chr> "In 278 AC or 279 AC, at Pyke", "In 273 AC, at Casterly Ro…
$ died        <chr> "", "", "", "In 297 AC, at Haunted Forest", "", "In 299 AC…
$ alive       <lgl> TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, FALSE, TRUE, TRUE, T…
$ titles      <list> <"Prince of Winterfell", "Lord of the Iron Islands (by la…
$ aliases     <list> <"Prince of Fools", "Theon Turncloak", "Reek", "Theon Kin…
$ father      <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""…
$ mother      <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", ""…
$ spouse      <chr> "", "https://www.anapioficeandfire.com/api/characters/2044…
$ allegiances <list> "House Greyjoy of Pyke", "House Lannister of Casterly Roc…
$ books       <list> <"A Game of Thrones", "A Storm of Swords", "A Feast for C…
$ povBooks    <list> <"A Clash of Kings", "A Dance with Dragons">, <"A Game of…
$ tvSeries    <list> <"Season 1", "Season 2", "Season 3", "Season 4", "Season …
$ playedBy    <list> "Alfie Allen", "Peter Dinklage", "", "Bronson Webb", "DeO…
  • This created several columns that were also list columns.
  • The list elements are no longer all of length 1; several have multiple elements themselves.

What you do next will depend on the questions being asked in the analysis.

Let’s say we want analyse which books and TV series in which a character appears.

Let’s try the following strategy to create a data frame to help answer the question.

  • First, select name, books, povBooks, and tvSeries. This gives us a data frame with a row for each of the 30 names and two list columns for books and tvSeries.
    • That is a good start but we want to have rows with either books or tvSeries in them and a column that indicates for each row what type of media.
    • In essence, we want to convert variable (column) names to a variable value.
  • That suggests using pivot_longer()to convert two list columns to one list column with the list element values and a new column for the type of media, book or tvSeries, based on the source column.
    • Now we have a data frame with 60 rows with a column for name and media type and a list column value where each row is a list of either the books or the tvSeries for the name for the row’s media type.
    • We want to transform the list column into a row for each element in the list.
  • Now we can use unnest_longer() to convert each row into as many rows as exist in the list for that row.
chars2 |>
  select(name, books, povBooks, tvSeries) |> ## view()
  pivot_longer(c(books, povBooks, tvSeries), names_to = "media", values_to = "value") |>  ## view()
  unnest_longer(value) ->
char_media
char_media
# A tibble: 240 × 3
   name          media    value               
   <chr>         <chr>    <chr>               
 1 Theon Greyjoy books    A Game of Thrones   
 2 Theon Greyjoy books    A Storm of Swords   
 3 Theon Greyjoy books    A Feast for Crows   
 4 Theon Greyjoy povBooks A Clash of Kings    
 5 Theon Greyjoy povBooks A Dance with Dragons
 6 Theon Greyjoy tvSeries Season 1            
 7 Theon Greyjoy tvSeries Season 2            
 8 Theon Greyjoy tvSeries Season 3            
 9 Theon Greyjoy tvSeries Season 4            
10 Theon Greyjoy tvSeries Season 5            
# ℹ 230 more rows

We created a data frame with one row for each book, povBook, or tvSeries for each character.

Now we can plot the total amount of media for each character, by type, sorted from most to least.

char_media |> 
  mutate(media = fct_rev(media)) |> 
  group_by(name, media) |> 
  summarize(n = n(), .groups = "drop_last") |> 
ggplot(aes(x = fct_reorder(name, .x = n, .fun = sum), y = n, fill = media)) +
  geom_col() +
  xlab("Characters") +
  ylab("Number of media") +
  labs(title = "Media Presence of Game of Thrones Characters",
       caption = "https://anapioficeandfire.com/") +
  coord_flip() ->
  my_plot
my_plot

When I first ran this example from {repurrrsive}, I used books and tvSeries and not povbooks. When I plotted the data, I noticed Arya Stark has NA for books which, as a main character, seemed strange.

Use the open API at https://anapioficeandfire.com/ to get the data for Arya Stark to see if that is what the source data shows.

The following is a quick check from the original API source for the data which confirms Books is NA but povbooks has five of them.

Show code
library(httr2)
request("https://www.anapioficeandfire.com/api/characters") |> 
  req_url_query(name = "Arya Stark") |>
  req_perform() ->
  resp
resp |> resp_body_json() |> tibble(values = _) |> unnest_wider(values) 
# A tibble: 1 × 16
  url       name  gender culture born  died  titles aliases father mother spouse
  <chr>     <chr> <chr>  <chr>   <chr> <chr> <list> <list>  <chr>  <chr>  <chr> 
1 https://… Arya… Female Northm… In 2… ""    <list> <list>  ""     ""     ""    
# ℹ 5 more variables: allegiances <list>, books <lgl>, povBooks <list>,
#   tvSeries <list>, playedBy <list>

Perhaps, you want to know which characters have the most honorific title(s).

chars2 |>
  select(name, title = titles) |> ## view()
  unnest_longer(title) |> 
  count(name, sort = TRUE) |> head()
# A tibble: 6 × 2
  name                   n
  <chr>              <int>
1 Cersei Lannister       5
2 Daenerys Targaryen     5
3 Eddard Stark           5
4 Davos Seaworth         4
5 Kevan Lannister        4
6 Asha Greyjoy           3

The {tidyr} vignette has more examples of tidying even more complicated outputs from APIs.

6.10 Example: Rectangling NASA Data

6.10.1 The Data

The National Aeronautics and Space Administration’s SBDB Close-Approach Data (CAD) API provides access to current close-approach data (CAD) for all asteroids and comets in the California Institute of Technology’s Jet Propulsion Laboratory’s Small-Body DataBase.

Each CAD record is packaged as an array of fields (corresponding to those listed) in the following order:

  • des: primary designation of the asteroid or comet (e.g., 443, 2000 SG344)
  • orbit_id: orbit ID
  • jd: time of close-approach (JD Ephemeris Time)
  • cd: time of close-approach (formatted calendar date/time)
  • dist: nominal approach distance (au)
  • dist_min: minimum (3-sigma) approach distance (au)
  • dist_max: maximum (3-sigma) approach distance (au)
  • v_rel: velocity relative to the approach body at close approach (km/s)
  • v_inf: velocity relative to a massless body (km/s)
  • t_sigma_f: 3-sigma uncertainty in the time of close-approach (formatted in days, hours, and minutes;
    • days are not included if zero; example “13:02” is 13 hours 2 minutes;
    • example “2_09:08” is 2 days 9 hours 8 minutes)
  • body: name of the close-approach body (e.g., Earth),
    • only output if the body query parameters is set to ALL
  • h: absolute magnitude H (mag)
  • fullname: formatted full-name/designation of the asteroid or comet
    • optional: only output if requested with the appropriate query flag.
      • formatted with leading spaces for column alignment in mono-spaced font tables

6.10.2 Get the Most Recent Data and Save as RDS

Retrieve data for four lunar distances: 10, 20, 50, and 100.

  • Here we create a request with all of the query elements but dist-max.
  • We then add that value to the query before we run req_perform().
  • Note the need to use ` for the non-standard parameter names with a - in them.

Save each response as an RDS file to enable analysis until the next update of data is needed.

  • Make sure you have created an 06_data folder.
library(tidyverse)
library(httr2)
request("https://ssd-api.jpl.nasa.gov/cad.api/") |> 
  req_url_query(`date-min` = "2013-01-01") |> 
  req_url_query(sort = "dist") |> 
  req_url_query(body = "ALL") ->
req_nasa

req_nasa |> 
  req_url_query(`dist-max` = "10LD") |> 
  req_perform()  |> 
  write_rds("./06_data/neo_data_10lu_2013.rds")

req_nasa |> 
  req_url_query(`dist-max` = "20LD") |> 
  req_perform()  |> 
  write_rds("./06_data/neo_data_20lu_2013.rds")

req_nasa |> 
  req_url_query(`dist-max` = "50LD") |> 
  req_perform()  |> 
  write_rds("./06_data/neo_data_50lu_2013.rds")

req_nasa |> 
  req_url_query(`dist-max` = "100LD") |> 
  req_perform()  |> 
  write_rds("./06_data/neo_data_100lu_2013.rds")

6.10.3 Develop the Steps to Tidy and Clean the Data from one of the RDS files

Load the Near Earth Object (NEO) data from NASA close-approach data for NEOs within 10 lunar distances on or after 2013-Jan-01

  • The $body element of the response has the information we want in RAW format.
neo_data_10 <- readRDS("./06_data/neo_data_10lu_2013.rds")
class(neo_data_10)
[1] "httr2_response"
glimpse(neo_data_10)
List of 7
 $ method     : chr "GET"
 $ url        : chr "https://ssd-api.jpl.nasa.gov/cad.api/?date-min=2013-01-01&sort=dist&body=ALL&dist-max=10LD"
 $ status_code: int 200
 $ headers    :List of 7
  ..$ date        : chr "Sat, 11 May 2024 20:11:39 GMT"
  ..$ content-type: chr "application/json"
  ..$ server      : chr "nginx"
  ..$ set-cookie  : chr "AWSALBAPP-0=_remove_; Expires=Sat, 18 May 2024 20:11:39 GMT; Path=/"
  ..$ set-cookie  : chr "AWSALBAPP-1=_remove_; Expires=Sat, 18 May 2024 20:11:39 GMT; Path=/"
  ..$ set-cookie  : chr "AWSALBAPP-2=_remove_; Expires=Sat, 18 May 2024 20:11:39 GMT; Path=/"
  ..$ set-cookie  : chr "AWSALBAPP-3=_remove_; Expires=Sat, 18 May 2024 20:11:39 GMT; Path=/"
  ..- attr(*, "class")= chr "httr2_headers"
 $ body       : raw [1:2085973] 7b 22 73 69 ...
 $ request    :List of 7
  ..$ url     : chr "https://ssd-api.jpl.nasa.gov/cad.api/?date-min=2013-01-01&sort=dist&body=ALL&dist-max=10LD"
  ..$ method  : NULL
  ..$ headers : list()
  ..$ body    : NULL
  ..$ fields  : list()
  ..$ options : list()
  ..$ policies: list()
  ..- attr(*, "class")= chr "httr2_request"
 $ cache      :<environment: 0x1567bf0b0> 
 - attr(*, "class")= chr "httr2_response"

Convert to a list of JSON.

neo_list_10 <- resp_body_json(neo_data_10)

The data we want is the vector of variable names in $fields and the list in $data.

We need to combine the elements from $fields and $data to create the tibble we want.

  • Convert data to a tibble using tibble().
  • Unnest the list column to create more columns and save as a new tibble.
  • Set the new tibble column names equal to the values from $fields (after unlisting to create a vector).
neo_df_10 <- tibble(value = neo_list_10$data)
neo_df_10 |> unnest_wider(value, names_sep = "") ->
  neo_df_10_un
names(neo_df_10_un) <- unlist(neo_list_10$fields)
glimpse(neo_df_10_un)
Rows: 11,317
Columns: 12
$ des       <chr> "2024 JV8", "2020 VT4", "2020 QG", "2021 UA1", "2023 BU", "2…
$ orbit_id  <chr> "2", "5", "4", "3", "22", "1", "27", "2", "2", "27", "3", "5…
$ jd        <chr> "2460362.065008999", "2459167.222686305", "2459077.672759561…
$ cd        <chr> "2024-Feb-21 13:34", "2020-Nov-13 17:21", "2020-Aug-16 04:09…
$ dist      <chr> "1.95292215627956e-05", "4.50910597356063e-05", "6.227979849…
$ dist_min  <chr> "0", "4.50230750909071e-05", "6.22040307793404e-05", "6.2949…
$ dist_max  <chr> "0.000112897191731237", "4.51593933094878e-05", "6.235555619…
$ v_rel     <chr> "2.01472928494528", "13.4271195491719", "12.3308673063873", …
$ v_inf     <chr> "0.838337877552835", "7.88069685599943", "8.15386324007669",…
$ t_sigma_f <chr> "10:14", "< 00:01", "< 00:01", "< 00:01", "< 00:01", "< 00:0…
$ body      <chr> "Moon", "Earth", "Earth", "Earth", "Earth", "Earth", "Moon",…
$ h         <chr> "26.242", "28.61", "29.90", "31.84", "29.70", "32.32", "31.7…

Everything is still character so we can clean the variable types and create a new variable for the minimum distance in Lunar Distances based on the dist_min which is in astronomical units.

  • Note: 1 Lunar Distance = 384,317 Kilometers = 0.002569 astronomical units (au).
neo_df_10_un |>
  mutate(across(contains("_"), ~ parse_number(.x)),
    body = parse_factor(body),
    dist = parse_number(dist),
    h = parse_number(h),
    jd = parse_number(jd),
    cd = lubridate::ymd_hm(cd)
  ) |>
  mutate(dist_min_ld = dist_min / 0.002569, .after = dist_min) ->
neo_df_10_un

glimpse(neo_df_10_un)
Rows: 11,317
Columns: 13
$ des         <chr> "2024 JV8", "2020 VT4", "2020 QG", "2021 UA1", "2023 BU", …
$ orbit_id    <dbl> 2, 5, 4, 3, 22, 1, 27, 2, 2, 27, 3, 5, 3, 1, 4, 3, 3, 4, 1…
$ jd          <dbl> 2460362, 2459167, 2459078, 2459513, 2459972, 2460195, 2458…
$ cd          <dttm> 2024-02-21 13:34:00, 2020-11-13 17:21:00, 2020-08-16 04:0…
$ dist        <dbl> 1.952922e-05, 4.509106e-05, 6.227980e-05, 6.301350e-05, 6.…
$ dist_min    <dbl> 0.000000e+00, 4.502308e-05, 6.220403e-05, 6.294948e-05, 6.…
$ dist_min_ld <dbl> 0.00000000, 0.01752553, 0.02421332, 0.02450350, 0.02593414…
$ dist_max    <dbl> 1.128972e-04, 4.515939e-05, 6.235556e-05, 6.307751e-05, 6.…
$ v_rel       <dbl> 2.014729, 13.427120, 12.330867, 15.835007, 9.267245, 13.58…
$ v_inf       <dbl> 0.8383379, 7.8806969, 8.1538632, 12.8910404, 2.4284938, 10…
$ t_sigma_f   <dbl> 10, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
$ body        <fct> Moon, Earth, Earth, Earth, Earth, Earth, Moon, Earth, Moon…
$ h           <dbl> 26.242, 28.610, 29.900, 31.840, 29.700, 32.320, 31.740, 32…

6.10.4 Create Functions to Tidy and Clean

We have multiple .RDS files so instead of copy and paste, let’s build some helper functions.

  • We want to load, tidy, and clean the data for any .rds file saved from this API.
  • It should work on objects of class “httr2_response.”

Strategy:

  1. Generate/load the response data
  2. Extract the content from the response using resp_body_json() which will be returned as list.
  3. Convert the list to a tibble using tibble().
  4. Unnest the list column into to multiple columns with minimal names.
  5. Set the tibble column names equal to the values from the $fields element.
  6. Use a second helper function to clean the data.

Create a vector with the paths to the .rds files.

rds_files <- c(
  "./06_data/neo_data_10lu_2013.rds",
  "./06_data/neo_data_20lu_2013.rds",
  "./06_data/neo_data_50lu_2013.rds",
  "./06_data/neo_data_100lu_2013.rds"
)

Write the “clean” function since we will want to source it and use it in the “tidy” function.

clean_neo_df <- function(df) {
  stopifnot("data.frame" %in% class(df))
  mutate(df,
    across(contains("_"), ~ parse_number(.x)),
    body = parse_factor(body),
    dist = parse_number(dist),
    h = parse_number(h),
    jd = parse_number(jd),
    cd = lubridate::ymd_hm(cd),
    dist_min_ld = dist_min / 0.002569
  )
  }

Write the “tidy” function and call the “clean” function at the end.

  • I added a step to add a column to each data frame that has the Lunar Distance as identified from the data frame. It pulls from the URL used in the query for the original API call.
  • Note: the first several steps are general to many APIs.
  • By changing out the “clean” function and later steps you could customize the cleaning to other APIs.
tidy_nasa_neo <- function(response) {
  stopifnot(class(response) == "httr2_response")
  
  resp_body_json(response) ->
  temp_list

  tibble(value = temp_list$data) |>  # the _ pronoun from the natural pipe
    unnest_wider(value, names_sep = "") ->
  temp_df
  
  names(temp_df) <- unlist(temp_list$fields)

  ## what follows customized to neo-cad data
  temp_df <- clean_neo_df(temp_df)
  temp_df <- mutate(temp_df,
    ld = as.numeric(response$url |> str_extract("(?<=max=).+(?=LD)")),
    .before = everything()
  )
  temp_df
}

Now we can use map() to run the function for our list of files which will create a list.

Since each data frame will be the same we can put all the data into one large tibble using list_rbind().

neo_list <- map(rds_files, \(resp) tidy_nasa_neo(read_rds(resp)), 
              .progress = TRUE) 
neo_df <- list_rbind(neo_list)
neo_df |> head()  
# A tibble: 6 × 14
     ld des      orbit_id       jd cd                     dist dist_min dist_max
  <dbl> <chr>       <dbl>    <dbl> <dttm>                <dbl>    <dbl>    <dbl>
1    10 2024 JV8        2 2460362. 2024-02-21 13:34:00 1.95e-5  0        1.13e-4
2    10 2020 VT4        5 2459167. 2020-11-13 17:21:00 4.51e-5  4.50e-5  4.52e-5
3    10 2020 QG         4 2459078. 2020-08-16 04:09:00 6.23e-5  6.22e-5  6.24e-5
4    10 2021 UA1        3 2459513. 2021-10-25 03:07:00 6.30e-5  6.29e-5  6.31e-5
5    10 2023 BU        22 2459972. 2023-01-27 00:29:00 6.66e-5  6.66e-5  6.66e-5
6    10 2023 RS         1 2460195. 2023-09-07 14:26:00 6.93e-5  6.92e-5  6.94e-5
# ℹ 6 more variables: v_rel <dbl>, v_inf <dbl>, t_sigma_f <dbl>, body <fct>,
#   h <dbl>, dist_min_ld <dbl>

6.10.5 Create Helper Functions to Plot and Explore the Data

6.10.5.1 Does the Number of Objects Vary with Minimum Approach Distance and Body

Plot the distribution of the minimum distance of objects by body

Create two helper functions

  • The first is for adjusting the position of the labels on the plot
  • The second is to create the plots.
give.n <- function(x) {
  c(y = median(x) * 1.4, label = length(x))
  ## experiment with the multiplier to find a good position
}

plot_df <- function(df) {
  ggplot(df, aes(x = body, y = dist_min_ld)) +
    geom_boxplot() +
    ggtitle(paste0(
      df$ld,
      " LD: ",
      format(nrow(df), big.mark = ","), " Objects"
    )) +
    xlab("") +
    ylab("Min App Dist") +
    coord_flip() +
    stat_summary(
      fun.data = give.n, geom = "text", fun = median,
      position = position_dodge(width = 0.75)
    ) ->
  my_return_value
  return(my_return_value)
}

Use map() with the plot helper function to create a list where the elements are a plot for each tibble in the list.

  • You can use Quarto layout parameters to plot multiple plots at once Figures.
  • You can also use the {patchwork} package to plot all at once with nmore flexibility.
plot_neo <- map(neo_list, \(df) plot_df(df))

plot_neo[[1]]
plot_neo[[2]]
plot_neo[[3]]
plot_neo[[4]]

There appears to be a difference between the Earth and the moon that does not make physical/geometric sense. We’ll explore in a few steps.

6.10.5.2 Does the Number of Objects vary by Day of the Week?

Plot the total number of objects by day of the week by body

Let’s create a second plot function for day of the week.

plot_dow <- function(df) {
  df |>
    mutate(wkday = lubridate::wday(cd)) |>
    group_by(wkday, body) |>
    summarize(tot_des = n(), .groups = "drop") |>
    ggplot(aes(x = wkday, y = tot_des, color = body)) +
    geom_line() +
    scale_y_log10() +
    ggtitle(paste0(
      df$ld,
      " LDs: ",
      format(nrow(df), big.mark = ","), " Objects"
    )) +
    xlab("Weekday") +
    ylab("Num Objects") +
    theme(legend.position = "top") +
    scale_color_discrete(name = " ")
}

Now plot.

plot_neo <- map(neo_list, \(df) plot_dow(df))

plot_neo[[1]]
plot_neo[[2]]
plot_neo[[3]]
plot_neo[[4]]

  • There appear to be some anomalies across the days of the week.
  • Run an analysis of variance for 10 and 50 LDs to see what might be happening?
neo_list[[1]] |>
  mutate(wkday = lubridate::wday(cd)) |>
  group_by(wkday, body) |>
  summarize(tot_des = n(), .groups = "drop") -> neo_df_10_dow

aout_10 <- aov(tot_des ~ body * wkday, data = neo_df_10_dow)
summary(aout_10)
            Df  Sum Sq Mean Sq  F value  Pr(>F)    
body         4 6091874 1522968 6209.210 < 2e-16 ***
wkday        1    2169    2169    8.841 0.00644 ** 
body:wkday   4    2498     624    2.546 0.06448 .  
Residuals   25    6132     245                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
neo_list[[3]] |>
  mutate(wkday = lubridate::wday(cd)) |>
  group_by(wkday, body) |>
  summarize(tot_des = n(), .groups = "drop") -> neo_df_50_dow
aout_50 <- aov(tot_des ~ body * wkday, data = neo_df_50_dow)
summary(aout_50)
            Df   Sum Sq  Mean Sq   F value Pr(>F)    
body         5 51819815 10363963 15831.373 <2e-16 ***
wkday        1      232      232     0.354  0.557    
body:wkday   5     4678      936     1.429  0.248    
Residuals   25    16366      655                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

There appears to be some evidence for a difference in the days when looking at only 10 LDs for some bodies. However, those bodies have relatively small sample sizes so require more analysis of the assumptions to see if the difference is real.

6.10.5.3 Assess the Differences Between the Earth and the Moon

Plot Objects going by both Moon and Earth.

Create another plot function.

plot_m_e_o <- function(df) {
  df |>
    filter(body %in% c("Earth", "Moon")) |> ## Only objects for Moon and Earth
    group_by(des) |> ## for each object
    mutate(n_bodies = length(unique(body)), .after = body) |> ## how many bodies
    ungroup() |>
    filter(n_bodies > 1) |> ## Select objects that pass both Earth and Moon
    group_by(des, body) |> ## for each object and body
    filter(dist_min_ld == min(dist_min_ld)) -> tt ## Select the pass with the minimum distance
  ggplot(tt, aes(x = body, y = dist_min_ld)) +
    geom_boxplot() +
    stat_summary(
      fun.data = give.n, geom = "text", fun = median,
      position = position_dodge(width = 0.75)
    ) +
    ggtitle(paste0(
      df$ld,
      " LDs: ",
      format(nrow(df), big.mark = ","), " Objects"
    )) +
    xlab("Weekday") +
    ylab("Num Objects") +
    theme(legend.position = "top") +
    scale_color_discrete(name = " ")
}
plot_neo <- map(neo_list, \(df) plot_m_e_o(df))
plot_neo[[1]]
plot_neo[[2]]
plot_neo[[3]]
plot_neo[[4]]

Plot Objects going only by Earth or Moon (1)

plot_m_or_e <- function(df) {
  df |>
    filter(body %in% c("Earth", "Moon")) |> ## Only objects for Moon and Earth
    group_by(des) |> ## for each object
    mutate(n_bodies = length(unique(body)), .after = body) |> ## how many bodies
    ungroup() |>
    ## filter(n_bodies > 1) |> #Select objects that pass both Earth and Moon
    group_by(des, body) |> ## for each object and body
    filter(dist_min_ld == min(dist_min_ld)) -> tt ## Select the pass with the minimum distance
  ggplot(tt, aes(x = body, y = dist_min_ld)) +
    geom_boxplot() +
    facet_grid(~n_bodies) +
    stat_summary(
      fun.data = give.n, geom = "text", fun = median,
      position = position_dodge(width = 0.75)
    ) +
    ggtitle(paste0(
      df$ld,
      " LDs: ", nrow(tt), " Objects \n(1 = E or M and 2 =  Both E and M)"
    ))
}
plot_neo <- map(neo_list, \(df) plot_m_or_e(df))
plot_neo[[1]]
plot_neo[[2]]
plot_neo[[3]]
plot_neo[[4]]

6.10.5.4 What is Happening?

After asking the SBDB team, it turns out there are hard limits imposed on the API on each close-approach body to keep the returned data set of reasonable size and still ensure detection of close-approaches of interest.

  • For example, the limit for Venus is 0.15 au, for Earth is 0.5 au, and for the moon is 0.01 au (about 4 LD).
  • That is why the Earth has far more objects than the moon.
Tip

Don’t be afraid to reach out to a data-source owner or the community to ask questions when the data does not seem to make sense!

6.10.6 Summary for Rectangling

Data returned from APIs can be in different structures and often in deeply-nested lists when converted from JSON.

The {tidyr} package provides many functions to facilitate manipulating deeply nested lists so you can access the data you want for your analysis while keeping all of the data for future work.