8  Getting Data from SQL Databases

Published

August 30, 2024

Keywords

database, relational, sql, duckdb, dbeaver, duckplyr

The code chunks for SQL and R have different colors in this chapter.

  • Code chunks for SQL use this color.
  • Code chunks for R use this color.

8.1 Introduction

8.1.1 Learning Outcomes

  • Understand the organization of relational databases.
  • Use Structured Query Language (SQL) to access and manipulate data in SQL databases.
  • Use the {dbplyr} package to access and manipulate data in SQL databases.

8.1.2 References:

8.1.2.1 Other References

8.2 A Brief History of Databases

People and organizations have been collecting data for centuries. As early as the 21st century BCE, people were using clay tablets to make records of farmland and grain production. Records were kept as individual files (clay tablets) that were eventually organized based on conceptual model of the data. That model was typically a hierarchical set of fixed attributes of the records, e.g., subject, date, author, etc. for tracking records in a filing system.

It wasn’t until the 1960s that computerized “databases” began to appear. Early computerized databases focused on storing and retrieving individual records which were organized based on existing data models which used the attributes of the tables and records. Data were stored in tables with rows of values and columns for attributes. Users could find the desired table and row by navigating through the hierarchy. This model resulted in larger and larger tables with many redundant fields as users tried to integrate data at the table level. As an example, a record of a purchase could include all of the details of the purchase to include all of data on the buyer and seller such as name, address, and telephone number. Every record of every purchase by a buyer could include their name, address, and telephone number.

Database Management Systems (DBMS) were developed as the interface between the system user and the raw data in the records. DBMS used the data model to manage the physical, digital storage of the data and provide the navigation to find the right table and record. Users could now retrieve and update records electronically. However, computing and storage was expensive so researchers were looking for ways to reduce the redundant storage and inflexible perspective of the data inherent in the hierarchical data model.

In 1970, IBM researcher E.F. Codd published what became a landmark paper, A relational model of data for large shared data banks. Codd (1970) Codd proposed replacing the fixed hierarchy and its large tables with much smaller tables where the data model now focused on defining relationships among the tables instead of navigating across a hierarchy of attributes or parent child relationships among tables. The relational database model is now ubiquitous for storing extremely large data sets.

A strength and a drawback of the relational database model is the capability to manage not just the data, but all the relationships across the tables of data. This makes relational databases great for working with structured data, where the relationships are known ahead of time and built into the database model. The drawback is it takes time and processing power to enforce those relationships and structure. Thus, other approaches such as Apache HADOOP have been developed to capture and manage the variety, velocity, and volume of data associated with “Big Data” sources, e.g., capturing all the meta data and data associated with billions of transactions a day on a social media site.

For the relational database model to work, it needed a corresponding high-level language for implementation. Users had to be able to create databases and manage the life cycle of their databases and the data they contained. This need led to the development of what were known as “query languages” which eventually evolved into what is now called Structured Query Language or SQL (pronounced as “SEE-quil” or “ESS-q-l”). SQL began with a row-based approach to managing data in tables. Other languages e.g., No-SQL, developed a column-based approach to managing data. There are trade offs between the approaches. Many relational database developers create their own (proprietary) versions of SQL to be optimized for their database that incorporate features of both for speed and distributed processing. Apache has developed Apache Hive as a data warehouse system that sits atop HADOOP. “Hive allows users to read, write, and manage petabytes of data using SQL.”

Relational Databases and HADOOP/Hive-like approaches are not competitors as much as different solutions for different requirements.

This section will focus on the relational database methods and systems which use SQL.

8.3 Database Concepts and Terms

8.3.1 Database Tables

Relational databases use Relational DBMS (RDBMS) to keep track of large amounts of structured data in a single system.

A relational database uses rectangular tables, also called entities to organize its data by rows and/or columns.

  • Each table is typically designed to contain data on a single type or class of “entity”.

  • An entity can be a physical object such as a person, building, location, or sales order, or a concept such as this class or metadata on an object.

  • Each entity must have properties or attributes of interest.

    • These become the names for the columns in the table.
    • These are also known as the fields for the tables.
  • Each row in the table is also referred to as a record.

  • The intersection of row and column contains the value, the data, for that record for that field.

  • All of the values within a given field should be of the same data type. These include the basic types in R such as character, integer, double, or logical as well as complex objects such as pictures, videos, or other Binary Large Objects.

The tables are similar to R or Python data frames, but with some key differences.

  • Database tables are stored on a drive so are not constrained, like R data frames, to having to fit into memory. They can have many columns and many millions of rows, as long as they fit on the drive.
  • Unlike a data frame, a database field (column) may have a domain which limits the format and range of values in the field in addition to constraining the data type, e.g., the format for a telephone number may be xx-xxx-xxx-xxxx.
  • Database tables are often designed so each record is unique (no duplicate rows) with rules the DBMS uses to enforce the uniqueness.
  • Database tables are almost always indexed so each row has a unique value associated with it (behind the scenes).
    • The index identifies each record from a physical perspective, much like one’s operating system indexes files for faster search and retrieval.
    • The index is what allows a DBMS to find and retrieve records (rows) quickly.

8.3.2 Database Table Keys and Relationships

Most database tables will also be designed with a primary key which is combination of one or more fields that uniquely identify each record, from a logical perspective, across the database.

  • As an example, a key may be a person’s social security number which is expected to be unique to one person.

A combined key, uses multiple fields to create a key, such as house number, street name, city, state, zip-code.

Many database designers prefer to create a separate surrogate key, that does not use other values and has no meaning, to avoid potential changes in data over time.

The presence of primary keys for each table enable “relationships” between tables to be created and used by the RDBMS.

As an example, an organization is interested in tracking data about purchase orders and the buyer and a product in a purchase order.

A database designer might create three tables to hold this information:

  • A ORDER table with fields for the order number, the date of the purchase, the buyer_id, the product_id, the quantity purchased, and the purchase price.
    • The primary key is order_number.
  • A BUYER table with fields for the buyer_id, name, company id, street address, state, postal code, country, telephone number, and email.
    • The primary key is buyer_id.
  • A PRODUCT table with fields for the products id, name, weight, volume, purchase sizes, default_shipper_id, and standard price.
    • The primary key is product_id.

Each table describes a single object or entity and each has a primary key to uniquely identify each record.

The ORDER table uses the primary keys from the BUYER table and the PRODUCT table to make a relationship with those tables.

The presence of a table’s primary key in a different table is called being a foreign key in the second table.

  • The table with the primary key is known as a Parent table.
  • The table that references the parent table (using the foreign key) is known as the child table.
  • RDBMS use rules (constraints) to preserve data integrity so that someone editing or changing a child table cannot change or use a foreign key field that is not present in the parent table.
  • In this example, there is a foreign key in the BUYER table, company _id, and a foreign key in the PRODUCT table, default_shipper_id.

If you were to put all this data into a single table, the attributes of the buyers and products would be duplicated many times over as additional orders were created.

This example illustrates how relationships allow database designers to create multiple tables to minimize the amount of redundant data in a system.

8.3.3 ER Diagrams

Tables and their relationships are usually described in the database schema.

They can be depicted using class or entity-relationship diagrams to show the relationships between tables, fields and keys.

  • E-R diagrams can include additional information such as constraints and type of relationship, e.g., one-to-one, or one-to-many.

E-R diagrams can serve as a useful reference for understanding the structure of a large database and designing queries.

Figure 8.1 is an example of a simplified ER diagram (created in Quarto using Mermaid) showing the relationships between the tables and the associated primary and foreign keys.

Figure 8.1: Simplified ER Diagram

Figure 8.2 shows a different style of E-R diagram commonly used for database schema. The arrows link individual fields as primary and foreign keys for the tables and show the type of the relationships.

Figure 8.2: A partial E-R diagram from a hospital setting. w3resource (2023)

8.3.4 Bringing Tables Together

Now that we can see the database schema, we can find where the data is that we want.

Suppose we want to answer the question, how many orders were placed in the last month. We can do that with just the data in the ORDERS table.

However, suppose we want to answer the question how many orders were placed from buyers in the states of Virginia and Maryland last month. We would have to get data from both the ORDERS table and the BUYERS table.

We get data from multiple tables by temporarily combining tables using a SQL query that creates a JOIN between two or more tables.

Database joins are the same joins you can execute using {dplyr} verbs but the SQL syntax is slight different.

If you are using the same join over and over in your analysis. SQL allows you to create a database “View” and save it. Then in the future, you can query the “view” instead of having to rejoin the tables.

  • This is analogous to writing an R function to create a data frame based on joins and querying the results of the function instead of rewriting the code to join the data frames.

8.3.5 Database Design

Designing databases is its own area of expertise.

A poorly designed database will be slow and have insufficient rules (safeguards) to ensure data integrity.

We are used to operating as single users of our data since it is all in the memory of our machine and we can do what we want.

  • A database may have thousands of users at the same time trying to update or retrieve records.
  • Only one user can update a record at a time so databases use Record Locking to prevent others from updating the record or even retrieving it while someone else is updating the record.
  • RDBMS can enforce locking at higher levels such as the table or even the entire database which can delay user queries. See All about locking in SQL Server for more details.

Database designers make trade offs when designing tables to balance competing desires.

  • Storage Efficiency: limit each table to a single entity to minimize the amount of redundant data. This can generate lots of tables.
  • Maintenance Efficiency: The more tables you have, the fewer users might be trying to update the table at a time.
  • Query Speed: Minimize the number of joins needed to make a query. This leads to fewer tables.

These trade offs were recognized by Codd in his original paper.

Thus database designers may create tables that have redundant data if the most common queries join fields from those tables.

They will use database performance testing to optimize the queries and the database design to get the optimal cost performance trade offs. They can use other approaches to support data integrity in addition to database normalization.

Don’t be surprised to see E-R diagrams that have redundant data in them.

8.3.6 Relational Database Management Systems

8.3.6.1 Database Front and Back Ends

Databases have two main components.

  • The Front-end user interface where users create code to generate queries and receive results.
  • The Back-end server where the data is stored and the DBMS executes the queries to update or retrieve the data.

The Front-end and Back-end communicate through a Connection that uses an API for handling queries and data.

Many RDBMS bundle a Front-end and Back-end together. Examples include Oracle, SAP, and DuckDB has DBeaver.

Most databases allow you to use a Front-end of your choice as long as you have a server driver you can use to configure and open a connection to the Back-end server of the database.

  • RStudio can operate as a Front-end to RDBMS for which you have installed a server.
  • Excel can operate as a Front-end to a Microsoft SQL Server database.
  • DBeaver allows you to make a connection to many different DBMS servers.
  • MariaDB identifies over 30 Front-ends that can work with it, including DuckDB’s DBeaver SQL IDE.

There are multiple solutions. We will use both RStudio and DBeaver.

8.3.6.2 Database Connections: ODBC and the {DBI} Package

The reason there is so much flexibility with connecting Front and Back ends is the standardization of the Microsoft Open Data Base Connectivity (ODBC) Standard. David-Engel (2023)

The ODBC standard was designed for “maximum interoperability”.

The construct is similar to the HTTP interface for web browsers. The DBMS developer creates a driver that serves as an API for their Back-end DBMS that is ODBC-compliant. Then the Front-end developer creates their software to also be ODBC-compliant.

That way, the Front-end can create a connection to and interact with any DBMS back end for which there is an ODBC-compliant driver.

The R {DBI} package (for Data Base Interface) supports the connection between R and over 25 different DBMS Back-ends.

  • Popular Back-end packages include: {duckdb}, {RMariaDB}, {RPostgres}, {RSQLite}, and {bigrquery}.
  • These packages include the ODBC-compliant driver for the given database.

If no one has written a package but an ODBC-compliant driver exists, the {DBI} package works with the {ODBC} package and the driver to make the connection to the Back-end server.

8.3.6.3 Database Configuration Types

RDBMS can also be categorized based on their configuration or how front and back ends are physically connected and how the data is stored.

  • In-process databases run on your computer. The data is stored locally and accessible without a network connection.
    • They are optimized for the use case of a single user working with large amounts of data.
  • Client-Server databases use a small front-end client on your computer (perhaps an IDE or in a browser) to connect to a robust server over a network. Data can be stored across a set of servers.
    • They are optimized for the use case where multiple users can query the database at once and the RDBMS resolves contention.
  • Cloud-databases also separate the front and back ends over a network but now the data may be distributed across multiple locations in the cloud.
    • These are optimized for the use case of multiple users accessing a wide-variety of very large data and enabling distributed computing and parallel computing.

With the proliferation of cloud computing and software as a service (SAAS), you can expect to see more and more hybrids of client server and cloud databases.

As the amount of collected structured and unstructured data scales to Apache solutions for “Big Data”, the fundamental concepts of exploiting relationships in data using some version of SQL will continue to apply for the foreseeable future.

8.4 SQL Basics

  • This section is based on work by Professor David Gerard.

We will use R Studio as the Front-end, DuckDB as the Back-end and the {DBI} package to support the connection between the front and back ends.

8.4.1 Conecting to a Database

Before you can use SQL to interact with a database, you must have a way to connect the Front-end user interface to the Back-end server with the data.

  • When working in R, the {DBI} package can create the connection and allow you to assign a name to it.
  • Your database will be created by a specific DBMS which provides the Back-end server.
  • If you are not using the Front-End that comes with the DBMS, you need an ODBC-compliant driver for the DBMS to make the connection.

We will use the DuckDB database and the {duckdb} Package for most of this section.

  • DuckDB is an in-process database so it has no external dependencies which makes it useful for single user analysis of large data. See Why DuckDB for their story.

  • You can install DuckDB directly on your computer and use the built-in Command Line Interface without using R.

  • However, the {duckdb} package installs the DuckDB DBMS as well as the ODBC-compliant driver for R.

  • Use the console to install {DBI} and {duckdb}.

To make the connection between R and DuckDB requires loading the two packages {DBI} and {duckdb}.

```{r}
#| message: true
library(DBI)
library(duckdb)
```

We also want to compare the SQL code (working with the database tables) with R code (working with data frames) so load {tidyverse} and {nycflights13} which has the same data as the database we will use.

  • You will see the concepts for accessing data are quite similar but the syntax is different.
```{r}
#| message: false
library(nycflights13)
library(tidyverse)
```

Download the DuckDB database {nycflights13} data from https://github.com/AU-datascience/data/blob/main/413-613/flights.duckdb and save to a data folder under your current file’s working directory.

  • {duckdb} only works with local data bases not over a URL

Use duckdb() and DBI::dbConnect() to create a connection to flights.duckdb using a relative path to the database.

  • Assign a name to it.
```{r}
#| label: make-connection
con <- dbConnect(duckdb(dbdir = "./data/flights.duckdb", read_only = TRUE))
class(con)
```
[1] "duckdb_connection"
attr(,"package")
[1] "duckdb"
  • Notice the connection to the database in the environment.
  • You will use the con connection to interact with the database Back-end.

Once you have a connection, you can create a basic SQL code chunk like this in R Markdown.

```{sql, connection=con}
SHOW TABLES;
```
5 records
name
airlines
airports
flights
planes
weather

In Quarto .qmd files use the chunk option #| connection: con.

```{sql}
#| connection: con
SHOW TABLES;
```
5 records
name
airlines
airports
flights
planes
weather
Important

Whenever you are using a SQL chunk inside an R Markdown or Quarto document, you have to identify the connection in each code chunk so the SQL interpreter knows with which database to interact.

All the SQL code chunks in this section will have the #| connection my_con chunk option where my_con is the name of the appropriate connection.

8.4.2 Basic SQL Syntax

Whereas Tidyverse uses “verbs”, SQL uses “STATEMENTS” which can have clauses, keywords, or other modifiers.

  • SQL statements are used across the life cycle of a database; to create it, to create tables, add data, update data, select data, delete data and tables, establish rules, and delete a database.

We are most interested in SELECT statements as that is how we get data from a database table or a set of joined tables.

The SELECT statement identifies which fields are to be returned.

SELECT has several other clauses or keywords to support tailoring the data returned by a query.

  • FROM: Required to identify the table or set of tables where the data will be located.
  • WHERE is a clause used to filter which records (rows) should be returned.
  • ORDER BY is a keyword for sorting records. The default is ascending order.
  • GROUP BY is a keyword akin to dplyr::group_by and is used for creating summaries of data.

Other useful statements include:

  • AS: used for renaming or creating aliases.
  • IN: similar to R’s %in%.

SQL uses = for comparison instead of == since you don’t assign anything in SQL.

  • Use != for not equal.
  • SQL uses NULL instead of NA.
  • SQL uses AND, OR and NOT as logical operators.
Important

There is a standard for SQL: SQL-92. However, many database developers see this as a point of departure rather than a destination.

  • They use ODBC and SQL-92 to develop their own versions that are optimized for their intended use cases and database design.
  • Thus, there are many flavors of SQL in the world.

Be sure to review the SQL implementation for the DBMS you are using to be aware of any special features.

If you use proprietary SQL, it may give you performance improvements, but it will also lock you into that DBMS and you will have to revise your code if the Back-end DBMS is changed.

Like all languages, SQL has standard syntax and reserved words.

Rules to remember:

  • Case does not matter (i.e. select is the same as SELECT is the same as SeLeCt), but it is standard to have all statements be in UPPERCASE (e.g., SELECT).
  • Order matters. A SQL SELECT statement’s clauses and keywords must be in the following order: SELECT, FROM, WHERE, GROUP BY, ORDER BY.
  • New lines and white space don’t matter. But it is common to put those statements, clauses, and keywords on new lines with spaces to preserve a “river” of white space between the keyword and the values.
  • Character string values must be in single quotes.
    • You can use invalid variable names by putting them in double quotes (same as using back ticks in R).
    • Some folks always use double quotes because it is not always clear what is an invalid variable name in the database management system. This is what you see below.
  • Comments in SQL start with two hyphens --.
  • Some DBMS require a semicolon at the end of each statement.
    • It is good practice to put a semicolon ; at the end of a SQL statement.
    • This will allow you to have multiple SQL statements passed to the database at once, e.g., in one chunk.

8.5 Common SQL Actions and their Syntax

  • This section is based on work by Professor David Gerard.

8.5.1 Show and Describe Tables in the Database

The SHOW TABLES command can be used to get a list of all of the tables in the database.

```{sql}
#| connection: con
SHOW TABLES;
```
5 records
name
airlines
airports
flights
planes
weather

The DESCRIBE command can be used to show tables and the variables (in list columns).

```{sql}
#| connection: con
DESCRIBE;
```
5 records
database schema name column_names column_types temporary
flights main airlines carrier, name VARCHAR, VARCHAR FALSE
flights main airports faa , name , lat , lon , alt , tz , dst , tzone VARCHAR, VARCHAR, DOUBLE , DOUBLE , DOUBLE , DOUBLE , VARCHAR, VARCHAR FALSE
flights main flights year , month , day , dep_time , sched_dep_time, dep_delay , arr_time , sched_arr_time, arr_delay , carrier , flight , tailnum , origin , dest , air_time , distance , hour , minute , time_hour INTEGER , INTEGER , INTEGER , INTEGER , INTEGER , DOUBLE , INTEGER , INTEGER , DOUBLE , VARCHAR , INTEGER , VARCHAR , VARCHAR , VARCHAR , DOUBLE , DOUBLE , DOUBLE , DOUBLE , TIMESTAMP FALSE
flights main planes tailnum , year , type , manufacturer, model , engines , seats , speed , engine VARCHAR, INTEGER, VARCHAR, VARCHAR, VARCHAR, INTEGER, INTEGER, INTEGER, VARCHAR FALSE
flights main weather origin , year , month , day , hour , temp , dewp , humid , wind_dir , wind_speed, wind_gust , precip , pressure , visib , time_hour VARCHAR , INTEGER , INTEGER , INTEGER , INTEGER , DOUBLE , DOUBLE , DOUBLE , DOUBLE , DOUBLE , DOUBLE , DOUBLE , DOUBLE , DOUBLE , TIMESTAMP FALSE

8.5.2 Select Fields (Columns) From a Table

To return specific fields (columns) from a table, use the SELECT statement.

  • The SELECT statement must always be followed by a FROM statement which identifies the source of the data.
  • That source is a table or a set of joined tables or a view.
  • The data which is returned as a result of executing a SELECT statement is stored in a result table, which is called the result-set. This means you can use it in a FROM clause.
  • A SELECT statement can have other clauses and key words to reduce the amount of data that is returned.
```{sql}
#| eval: false
#| connection: con
SELECT <field1>, <field2>, <field3> 
FROM <mytable>;
```

Let’s use this syntax to get the data in the fields tailnum, year, and model from the planes table.

```{sql}
#| connection: con
SELECT "tailnum", "year", "model" 
FROM planes;
```
Displaying records 1 - 10
tailnum year model
N10156 2004 EMB-145XR
N102UW 1998 A320-214
N103US 1999 A320-214
N104UW 1999 A320-214
N10575 2002 EMB-145LR
N105UW 1999 A320-214
N107US 1999 A320-214
N108UW 1999 A320-214
N109UW 1999 A320-214
N110UW 1999 A320-214
  • R equivalent
```{r}
#| eval: false
planes |>
  select(tailnum, year, model)
```

You can select every field in a table by using the wildcard *, e.g., SELECT *.

Tip

For tables with many fields, select only the specific fields of interest unless you are certain the table has only the fields of interest.

This has two benefits:

  • This makes your code more robust as using SELECT * will select every field and you don’t know what someone else may have done to the table.
  • On tables with many fields this can make your query much faster in two ways. The first is actually selecting the data, and the second is by limiting the amount of data that must returned through the connection.
  • For In-process databases that run on your computer the connection speed is less of a concern.
  • However, for the other database types, where the connection is over a network, using SELECT * can mean you are trying to return a lot of unneeded data through the connection, slowing you (and others) down.
```{sql}
#| connection: con
SELECT * 
FROM planes;
```
Displaying records 1 - 10
tailnum year type manufacturer model engines seats speed engine
N10156 2004 Fixed wing multi engine EMBRAER EMB-145XR 2 55 NA Turbo-fan
N102UW 1998 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan
N103US 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan
N104UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan
N10575 2002 Fixed wing multi engine EMBRAER EMB-145LR 2 55 NA Turbo-fan
N105UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan
N107US 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan
N108UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan
N109UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan
N110UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan
  • R equivalent
```{r}
#| eval: false
planes
```
  • There is no equivalent for excluding fields (like dplyr::select(-year)). Just select the ones you want.

8.5.3 Select Distinct or Unique Records

Some tables may allow duplicate records or have many records with duplicate values if only a few fields are selected.

To get distinct records use the statement SELECT DISTINCT field1, field2.

The following example will return all unique destinations.

```{sql}
#| connection: con
SELECT DISTINCT "dest" 
FROM flights;
```
Displaying records 1 - 10
dest
DCA
DAY
SRQ
CHS
AUS
SAV
SNA
OMA
MYR
MVY

8.5.4 Reduce the Number of Records that are Returned

8.5.4.1 Filter Records (Rows) by Value

You can use a WHERE clause after the FROM inside a SELECT statement to filter which subset of records will be returned based on the values in the record (row).

  • The WHERE clause requires an expression that returns a logical value.
```{sql}
#| connection: con
SELECT "flight", "distance", "origin", "dest" 
FROM flights
WHERE "distance" < 50;
```
1 records
flight distance origin dest
1632 17 EWR LGA
  • R equivalent:
```{r}
#| eval: false
flights |>
  select(flight, distance, origin, dest) |>
  filter(distance < 50)
```

To test for equality, use a single equal sign, =.

```{sql}
#| connection: con
SELECT "flight", "month" 
FROM flights
WHERE "month" = 12;
```
Displaying records 1 - 10
flight month
745 12
839 12
1895 12
1487 12
2243 12
939 12
3819 12
1441 12
2167 12
605 12
  • R equivalent
```{r}
#| eval: false
flights |>
  select(flight, month) |>
  filter(month == 12)
```

To use a character string you must use single quotes on the string, not double.

```{sql}
#| connection: con
SELECT "flight", "origin" 
FROM flights
WHERE "origin" = 'JFK';
```
Displaying records 1 - 10
flight origin
1141 JFK
725 JFK
79 JFK
49 JFK
71 JFK
194 JFK
1806 JFK
1743 JFK
303 JFK
135 JFK
  • R equivalent
```{r}
#| eval: false
flights |>
  select(flight, origin) |>
  filter(origin == "JFK")
```

You can filter records based on multiple criteria and the logical operators.

Here is an example of using the AND logical operator.

```{sql}
#| connection: con
SELECT "flight", "origin", "dest" 
FROM flights
WHERE "origin" = 'JFK' AND "dest" = 'CMH';
```
Displaying records 1 - 10
flight origin dest
4146 JFK CMH
3783 JFK CMH
4146 JFK CMH
3783 JFK CMH
4146 JFK CMH
3783 JFK CMH
4146 JFK CMH
3783 JFK CMH
4146 JFK CMH
3650 JFK CMH
  • R equivalent:
```{r}
#| eval: false
flights |>
  select(flight, origin, dest) |>
  filter(origin == "JFK", dest == "CMH")
```

You can use the OR and `NOT logical operators too.

  • Put parentheses around the portions of the expression to ensure you get the desired order of operations.
```{sql}
#| connection: con
SELECT
    DISTINCT "flight",
    "origin",
    "dest"
FROM
    flights
WHERE
    NOT ("origin" = 'JFK'
        OR "origin" = 'LGA')
    AND dest = 'CMH';
```
Displaying records 1 - 10
flight origin dest
5680 EWR CMH
3852 EWR CMH
4446 EWR CMH
5846 EWR CMH
3854 EWR CMH
4536 EWR CMH
4250 EWR CMH
4644 EWR CMH
4625 EWR CMH
4704 EWR CMH
  • R equivalent
```{r}
#| eval: false
flights |>
  select(flight, origin, dest) |>
  filter(!(origin == "JFK" | origin == "LGA"), dest == "CMH") |> 
  distinct()
```

8.5.4.2 Remove Records with Missing Data

Missing data is NULL in SQL (instead of NA).

We can explicitly remove records with missing values with the clause WHERE field IS NOT NULL.

```{sql}
#| connection: con
SELECT "flight", "dep_delay" 
FROM flights
WHERE "dep_delay" IS NOT NULL;
```
Displaying records 1 - 10
flight dep_delay
1545 2
1714 4
1141 2
725 -1
461 -6
1696 -4
507 -5
5708 -3
79 -3
301 -2
  • R equivalent
```{r}
#| eval: false
flights |>
  select(flight, dep_delay) |>
  filter(!is.na(dep_delay))
```

Just use IS if you want only the records that are missing data.

```{sql}
#| connection: con
SELECT "flight", "dep_delay" 
FROM flights
WHERE "dep_delay" IS NULL;
```
Displaying records 1 - 10
flight dep_delay
4308 NA
791 NA
1925 NA
125 NA
4352 NA
4406 NA
4434 NA
4935 NA
3849 NA
133 NA

8.5.4.3 Restrict the Number of Records Returned

When you are building a query, you often want to subset the rows while you are finishing it (you don’t want to return the whole table each time you are trouble shooting a query).

Use LIMIT output modifier at the end of the query to restrict the number of records that are returned.

  • Some flavors of SQL allow SELECT TOP and SELECT TOP PERCENT statements to restrict to the top number/percentage of records.
```{sql}
#| connection: con
SELECT "flight", "origin", "dest" 
FROM flights
LIMIT 5;
```
5 records
flight origin dest
1545 EWR IAH
1714 LGA IAH
1141 JFK MIA
725 JFK BQN
461 LGA ATL

If you want a limited number of records but you not just the first records, use the OFFSET keyword to identify how may records to skip before starting to collect the records.

```{sql}
#| connection: con
SELECT "flight", "origin", "dest" 
FROM flights
LIMIT 5
  OFFSET 4;
```
5 records
flight origin dest
461 LGA ATL
1696 EWR ORD
507 EWR FLL
5708 LGA IAD
79 JFK MCO

8.5.4.4 Return a Random Sample of Records

You can also randomly sample rows with the clause USING SAMPLE.

```{sql}
#| connection: con
SELECT "flight", "origin", "dest" 
FROM flights
USING SAMPLE 5 ROWS;
```
5 records
flight origin dest
703 JFK LAX
4150 EWR CVG
4191 EWR BNA
1791 EWR HOU
2171 LGA DCA

Depending upon your version of SQL there may be an modifier to SAMPLE that allows you to set the random number seed for the sample so it is repeatable.

DuckDB has multiple sampling methods. The following uses bernoulli sampling to get a small number of rows for a small percentage (just to avoid many rows in the output.)

  • Note that repeating the code above changes the selected rows whereas repeating the following does not.
```{sql}
#| connection: con
SELECT "flight", "origin", "dest" 
FROM flights
USING SAMPLE .0027% (bernoulli, 377);
```
5 records
flight origin dest
1164 EWR FLL
1160 LGA MIA
1281 EWR CLT
77 JFK FLL
1733 LGA CLT
Tip

When working with tables with more than a few hundred records, recommend using one of these methods when initially building your query to speed up the return of results, especially if the connection is over a network.

Remove the restrictions when the query is ready for use on the full table.

8.5.5 Arrange Rows

Use the ORDER BY clause to sort the records that are returned.

  • The default order is ascending.

Let’s remove missing values so we can see the ordering.

```{sql}
#| connection: con
SELECT "flight", "dep_delay" 
FROM flights
WHERE "dep_delay" IS NOT NULL
ORDER BY "dep_delay"
LIMIT 20;
```
Displaying records 1 - 10
flight dep_delay
97 -43
1715 -33
5713 -32
1435 -30
837 -27
3478 -26
4361 -25
4573 -25
2223 -24
3318 -24
  • R equivalent
```{r}
#| eval: false
flights |>
  select(flight, dep_delay) |>
  filter(!is.na(dep_delay)) |>
  arrange(dep_delay) |> 
  slice_head(n = 20)
```

Use the modifier DESC after the field name to arrange in descending order

```{sql}
#| connection: con
SELECT "flight", "dep_delay" 
FROM flights
WHERE "dep_delay" IS NOT NULL
ORDER BY "dep_delay" DESC;
```
Displaying records 1 - 10
flight dep_delay
51 1301
3535 1137
3695 1126
177 1014
3075 1005
2391 960
2119 911
2007 899
2047 898
172 896
  • R equivalent
```{r}
#| eval: false
flights |>
  select(flight, dep_delay) |>
  filter(!is.na(dep_delay)) |>
  arrange(desc(dep_delay))
```

You break ties by adding more variables in the ORDER BY clause.

```{sql}
#| connection: con
SELECT "flight", "origin", "dep_delay" 
FROM flights
WHERE "dep_delay" IS NOT NULL
ORDER BY "origin" DESC, "dep_delay"
LIMIT 20;
```
Displaying records 1 - 10
flight origin dep_delay
1715 LGA -33
5713 LGA -32
1435 LGA -30
837 LGA -27
3478 LGA -26
4573 LGA -25
2223 LGA -24
375 LGA -24
4065 LGA -24
1371 LGA -23
  • R equivalent
```{r}
#| eval: false
flights |>
  select(flight, origin, dep_delay) |>
  filter(!is.na(dep_delay)) |>
  arrange(desc(origin), dep_delay) |> 
  slice_head(n = 20)
```

8.5.6 Create, Transform, or Mutate Fields

In SQL, you can create, transform, or mutate the fields that will appear in a returned record-set based on existing fields in the table.

  • You are not updating the original table; you are creating an alias for a new temporary field that will disappear after the query is run.

Use the keyword AS, inside a SELECT statement, to specify the alias (new name for the new field)

```{sql}
#| eval: false
#| connection: con
SELECT <expression> AS <myvariable> 
FROM <mytable>;
```

Let’s calculate average speed from the flights table. We’ll also keep the flight number, distance, and air time variables.

```{sql}
#| connection: con
SELECT
    "flight",
    "distance" / "air_time" AS "speed",
    "distance",
    "air_time"
FROM
    flights
LIMIT 10;
```
Displaying records 1 - 10
flight speed distance air_time
1545 6.167401 1400 227
1714 6.237885 1416 227
1141 6.806250 1089 160
725 8.612022 1576 183
461 6.568966 762 116
1696 4.793333 719 150
507 6.740506 1065 158
5708 4.320755 229 53
79 6.742857 944 140
301 5.311594 733 138
  • R equivalent:
```{r}
#| eval: false
flights |>
  select(flight, distance, air_time) |>
  mutate(speed = distance / air_time) |> 
  slice_head(n = 10)
```

SQL has many functions for transforming values. Common ones include:

  • LN(): Natural log transformation.
  • EXP(): Exponentiation.
  • SQRT(): Square root.
  • POW(): Power transformation.
  • POW(2.0, x) would be \(2^x\)
  • POW(x, 2.0) would be \(x^2\)

Beware of integer division during transformations. Integer division will round results to integers.

```{sql}
#| connection: con
SELECT
    DISTINCT "month",
    "day",
    "day" // "month" AS "ratio"
FROM
    flights
WHERE
    "month" >= 5
ORDER BY
    "month",
    "day";
```
Displaying records 1 - 10
month day ratio
5 1 0
5 2 0
5 3 0
5 4 0
5 5 1
5 6 1
5 7 1
5 8 1
5 9 1
5 10 2

Use the CAST() function to explicitly convert an integer to a double before operations that could produce doubles.

```{sql}
#| connection: con
SELECT
    DISTINCT "month",
    "day",
    CAST("day" AS DOUBLE) / CAST("month" AS DOUBLE) AS "ratio"
FROM
    flights
WHERE
    "month" >= 5
ORDER BY
    "month",
    "day";
```
Displaying records 1 - 10
month day ratio
5 1 0.2
5 2 0.4
5 3 0.6
5 4 0.8
5 5 1.0
5 6 1.2
5 7 1.4
5 8 1.6
5 9 1.8
5 10 2.0
  • Note: DuckDB has two different division operators that operate differently if both numbers are integers.
    • The single slash, / ,will convert to floating point numbers for the division.
    • The double slash, //, will use integer division.
    • See DuckDB Numeric Operators.

8.5.7 Summarize Fields with Aggregate Functions

SQL has functions for summarizing data (SQL calls these “Aggregates”).

Aggregate functions must be inside a SELECT statement which identifies the fields of interest.

  • Aggregates are similar to the functions used inside a summarize() in R.
  • They calculate the results for the selected fields of the Fields and aggregate (collapse) the records to a single record.
  • By default, all records with missing data (NULL) are ignored (like setting na.rm = TRUE) and there is no warning or message.

Some of the common aggregates are:

  • COUNT(): Count the number of rows.
  • AVG(): Calculate average.
  • MEDIAN(): Median (not standard across all DBMS’s).
  • SUM(): Summation.
  • MIN(): Minimum.
  • MAX(): Maximum.
  • STDDEV(): Standard deviation.
  • VARIANCE(): Variance
  • CORR(y, x): Correlation between x and y.

Let’s calculate the average and standard deviation for departure delays across all records in the table.

```{sql}
#| connection: con
SELECT
    AVG("dep_delay"),
    STDDEV("dep_delay") AS "sd_dep_delay",
    COUNT("dep_delay") as "num_records"
FROM
    flights;
```
1 records
avg(dep_delay) sd_dep_delay num_records
12.63907 40.21006 328521
  • Note only one record is returned with fields identified by default with the function call, but you can create aliases using the AS statement.

  • Note how SQL automatically drops records with a NULL value before counting the records.

  • R equivalent:

```{r}
#| eval: true
flights |>
  summarize(avg_dep_delay = mean(dep_delay, na.rm = TRUE), 
            sd_dep_delay = sd(dep_delay, na.rm = TRUE),
            num_records = n())

flights |>
  filter(!is.na(dep_delay)) |> 
  summarize(avg_dep_delay = mean(dep_delay), 
            sd_dep_delay = sd(dep_delay),
            num_records = n())
```
# A tibble: 1 × 3
  avg_dep_delay sd_dep_delay num_records
          <dbl>        <dbl>       <int>
1          12.6         40.2      336776
# A tibble: 1 × 3
  avg_dep_delay sd_dep_delay num_records
          <dbl>        <dbl>       <int>
1          12.6         40.2      328521
  • Note how R counts the number of records that went into the summary, before the records were removed by is.na().
  • You need to filter the data in R to remove the NAs before sendint to the summarize() to get the same count.

8.5.7.1 Use the GROUP BY statement to Aggregate results across Groups of Records.

GROUP BY identifies one or more fields by which to aggregate the records.

  • This is similar (but not exactly the same) as using summarize() inside a group_by() in R.

Let’s summarize the average and standard deviation of the departure delays by each origin and month.

```{sql}
#| connection: con
SELECT
    "origin",
    "month",
    AVG("dep_delay") AS "avg_dep_delay",
    STDDEV("dep_delay") AS "sd_dep_delay"
FROM
    flights
GROUP BY
    "origin",
    "month"
ORDER BY
  "origin",
  "month";
```
Displaying records 1 - 10
origin month avg_dep_delay sd_dep_delay
EWR 1 14.905748 40.80044
EWR 2 13.067263 37.15890
EWR 3 18.102457 44.09987
EWR 4 17.400058 43.86926
EWR 5 15.366141 38.95333
EWR 6 22.470810 50.78085
EWR 7 22.035112 49.50592
EWR 8 13.493714 37.60204
EWR 9 7.290953 34.98619
EWR 10 8.636436 32.73880
  • Now one record is returned for each of the 36 groups.
  • Note the order of the fields origin and month is based on the order of the SELECT, not the GROUP BY.

Change the order of fields in the SELECT.

```{sql}
#| connection: con
SELECT
    "month",
    "origin",
    AVG("dep_delay") AS "avg_dep_delay",
    STDDEV("dep_delay") AS "sd_dep_delay"
FROM
    flights
GROUP BY
    "origin",
    "month"
ORDER BY
  "origin",
  "month";
```
Displaying records 1 - 10
month origin avg_dep_delay sd_dep_delay
1 EWR 14.905748 40.80044
2 EWR 13.067263 37.15890
3 EWR 18.102457 44.09987
4 EWR 17.400058 43.86926
5 EWR 15.366141 38.95333
6 EWR 22.470810 50.78085
7 EWR 22.035112 49.50592
8 EWR 13.493714 37.60204
9 EWR 7.290953 34.98619
10 EWR 8.636436 32.73880

Changing the order of the fields in the GROUP BY does not change sequence of the fields or the results.

```{sql}
#| connection: con
SELECT
    "month",
    "origin",
    AVG("dep_delay") AS "avg_dep_delay",
    STDDEV("dep_delay") AS "sd_dep_delay"
FROM
    flights
GROUP BY
    "month",
    "origin"
ORDER BY
  "origin",
  "month";
```
Displaying records 1 - 10
month origin avg_dep_delay sd_dep_delay
1 EWR 14.905748 40.80044
2 EWR 13.067263 37.15890
3 EWR 18.102457 44.09987
4 EWR 17.400058 43.86926
5 EWR 15.366141 38.95333
6 EWR 22.470810 50.78085
7 EWR 22.035112 49.50592
8 EWR 13.493714 37.60204
9 EWR 7.290953 34.98619
10 EWR 8.636436 32.73880
  • R equivalent
```{r}
#| eval: true
flights |>
  select(origin, month, dep_delay) |>
  group_by(origin, month) |>
  summarize(avg_dep_delay = mean(dep_delay, na.rm = TRUE), 
            sd_dep_delay = sd(dep_delay, na.rm = TRUE),
            .groups = "drop")
```
# A tibble: 36 × 4
   origin month avg_dep_delay sd_dep_delay
   <chr>  <int>         <dbl>        <dbl>
 1 EWR        1         14.9          40.8
 2 EWR        2         13.1          37.2
 3 EWR        3         18.1          44.1
 4 EWR        4         17.4          43.9
 5 EWR        5         15.4          39.0
 6 EWR        6         22.5          50.8
 7 EWR        7         22.0          49.5
 8 EWR        8         13.5          37.6
 9 EWR        9          7.29         35.0
10 EWR       10          8.64         32.7
# ℹ 26 more rows
  • Note the order of the origin and month.

  • Changing the order in the select() does not matter in R.

```{r}
flights |>
  select(month, origin, dep_delay) |>
  group_by(origin, month) |>
  summarize(avg_dep_delay = mean(dep_delay, na.rm = TRUE), 
            sd_dep_delay = sd(dep_delay, na.rm = TRUE),
            .groups = "drop")
```
# A tibble: 36 × 4
   origin month avg_dep_delay sd_dep_delay
   <chr>  <int>         <dbl>        <dbl>
 1 EWR        1         14.9          40.8
 2 EWR        2         13.1          37.2
 3 EWR        3         18.1          44.1
 4 EWR        4         17.4          43.9
 5 EWR        5         15.4          39.0
 6 EWR        6         22.5          50.8
 7 EWR        7         22.0          49.5
 8 EWR        8         13.5          37.6
 9 EWR        9          7.29         35.0
10 EWR       10          8.64         32.7
# ℹ 26 more rows

Changing the order in the group_by() does matter in how the fields are arranged by default.

```{r}
flights |>
  select(month, origin, dep_delay) |>
  group_by(month, origin) |>
  summarize(avg_dep_delay = mean(dep_delay, na.rm = TRUE), 
            sd_dep_delay = sd(dep_delay, na.rm = TRUE),
            .groups = "drop")
```
# A tibble: 36 × 4
   month origin avg_dep_delay sd_dep_delay
   <int> <chr>          <dbl>        <dbl>
 1     1 EWR            14.9          40.8
 2     1 JFK             8.62         36.0
 3     1 LGA             5.64         29.7
 4     2 EWR            13.1          37.2
 5     2 JFK            11.8          37.4
 6     2 LGA             6.96         33.4
 7     3 EWR            18.1          44.1
 8     3 JFK            10.7          35.3
 9     3 LGA            10.2          39.7
10     4 EWR            17.4          43.9
# ℹ 26 more rows

8.5.8 Window Functions are for Creating Grouped Summaries without Aggregating the Records

There may be times when you want to add a grouped summary to the data without collapsing the data into one record for each groups.

  • As an example, you want to calculate the total departure delay time for each destination so you can use it to then calculate the percentage of that departure delay time for each airline flying to that destination.
  • You could do that with a summarized data frame and then a join.
  • In R, this can be done by using group_by() and then using mutate() instead of summarize() to add a new column to the data frame with the result for each group while preserving all of the rows.

In SQL, you use the aggregate function you want in a what that SQL calls a Window Function.

  • A Window function uses an aggregate function to calculate a result for one or more fields (across a specified set of records) and it adds the results as new fields in the result set (table), but it does Not collapse the records, so the result set has all the original un-aggregated records with the new aggregated results (for each set of records).
  • All the records for the given set of specified records have the same values in the new aggregated fields.

To use an aggregate function as a window function, use the OVER clause with a (PARTITION BY myfield) modifier.

  • The PARTITION BY clause is a sub-clause of the OVER clause. The partition by breaks up (partitions) the table into separate tables based one or more chosen fields. This is similar to creating a grouped data frame in R with one or more groups.
  • Then, the aggregate function is applied to the records in each partition and a new field is added to the records in that partition with the aggregated value.

Let’s find the flight numbers for the longest flights (in terms of air time) from each airport (origin).

  • We will specify origin as the partitioning (group by) field.
  • We will specify the alias amax as the name of the new field added to the table.
```{sql}
#| connection: con
SELECT
    "flight",
    "origin",
    "air_time",
    MAX("air_time") 
    OVER (PARTITION BY "origin") AS "amax"
FROM
    flights
LIMIT 10 OFFSET 120827;
```
Displaying records 1 - 10
flight origin air_time amax
1100 EWR 324 695
449 EWR 82 695
4382 EWR 87 695
1567 EWR 119 695
1037 EWR 330 695
1266 EWR 301 695
1148 EWR 169 695
4208 EWR 165 695
488 EWR 213 695
580 EWR 192 695

This approach can be useful as a sub-query (or inner nested query) inside the outer query FROM clause.

Here is how you find the flight numbers and destination for the longest flights (in terms of air time) from each airport after using the sub-query to find the longest air time from each origin.

```{sql}
#| label: sub-query-1
#| connection: con
SELECT
    "flight",
    "origin",
    "dest",
    "air_time"
FROM
    (
    SELECT
        "flight",
        "origin",
        "dest",
        "air_time",
        MAX("air_time") OVER (PARTITION BY "origin") AS "amax"
    FROM
        flights
)
WHERE
    "air_time" = "amax"
```
3 records
flight origin dest air_time
51 JFK HNL 691
15 EWR HNL 695
745 LGA DEN 331
  • Note that all fields used in the outer query must be returned by the inner query

8.5.9 Recode Field Values

SQL has a CASE expression you can use with a WHEN clause to recode the values in the records in a field.

  • Like dplyr::case_when(), you use the WHEN clause to specify a logical expression with the THEN keyword specifying the new value.
  • CASE will stop checking at the first TRUE.

Use the CASE WHEN syntax to recode values.

```{sql}
#| label: case-when
#| connection: con

SELECT
    "flight",
    "origin",
    CASE
        WHEN ("origin" = 'JFK') THEN 'John F. Kennedy'
        WHEN ("origin" = 'LGA') THEN 'LaGaurdia'
        WHEN ("origin" = 'EWR') THEN 'Newark Liberty'
        ELSE 'WHO KNOWS'
    END 
AS "orig_long"
FROM
    flights
LIMIT 10;
```
Displaying records 1 - 10
flight origin orig_long
1545 EWR Newark Liberty
1714 LGA LaGaurdia
1141 JFK John F. Kennedy
725 JFK John F. Kennedy
461 LGA LaGaurdia
1696 EWR Newark Liberty
507 EWR Newark Liberty
5708 LGA LaGaurdia
79 JFK John F. Kennedy
301 LGA LaGaurdia
  • R equivalent using either case_when() or case_match().
```{r}
#| eval: false
flights |>
  select(flight, origin) |>
  mutate(orig_long = case_when(
    origin == "JFK" ~ "John F. Kennedy",
    origin == "LGA" ~ "LaGuardia",
    origin == "EWR" ~ "Newark Liberty",
    .default = "other"
  ))

## or for value matching
flights |>
  select(flight, origin) |>
  mutate(orig_long = case_match(
    origin,
    "JFK" ~ "John F. Kennedy",
    "LGA" ~ "LaGuardia",
    "EWR" ~"Newark Liberty",
    .default = "other"
  ))
```

You can also use CASE WHEN to recode based on other logical operations.

```{sql}
#| connection: con
SELECT
    "flight",
    "air_time",
    CASE
        WHEN ("air_time" > 2500) THEN 'Long'
        WHEN ("air_time" <= 2500) THEN 'Short'
    END 
AS "qual_dist"
FROM
    flights;
```
Displaying records 1 - 10
flight air_time qual_dist
1545 227 Short
1714 227 Short
1141 160 Short
725 183 Short
461 116 Short
1696 150 Short
507 158 Short
5708 53 Short
79 140 Short
301 138 Short
  • R equivalent.
```{r}
#| eval: false
flights |>
  select(flight, air_time) |>
  mutate(qual_dist = case_when(
air_time > 2500 ~ "Long",
air_time <= 2500 ~ "Short"
  ))

## or
flights |>
  select(flight, air_time) |>
  mutate(qual_dist = if_else(air_time > 2500, "Long", "Short"))
```

8.5.10 Joining Tables

To join tables, use a SELECT statement and write out all of the columns in both tables you are joining.

  • If the tables share one or more column names, you need to distinguish between the two fields via table1."var" or table2."var".

Use the LEFT JOIN clause to declare a left join, and ON to declare the keys.

```{sql}
#| connection: con
-- flight is from the flights table
-- type is from the planes table
-- both tables have a tailnum column, so we need to tell them apart
-- if you list both tailnums in SELECT, you'll get two tailnum columns
SELECT
    "flight",
    flights."tailnum",
    "type"
FROM
    flights
JOIN planes
    ON
    flights."tailnum" = planes."tailnum";
```
Displaying records 1 - 10
flight tailnum type
569 N846UA Fixed wing multi engine
4424 N19966 Fixed wing multi engine
684 N809UA Fixed wing multi engine
1279 N328NB Fixed wing multi engine
1691 N34137 Fixed wing multi engine
1447 N117UW Fixed wing multi engine
583 N632JB Fixed wing multi engine
3574 N790SW Fixed wing multi engine
3351 N711MQ Fixed wing multi engine
303 N502UA Fixed wing multi engine
  • R equivalent:
```{r}
planes |>
  select(tailnum, type) ->
planes2
flights |>
  select(flight, tailnum) |>
  left_join(planes2, by = "tailnum")
```
# A tibble: 336,776 × 3
   flight tailnum type                   
    <int> <chr>   <chr>                  
 1   1545 N14228  Fixed wing multi engine
 2   1714 N24211  Fixed wing multi engine
 3   1141 N619AA  Fixed wing multi engine
 4    725 N804JB  Fixed wing multi engine
 5    461 N668DN  Fixed wing multi engine
 6   1696 N39463  Fixed wing multi engine
 7    507 N516JB  Fixed wing multi engine
 8   5708 N829AS  Fixed wing multi engine
 9     79 N593JB  Fixed wing multi engine
10    301 N3ALAA  <NA>                   
# ℹ 336,766 more rows

The other joins are:

  • RIGHT JOIN
  • INNER JOIN
  • FULL OUTER JOIN, which you should avoid using if possible.

8.5.11 Subqueries

SQL has the capability to allow you to embed or nest queries within queries. There are several varieties as you can see in How to write subqueries in SQL.

  • As an example, if you need data from a “table” that does not exist, you can use a query to create the new table (virtually) and insert it into the FROM clause of an outer query.
  • As stated in R for Data Science (2ed), “A subquery is just a query used as a data source in the FROM clause, instead of the usual table.”
Important

To delineate the query as a subquery you put it inside parentheses (...). This is similar to R to require the subquery to be executed before other parts of the query can be evaluated.

If using the results in other parts of a complicated query, you must also assign an “alias” for the result. This is similar to using a pronoun in R piping. It represents the returned object from the subquery.

A silly example of using a sub query to create a table with just the JFK flights.

  • Note this does not assign an alias as we are not using the results elsewhere.
```{sql}
#| connection: con
SELECT
    "flight",
    "tailnum",
    "origin"
FROM
    (SELECT *
    FROM flights
    WHERE "origin" = 'JFK' 
    LIMIT 20)
```
Displaying records 1 - 10
flight tailnum origin
1141 N619AA JFK
725 N804JB JFK
79 N593JB JFK
49 N793JB JFK
71 N657JB JFK
194 N29129 JFK
1806 N708JB JFK
1743 N3739P JFK
303 N532UA JFK
135 N635JB JFK

Subqueries can also be very useful when combined with JOINS.

```{sql}
#| connection: con
SELECT "flight",
    fli."tailnum",
    "origin",
    pla."manufacturer"
FROM flights
AS fli
JOIN
    (SELECT *
    FROM planes
    WHERE "manufacturer" = 'AIRBUS INDUSTRIE')
AS pla
ON
fli."tailnum"=pla."tailnum"
WHERE fli."origin"='JFK'
LIMIT 20
```
Displaying records 1 - 10
flight tailnum origin manufacturer
1443 N742PS JFK AIRBUS INDUSTRIE
83 N520JB JFK AIRBUS INDUSTRIE
1167 N333NW JFK AIRBUS INDUSTRIE
74 N516JB JFK AIRBUS INDUSTRIE
179 N535JB JFK AIRBUS INDUSTRIE
711 N503JB JFK AIRBUS INDUSTRIE
2190 N356NW JFK AIRBUS INDUSTRIE
21 N510JB JFK AIRBUS INDUSTRIE
35 N516JB JFK AIRBUS INDUSTRIE
112 N517JB JFK AIRBUS INDUSTRIE

8.5.12 Reshaping Data

Recall {tidyr} has several verbs for reshaping rectangular data: pivot_wider, pivot_longer, separate_*, and unite.

Reshaping is rarely done using SQL as people use SQL to query a subset of data they can then reshape much easier using other methods (R, Python, Excel, etc).

You can create a series of CASE WHEN statements to custom make an equivalent to a pivot.

If you are interested in reshaping, especially with an intent to aggregate the data, the various flavors have different approaches.

As an example, SQL Server has the statements PIVOT and UNPIVOT. See FROM - Using PIVOT and UNPIVOT.

8.6 Working with Tables

8.6.1 Creating Tables

You can create new tables using SQL.

Let’s create a new, temporary, connection we can use to create tables in a new Back-end database.

```{r}
#| label: create-tmpcon
tmpcon <- dbConnect(duckdb())
```

Use CREATE TABLE to declare new tables and INSERT INTO to insert new records into that table.

  • Note the semicolon ; after each SQL statement in the same code chunk.
```{sql}
#| connection: tmpcon
CREATE TABLE table4a (
  "country" VARCHAR,
  "1999"   INTEGER,
  "2000"   INTEGER
);

INSERT INTO table4a ("country", "1999", "2000") 
VALUES
  ('Afghanistan',745,   2666),
  ('Brazil' ,  37737,  80488),
  ('China'  , 212258, 213766);
```

DBI::dbWriteTable() will add a table to a connection from R.

Let’s write multiple data frames from {tidyr} into new tables in our tmpcon database.

```{r}
#| label: write-tidyr-tables
data("table1", package = "tidyr")
dbWriteTable(conn = tmpcon, name = "table1", value = table1)

data("table2", package = "tidyr")
dbWriteTable(conn = tmpcon, name = "table2", value = table2)

data("table3", package = "tidyr")
dbWriteTable(conn = tmpcon, name = "table3", value = table3)

data("table4b", package = "tidyr")
dbWriteTable(conn = tmpcon, name = "table4b", value = table4b)

data("table5", package = "tidyr")
dbWriteTable(conn = tmpcon, name = "table5", value = table5)
```

Here they are.

```{sql}
#| connection: tmpcon
DESCRIBE;
```
6 records
database schema name column_names column_types temporary
memory main table1 country , year , cases , population VARCHAR, DOUBLE , DOUBLE , DOUBLE FALSE
memory main table2 country, year , type , count VARCHAR, DOUBLE , VARCHAR, DOUBLE FALSE
memory main table3 country, year , rate VARCHAR, DOUBLE , VARCHAR FALSE
memory main table4a country, 1999 , 2000 VARCHAR, INTEGER, INTEGER FALSE
memory main table4b country, 1999 , 2000 VARCHAR, DOUBLE , DOUBLE FALSE
memory main table5 country, century, year , rate VARCHAR, VARCHAR, VARCHAR, VARCHAR FALSE

8.6.2 Writing Tables to a CSV File

Use COPY TO to write the outputs of a SQL query to a CSV file (using a relative path).

```{sql}
#| eval: false
COPY
(
  Your SQL query goes here
)
TO 'myfile.csv' (HEADER, DELIMITER ',');
```

Let’s write some group summaries to a CSV file.

```{sql}
#| connection: con
#| eval: false
COPY
(
  SELECT "origin", AVG("dep_delay")
  FROM flights
  GROUP BY "origin"
)
TO './output/summaries.csv' (HEADER, DELIMITER ',');
```

This is what the resulting file looks like:

origin,avg(dep_delay)
LGA,10.3468756464944
EWR,15.10795435218885
JFK,12.112159099217665

8.7 Calling SQL from R

Use DBI::dbGetQuery() to run SQL code in R and obtain the result as a data frame.

```{r}
#| label: dbgetquery
planes2 <- DBI::dbGetQuery(conn = con, 
                           statement = "SELECT * FROM planes")
glimpse(planes2)
```
Rows: 3,322
Columns: 9
$ tailnum      <chr> "N10156", "N102UW", "N103US", "N104UW", "N10575", "N105UW…
$ year         <int> 2004, 1998, 1999, 1999, 2002, 1999, 1999, 1999, 1999, 199…
$ type         <chr> "Fixed wing multi engine", "Fixed wing multi engine", "Fi…
$ manufacturer <chr> "EMBRAER", "AIRBUS INDUSTRIE", "AIRBUS INDUSTRIE", "AIRBU…
$ model        <chr> "EMB-145XR", "A320-214", "A320-214", "A320-214", "EMB-145…
$ engines      <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ seats        <int> 55, 182, 182, 182, 55, 182, 182, 182, 182, 182, 55, 55, 5…
$ speed        <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ engine       <chr> "Turbo-fan", "Turbo-fan", "Turbo-fan", "Turbo-fan", "Turb…

Often, folks have SQL written in a separate file (that ends in .sql).

  • Here is what is in “query.sql”.
-- Just gets some planes
SELECT * FROM planes;

If you want to load the results of a SQL query in R (saved in query.sql), use readr::read_file() inside DBI::dbGetQuery().

```{r}
mydf <- DBI::dbGetQuery(conn = con, statement = read_file("./sql/query.sql"))
glimpse(mydf)
```
Rows: 3,322
Columns: 9
$ tailnum      <chr> "N10156", "N102UW", "N103US", "N104UW", "N10575", "N105UW…
$ year         <int> 2004, 1998, 1999, 1999, 2002, 1999, 1999, 1999, 1999, 199…
$ type         <chr> "Fixed wing multi engine", "Fixed wing multi engine", "Fi…
$ manufacturer <chr> "EMBRAER", "AIRBUS INDUSTRIE", "AIRBUS INDUSTRIE", "AIRBU…
$ model        <chr> "EMB-145XR", "A320-214", "A320-214", "A320-214", "EMB-145…
$ engines      <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ seats        <int> 55, 182, 182, 182, 55, 182, 182, 182, 182, 182, 55, 55, 5…
$ speed        <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ engine       <chr> "Turbo-fan", "Turbo-fan", "Turbo-fan", "Turbo-fan", "Turb…

8.8 Closing the Connection

After you are finished working with a database, you should close down your connection:

```{r}
#| label: close-connection
DBI::dbDisconnect(con, shutdown = TRUE)
DBI::dbDisconnect(tmpcon, shutdown = TRUE)
```

8.9 Basic SQL Exercises

Download the file starwars.duckdb and save to a data directory.

  • It’s the data from the {starwarsdb} as a DuckDB database.

Use only SQL to answer the following questions.

  1. Download these data and open a connection to the database.
Show code
library(DBI)
library(duckdb)
starcon <- dbConnect(duckdb(dbdir = "./data/starwars.duckdb", read_only = TRUE))
  1. Print out a summary of the tables in this database.
Show code
```{sql}
#| connection: starcon
#| eval: false
#| code-fold: true
DESCRIBE;
```
  1. Select the entire people table.
Show code
```{sql}
#| connection: starcon
#| eval: false
#| code-fold: true
SELECT * 
  FROM people;
```
  1. Select just the name, height, mass, and species variables from the people table.
Show code
```{sql}
#| connection: starcon
#| eval: false
#| code-fold: true
SELECT
    "name",
    "height",
    "mass",
    "species"
FROM
    people;
```
  1. Add to the above query by selecting only the humans and droids.
Show code
```{sql}
#| connection: starcon
#| eval: false
#| code-fold: true
SELECT "name", "height", "mass", "species"
FROM people
WHERE "species" = 'Human' OR "species" = 'Droid';
```
  1. Remove the individuals with missing mass data from the above query.
Show code
```{sql}
#| connection: starcon
#| eval: false
#| code-fold: true
SELECT
    "name",
    "height",
    "mass",
    "species"
FROM
    people
WHERE
    ("species" = 'Human'
        OR "species" = 'Droid')
    AND "mass" IS NOT NULL;
```
  1. Modify the above query to calculate the average height and mass for humans and droids.
Show code
```{sql}
#| connection: starcon
#| eval: false
#| code-fold: true
SELECT "species", AVG("height") AS "height", AVG("mass") AS "mass"
  FROM people
  WHERE ("species" = 'Human' OR "species" = 'Droid') AND "mass" IS NOT NULL
  GROUP BY "species";
```
  1. Make sure Droids are in the first row and Humans are in the second row in the above summary output.
Show code
```{sql}
#| connection: starcon
#| eval: false
#| code-fold: true
SELECT
    "species",
    AVG("height") AS "height",
    AVG("mass") AS "mass"
FROM
    people
WHERE
    ("species" = 'Human'
        OR "species" = 'Droid')
    AND "mass" IS NOT NULL
GROUP BY
    "species"
ORDER BY
    "species";
```
  1. Here is the summary of the keys for the database from the {starwarsdb} GitHub page.

E-R diagram for the {starwarsdb} Database.

Select all films with characters whose homeworld is Kamino.

Show code
```{sql}
#| connection: starcon
#| eval: false
#| code-fold: true

FROM (
SELECT
    DISTINCT title AS title
FROM
    people,
    films_people,
WHERE
    people."homeworld" = 'Kamino'
    AND people."name" = films_people."character"
GROUP BY
    "title"
) LHS
INNER JOIN films
ON
LHS."title" = films."title";
```
  1. Filter the people table to only contain humans from Tatooine and export the result to a CSV file called “folks.csv”.
Show code
```{sql}
#| connection: starcon
#| eval: false
#| code-fold: true
COPY
(
  SELECT *
  FROM people
  WHERE "homeworld" = 'Tatooine' AND "species" = 'Human'
)
TO './output/folks.csv' (HEADER, DELIMITER ',');
```
  1. Close the SQL connection.
Show code
```{r}
#| connection: starcon
#| eval: false
#| code-fold: true
dbDisconnect(conn = starcon, shutdown = TRUE)
```

8.10 Integrating SQL with R

Section 8.4 demonstrated using SQL inside an R environment using the {DBI} and {duckdb} packages.

All of the SQL was run in a SQL code chunk in a Quarto document.

  • We manually converted R code chunks to SQL and inserted the connection name.

  • All of the SQL code chunks were interactive and they did not affect the R environment.

Section 8.7 showed how one could run SQL scripts (.sql files) in R and use the DBI::dbGetQuery() function and assign an R name to the result-set of a SQL query.

This section will demonstrate some shortcuts to both creating SQL code chunks and then getting access to the result-sets as R objects.

Let’s create a connection to the flights.duckdb database.

```{r}
#| message: false
library(tidyverse)
library(DBI)
library(duckdb)
```
```{r}
con <- dbConnect(duckdb(dbdir = "./data/flights.duckdb", read_only = TRUE))
```

8.10.1 Create SQL Code Chunks

8.10.1.1 Customize a Keyboard Shortcut

RStudio has a built-in capability to create an SQL code chunk with an R Markdown code chunk option for the connection.

To access it quickly, suggest customizing a keyboard shortcut for it per Customizing Keyboard Shortcuts in the RStudio IDE. Ushey (2023)

Go to the RStudio Main Menu and select Tools/Modify Keyboard Shortcuts.

A pop up pane will appear with many options and the existing keyboard shortcuts.

  • Enter S Q in the search pane and you should see something like Figure 8.3.
Figure 8.3: A custom keyboard shortcut for an SQL code chunk.
  • Click in the Shortcut column to the right of “Insert Chunk S Q L”.

  • Enter the shortcut combination you want. Figure 8.3 shows the Mac shortcut I chose, Ctrl + Cmd + s.

  • Click Apply and your shortcut will be saved. You can use it right away.

Enter the shortcut in an area for R Markdown (not an existing chunk) in a .Rmd or .qmd file and you should get the following (without the con).

  • Make sure you enter in the name of the connection you have created.
```{sql connection=con}

```
Important

You cannot render a document with an SQL code chunk with invalid or empty connection.

You have to have created a valid connection object before the SQL code chunk.

8.10.1.2 Create a Custom Code Snippet

Another option is to create a custom code snippet per Code Snippets in the RStudio IDE. Allaire (2023)

RStudio snippets are templates for code or R Markdown. We used a code snippet to create a bare-bones template for a Shiny app.

To create a snippet for a custom SQL code chunk, go to the RStudio Main Menu and select Tools/Edit Code Snippets.

A pop-up pane should appear.

Select Markdown as you will be using the snippet in a Markdown section. You will see something like Figure 8.4 with the existing Markdown code snippets.

Figure 8.4: A Markdown code snippet for an SQL code chunk.

Figure 8.4 already shows the snippet I saved for creating a custom SSQL code chunk.

You can create your own snippet by following this example.

  • Enter snippet s_name where you enter your name for the snippet instead of s_name. It does not have to be rsql.
  • Enter the following lines and use the tab key not spaces to indent.
  • The ${0} is a placeholder for your SQL code. The 0 means it is the first place your cursor will go.
    • You could replace con with ${0} so you can fill in which connection each time. Then edit the next ${0} on the original line to ${1}.
  • Select Save to save it and then you can use it.

To use a snippet, enter the name, here “rsql”, and then SHIFT+TAB and the code chunk will appear like below (without the other options).

```{sql}
#| connection: con
#| eval: false

```

8.10.2 Other SQL Code Chunk Options

The R Markdown Definitive Guide section on SQL discusses several SQL chunk options you can use to adjust how SQL is interpreted by {knitr} in RStudio documents.

  • max.print = XX will adjust to print XX lines of SQL output to print in the rendered document from the default of XX = 10.
  • tab.cap = "My Caption" will change the default for a caption of how many records are displayed to whatever you want in the rendered document.

These options will work as either R Markdown style options or Quarto style options.

```{sql, connection=con, max.print=5, tab.cap= "R Markdown Options"}
SELECT
    DISTINCT "flight",
    "origin",
    "dest"
FROM
    flights
WHERE
    NOT ("origin" = 'JFK'
        OR "origin" = 'LGA')
    AND dest = 'CMH';
```
R Markdown Options
flight origin dest
5680 EWR CMH
3852 EWR CMH
3854 EWR CMH
4536 EWR CMH
4446 EWR CMH
```{sql}
#| connection: con
#| max.print: 5
#| tab.cap: "Quarto Options"
SELECT
    DISTINCT "flight",
    "origin",
    "dest"
FROM
    flights
WHERE
    NOT ("origin" = 'JFK'
        OR "origin" = 'LGA')
    AND dest = 'CMH';
```
Quarto Options
flight origin dest
4250 EWR CMH
4644 EWR CMH
4625 EWR CMH
4704 EWR CMH
3851 EWR CMH

8.10.3 Creating R Variables from SQL Code

You can use a SQL code chunk option output.var to create an R data frame variable from a SQL result-set.

Let’s create R variables for the tables in the database and the SQL description of the tables in the data base.

  • Again, either style chunk option will work.

Notice that just like assignment in R, the results are implicit and not seen in the output.

  • You do see the new variables in the Global Environment.
```{sql, connection=con, output.var="flights_tables_rm"}
SHOW TABLES
```
```{sql}
#| connection: con
#| output.var: "flights_tables_qo"
SHOW TABLES
```
```{sql connection=con, output.var="flights_desc_rm"}
DESCRIBE
```
```{sql}
#| connection: con
#| output.var: "flights_desc_qo"
DESCRIBE
```
```{r}
flights_desc_rm
class(flights_desc_rm)
```
  database schema     name
1  flights   main airlines
2  flights   main airports
3  flights   main  flights
4  flights   main   planes
5  flights   main  weather
                                                                                                                                                                     column_names
1                                                                                                                                                                   carrier, name
2                                                                                                                                        faa, name, lat, lon, alt, tz, dst, tzone
3 year, month, day, dep_time, sched_dep_time, dep_delay, arr_time, sched_arr_time, arr_delay, carrier, flight, tailnum, origin, dest, air_time, distance, hour, minute, time_hour
4                                                                                                         tailnum, year, type, manufacturer, model, engines, seats, speed, engine
5                                                          origin, year, month, day, hour, temp, dewp, humid, wind_dir, wind_speed, wind_gust, precip, pressure, visib, time_hour
                                                                                                                                                           column_types
1                                                                                                                                                      VARCHAR, VARCHAR
2                                                                                                    VARCHAR, VARCHAR, DOUBLE, DOUBLE, DOUBLE, DOUBLE, VARCHAR, VARCHAR
3 INTEGER, INTEGER, INTEGER, INTEGER, INTEGER, DOUBLE, INTEGER, INTEGER, DOUBLE, VARCHAR, INTEGER, VARCHAR, VARCHAR, VARCHAR, DOUBLE, DOUBLE, DOUBLE, DOUBLE, TIMESTAMP
4                                                                                       VARCHAR, INTEGER, VARCHAR, VARCHAR, VARCHAR, INTEGER, INTEGER, INTEGER, VARCHAR
5                                        VARCHAR, INTEGER, INTEGER, INTEGER, INTEGER, DOUBLE, DOUBLE, DOUBLE, DOUBLE, DOUBLE, DOUBLE, DOUBLE, DOUBLE, DOUBLE, TIMESTAMP
  temporary
1     FALSE
2     FALSE
3     FALSE
4     FALSE
5     FALSE
[1] "data.frame"
```{r}
glimpse(flights_desc_qo)
class(flights_desc_qo)
```
Rows: 5
Columns: 6
$ database     <chr> "flights", "flights", "flights", "flights", "flights"
$ schema       <chr> "main", "main", "main", "main", "main"
$ name         <chr> "airlines", "airports", "flights", "planes", "weather"
$ column_names <list> <"carrier", "name">, <"faa", "name", "lat", "lon", "alt",…
$ column_types <list> <"VARCHAR", "VARCHAR">, <"VARCHAR", "VARCHAR", "DOUBLE", …
$ temporary    <lgl> FALSE, FALSE, FALSE, FALSE, FALSE
[1] "data.frame"

Let’s use SQL to get the departure delay for all flights from JFK and LaGuardia to Chicago and then use {ggplot2} to plot.

```{sql}
#| connection: con
#| output.var: "select_flights_df"
SELECT
    DISTINCT "flight",
    "origin",
    "dest",
    "dep_delay"
FROM
    flights
WHERE
     ("origin" = 'JFK'
        OR "origin" = 'LGA')
    AND dest = 'ORD';
```

Check our data.

```{r}
glimpse(select_flights_df)
```
Rows: 3,862
Columns: 4
$ flight    <int> 3525, 343, 371, 329, 544, 329, 1498, 343, 1491, 693, 301, 11…
$ origin    <chr> "JFK", "LGA", "LGA", "LGA", "LGA", "LGA", "LGA", "LGA", "LGA…
$ dest      <chr> "ORD", "ORD", "ORD", "ORD", "ORD", "ORD", "ORD", "ORD", "ORD…
$ dep_delay <dbl> NA, 97, 50, NA, 16, -4, 14, 5, 41, NA, 12, -3, 7, -8, 54, 3,…

Plot.

```{r}
#| warning: false
select_flights_df |> 
  ggplot(aes(x = origin, y = dep_delay)) +
  geom_boxplot(notch = TRUE) +
  scale_y_log10() +
  labs(title = "Departure Delays for Flights to ORD",
       subtitle = "Log Scale",
       caption = "Data from {nycflights13}.")
```

8.10.4 Using R Variables in SQL Code

You can also use an existing R variable as a value in a SQL query clause.

Create the R Variable and use the name preceded by a ?.

```{r}
my_dest <- 'MCO'
```
```{sql}
#| connection: con
#| output.var: "select_flights_df2"
SELECT
    DISTINCT "flight",
    "origin",
    "dest",
    "dep_delay"
FROM
    flights
WHERE
     ("origin" = 'JFK'
        OR "origin" = 'LGA')
    AND dest = ?my_dest;
```
```{r}
glimpse(select_flights_df2)
```
Rows: 2,638
Columns: 4
$ flight    <int> 2159, 683, 1883, 683, 499, 2285, 1685, 1585, 2285, 1885, 228…
$ origin    <chr> "JFK", "JFK", "JFK", "JFK", "LGA", "LGA", "LGA", "LGA", "LGA…
$ dest      <chr> "MCO", "MCO", "MCO", "MCO", "MCO", "MCO", "MCO", "MCO", "MCO…
$ dep_delay <dbl> -1, 8, 53, 7, -3, -9, 0, -6, -6, -4, -10, -8, 13, 62, -1, 23…
Warning

Using parameters in SQL queries is common practice.

Use R to manipulate text to create queries as character strings is also common.

However, if you are building something, e.g., a Shiny app, where you are allowing others to input text that will feed into a query, you have to protect what you are doing.

You are liable to be victim of a SQL Injection Attack which can have bad consequences.

For R-specific approaches to mitigate the risk, see Run Queries Safely Posit (2023c) from the Posit Best Practices in working with Databases. Posit (2023b)

8.10.5 Setting the Connection Once in a Document

If you are going to use the same connection for all (or most chunks) in a document, you can set the connection as an option in a setup chunk.

Let’s close the current con.

```{r}
DBI::dbDisconnect(con, shutdown = TRUE)
```

Now create a new connection.

```{r}
#| label: setup
#| message: false
library(DBI)
library(duckdb)
con2 <- dbConnect(duckdb(dbdir = "./data/flights.duckdb", read_only = TRUE))
knitr::opts_chunk$set(connection = 'con2')
```

Now run a SQL code chunk without stating the connection.

```{sql}
SHOW TABLES
```
5 records
name
airlines
airports
flights
planes
weather
Note

This document will render, however, it does not work interactively as of August 2023.

That really reduces the utility of the set-up chunk option.

A comment to a related issue (13240) has been submitted on GitHub for RStudio.

I suspect this will be resolved in the near future.

8.10.6 Close the Connection.

To avoid a warning when rendering the document, close the connection at the end.

```{r}
DBI::dbDisconnect(con2, shutdown = TRUE)
```

8.11 Using the {dbplyr} Package to Translate {dplyr} into SQL

The {dbplyr} package allows you to use {dplyr} code and it will translate it into SQL to interact (through an established connection) with a Back-end server.

  • Consider R as a higher-level language you use to access code written in C.
  • The {shiny} package can be considered as providing a higher-level language to generate HTML from {shiny} code.
  • You can consider {dbplyr} as a higher-level package that generates SQL from {dplyr} code.
Important

{dbplyr} works for many flavors of SQL and can generate many SQL expressions.

However it cannot always do everything. For more details see the {dbplyr} vignette on Function Translation.

Even if you can’t use {dbplyr} in your situation, you can still use it to generate SQL code you can see and use. That makes it a good tool for learning SQL since you know {dplyr}.

Let’s load {dbplyr} and {tidyverse} along with {DBI} and {duckdb}.

```{r}
#| message: false
library(tidyverse)
library(dbplyr)
library(DBI)
library(duckdb)
```

Create the connection to the database.

```{r}
con <- dbConnect(duckdb(dbdir = "./data/flights.duckdb", read_only = TRUE))
```

Retrieve a table from a SQL database using dplyr::tbl().

  • Note you have to specify the source of the data, here the database connection, and the table name.
```{r}
flights2 <- tbl(src = con, "flights")

class(flights2)

names(flights2)
```
[1] "tbl_duckdb_connection" "tbl_dbi"               "tbl_sql"              
[4] "tbl_lazy"              "tbl"                  
[1] "src"        "lazy_query"
  • Note that flights2 is not a dataframe. It’s a list with class tbl_duckdb_connection

Now you can use your tidyverse code on the flights table to generate a query.

```{r}
flights2 |>
  select(flight, origin, dest, dep_delay) |>
  filter(origin == "JFK", dest == "CMH") |>
  summarize(dep_delay = mean(dep_delay, na.rm = TRUE)) ->
  my_query

my_query

class(my_query)
```
# Source:   SQL [1 x 1]
# Database: DuckDB v1.0.0 [root@Darwin 23.6.0:R 4.4.1//Users/rressler/Library/CloudStorage/OneDrive-american.edu/Courses/DATA-413-613/lectures_book/data/flights.duckdb]
  dep_delay
      <dbl>
1      22.0
[1] "tbl_duckdb_connection" "tbl_dbi"               "tbl_sql"              
[4] "tbl_lazy"              "tbl"                  

Execute the query by using collect() at the end.

```{r}
flights2 |>
  select(flight, origin, dest, dep_delay) |>
  filter(origin == "JFK", dest == "CMH") |>
  summarize(dep_delay = mean(dep_delay, na.rm = TRUE)) |>
  collect() ->
  my_result

my_result
class(my_result)
```
# A tibble: 1 × 1
  dep_delay
      <dbl>
1      22.0
[1] "tbl_df"     "tbl"        "data.frame"

To see the SQL code that was generated, use show_query() at the end (this is a good way to learn SQL).

```{r}
flights2 |>
  select(flight, origin, dest, dep_delay) |>
  filter(origin == "JFK", dest == "CMH") |>
  summarize(dep_delay = mean(dep_delay, na.rm = TRUE)) |>
  show_query()
```
<SQL>
SELECT AVG(dep_delay) AS dep_delay
FROM (
  SELECT flight, origin, dest, dep_delay
  FROM flights
  WHERE (origin = 'JFK') AND (dest = 'CMH')
) q01

After you are done with the database, close the connection.

```{r}
DBI::dbDisconnect(con, shutdown = TRUE)
```

8.12 The DBeaver Integrated Development Environment (IDE)

  • This section based on work by Professor David Gerard.

There is a free, separate Front-end for the DuckDB database called “DBeaver Community”.

Per DBeaver community:

DBeaver Community is a free cross-platform database tool for developers, database administrators, analysts, and everyone working with data. It supports all popular SQL databases like MySQL, MariaDB, PostgreSQL, SQLite, Apache Family, and more.

8.12.1 Setup DBeaver Community

The instructions here are only for DuckDB. But DBeaver works with other DBMS’s pretty easily.

Download and install DBeaver Community here: https://dbeaver.io/download/.

When you open it up, select Database Navigator and under All or SQL, select “DuckDB”.

Then select Next.

Select “Host”, and Open to navigate to your flights.duckdb database.

Click on Test Connection.

If Successful, you will get a pop-up window with the results of the test and which will allow you to see the details of the connection.

If not successful, and you haven’t used it before, it may ask you to install a JDBC-compliant driver to use to connect to a DuckDB database.

  • Click Download Driver.
  • After the download, it will run the test and you should get a positive result.

Click “Finish”.

You should now see the database under the Database Navigator.

Right click on the connection and create a new SQL script file.

Type some SQL and hit the execute button or just hit Control + Enter.

You can now play around with SQL.

When you want to save the data to a CSV file, just hit “Export data”.

  • There are a lot of features in DBeaver, but we won’t cover them. They are pretty intuitive. Just play with it for awhile.

See the DBeaver General User Guide for detailed documentation.

8.12.2 Exercises

Use DBeaver for all of these exercises.

  1. What is the mean temperature for flights from each airport?
Show code
```{sql}
#| eval: false
#| code-fold: true
    SELECT "origin", AVG("temp") AS "temp"
    FROM weather 
    GROUP BY "origin";
```
  1. What is the average number of flights from each airport per day in January?
Show code
```{sql}
#| eval: false
#| code-fold: true
    SELECT "origin", COUNT() / 31.0 AS "ave"
    FROM flights 
    WHERE "month" = 1
    GROUP BY "origin";
```
  1. What are the top destinations for each airport?
Show code
```{sql}
#| eval: false
#| code-fold: true
#| label: dbeaver-exer-3
    SELECT "origin", "dest", "n"
    FROM
    (
    SELECT "origin", "dest", "n", MAX("n") OVER (PARTITION BY "origin") AS "nmax"
    FROM
    (
    SELECT "origin", "dest", COUNT() AS "n",
    FROM flights
    GROUP BY "origin", "dest"
    )
    )
    WHERE "n" = "nmax"
```

8.13 Considerations for Working with SQL and Databases

8.13.1 Review Documentation about the database

When working with a new database, or one that is often changing, seek out and review the available documentation.

Data Dictionaries, schema descriptions, and/or E-R diagrams can help you understand the database design, the table descriptions and structure, the fields in each table, the primary keys, and whether tables are indexed.

The documentation can help you see a holistic picture of the data so you know where to look for tables and fields of interest and how to join tables most efficiently.

  • As an example, the database may have a naming convention for tables and fields to help you find the tables and fields you need to create your queries.

8.13.2 Review Documentation about the DBMS and its SQL

Every DBMS is slightly different as developers optimize their code for the goals for the database.

Review documentation on the “flavor” of SQL accepted by the database.

Many flavors have extra features (statements, clauses, keywords, or modifiers) you can use to simplify your code or speed up your queries.

8.13.3 Choose a Front-end IDE that works with the DBMS

SQL is just text so you can write SQL in many environments.

If you are doing a lot of SQL, find a Front-end IDE that has features to facilitate using SQL.

  • As an example, DBeaver has auto-complete for SQL statements while RStudio does not.

Many Front-ends have customized drivers for working with many different Back-end DBMS.

  • So chose a Front End that allows you to work with the different databases in your organization without having to switch.
  • Read the documentation for the Front-end to use it effectively.

8.13.4 Follow a Style Guide

Like tidyverse style, there are style guides to help you make your code easier to read, faster to debug, and more robust to future changes.

If you do nothing else from a style perspective, use line breaks often to break up your code into readable lines where each starts on a statement, clause, or keyword.

Both of the following code chunks are valid SQL.

Streaming SQL example.

```{sql}
#| eval: false
#| connection: con
SELECT "flight", "origin", CASE WHEN ("origin" = 'JFK') THEN 'John F. Kennedy' WHEN ("origin" = 'LGA') THEN 'LaGaurdia' WHEN ("origin" = 'EWR') THEN 'Newark Liberty' ELSE 'WHO KNOWS' END AS "orig_long" FROM flights LIMIT 10;
```

Stylized tidy SQL.

```{sql}
#| label: tidy-sql
#| eval: false
#| connection: con
SELECT
    "flight",
    "origin",
    CASE
        WHEN ("origin" = 'JFK') THEN 'John F. Kennedy'
        WHEN ("origin" = 'LGA') THEN 'LaGaurdia'
        WHEN ("origin" = 'EWR') THEN 'Newark Liberty'
        ELSE 'WHO KNOWS'
    END AS "orig_long"
FROM
    flights
LIMIT 10;
```

Which do you think is easier to read, debug, and maintain?

  • The second chunk was generated by the DBeaver Format SQL command under the Edit menu.

There is no single authoritative style guide so check if your organization has its own.

If it does not, consider the following.

8.13.5 Build-a-little-test-a-little

Writing and debugging SQL queries should proceed in steps just like building and debugging R code and functions.

Don’t write a 30 line query and then start to debug it.

Write short queries and use LIMIT to keep the returns fast.

Write multiple small queries and then integrate them piece by piece.

Consider using sub-queries and Common Table Expressions to combine them as well.

8.13.6 Follow Best Practices for Writing Queries

There are many sources for Best Practices in writing SQL queries to optimize speed.

Common threads appears across best practices:

  1. Shape the query to return only the data you need.
  2. Shape the query to take advantage of the DBMS ability to use indexes for faster performance.
  3. Make your queries easy to read.

The most commonly cited best practice is “Avoid the use of SELECT *”.

Other suggestions include:

  • Use Joins instead of sub-queries inside a WHERE when you are just matching values.
  • Avoid using LIKE with wildcards,%, at the beginning of a string as this means you have to search every record.
  • Use GROUP BY or add additional fields instead of using SELECT DISTINCT on large tables.
  • Use comments -- to explain why you are doing something.
  • Use AS to create aliases to clarify fields from different tables for readability.
  • Avoid sorting, especially for large tables.

For additional recommendations and rationale, consider the following as a starting point.

8.13.7 Consider Tuning Slow Queries

Queries go through three stages when being executed.

  • Syntax Check: The DBMS parser checks each query is syntactically correct.
  • Optimization: The DBMS optimizer examines the query and builds a plan for execution to minimize resource demand and time.
  • Execution: The DBMS runs the optimized plan to return any results.

Thus every query gets optimized but that does not mean the query could not be tuned to be faster.

If you have a query that is run many times, on large tables, and appears slow, then consider tuning it.

The first step is detecting a slow query.

The second step to is to assess how the query is actually being run.

  • DBMS also have a way get to the results of the Query Planner so you can review them to see what is actually happening.
  • Look at the help for EXPLAIN.
  • For DBeaver, review the PostgreSQL Documentation on Using EXPLAIN Group (2023)

Once you understand the execution, then consider if any of the best practices for writing queries will help improve performance.

For more explanation and ideas see SQL Performance Tuning. khushboogoyal499 (2022)

8.13.8 Integrate Your SQL Into Your Version Control Workflow

Regardless of the IDE you use for a Front-end, SQL is still text-based code.

You can put the files into a Git Repo and version control with GitHub.

  • If you are using DBeaver, use the search tool (magnifying glass) and look for “Install new software.”
  • Use the drop down arrow to select the Git option as in Figure 8.5.
Figure 8.5: Installing the DBeaver Extension for Git

For more details see How to Version Control Your SQL.

8.13.9 Consider More Advanced SQL Statments and Methods

These notes have just scratched the surface of a language that has been around for 50 years and is still going strong.

There are many more statements, clauses, and keywords you can use to create more effective SQL to access the data you want, especially for large databases that will be accessed many times.

Some of these are:

  • Views, Sub-queries, and Common Table Expressions
  • Additional clauses and modifiers for WHERE such as {stringr} style functions.
  • Dynamic SQL where you can pass parameters to a stored procedure akin to an R function.

There are many free references on SQL to expand your knowledge.

R for Data Science suggests two non-free books:

8.13.10 Strategize How to Use SQL in Your Analysis Workflow

From a data science perspective SQL is usually a means for getting data out of databases.

  • Data scientists are usually less involved in designing, building, and managing databases or maintaining the data.

Your working environment will often dictate which databases have the data of interest and the associated versions of SQL you must use.

There are SQL wizards who can make SQL dance by building and optimizing queries to the nth degree.

You do not have to be one!

As long as you can get the data you need, you can use the other tools in R and Python to make your analysis sing and dance.

If you have the time and hardware memory to get the data, you can clean and shape it outside of SQL if that works for you.

SQL is an important tool in your Data Science tool box for sure. But it works best when you use it with other tools in it as well.

Tip

If your primary analysis tools are R and RStudio, but you need to get subsets of data from a large database using SQL, consider using a SQL-focused IDE, such as DBeaver, to build your queries and then copying them into a .sql file or an SQL code chunk.

This way you get the advantage of using a SQL-focused IDE while using RStudio for the rest of the analysis workflow.