8  Getting Data from SQL Databases

Published

October 7, 2025

Keywords

database, relational, sql, duckdb, dbeaver, duckplyr

The code chunks for SQL and R have different colors in this chapter.

  • Code chunks for SQL use this color.
  • Code chunks for R use this color.

8.1 Introduction

8.1.1 Learning Outcomes

  • Understand the organization of relational databases.
  • Use Structured Query Language (SQL) to access and manipulate data in SQL databases.
  • Use the {dbplyr} package to access and manipulate data in SQL databases.

8.1.2 References:

8.1.2.1 Other References

8.2 A Brief History of Databases

People and organizations have been collecting data for centuries. As early as the 21st century BCE, people were using clay tablets to make records of farmland and grain production. Records were kept as individual files (clay tablets) that were eventually organized based on conceptual model of the data. That model was typically a hierarchical set of fixed attributes of the records, e.g., subject, date, author, etc. for tracking records in a filing system.

It wasn’t until the 1960s that computerized “databases” began to appear. Early computerized databases focused on storing and retrieving individual records which were organized based on existing data models which used the attributes of the tables and records. Data were stored in tables with rows of values and columns for attributes. Users could find the desired table and row by navigating through the hierarchy. This model resulted in larger and larger tables with many redundant fields as users tried to integrate data at the table level. As an example, a record of a purchase could include all of the details of the purchase to include all of data on the buyer and seller such as name, address, and telephone number. Every record of every purchase by a buyer could include their name, address, and telephone number.

Database Management Systems (DBMS) were developed as the interface between the system user and the raw data in the records. DBMS used the data model to manage the physical, digital storage of the data and provide the navigation to find the right table and record. Users could now retrieve and update records electronically. However, computing and storage was expensive so researchers were looking for ways to reduce the redundant storage and inflexible perspective of the data inherent in the hierarchical data model.

In 1970, IBM researcher E.F. Codd published what became a landmark paper, A relational model of data for large shared data banks. Codd (1970) Codd proposed replacing the fixed hierarchy and its large tables with much smaller tables where the data model now focused on defining relationships among the tables instead of navigating across a hierarchy of attributes or parent child relationships among tables. The relational database model is now ubiquitous for storing extremely large data sets.

A strength and a drawback of the relational database model is the capability to manage not just the data, but all the relationships across the tables of data. This makes relational databases great for working with structured data, where the relationships are known ahead of time and built into the database model. The drawback is it takes time and processing power to enforce those relationships and structure. Thus, other approaches such as Apache HADOOP have been developed to capture and manage the variety, velocity, and volume of data associated with “Big Data” sources, e.g., capturing all the meta data and data associated with billions of transactions a day on a social media site.

For the relational database model to work, it needed a corresponding high-level language for implementation. Users had to be able to create databases and manage the life cycle of their databases and the data they contained. This need led to the development of what were known as “query languages” which eventually evolved into what is now called Structured Query Language or SQL (pronounced as “SEE-quil” or “ESS-q-l”). SQL began with a row-based approach to managing data in tables. Other languages e.g., No-SQL, developed a column-based approach to managing data. There are trade offs between the approaches. Many relational database developers create their own (proprietary) versions of SQL to be optimized for their database that incorporate features of both for speed and distributed processing. Apache has developed Apache Hive as a data warehouse system that sits atop HADOOP. “Hive allows users to read, write, and manage petabytes of data using SQL.”

Relational Databases and HADOOP/Hive-like approaches are not competitors as much as different solutions for different requirements.

This section will focus on the relational database methods and systems which use SQL.

8.3 Database Concepts and Terms

8.3.1 Database Tables

Relational databases use Relational DBMS (RDBMS) to keep track of large amounts of structured data in a single system.

A relational database uses rectangular tables, also called entities to organize its data by rows and/or columns.

  • Each table is typically designed to contain data on a single type or class of “entity”.

  • An entity can be a physical object such as a person, building, location, or sales order, or a concept such as this class or metadata on an object.

  • Each entity must have properties or attributes of interest.

    • These become the names for the columns in the table.
    • These are also known as the fields for the tables.
  • Each row in the table is also referred to as a record.

  • The intersection of row and column contains the value, the data, for that record for that field.

  • All of the values within a given field should be of the same data type. These include the basic types in R such as character, integer, double, or logical as well as complex objects such as pictures, videos, or other Binary Large Objects.

The tables are similar to R or Python data frames, but with some key differences.

  • Database tables are stored on a drive so are not constrained, like R data frames, to having to fit into memory. They can have many columns and many millions of rows, as long as they fit on the drive.
  • Unlike a data frame, a database field (column) may have a domain which limits the format and range of values in the field in addition to constraining the data type, e.g., the format for a telephone number may be xx-xxx-xxx-xxxx.
  • Database tables are often designed so each record is unique (no duplicate rows) with rules the DBMS uses to enforce the uniqueness.
  • Database tables are almost always indexed so each row has a unique value associated with it (behind the scenes).
    • The index identifies each record from a physical perspective, much like one’s operating system indexes files for faster search and retrieval.
    • The index is what allows a DBMS to find and retrieve records (rows) quickly.

8.3.2 Database Table Keys and Relationships

Most database tables will also be designed with a primary key which is combination of one or more fields that uniquely identify each record, from a logical perspective, across the database.

  • As an example, a key may be a person’s social security number which is expected to be unique to one person.

A combined key, uses multiple fields to create a key, such as house number, street name, city, state, zip-code.

Many database designers prefer to create a separate surrogate key, that does not use other values and has no meaning, to avoid potential changes in data over time.

The presence of primary keys for each table enable “relationships” between tables to be created and used by the RDBMS.

As an example, an organization is interested in tracking data about purchase orders and the buyer and a product in a purchase order.

A database designer might create three tables to hold this information:

  • A ORDER table with fields for the order number, the date of the purchase, the buyer_id, the product_id, the quantity purchased, and the purchase price.
    • The primary key is order_number.
  • A BUYER table with fields for the buyer_id, name, company id, street address, state, postal code, country, telephone number, and email.
    • The primary key is buyer_id.
  • A PRODUCT table with fields for the products id, name, weight, volume, purchase sizes, default_shipper_id, and standard price.
    • The primary key is product_id.

Each table describes a single object or entity and each has a primary key to uniquely identify each record.

The ORDER table uses the primary keys from the BUYER table and the PRODUCT table to make a relationship with those tables.

The presence of a table’s primary key in a different table is called being a foreign key in the second table.

  • The table with the primary key is known as a Parent table.
  • The table that references the parent table (using the foreign key) is known as the child table.
  • RDBMS use rules (constraints) to preserve data integrity so that someone editing or changing a child table cannot change or use a foreign key field that is not present in the parent table.
  • In this example, there is a foreign key in the BUYER table, company _id, and a foreign key in the PRODUCT table, default_shipper_id.

If you were to put all this data into a single table, the attributes of the buyers and products would be duplicated many times over as additional orders were created.

This example illustrates how relationships allow database designers to create multiple tables to minimize the amount of redundant data in a system.

8.3.3 ER Diagrams

Tables and their relationships are usually described in the database schema.

They can be depicted using class or entity-relationship diagrams to show the relationships between tables, fields and keys.

  • E-R diagrams can include additional information such as constraints and type of relationship, e.g., one-to-one, or one-to-many.

E-R diagrams can serve as a useful reference for understanding the structure of a large database and designing queries.

Figure 8.1 is an example of a simplified ER diagram (created in Quarto using Mermaid) showing the relationships between the tables and the associated primary and foreign keys.

Figure 8.1: Simplified ER Diagram

Figure 8.2 shows a different style of E-R diagram commonly used for database schema. The arrows link individual fields as primary and foreign keys for the tables and show the type of the relationships.

Figure 8.2: A partial E-R diagram from a hospital setting. w3resource (2023)

8.3.4 Bringing Tables Together

Now that we can see the database schema, we can find where the data is that we want.

Suppose we want to answer the question, how many orders were placed in the last month. We can do that with just the data in the ORDERS table.

However, suppose we want to answer the question how many orders were placed from buyers in the states of Virginia and Maryland last month. We would have to get data from both the ORDERS table and the BUYERS table.

We get data from multiple tables by temporarily combining tables using a SQL query that creates a JOIN between two or more tables.

Database joins are the same joins you can execute using {dplyr} verbs but the SQL syntax is slight different.

If you are using the same join over and over in your analysis. SQL allows you to create a database “View” and save it. Then in the future, you can query the “view” instead of having to rejoin the tables.

  • This is analogous to writing an R function to create a data frame based on joins and querying the results of the function instead of rewriting the code to join the data frames.

8.3.5 Database Design

Designing databases is its own area of expertise.

A poorly designed database will be slow and have insufficient rules (safeguards) to ensure data integrity.

We are used to operating as single users of our data since it is all in the memory of our machine and we can do what we want.

  • A database may have thousands of users at the same time trying to update or retrieve records.
  • Only one user can update a record at a time so databases use Record Locking to prevent others from updating the record or even retrieving it while someone else is updating the record.
  • RDBMS can enforce locking at higher levels such as the table or even the entire database which can delay user queries. See All about locking in SQL Server for more details.

Database designers make trade offs when designing tables to balance competing desires.

  • Storage Efficiency: limit each table to a single entity to minimize the amount of redundant data. This can generate lots of tables.
  • Maintenance Efficiency: The more tables you have, the fewer users might be trying to update the table at a time.
  • Query Speed: Minimize the number of joins needed to make a query. This leads to fewer tables.

These trade offs were recognized by Codd in his original paper.

Thus database designers may create tables that have redundant data if the most common queries join fields from those tables.

They will use database performance testing to optimize the queries and the database design to get the optimal cost performance trade offs. They can use other approaches to support data integrity in addition to database normalization.

Don’t be surprised to see E-R diagrams that have redundant data in them.

8.3.6 Relational Database Management Systems

8.3.6.1 Database Front and Back Ends

Databases have two main components.

  • The Front-end user interface where users create code to generate queries and receive results.
  • The Back-end server where the data is stored and the DBMS executes the queries to update or retrieve the data.

The Front-end and Back-end communicate through a Connection that uses an API for handling queries and data.

Many RDBMS bundle a Front-end and Back-end together. Examples include Oracle, SAP, and DuckDB has DBeaver.

Most databases allow you to use a Front-end of your choice as long as you have a server driver you can use to configure and open a connection to the Back-end server of the database.

  • RStudio can operate as a Front-end to RDBMS for which you have installed a server.
  • Excel can operate as a Front-end to a Microsoft SQL Server database.
  • DBeaver allows you to make a connection to many different DBMS servers.
  • MariaDB identifies over 30 Front-ends that can work with it, including DuckDB’s DBeaver SQL IDE.

There are multiple solutions. We will use both RStudio and DBeaver.

8.3.6.2 Database Connections: ODBC and the {DBI} Package

The reason there is so much flexibility with connecting Front and Back ends is the standardization of the Microsoft Open Data Base Connectivity (ODBC) Standard. David-Engel (2023)

The ODBC standard was designed for “maximum interoperability”.

The construct is similar to the HTTP interface for web browsers. The DBMS developer creates a driver that serves as an API for their Back-end DBMS that is ODBC-compliant. Then the Front-end developer creates their software to also be ODBC-compliant.

That way, the Front-end can create a connection to and interact with any DBMS back end for which there is an ODBC-compliant driver.

The R {DBI} package (for Data Base Interface) supports the connection between R and over 25 different DBMS Back-ends.

  • Popular Back-end packages include: {duckdb}, {RMariaDB}, {RPostgres}, {RSQLite}, and {bigrquery}.
  • These packages include the ODBC-compliant driver for the given database.

If no one has written a package but an ODBC-compliant driver exists, the {DBI} package works with the {ODBC} package and the driver to make the connection to the Back-end server.

8.3.6.3 Database Configuration Types

RDBMS can also be categorized based on their configuration or how front and back ends are physically connected and how the data is stored.

  • In-process databases run on your computer. The data is stored locally and accessible without a network connection.
    • They are optimized for the use case of a single user working with large amounts of data.
  • Client-Server databases use a small front-end client on your computer (perhaps an IDE or in a browser) to connect to a robust server over a network. Data can be stored across a set of servers.
    • They are optimized for the use case where multiple users can query the database at once and the RDBMS resolves contention.
  • Cloud-databases also separate the front and back ends over a network but now the data may be distributed across multiple locations in the cloud.
    • These are optimized for the use case of multiple users accessing a wide-variety of very large data and enabling distributed computing and parallel computing.

With the proliferation of cloud computing and software as a service (SAAS), you can expect to see more and more hybrids of client server and cloud databases.

As the amount of collected structured and unstructured data scales to Apache solutions for “Big Data”, the fundamental concepts of exploiting relationships in data using some version of SQL will continue to apply for the foreseeable future.

8.4 SQL Basics

  • Portions of this section are based on work by Professor David Gerard.

We will use R Studio as the Front-end, DuckDB as the Back-end, and the {DBI} package to support the connection between the front and back ends.

8.4.1 Connecting to a Database

Before you can use SQL to interact with a database, you must have a way to connect the Front-end user interface to the Back-end server with the data.

  • When working in R, the {DBI} package can create the connection and allow you to assign a name to it.
  • Your database will be created by a specific DBMS which provides the Back-end server.
  • If you are not using the Front-End that comes with the DBMS, you need an ODBC-compliant driver for the DBMS to make the connection.

We will use the DuckDB database and the {duckdb} Package for most of this section.

  • DuckDB is an in-process database so it has no external dependencies which makes it useful for single user analysis of large data. See Why DuckDB for their story.

  • You can install DuckDB directly on your computer and use the built-in Command Line Interface without using R.

  • However, the {duckdb} package installs the DuckDB DBMS as well as the ODBC-compliant driver for R.

  • Use the console to install {DBI} and {duckdb}.

To make the connection between R and DuckDB requires loading the two packages {DBI} and {duckdb}.

library(DBI)
library(duckdb)

We also want to compare the SQL code (working with the database tables) with R code (working with data frames) so load {tidyverse} and {nycflights13} which has the same data as the database we will use.

  • You will see the concepts for accessing data are quite similar but the syntax is different.
library(nycflights13)
library(tidyverse)

The DuckDB database with {nycflights13} data is on GitHub at https://github.com/AU-datascience/data/blob/main/413-613/flights.duckdb.

Important
  • {duckdb} only works with local data bases, not over a URL
  • You can manually download to a data folder under your current file’s working directory or use the following once you are sure there is a data directory:
# GitHub raw link for the DuckDB database file
url <- "https://github.com/AU-datascience/data/raw/main/413-613/flights.duckdb"

# Local path to save the file
dest <- here("data", "flights.duckdb")

# Download the file
download.file(url, dest, mode = "wb")

Create a connection to flights.duckdb by using DBI::dbConnect() with duckdb() and a relative path to the database and assign the name con to the connection.

con <- dbConnect(duckdb(dbdir = "./data/flights.duckdb"))
class(con)
[1] "duckdb_connection"
attr(,"package")
[1] "duckdb"
  • Notice the connection to the database in the environment.
  • You will use the con connection to interact with the database Back-end.

Once you have a connection, you can create a basic SQL code chunk like this in R Markdown.

```{sql, connection=con}
#| label: lst-show-tables
SHOW TABLES;
```
5 records
name
airlines
airports
flights
planes
weather

In Quarto .qmd files use the chunk option #| connection: con.

```{sql}
#| connection: con
#| label: lst-show-tables-q
SHOW TABLES;
```
5 records
name
airlines
airports
flights
planes
weather
Important

Whenever you are using a SQL chunk inside an R Markdown or Quarto document, identify the connection in each code chunk so the SQL interpreter knows with which database to interact.

All the SQL code chunks in this section will have the #| connection my_con chunk option where my_con is the name of the appropriate connection.

8.4.2 Connecting to Parquet Files

DuckDB and other modern databases (e.g., Apache Spark, Presto/Trino, Google BigQuery, Snowflake, Databricks) explicitly support the reading and writing of Parguet files.

  • Parquet is extremely common today in big data and analytics; it’s basically the default format for modern data lakes and replacing CSV as a format for data exchange.
  • Parquet is a columnar, compressed, self-describing format so systems built with columnar execution in mind (analytics DBs, OLAP engines, distributed query engines) tend to support direct querying of compressed Parquet.
  • Traditional RDBMS (Postgres, MySQL, SQL Server, Oracle) may require an extension or pre-processing to ingest them.
Note

Parquet is a storage format, not a relational database.

  • It preserves schema information (column names, types, and optional metadata).
  • It does not preserve or enforce higher-level relational rules such as: Primary keys, Foreign keys, Indexes, or Constraints (NOT NULL, CHECK, etc.).

In parquet format, each table can be a separate file or a set of files (for really large data). The files can be compressed for efficient storage.

Assume there is a directory (“./data/parquet”) of tables where each file is a separate table

  • Create a new connection.
  • Read individual files by name.
con_p <- dbConnect(duckdb())
  • DESCRIBE a table.
```{sql}
#| connection: con_p
#| label: lst-read_parquet
DESCRIBE  './data/parquet/flights.parquet';
```
Displaying records 1 - 10
column_name column_type null key default extra
year INTEGER YES NA NA NA
month INTEGER YES NA NA NA
day INTEGER YES NA NA NA
dep_time INTEGER YES NA NA NA
sched_dep_time INTEGER YES NA NA NA
dep_delay DOUBLE YES NA NA NA
arr_time INTEGER YES NA NA NA
sched_arr_time INTEGER YES NA NA NA
arr_delay DOUBLE YES NA NA NA
carrier VARCHAR YES NA NA NA

When you have a DuckDB connection (here called con_p), DuckDB can automatically treat Parquet file paths as virtual tables so you can work with them.

  • Here is an example of a JOIN of two parquet files. (Don’t worry about the syntax for now.)
```{sql}
#| connection: con_p
#| label: lst-join_parquet
SELECT f.flight, a.name AS airline_name
FROM './data/parquet/flights.parquet' AS f
JOIN './data/parquet/airlines.parquet' AS a
  ON f.carrier = a.carrier
LIMIT 10;
```
Displaying records 1 - 10
flight airline_name
1545 United Air Lines Inc.
1714 United Air Lines Inc.
1141 American Airlines Inc.
725 JetBlue Airways
461 Delta Air Lines Inc.
1696 United Air Lines Inc.
507 JetBlue Airways
5708 ExpressJet Airlines Inc.
79 JetBlue Airways
301 American Airlines Inc.

Using the file names directly though makes the code less portable and harder to read.

To make portable code, create a VIEW for each table (Section 8.23) with an alias of the table name and write your code using the VIEW instead of the file name as working with a DBMS table.

  • Let’s close the con_p connection
dbDisconnect(con_p)
  • Use R to create a new connection and the views.
# create connection (here, ephemeral in-memory db)
con_p <- dbConnect(duckdb())

# register all Parquet files in a folder
dbExecute(con_p, "
  CREATE VIEW flights AS
  SELECT * FROM parquet_scan('./data/parquet/flights.parquet');

  CREATE VIEW airlines AS
  SELECT * FROM parquet_scan('./data/parquet/airlines.parquet');
")
[1] 0

Now we can use the new connection to run the query as if it were standard SQL.

```{sql}
#| connection: con_p
#| label: lst-join_parquet-view
SELECT f.flight, a.name AS airline_name
FROM flights AS f
JOIN airlines AS a
  ON f.carrier = a.carrier
LIMIT 10;
```
Displaying records 1 - 10
flight airline_name
1545 United Air Lines Inc.
1714 United Air Lines Inc.
1141 American Airlines Inc.
725 JetBlue Airways
461 Delta Air Lines Inc.
1696 United Air Lines Inc.
507 JetBlue Airways
5708 ExpressJet Airlines Inc.
79 JetBlue Airways
301 American Airlines Inc.
  • Close the connection
dbDisconnect(con_p)

8.4.3 SQL Syntax Overview

Like all languages, SQL has standard syntax and reserved words.

Important

There is a standard for SQL: SQL-92. However, many database developers see this as a point of departure rather than a destination.

  • They use ODBC and SQL-92 to develop their own versions that are optimized for their intended use cases and database design.
  • Thus, there are many flavors of SQL in the world.

Be sure to review the SQL implementation for the DBMS you are using to be aware of any special features.

If you use proprietary SQL, it may give you performance improvements, but it will also lock you into that DBMS and you will have to revise your code if the back-end DBMS is changed.

These notes use the SQL flavor used by the DuckDB DBMS.

The Tidyverse uses functions as “verbs” to manipulate data; SQL uses “STATEMENTS” which can contain clauses, expressions, and other syntactic building blocks.

Table 8.1 identifies the common types of building blocks used in SQL along with some examples and their usage.

Table 8.1: SQL Language Building Blocks
Type Examples Role / Purpose
Statements SELECT, INSERT, UPDATE, DELETE, CREATE, DROP Complete SQL commands that tell the database what action to perform.
Clauses FROM, WHERE, GROUP BY, HAVING, ORDER BY, LIMIT, JOIN Sub-components of statements that define how the action is carried out.
Expressions CASE, arithmetic (a + b), boolean (x > 5), string concatenation Return a single value, often used inside clauses (computed or conditional).
Functions Aggregate: SUM(), AVG(), COUNT()
Scalar: ROUND(), UPPER(), NOW()
Built-in operations that transform values or sets of values.
Modifiers DISTINCT, ALL, TOP, AS (alias) Adjust or qualify the behavior of clauses or expressions.
Predicates IN, BETWEEN, LIKE, IS NULL, EXISTS Conditions that return TRUE/FALSE, often used in WHERE or HAVING.
Operators =, <, >, AND, OR, NOT, +, -, *, / Combine or compare values.
Keywords NULL, PRIMARY KEY, FOREIGN KEY, DEFAULT, CHECK Special reserved words that define schema rules or literal values.

A SQL “statement” is a “complete sentence” that cab be executed independently as an executable query.

  • A clause is a phrase in the sentence.
  • An expression is like a formula inside a clause.
  • A function is a built-in calculator you can call inside expressions.
  • Modifiers tweak the behavior (like adverbs in English).
  • Predicates make true/false checks.
  • Operators are the glue (math or logic).
  • Keywords are the reserved words that give SQL its grammar.

The following diagram shows how the building blocks are ordered in creating a complete sentence.

SQL Statement (e.g., SELECT)

├── Clauses (parts of the statement)
│   ├── FROM (which tables?)
│   ├── WHERE (filter rows before grouping)
│   │   └── Predicates (IN, BETWEEN, LIKE, IS NULL, EXISTS)
│   │       └── Expressions (x > 5, col1 + col2)
│   ├── GROUP BY (form groups)
│   ├── HAVING (filter groups after aggregation)
│   │   └── Aggregate Functions (SUM, AVG, COUNT, MIN, MAX)
│   ├── SELECT (choose fields/expressions to return)
│   │   ├── Expressions (CASE, arithmetic, concatenation)
│   │   │   └── Functions (ROUND, UPPER, NOW, SQRT, etc.)
│   │   └── Modifiers (DISTINCT, AS alias)
│   ├── ORDER BY (sort results)
│   └── LIMIT / OFFSET (restrict result size)

├── Operators (math: +, -, *, / ; logic: AND, OR, NOT ; comparison: =, <, >)

└── Keywords (NULL, DEFAULT, PRIMARY KEY, FOREIGN KEY, CHECK, etc.)

SQL is used across the entire database workflow:

  • Database setup: CREATE DATABASE, CREATE TABLE
  • Data management: INSERT, UPDATE, DELETE, DROP TABLE, ALTER TABLE
  • Data retrieval and analysis: SELECT (with various clauses)

We are most interested in SELECT statements as that is how we get data from a database table or a set of joined tables.

8.4.4 The SELECT statement

The SELECT statement is the SQL workhorse for getting and manipulating data akin to dplyr::select() combined with filter(), arrange(), and summarise().

SELECT has several other clauses or keywords to support tailoring the data returned by a query.

  • FROM: Required to identify the table or set of tables where the data will be located.
  • WHERE is a clause used to filter which records (rows) should be returned like dplyr::filter().
  • ORDER BY is a keyword for sorting records like dplyr::arrange(). The default is ascending order.
  • GROUP BY is a keyword akin to dplyr::group_by and is used for creating summaries of data, often with aggregate functions (SUM(), AVG(), COUNT(), …).

Other useful modifiers include:

  • AS: rename/alias columns or tables (like dplyr::rename()).
  • IN: test membership (like %in%).
  • DISTINCT: removes duplicate rows (like dplyr::distinct()).
  • LIMIT: restrict number of rows returned (like utils::head()).

SELECT Statement Operators

  • SQL uses = for comparison instead of == since you don’t assign anything in SQL.
  • Use != for not equal.
  • SQL uses NULL instead of NA. Test with IS NULL or IS NOT NULL.
  • SQL uses AND, OR and NOT as logical operators.

8.4.5 SQL Syntax Rules to Remember

  1. Case insensitivity:
  • Keywords are case insensitive, i.e. select is the same as SELECT is the same as SeLeCt).
  • Best practice is to have all statement keywords be in UPPERCASE (e.g., SELECT) and table/field names in lowercase for readability.
  1. Order matters:
  • A SQL SELECT statement’s clauses and keywords must be in the following order: SELECT, FROM, WHERE, GROUP BY, ORDER BY.
  1. Whitespace & formatting:
  • SQL ignores newlines and extra spaces.
  • Best practice is to put statements, clauses, and keywords on new lines with spaces to align the keywords to preserve a “river” of white space between the keyword and the values.
  1. Quoting rules:
  • Character string values must be in single quotes, e.g., 'NYC'.
  • Identifiers (column/table names): use double quotes (“first name”) if they contain spaces, reserved words, or special characters (same as using back ticks in R).
  • It is fine to always use double quotes because it is not always clear what is an invalid variable name in the database management system.
  1. Comments
  • Single-line comments start with two hyphens --.
  • Multi-line comments use /* ... */ in most DBMS.
  1. Statement termination
  • Some databases (e.g., PostgreSQL, Oracle) require a semicolon ; at the end of each statement.
  • Best practice is to always include ; so multiple statements can be run in one batch.

Together, these rules make SQL predictable (rigid order) but also forgiving (case, whitespace, optional semicolons).

8.4.6 Creating SQL Code Chunks

If you are creating working a lot with SQL in RStudio, you can manually enter a chunk and change the language engine to sql each time. However there are other ways to be more productive.

8.4.6.1 Customize a Keyboard Shortcut

RStudio has a built-in capability to create an SQL code chunk with an R Markdown code chunk option for the connection.

To access it quickly, suggest customizing a keyboard shortcut for it per Customizing Keyboard Shortcuts in the RStudio IDE. Ushey (2023)

Go to the RStudio Main Menu and select Tools/Modify Keyboard Shortcuts.

A pop up pane will appear with many options and the existing keyboard shortcuts.

  • Enter S Q in the search pane and you should see something like Figure 8.3.
Figure 8.3: A custom keyboard shortcut for an SQL code chunk.
  • Click in the Shortcut column to the right of “Insert Chunk S Q L”.

  • Enter the shortcut combination you want. Figure 8.3 shows the Mac shortcut I chose, Ctrl + Cmd + s.

  • Click Apply and your shortcut will be saved. You can use it right away.

Enter the shortcut in a text area (not an existing chunk) and you should get the following (without the con).

  • Make sure you enter in the name of the connection you have created.
```{sql connection=con}
#| label: lst-sql-sequence
#| lst-cap: Generalized hierarchy of SQL building blocks
#| echo: true

```
Important

You cannot render a document with an SQL code chunk with invalid or empty connection.

You have to have created a valid connection object before the SQL code chunk.

If you have a connection, you must also have a valid SQL line, else set the code chunk option #| eval: false

8.4.6.2 Create a Custom Code Snippet

Another option is to create a custom code snippet per Code Snippets in the RStudio IDE. Allaire (2023)

  • RStudio snippets are templates for code or R Markdown.

To create a snippet for a custom SQL code chunk, go to the RStudio Main Menu and select Tools/Edit Code Snippets.

  • A pop-up pane should appear.

Select Markdown as you will be using the snippet in a Markdown text section. You will see something like Figure 8.4 with the existing Markdown code snippets.

Figure 8.4: A Markdown code snippet for an SQL code chunk.
  • Figure 8.4 already shows the snippet I saved for creating a custom SSQL code chunk.

You can create your own snippet.

  • Enter snippet s_name where you enter your name for the snippet instead of s_name. It does not have to be rsql.
  • Enter the following lines and use the tab key not spaces to indent.
  • The ${0} is a placeholder for your SQL code. The 0 means it is the first place your cursor will go.
    • You could replace con with ${0} so you can fill in which connection each time. Then edit the next ${0} on the original line to ${1}.
  • Select Save to save it and then you can use it.

To use a snippet, enter the name your created, I use “rsql”, and then SHIFT+TAB (on a mac) aTAB on windows, and the code chunk will appear like below (without the other options).

```{sql}
#| label: lst-sql-sequence
#| lst-cap: Generalized hierarchy of SQL building blocks
#| echo: true

```

Now we are ready to go through a number of the SQL Building blocks from Table 8.1 to manipulate and retrieve data using the connection to our DuckDB database.

8.5 SHOW and DESCRIBE Provide an Overview of the Tables in the Database

SHOW TABLES returns a list of all of the tables in the database.

```{sql}
#| connection: con
#| label: lst-show-tables-3
-- List all tables in the database
SHOW TABLES;
```
5 records
name
airlines
airports
flights
planes
weather

DESCRIBE (DuckDB) returns the set of table with columns for the table names, their column names, and column types.

```{sql}
#| connection: con
#| label: lst-describe-table
#| eval: false
DESCRIBE;
```

DESCRIBE tablename (DuckDB) returns metadata about the table tablename.

  • column name, data type, whether NULL is allowed, key info (e.g. primary key), default values.
```{sql}
#| connection: con
#| label: lst-describe-table-airlines
-- Structure of a specific table
DESCRIBE airlines;
```
2 records
column_name column_type null key default extra
carrier VARCHAR YES NA NA NA
name VARCHAR YES NA NA NA
  • R equivalents for a data frame
str(flights)
tibble [336,776 × 19] (S3: tbl_df/tbl/data.frame)
 $ year          : int [1:336776] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
 $ month         : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
 $ day           : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
 $ dep_time      : int [1:336776] 517 533 542 544 554 554 555 557 557 558 ...
 $ sched_dep_time: int [1:336776] 515 529 540 545 600 558 600 600 600 600 ...
 $ dep_delay     : num [1:336776] 2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
 $ arr_time      : int [1:336776] 830 850 923 1004 812 740 913 709 838 753 ...
 $ sched_arr_time: int [1:336776] 819 830 850 1022 837 728 854 723 846 745 ...
 $ arr_delay     : num [1:336776] 11 20 33 -18 -25 12 19 -14 -8 8 ...
 $ carrier       : chr [1:336776] "UA" "UA" "AA" "B6" ...
 $ flight        : int [1:336776] 1545 1714 1141 725 461 1696 507 5708 79 301 ...
 $ tailnum       : chr [1:336776] "N14228" "N24211" "N619AA" "N804JB" ...
 $ origin        : chr [1:336776] "EWR" "LGA" "JFK" "JFK" ...
 $ dest          : chr [1:336776] "IAH" "IAH" "MIA" "BQN" ...
 $ air_time      : num [1:336776] 227 227 160 183 116 150 158 53 140 138 ...
 $ distance      : num [1:336776] 1400 1416 1089 1576 762 ...
 $ hour          : num [1:336776] 5 5 5 5 6 5 6 6 6 6 ...
 $ minute        : num [1:336776] 15 29 40 45 0 58 0 0 0 0 ...
 $ time_hour     : POSIXct[1:336776], format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...
glimpse(flights)
Rows: 336,776
Columns: 19
$ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
$ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
$ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
$ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
$ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
$ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
$ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
$ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
$ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
$ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
$ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
$ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
$ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
$ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
$ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
$ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
$ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…

8.6 SELECT Returns Specific Fields (columns) FROM a Table

  • The SELECT statement must always be followed by a FROM clause which identifies the source of the data.
    • That source is a table, a set of joined tables, or a previously-created table view (a saved query like an intermediate data frame variable).
  • The data which is returned as a result of executing a SELECT statement is stored in a result table, which can also be called the result-set.
    • This means you can use the output as a table in a new FROM clause.
  • A SELECT statement can have other clauses and modifiers to reduce the amount of data that is returned.
  • There is no equivalent for excluding fields (like dplyr::select(-year)). Just select the fields you want.
```{sql}
#| eval: false
#| connection: con
SELECT <field1>, <field2>, <field3> 
FROM <mytable>;
```

Let’s use this syntax to get all the data in the fields tailnum, year, and model from the planes table.

```{sql}
#| connection: con
#| label: lst-select-planes
SELECT "tailnum", "year", "model" 
FROM planes;
```
Displaying records 1 - 10
tailnum year model
N10156 2004 EMB-145XR
N102UW 1998 A320-214
N103US 1999 A320-214
N104UW 1999 A320-214
N10575 2002 EMB-145LR
N105UW 1999 A320-214
N107US 1999 A320-214
N108UW 1999 A320-214
N109UW 1999 A320-214
N110UW 1999 A320-214
  • R equivalent
planes |>
  select(tailnum, year, model)

Think of SELECT ... FROM ... as the core skeleton of every SQL query.

  • Everything else (WHERE, GROUP BY, ORDER BY, etc.) adds constraints to shape the result-set, typically to be fewer rows that more focused on the data needed for the question at hand.
Important

You can select every field in a table by using the wildcard *, e.g., SELECT *.

```{sql}
#| connection: con
#| label: lst-select-planes-2
SELECT * 
FROM planes
LIMIT 5;
```
5 records
tailnum year type manufacturer model engines seats speed engine
N10156 2004 Fixed wing multi engine EMBRAER EMB-145XR 2 55 NA Turbo-fan
N102UW 1998 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan
N103US 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan
N104UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan
N10575 2002 Fixed wing multi engine EMBRAER EMB-145LR 2 55 NA Turbo-fan

However, this is generally considered a bad practice.

  1. Fragile code
  • Tables evolve over time. New columns may appear or existing columns may change.
  • SELECT * may suddenly return unexpected fields and break downstream code.
  1. Performance issues
  • On wide tables (hundreds of columns), returning everything is wasteful.
  • Extra data must be:
    • retrieved from storage
    • processed by the database
    • transmitted over the connection
  1. Network overhead
    • If the database is remote, sending unnecessary data across the network makes queries slower for you and everyone else.

The best practice is to select only the specific fields of interest.

Benefits:

  1. More robust code; your code won’t break if new, irrelevant fields are added.
  2. More efficient: only the data you need is selected, reducing workload on both the database and the connection.

Exceptions 1. Small tables you know won’t change (e.g., a static lookup table). 2. Quick interactive exploration and then switch to explicit fields in production queries.

Let’s examine several keywords that modify SELECT to reduce the number of rows or modify the data.

8.6.1 Select Distinct Records to Remove Duplicate Records

Some tables may allow duplicate records or a query may create a result set with duplicate values if only a few fields are selected.

The statement SELECT DISTINCT field1, field2 modifies the result set so it has no duplicate rows.

Let’s select all the distinct (unique) destinations.

```{sql}
#| connection: con
SELECT DISTINCT "dest" 
FROM flights;
```
Displaying records 1 - 10
dest
DCA
DAY
SRQ
CHS
AUS
SAV
SNA
OMA
MYR
MVY
  • R equivalent
distinct(flights |> select(dest))

8.7 WHERE Filters Records (Rows) by Values

To modify the result set based on values in the records, add a WHERE clause after the FROM clause with a condition that returns a logical value.

SELECT column1, column2
FROM table_name
WHERE condition;
  • The WHERE clauses uses the condition to filter the returned records to those whose values meet the condition, i.e., only rows for which condition evaluates to TRUE will appear in the result-set.

  • The WHERE clause requires an expression that returns a logical value.

```{sql}
#| connection: con
SELECT "flight", "distance", "origin", "dest" 
FROM flights
WHERE "distance" < 50;
```
1 records
flight distance origin dest
1632 17 EWR LGA
  • R equivalent:
flights |>
  select(flight, distance, origin, dest) |>
  filter(distance < 50)

The allowable operators in WHERE are:

  • Comparison: =, !=, <, >, <=, >=
  • Logical: AND, OR, NOT
  • Membership: IN
  • Null check: IS NULL, IS NOT NULL

To test for equality, use a single equal sign, =.

```{sql}
#| connection: con
SELECT "flight", "month" 
FROM flights
WHERE "month" = 12;
```
Displaying records 1 - 10
flight month
745 12
839 12
1895 12
1487 12
2243 12
939 12
3819 12
1441 12
2167 12
605 12
  • R equivalent
flights |>
  select(flight, month) |>
  filter(month == 12)

To use a character string you must use single quotes on the string, not double.

```{sql}
#| connection: con
SELECT "flight", "origin" 
FROM flights
WHERE "origin" = 'JFK';
```
Displaying records 1 - 10
flight origin
1141 JFK
725 JFK
79 JFK
49 JFK
71 JFK
194 JFK
1806 JFK
1743 JFK
303 JFK
135 JFK
  • R equivalent
flights |>
  select(flight, origin) |>
  filter(origin == 'JFK')

8.7.1 Combine Multiple Criteria Using the Logical Operators.

Here is an example of using the AND logical operator.

```{sql}
#| connection: con
SELECT "flight", "origin", "dest" 
FROM flights
WHERE "origin" = 'JFK' AND "dest" = 'CMH';
```
Displaying records 1 - 10
flight origin dest
4146 JFK CMH
3783 JFK CMH
4146 JFK CMH
3783 JFK CMH
4146 JFK CMH
3783 JFK CMH
4146 JFK CMH
3783 JFK CMH
4146 JFK CMH
3650 JFK CMH
  • R equivalent:
flights |>
  select(flight, origin, dest) |>
  filter(origin == "JFK", dest == "CMH")

You can use the OR and `NOT logical operators too.

  • Put parentheses around the portions of the expression to ensure you get the desired order of operations.
```{sql}
#| connection: con
SELECT
    DISTINCT "flight",
    "origin",
    "dest"
FROM
    flights
WHERE
    NOT ("origin" = 'JFK'
        OR "origin" = 'LGA')
    AND dest = 'CMH';
```
Displaying records 1 - 10
flight origin dest
3852 EWR CMH
4147 EWR CMH
4446 EWR CMH
4240 EWR CMH
3839 EWR CMH
3854 EWR CMH
4704 EWR CMH
5664 EWR CMH
5815 EWR CMH
4335 EWR CMH
  • R equivalent
flights |>
  select(flight, origin, dest) |>
  filter(!(origin == "JFK" | origin == "LGA"), dest == "CMH") |> 
  distinct()

8.7.2 Filter on Membership using IN or NOT IN

You can simplify and generalize a WHERE condition for membership using IN instead of multiple OR comparisons.

  • To get flights whose destination is one of several airports (e.g., BWI, IAD, DCA) and exclude certain origins:
```{sql}
#| connection: con
SELECT DISTINCT
    "flight",
    "origin",
    "dest"
FROM flights
WHERE
    "origin" NOT IN ('JFK', 'LGA')
    AND "dest" IN ('BWI', 'IAD', 'DCA');
```
Displaying records 1 - 10
flight origin dest
4241 EWR DCA
3838 EWR DCA
3845 EWR BWI
4372 EWR DCA
6049 EWR IAD
3832 EWR DCA
4368 EWR DCA
4103 EWR DCA
4348 EWR IAD
1672 EWR BWI
  • R Equivalent
flights |>
  filter(
    !origin %in% c("JFK", "LGA"), # equivalent to NOT IN
    dest %in% c("BWI", "IAD", "DCA") # equivalent to IN
  ) |>
  select(flight, origin, dest) |>
  distinct() 

8.7.3 Remove Records with Missing Data

Missing data is NULL in SQL (instead of NA).

We can explicitly remove records with missing values with the clause WHERE field IS NOT NULL.

```{sql}
#| connection: con
SELECT "flight", "dep_delay" 
FROM flights
WHERE "dep_delay" IS NOT NULL;
```
Displaying records 1 - 10
flight dep_delay
1545 2
1714 4
1141 2
725 -1
461 -6
1696 -4
507 -5
5708 -3
79 -3
301 -2
  • R equivalent
flights |>
  select(flight, dep_delay) |>
  filter(!is.na(dep_delay))

Just use IS if you want only the records that are missing data.

```{sql}
#| connection: con
SELECT "flight", "dep_delay" 
FROM flights
WHERE "dep_delay" IS NULL;
```
Displaying records 1 - 10
flight dep_delay
4308 NA
791 NA
1925 NA
125 NA
4352 NA
4406 NA
4434 NA
4935 NA
3849 NA
133 NA

8.8 String/Text Functions

SQL provides several functions for working with text.

  • These are often used in WHERE clauses to filter data:

  • Find flights for carriers with “Jet” in their name (case-insensitive).

```{sql}
#| connection: con

SELECT f.flight, a.name AS airline_name, f.origin
FROM flights f
JOIN airlines a
  ON f.carrier = a.carrier
WHERE a.name ILIKE '%jet%'   -- ILIKE for case-insensitive matching
LIMIT 10;
```
Displaying records 1 - 10
flight airline_name origin
4424 ExpressJet Airlines Inc. EWR
6177 ExpressJet Airlines Inc. EWR
583 JetBlue Airways JFK
4963 ExpressJet Airlines Inc. EWR
4380 ExpressJet Airlines Inc. EWR
601 JetBlue Airways JFK
4122 ExpressJet Airlines Inc. EWR
4334 ExpressJet Airlines Inc. EWR
4533 ExpressJet Airlines Inc. EWR
4166 ExpressJet Airlines Inc. EWR
  • Extract first characters of the carrier code.
```{sql}
#| connection: con
SELECT carrier, SUBSTR(carrier, 1, 2) AS carrier_prefix
FROM airlines
LIMIT 10;
```
Displaying records 1 - 10
carrier carrier_prefix
9E 9E
AA AA
AS AS
B6 B6
DL DL
EV EV
F9 F9
FL FL
HA HA
MQ MQ

– Trim whitespace from airport codes (if any).

```{sql}
#| connection: con
SELECT TRIM(origin) AS clean_origin, TRIM(dest) AS clean_dest
FROM flights
LIMIT 10;
```
Displaying records 1 - 10
clean_origin clean_dest
EWR IAH
LGA IAH
JFK MIA
JFK BQN
LGA ATL
EWR ORD
EWR FLL
LGA IAD
JFK MCO
LGA ORD

8.8.1 CONCAT Concatenates Multiple Strings or Lists.

SQL has the CONCAT function and an operator for combining strings into a single value.

  • This includes concatenating column names. Add an alias so it is easier to reference.
```{sql}
#| connection: con
SELECT DISTINCT CONCAT(f.carrier, '-', a.name) AS airline_name
FROM flights f
JOIN airlines a
  ON f.carrier = a.carrier
WHERE a.name ILIKE '%air%'   -- ILIKE for case-insensitive matching
ORDER BY airline_name
LIMIT 10;
```
Displaying records 1 - 10
airline_name
9E-Endeavor Air Inc.
AA-American Airlines Inc.
AS-Alaska Airlines Inc.
B6-JetBlue Airways
DL-Delta Air Lines Inc.
EV-ExpressJet Airlines Inc.
F9-Frontier Airlines Inc.
FL-AirTran Airways Corporation
HA-Hawaiian Airlines Inc.
MQ-Envoy Air

Concatenation is so common there is an operator for it.

Use || as a binary operator to concatenate the elements on both sides of it.

```{sql}
#| connection: con
SELECT DISTINCT f.carrier || '-' || a.name AS airline_name
FROM flights f
JOIN airlines a
  ON f.carrier = a.carrier
WHERE a.name ILIKE '%air%'   -- ILIKE for case-insensitive matching
ORDER BY airline_name
LIMIT 10;
```
Displaying records 1 - 10
airline_name
9E-Endeavor Air Inc.
AA-American Airlines Inc.
AS-Alaska Airlines Inc.
B6-JetBlue Airways
DL-Delta Air Lines Inc.
EV-ExpressJet Airlines Inc.
F9-Frontier Airlines Inc.
FL-AirTran Airways Corporation
HA-Hawaiian Airlines Inc.
MQ-Envoy Air

8.8.2 Regular Expressions in SQL

SQL supports regex pattern matching to filter strings more precisely.

  • In many SQL dialects you can use REGEXP (in DuckDB use REGEXP_MATCHES) to match patterns:

  • Flights whose destination airport code starts with ‘O’ and ends with ‘A’.

```{sql}
#| connection: con
SELECT flight, dest
FROM flights
WHERE REGEXP_MATCHES(dest, '^O.*A$')
LIMIT 10;
```
Displaying records 1 - 10
flight dest
4417 OMA
4085 OMA
4160 OMA
4417 OMA
4085 OMA
4294 OMA
4085 OMA
4294 OMA
4085 OMA
4085 OMA
  • Carriers whose name contains ” Air “.
```{sql}
#| connection: con
SELECT name
FROM airlines
WHERE REGEXP_MATCHES(name, '^.* Air .*')
LIMIT 10;
```
3 records
name
Endeavor Air Inc.
Delta Air Lines Inc.
United Air Lines Inc.
  • Replace tail numbers that begin with “N” with only the digits after the “N”.
```{sql}
#| connection: con
SELECT tailnum,
       REGEXP_REPLACE(tailnum, '^N([0-9]+).*$', '\1') AS numeric_part
FROM flights
LIMIT 10;
```
Displaying records 1 - 10
tailnum numeric_part
N14228 14228
N24211 24211
N619AA 619
N804JB 804
N668DN 668
N39463 39463
N516JB 516
N829AS 829
N593JB 593
N3ALAA 3

8.9 Date/Time Functions

SQL has operators and functions to work with dates and times.

  • These can be useful for filtering or aggregating flights

Operators allow us to add and subtract DATE types

```{sql}
#| connection: con
SELECT flight, carrier, time_hour
FROM flights
WHERE time_hour = DATE '2013-07-04' + 1
LIMIT 10;
```
Displaying records 1 - 10
flight carrier time_hour
418 B6 2013-07-05
2391 DL 2013-07-05
415 VX 2013-07-05
1505 B6 2013-07-05
1464 UA 2013-07-05
1244 UA 2013-07-05
686 B6 2013-07-05
485 FL 2013-07-05
1271 WN 2013-07-05
4079 9E 2013-07-05

Functions allow us to manipulate or extract information from DATEs or TIMEs

Note

DuckDB has two versions of some functions, with and without an underscore, to help with compatibility.

  • The ones without an underscore, e.g.,datediff, dateadd, etc., were originally added to mimic common SQL dialects like SQL Server or MySQL.
  • The ones with an underscore, e.g., date_diff, date_add, follow standard SQL/ISO naming conventions.

The functions provide the same result (although the argument order may differ)

Say we want to select only those flights in that occurred in March at 10 PM.

  • We can use the concatenation operator to construct a date and assign an alias
  • We could also just use time_hour which is type TIMESTAMP (combines DATE and TIME information.)
  • Then select the month from date that is March and the hour is 10PM.
```{sql}
#| connection: con
SELECT flight, carrier, DATE(year || '-' || month || '-' || day) AS flight_date,
time_hour
FROM flights
WHERE EXTRACT(MONTH FROM flight_date) = 3 AND EXTRACT(HOUR from time_hour) = 22
LIMIT 10;
```
Displaying records 1 - 10
flight carrier flight_date time_hour
575 AA 2013-03-01 2013-03-01 22:00:00
2136 US 2013-03-01 2013-03-01 22:00:00
127 DL 2013-03-01 2013-03-01 22:00:00
1499 DL 2013-03-01 2013-03-01 22:00:00
2183 US 2013-03-01 2013-03-01 22:00:00
257 AA 2013-03-01 2013-03-01 22:00:00
2042 DL 2013-03-01 2013-03-01 22:00:00
4202 EV 2013-03-01 2013-03-01 22:00:00
31 DL 2013-03-01 2013-03-01 22:00:00
689 UA 2013-03-01 2013-03-01 22:00:00

Extract parts of a date

```{sql}
#| connection: con
SELECT year, month, day, EXTRACT(DAY FROM date '2013-03-15') AS day_of_month
FROM flights
LIMIT 5;
```
5 records
year month day day_of_month
2013 1 1 15
2013 1 1 15
2013 1 1 15
2013 1 1 15
2013 1 1 15

Calculate the difference between the scheduled departure date and a calculated arrival date.

  • Use the INTERVAL function to convert a character air_time to a time interval
  • Use the DATE_DIFF function to find the difference between two dates.
```{sql}
#| connection: con
SELECT flight, origin, dest, time_hour AS dep_datetime,
       time_hour + (air_time * INTERVAL '1 MINUTE') AS arr_datetime,
       date_diff('day',
                time_hour,
                time_hour + (air_time * INTERVAL '1 MINUTE')
               ) as days_different
FROM flights
WHERE date_diff('day',
                time_hour,
                time_hour + (air_time * INTERVAL '1 MINUTE')
               ) <> 0
LIMIT 10;
```
Displaying records 1 - 10
flight origin dest dep_datetime arr_datetime days_different
51 JFK HNL 2013-01-01 14:00:00 2013-01-02 00:59:00 1
117 JFK LAX 2013-01-01 18:00:00 2013-01-02 00:02:00 1
15 EWR HNL 2013-01-01 18:00:00 2013-01-02 04:56:00 1
16 EWR SEA 2013-01-01 19:00:00 2013-01-02 00:48:00 1
257 JFK SFO 2013-01-01 19:00:00 2013-01-02 01:02:00 1
355 JFK BUR 2013-01-01 18:00:00 2013-01-02 00:11:00 1
1322 JFK SFO 2013-01-01 19:00:00 2013-01-02 01:15:00 1
1467 JFK LAX 2013-01-01 20:00:00 2013-01-02 01:40:00 1
759 LGA DFW 2013-01-01 20:00:00 2013-01-02 00:08:00 1
1699 EWR SFO 2013-01-01 20:00:00 2013-01-02 01:48:00 1

For more complex data (or REGEX) functions, we need to install and load the International Components for Unicode (ICU).

  • It’s a library that provides:
    • Unicode support – handles text in almost any language and script.
    • Advanced text functions – like regular expressions, case conversions, collation (sorting), normalization, etc.
    • Date/time parsing and formatting – especially when working with strings in different locales.
```{sql}
#| connection: con
INSTALL icu;
LOAD icu;
```

Compare with the current date.

```{sql}
#| connection: con
SELECT flight, carrier
FROM flights
WHERE DATE('2013-01-01') < CURRENT_DATE  -- example comparison
LIMIT 10;
```
Displaying records 1 - 10
flight carrier
1545 UA
1714 UA
1141 AA
725 B6
461 DL
1696 UA
507 B6
5708 EV
79 B6
301 AA

8.10 LIMIT Restricts the Number of Records Returned Starting with the OFFSET Record

When you are building a query, you often want to subset the rows while you are finishing it (you don’t want to return the whole table each time you are trouble shooting a query).

Use the LIMIT output modifier at the end of the query to restrict the number of records that are returned.

  • Some flavors of SQL allow SELECT TOP and SELECT TOP PERCENT statements to restrict to the top number/percentage of records.
```{sql}
#| connection: con
SELECT "flight", "origin", "dest" 
FROM flights
LIMIT 5;
```
5 records
flight origin dest
1545 EWR IAH
1714 LGA IAH
1141 JFK MIA
725 JFK BQN
461 LGA ATL

If you want a limited number of records but you not just the first records, use the OFFSET clause to identify how may records to skip before starting to collect the records.

  • The OFFSET clause indicates at which position to start reading the values, i.e., the first OFFSET values are ignored.
```{sql}
#| connection: con
SELECT "flight", "origin", "dest" 
FROM flights
LIMIT 5
  OFFSET 4;
```
5 records
flight origin dest
461 LGA ATL
1696 EWR ORD
507 EWR FLL
5708 LGA IAD
79 JFK MCO

8.11 USING Returns a Random Sample of Records

You can also randomly sample rows with the clause USING SAMPLE.

```{sql}
#| connection: con
SELECT "flight", "origin", "dest" 
FROM flights
USING SAMPLE 5 ROWS;
```
5 records
flight origin dest
927 EWR MIA
601 JFK MIA
3910 JFK RIC
1708 EWR SJU
1529 JFK LAS

Depending upon your version of SQL there may be an modifier to SAMPLE that allows you to set the random number seed for the sample so it is repeatable.

  • DuckDB has multiple sampling methods. The following uses bernoulli sampling to get a small number of rows for a small percentage (just to avoid many rows in the output.)
  • Note that repeating the code above changes the selected rows whereas repeating the following does not.
```{sql}
#| connection: con
SELECT "flight", "origin", "dest" 
FROM flights
USING SAMPLE .0027% (bernoulli, 377);
```
1 records
flight origin dest
340 EWR MIA
Tip

When working with tables with more than a few hundred records, recommend using LIMIT and one of the WHERE or USING clauses when initially building/testing your query to speed up the return of results, especially if the connection is over a network.

Remove the restrictions when the query is ready for use on the full table.

8.12 ORDER BY Arranges Rows

The ORDER BY clause modifys the output to sort the records that are returned.

  • The default order is ascending.

Let’s remove missing values so we can see the ordering.

```{sql}
#| connection: con
SELECT "flight", "dep_delay" 
FROM flights
WHERE "dep_delay" IS NOT NULL
ORDER BY "dep_delay"
LIMIT 20;
```
Displaying records 1 - 10
flight dep_delay
97 -43
1715 -33
5713 -32
1435 -30
837 -27
3478 -26
4573 -25
4361 -25
2223 -24
3318 -24
  • R equivalent
flights |>
  select(flight, dep_delay) |>
  filter(!is.na(dep_delay)) |>
  arrange(dep_delay) |> 
  slice_head(n = 20)

Use the modifier DESC after the field name to arrange in descending order

```{sql}
#| connection: con
SELECT "flight", "dep_delay" 
FROM flights
WHERE "dep_delay" IS NOT NULL
ORDER BY "dep_delay" DESC;
```
Displaying records 1 - 10
flight dep_delay
51 1301
3535 1137
3695 1126
177 1014
3075 1005
2391 960
2119 911
2007 899
2047 898
172 896
  • R equivalent
flights |>
  select(flight, dep_delay) |>
  filter(!is.na(dep_delay)) |>
  arrange(desc(dep_delay))

Break ties by adding more variables in the ORDER BY clause.

```{sql}
#| connection: con
SELECT "flight", "origin", "dep_delay" 
FROM flights
WHERE "dep_delay" IS NOT NULL
ORDER BY "origin" DESC, "dep_delay"
LIMIT 20;
```
Displaying records 1 - 10
flight origin dep_delay
1715 LGA -33
5713 LGA -32
1435 LGA -30
837 LGA -27
3478 LGA -26
4573 LGA -25
375 LGA -24
4065 LGA -24
2223 LGA -24
1371 LGA -23
  • R equivalent
flights |>
  select(flight, origin, dep_delay) |>
  filter(!is.na(dep_delay)) |>
  arrange(desc(origin), dep_delay) |> 
  slice_head(n = 20)

8.13 AS creates an Alias for a Field, Table, or Sub-Query/View

It is often helpful to create a new, often shorter, name for a field, table , or sub-query.

  • An alias is a temporary name which can make it much easier to read and debug the code., especially with multiple joins and sub-queries at play.
  • AS does Not change or update the original table; it is creating the alias for a new temporary field/table that will disappear after the query is run.

To create an alias or new name for a field in a result set, use the keyword AS, inside the SELECT statement, after the column name or expression, to specify the alias (new name for the field resulting from the column name or expression).

```{sql}
#| eval: false
#| connection: con
SELECT <expression> AS <myvariable> 
FROM <mytable>;
```

Let’s calculate average speed from the flights table. We’ll also keep the flight number, distance, and air time variables.

```{sql}
#| connection: con
SELECT
    "flight",
    "distance" / "air_time" AS "speed", -- "speed" is the alias
    "distance",
    "air_time"
FROM
    flights
LIMIT 5;
```
5 records
flight speed distance air_time
1545 6.167401 1400 227
1714 6.237885 1416 227
1141 6.806250 1089 160
725 8.612022 1576 183
461 6.568966 762 116
  • R equivalent:
flights |>
  select(flight, distance, air_time) |>
  mutate(speed = distance / air_time) |> 
  slice_head(n = 5)

8.14 CAST() Changes the Type of a Variable

The CAST() function allows you to explicitly change the data type of a column, expression, or literal. It is commonly used when you need to:

  • Run numeric operations on text columns
  • Format dates or times
  • Ensure compatible types for joins or calculations
CAST(expression AS target_data_type)
  • Expression is what you want to convert and the target is the new type (e.g., INTEGER, FLOAT (single precision 4 bytes), DOUBLE (double precision 8 bytes), VARCHAR, DATE, ).

DuckDB also supports the shorthand us of :: as a binary operator, e.g., expr::TYPENAME, which is also present in PostgreSQL.

SELECT expression::target_data_type
Note

Numeric Types in DuckDB and Their Relation to Standard SQL

DuckDB supports a variety of numeric types, which align closely with standard SQL data types. These types can be broadly categorized into integer types, approximate numeric types, and exact numeric types.

  1. Integer Types

DuckDB provides several integer types for whole numbers:

Type Storage Description
TINYINT 1 byte Small integer values
SMALLINT 2 bytes Medium-range integer values
INTEGER / INT 4 bytes Standard integer values
BIGINT 8 bytes Large integer values

These are equivalent to standard SQL integer types and are used when exact whole numbers are needed.

  1. Floating-Point Types (Approximate)
Type Storage Description
FLOAT / REAL 4 bytes Single-precision floating point (approximate)
DOUBLE / DOUBLE PRECISION 8 bytes Double-precision floating point (approximate)
  • These types store approximate numeric values, meaning they can introduce rounding errors.
  • They are useful for scientific or large-scale calculations where exact precision is not required.
  1. Exact Numeric Types
Type Description
DECIMAL(p, s) / NUMERIC(p, s) Exact fixed-point numbers. p = total digits, s = digits after the decimal point.
  • Ideal for financial or precise calculations, where exact decimal representation is required.
  • Arithmetic with DECIMAL preserves exact results, unlike floating-point types.

Summary

  • Integers:exact whole numbers.
  • FLOAT / DOUBLE: approximate numbers, faster, but may have rounding errors.
  • DECIMAL / NUMERIC: exact numbers with fixed precision, recommended for money or other high-precision use cases.

DuckDB’s numeric types closely follow standard SQL types, making it easy to write portable SQL queries and integrate with other database systems.

8.15 Math Functions Transform Values

SQL has many functions for transforming values. Common ones include:

  • ABS(x): Absolute Value
  • LN(): Natural log transformation.
  • EXP(): Exponentiation.
  • SQRT(): Square root.
  • POW(): Power transformation.
    • POW(2.0, x) would be \(2^x\)
    • POW(x, 2.0) would be \(x^2\)
  • ROUND(x, n): Round x to n decimal places
  • CEIL(x) / CEILING(x): Smallest integer ≥ x
  • FLOOR(x) Largest integer ≤ x

8.15.1 Integer Division

In SQL, when you divide two integer columns, many database systems perform integer division by default.

DuckDB has two different division operators that operate differently if both numbers are integers.

  • The single slash, / ,will convert to floating point numbers for the division.
  • The double slash, //, will use integer division.
  • See DuckDB Numeric Operators.
```{sql}
#| connection: con
SELECT 5 / 2;  -- Most SQL DBs return 2, not 2.5. DuckDB returns 2.5
SELECT 5 // 2;  -- returns 2, not 2.5 in DuckDB
```
1 records
(5 // 2)
2

To create more portable code, you can use CAST() to convert prior to division.

  • The fields day and “`month” are integer columns in the flights table.
  • Without casting, “day” // “month” would give truncated integer results, which would be misleading for the “ratio” calculation.
  • Using CAST() ensures “ratio” is a precise decimal number, giving meaningful results.

Without CAST() a standard SQL system with a single / would yield the following.

```{sql}
#| connection: con
SELECT
    DISTINCT "month",
    "day",
    "day" // "month" AS "ratio" -- using // to Force Integer Division
FROM
    flights
WHERE
    "month" >= 5
ORDER BY
    "month",
    "day";
```
Displaying records 1 - 10
month day ratio
5 1 0
5 2 0
5 3 0
5 4 0
5 5 1
5 6 1
5 7 1
5 8 1
5 9 1
5 10 2

Now with CAST()for more portable code.

  • CAST("day" AS DOUBLE) converts the integer column day into a double-precision floating-point number.
  • CAST("month" AS DOUBLE) does the same for month.
  • Dividing two DOUBLEs ensures floating-point division across most systems, preserving the decimal part.
```{sql}
#| connection: con
SELECT
    DISTINCT "month",
    "day",
    CAST("day" AS DOUBLE) / CAST("month" AS DOUBLE) AS "ratio"
FROM
    flights
WHERE
    "month" >= 5
ORDER BY
    "month",
    "day";
```
Displaying records 1 - 10
month day ratio
5 1 0.2
5 2 0.4
5 3 0.6
5 4 0.8
5 5 1.0
5 6 1.2
5 7 1.4
5 8 1.6
5 9 1.8
5 10 2.0

8.16 Aggregate Functions Summarize Fields

SQL has functions for summarizing data (SQL calls these “Aggregates”).

Aggregate functions must be inside a SELECT statement which identifies the fields of interest.

  • Aggregates are similar to the functions used inside a dplyr::summarize() in R.
  • They calculate the results for the selected fields of the Fields and aggregate (collapse) the records to a single record.
  • By default, all records with missing data (NULL) are ignored (like setting na.rm = TRUE) and there is no warning or message.

Some of the common aggregates are:

  • COUNT(): Count the number of rows.
  • AVG(): Calculate average.
  • MEDIAN(): Median (not standard across all DBMS’s).
  • SUM(): Summation.
  • MIN(): Minimum.
  • MAX(): Maximum.
  • STDDEV(): Standard deviation.
  • VARIANCE(): Variance
  • CORR(y, x): Correlation between x and y.

Let’s calculate the average and standard deviation for departure delays across all records in the table.

```{sql}
#| connection: con
SELECT
    AVG("dep_delay"), -- Note the default field name without an alias
    STDDEV("dep_delay") AS "sd_dep_delay",
    COUNT("dep_delay") AS "num_records"
FROM
    flights;
```
1 records
avg(dep_delay) sd_dep_delay num_records
12.63907 40.21006 328521
  • Note only one record is returned with fields identified by default with the function call, but you can create aliases using the AS statement.

  • compare to the R results to see that SQL automatically drops records with a NULL value before counting the records.

  • R equivalent:

flights |>
  summarize(avg_dep_delay = mean(dep_delay, na.rm = TRUE), 
            sd_dep_delay = sd(dep_delay, na.rm = TRUE),
            num_records = n())
# A tibble: 1 × 3
  avg_dep_delay sd_dep_delay num_records
          <dbl>        <dbl>       <int>
1          12.6         40.2      336776
flights |>
  filter(!is.na(dep_delay)) |> 
  summarize(avg_dep_delay = mean(dep_delay), 
            sd_dep_delay = sd(dep_delay),
            num_records = n())
# A tibble: 1 × 3
  avg_dep_delay sd_dep_delay num_records
          <dbl>        <dbl>       <int>
1          12.6         40.2      328521
  • R counts the number of records that went into the summary, before the records were removed by is.na().
  • You need to filter the data in R to remove the NAs before sending to the summarize() to get the same count as SQL.

8.17 GROUP BY Aggregates Results across Groups of Records.

GROUP BY identifies one or more fields by which to aggregate the records.

  • This is similar (but not exactly the same) as using summarize() inside a group_by() in R.

Let’s summarize the average and standard deviation of the departure delays by each origin and month.

```{sql}
#| connection: con
SELECT
    "origin",
    "month",
    AVG("dep_delay") AS "avg_dep_delay",
    STDDEV("dep_delay") AS "sd_dep_delay"
FROM
    flights
GROUP BY
    "origin",
    "month"
ORDER BY
  "origin",
  "month"
LIMIT 5
OFFSET 10
```
5 records
origin month avg_dep_delay sd_dep_delay
EWR 11 6.723769 28.78092
EWR 12 21.026575 45.74062
JFK 1 8.615826 35.99002
JFK 2 11.791355 37.43803
JFK 3 10.721825 35.33440
  • Now one record is returned for each of the 36 groups.
  • Note the order of the fields origin and month is based on the order of the SELECT, not the GROUP BY.

Change the order of fields in the SELECT.

```{sql}
#| connection: con
SELECT
    "month",
    "origin",
    AVG("dep_delay") AS "avg_dep_delay",
    STDDEV("dep_delay") AS "sd_dep_delay"
FROM
    flights
GROUP BY
    "origin",
    "month"
ORDER BY
  "origin",
  "month"
LIMIT 5
OFFSET 10
```
5 records
month origin avg_dep_delay sd_dep_delay
11 EWR 6.723769 28.78092
12 EWR 21.026575 45.74062
1 JFK 8.615826 35.99002
2 JFK 11.791355 37.43803
3 JFK 10.721825 35.33440

Changing the order of the fields in the GROUP BY does not change sequence of the fields or the results.

```{sql}
#| connection: con
SELECT
    "origin",
    "month",
    AVG("dep_delay") AS "avg_dep_delay",
    STDDEV("dep_delay") AS "sd_dep_delay"
FROM
    flights
GROUP BY
    "month",
    "origin"
ORDER BY
  "origin",
  "month"
LIMIT 5
OFFSET 10
```
5 records
origin month avg_dep_delay sd_dep_delay
EWR 11 6.723769 28.78092
EWR 12 21.026575 45.74062
JFK 1 8.615826 35.99002
JFK 2 11.791355 37.43803
JFK 3 10.721825 35.33440
  • R equivalent
flights |>
  select(origin, month, dep_delay) |>
  group_by(origin, month) |>
  summarize(avg_dep_delay = mean(dep_delay, na.rm = TRUE), 
            sd_dep_delay = sd(dep_delay, na.rm = TRUE),
            .groups = "drop") |> 
slice(11:15)
# A tibble: 5 × 4
  origin month avg_dep_delay sd_dep_delay
  <chr>  <int>         <dbl>        <dbl>
1 EWR       11          6.72         28.8
2 EWR       12         21.0          45.7
3 JFK        1          8.62         36.0
4 JFK        2         11.8          37.4
5 JFK        3         10.7          35.3
  • Note the order of the origin and month.
  • Changing the order in the select() does not matter in R.
flights |>
  select(month, origin, dep_delay) |>
  group_by(origin, month) |>
  summarize(avg_dep_delay = mean(dep_delay, na.rm = TRUE), 
            sd_dep_delay = sd(dep_delay, na.rm = TRUE),
            .groups = "drop") |> 
slice(11:15)
# A tibble: 5 × 4
  origin month avg_dep_delay sd_dep_delay
  <chr>  <int>         <dbl>        <dbl>
1 EWR       11          6.72         28.8
2 EWR       12         21.0          45.7
3 JFK        1          8.62         36.0
4 JFK        2         11.8          37.4
5 JFK        3         10.7          35.3

Changing the order in the group_by() does matter in how the fields are arranged by default.

flights |>
  select(month, origin, dep_delay) |>
  group_by(month, origin) |>
  summarize(avg_dep_delay = mean(dep_delay, na.rm = TRUE), 
            sd_dep_delay = sd(dep_delay, na.rm = TRUE),
            .groups = "drop") |> 
slice(11:15)
# A tibble: 5 × 4
  month origin avg_dep_delay sd_dep_delay
  <int> <chr>          <dbl>        <dbl>
1     4 JFK             12.2         41.2
2     4 LGA             11.5         43.4
3     5 EWR             15.4         39.0
4     5 JFK             12.5         38.5
5     5 LGA             10.6         40.6

8.17.1 Use WHERE to Filter Records Before Grouping with GROUP BY

Using WHERE before a GROUP BY allows you to filter the rows that will be grouped.

As an example, we want the departure delay calculations to only include flights that were delayed, not those that arrived early so had a negative departure delay.

```{sql}
#| connection: con
SELECT
    "origin",
    "month",
    AVG("dep_delay") AS "avg_dep_delay",
    STDDEV("dep_delay") AS "sd_dep_delay"
FROM
    flights
WHERE dep_delay > 0
GROUP BY
    "month",
    "origin"
ORDER BY
  "origin",
  "month"
LIMIT 5
OFFSET 10
```
5 records
origin month avg_dep_delay sd_dep_delay
EWR 11 27.34014 41.06456
EWR 12 38.42523 53.11234
JFK 1 33.11765 53.54925
JFK 2 36.82556 50.55263
JFK 3 35.88225 48.38411

Note the increases in the avg_dep_delay as the summary is based only on the filtered rows.

  • R equivalent
flights %>%
  filter(dep_delay > 0) |>                  # WHERE dep_delay > 0
  group_by(origin, month) |>                # GROUP BY origin, month
  summarise(
    avg_dep_delay = mean(dep_delay),        # AVG(dep_delay)
    sd_dep_delay = sd(dep_delay),           # STDDEV(dep_delay)
    .groups = "drop"
  ) |> 
  arrange(origin, month) |>                  # ORDER BY origin, month
  slice(11:15)                               # LIMIT 5 OFFSET 10

8.17.2 Use HAVING to Filter Records After Grouping with GROUP BY

HAVING allow you to filter groups after aggregation, rather than individual rows going into the group.

  • Often, you calculate summary statistics such as averages, totals, or counts, and then only want to keep groups that meet certain criteria. For example, you might only want destinations where the average departure delay exceeds a threshold.
  • Using HAVING allows these filtered groups to be used in subsequent calculations or as a subquery for joining with other tables, without having to manually filter the aggregated results outside the query.
  • This makes queries more compact, readable, and efficient, especially when working with multiple levels of aggregation.
Important

The key advantage of HAVING is that it allows you to filter based on the results of aggregate functions (like SUM(), AVG(), COUNT()) computed for each group.

  • In contrast, WHERE cannot use aggregate functions because it operates before any grouping or aggregation has taken place, it only filters individual rows.

This distinction is important: if you want to keep only groups that meet certain criteria, such as “destinations with more than one carrier” or “airlines with an average delay above 20 minutes,” HAVING lets you do this directly on the aggregated values.

  • Without HAVING, you would need an extra subquery or intermediate table to perform the same filtering.

Let’s use both WHERE to filter the rows to positive delays, and HAVING to find those origins and months with average delays greater than 35.

```{sql}
#| connection: con
SELECT
    origin,
    month,
    AVG(dep_delay) AS avg_dep_delay,
    STDDEV(dep_delay) AS sd_dep_delay
FROM flights
WHERE dep_delay > 0                     -- row-level filter
GROUP BY origin, month
HAVING AVG(dep_delay) > 35              -- group-level filter
ORDER BY origin, month
LIMIT 5
OFFSET 10;
```
5 records
origin month avg_dep_delay sd_dep_delay
JFK 3 35.88225 48.38411
JFK 4 40.91712 57.72617
JFK 5 39.10808 52.10440
JFK 6 47.98522 62.77220
JFK 7 49.56254 64.39900

-R equivalent

flights %>%
  filter(dep_delay > 0) %>%                    # WHERE dep_delay > 0
  group_by(origin, month) %>%                  # GROUP BY origin, month
  summarise(
    avg_dep_delay = mean(dep_delay),
    sd_dep_delay = sd(dep_delay),
    .groups = "drop"
  ) %>%
  filter(avg_dep_delay > 35) %>%              # HAVING AVG(dep_delay) > 20
  arrange(origin, month) %>%                   # ORDER BY origin, month
  slice(11:15)                                 # LIMIT 5 OFFSET 10
SQL Query Execution Order

We write queries in the following order:

SELECT ...
FROM ...
WHERE ...
GROUP BY ...
HAVING ...
ORDER BY ...
LIMIT ...

However, the database actually executes them in this logical order:

Step Clause Purpose
1 FROM Choose the source tables (and joins).
2 WHERE Filter individual rows BEFORE grouping.
3 GROUP BY Group rows into partitions (if requested).
4 HAVING Filter entire groups AFTER aggregation.
5 SELECT Choose columns and compute expressions.
6 ORDER BY Sort the final result set.
7 LIMIT Restrict the number of rows returned.

The difference between the way we write SQL syntax and the actual query execution order means that clauses like WHERE and HAVING operate at different stages than they appear in the code.

Understanding this helps explain why WHERE filters rows before aggregation while HAVING filters groups after aggregation, and why aggregate functions cannot be used in WHERE but can be used in HAVING.

8.18 Window Functions use OVER to Create Grouped Summaries without Aggregating the Records

Window Functions are there for when you want to add a grouped summary to the data without collapsing the data into one record for each groups.

  • As an example, you want to calculate the total departure delay time for each destination so you can use it to then calculate the percentage of that departure delay time for each airline flying to that destination.
  • You could do that with a summarized data frame and then a join.
  • In R, this can be done by using group_by() and then using mutate() instead of summarize() to add a new column to the data frame with the result for each group while preserving all of the rows.

In SQL, you use the aggregate function you want in a what that SQL calls a Window Function.

  • A Window function uses an aggregate function to calculate a result for one or more fields (across a specified set of records) and it adds the results as new fields in the result set (table), but it does Not collapse the records, so the result set has all the original un-aggregated records with the new aggregated results (for each set of records).
  • All the records for the given set of specified records have the same values in the new aggregated fields.

To use an aggregate function as a window function, use the OVER clause with a (PARTITION BY myfield) modifier.

AGG_FUNCTION(column) OVER (PARTITION BY column1, column2 ORDER BY column3)
  • AGG_FUNCTION is an aggregate function (e.g., SUM, AVG, COUNT)

  • PARTITION BY divides the data into groups (like group_by in R)

  • ORDER BY (optional) defines the order of rows inside each partition (important for ROW_NUMBER(), RANK(), LEAD(), LAG(), etc.)

  • The PARTITION BY clause is a sub-clause of the OVER clause.

  • The partition by breaks up (partitions) the table into “separate” tables based one or more chosen fields.

  • This is similar to creating a virtual grouped data frame in R with one or more groups.

  • Then, the aggregate function is applied to the records in each partition and a new field is added to the records in that partition with the aggregated value.

The comparison in Table Table 8.2 might help connect the concepts for using dplyr::group_by() with using window functions.

Table 8.2: Comparing dplyr::group_by() with the SQL Window function
R (dplyr) SQL (Window Function)
group_by(dest) PARTITION BY dest
operate on each group aggregate over each partition
mutate() preserves rows window function preserves rows
summarize() collapses rows normal aggregate without OVER() collapses rows
Note

Window functions are most useful when you need a grouped statistic as part of another calculation, rather than as a final summarized result.

  • In R, this is the same situation where you’d use group_by() with mutate() instead of with the usual summarize(), you want to preserve the full dataset, but also attach totals, percentages, or ranks.
  • For example, calculating the percentage of total delay per airline requires knowing the total delays for each destination (the group), but keeping the individual flight rows (the detail).

In everyday SQL, people often use simple SELECT and GROUP BY queries, so window functions are less common.

  • But when you need running totals, percentages of group totals, rankings, or comparisons within groups without collapsing the rows, window functions are the most natural and powerful tool.

8.18.1 Window Function Example 1

Let’s find the flight numbers for the longest flights (in terms of air time) for each airport (origin).

  • This uses the window function as part of the SELECT to create a new field.
  • The Aggregate function is MAX.
  • The PARTITON_BY is origin
  • Let’s use AS to create the alias apt_max for the name of the new field with the airport maximum flight time.
```{sql}
#| connection: con
SELECT
    "flight",
    "origin",
    "air_time",
    MAX("air_time") -- The AGG_function
    OVER (PARTITION BY "origin") AS "apt_max"
FROM
    flights
LIMIT 10 
OFFSET 120830;
```
Displaying records 1 - 10
flight origin air_time apt_max
1691 JFK 309 691
1447 JFK 75 691
583 JFK 128 691
303 JFK 350 691
1743 JFK 101 691
5796 EWR 97 695
1223 EWR 200 695
297 EWR 355 695
5 EWR 172 695
5814 EWR 120 695
  • Note the apt_max values are the same for all the records for each origin.

Having this new field (max_air_time) may not be interesting on its own, but by computing it in a derived table, we can use it as an intermediate result for further analysis, like finding the percentage of each flight’s time relative to the maximum or joining with other datasets.

A “derived table” is a temporary table created on the fly inside a query. It is defined by using a query in the FROM clause and exists only for the duration of that query.

  • A derived table allows you to compute intermediate results that you can then filter, join, or aggregate in the outer query.
  • It is often used when you need a result that cannot be expressed directly in WHERE (like a window function).
  • It does not create a permanent table in the database; it is transient.

Let’s find the destination for the longest flights from each airport.

```{sql}
#| label: lst-derived-table
#| connection: con
SELECT flight, origin, dest, air_time    --
FROM (
2    SELECT                              --
        flight,
        origin,
        dest,
        air_time,
        MAX(air_time) OVER (PARTITION BY origin) AS max_air_time
    FROM flights
) AS t
WHERE air_time = max_air_time;
```
2
Inner query / derived table (AS t):
3 records
flight origin dest air_time
51 JFK HNL 691
745 LGA DEN 331
15 EWR HNL 695
  • Computes the maximum flight time (MAX(air_time) OVER (PARTITION BY origin)) for each origin airport.
  • Produces a temporary table with all flights and the maximum flight time for that airport.
  1. Outer query:
  • Filters the derived table to keep only flights where air_time equals the maximum for that origin.
  • Returns the longest flight(s) from each airport.

8.18.2 Window Function Example 2

Here is a slightly more complicated example.

We want the percentage of the total departure delay for a destination for each flight to that destination.

  • We first use a window function to SUM all delays for a destination and give it the alias total_delay_dest.
  • Then we use total_delay_dest to calculate for each flight, their percentage of that delay.
  • Use WHERE to only calculate for flights with a positive delay, i.e., that were actually delayed.
```{sql}
#| connection: con
SELECT
    dest,
    flight,
    month, 
    day,
    carrier,
    dep_delay,
    SUM(dep_delay) OVER (PARTITION BY dest) AS total_delay_dest,
    dep_delay * 100.0 / SUM(dep_delay) OVER (PARTITION BY dest) AS pct_of_total
FROM flights
WHERE dep_delay > 0
ORDER BY dest, carrier, day, flight
LIMIT 5;
```
5 records
dest flight month day carrier dep_delay total_delay_dest pct_of_total
ABQ 1505 8 1 B6 54 4076 1.3248283
ABQ 1505 7 1 B6 9 4076 0.2208047
ABQ 65 10 2 B6 9 4076 0.2208047
ABQ 1505 9 2 B6 125 4076 3.0667321
ABQ 1505 7 2 B6 21 4076 0.5152110

Now we want to also get the percentage of the total delay at each destination caused by each carrier across all their flights.

  • We add another window function to create pct_of_total_by_carrier.
```{sql}
#| connection: con
SELECT
    dest,
    flight,
    carrier,
    dep_delay,
    -- total delay at the destination
    SUM(dep_delay) OVER (PARTITION BY dest) AS total_delay_dest,

    -- flight-level % contribution
    dep_delay * 100.0 / SUM(dep_delay) OVER (PARTITION BY dest) AS pct_of_total_flight,

    -- carrier-level total delay
    SUM(dep_delay) OVER (PARTITION BY dest, carrier) AS total_delay_dest_by_carrier,

    -- carrier-level % contribution
    SUM(dep_delay) OVER (PARTITION BY dest, carrier) * 100.0
        / SUM(dep_delay) OVER (PARTITION BY dest) AS pct_of_total_by_carrier
FROM flights
WHERE dep_delay > 0
ORDER BY dest, carrier, flight, dep_delay
lIMIT 10
```
Displaying records 1 - 10
dest flight carrier dep_delay total_delay_dest pct_of_total_flight total_delay_dest_by_carrier pct_of_total_by_carrier
ABQ 65 B6 1 4076 0.0245339 4076 100
ABQ 65 B6 1 4076 0.0245339 4076 100
ABQ 65 B6 2 4076 0.0490677 4076 100
ABQ 65 B6 3 4076 0.0736016 4076 100
ABQ 65 B6 3 4076 0.0736016 4076 100
ABQ 65 B6 4 4076 0.0981354 4076 100
ABQ 65 B6 4 4076 0.0981354 4076 100
ABQ 65 B6 6 4076 0.1472031 4076 100
ABQ 65 B6 7 4076 0.1717370 4076 100
ABQ 65 B6 9 4076 0.2208047 4076 100

Let’s limit to those destinations that have more than one carrier.

  • We will use the derived table approach.
SELECT
    dest,
    flight,
    carrier,
    dep_delay,
    SUM(dep_delay) OVER (PARTITION BY dest) AS total_delay_dest,
    dep_delay * 100.0 / SUM(dep_delay) OVER (PARTITION BY dest) AS pct_of_total_flight,
    SUM(dep_delay) OVER (PARTITION BY dest, carrier) AS total_delay_dest_by_carrier,
    SUM(dep_delay) OVER (PARTITION BY dest, carrier) * 100.0
        / SUM(dep_delay) OVER (PARTITION BY dest) AS pct_of_total_by_carrier
FROM flights
WHERE dep_delay > 0
  AND dest IN (
      SELECT dest
      FROM flights
      GROUP BY dest
      HAVING COUNT(DISTINCT carrier) > 1
  )
ORDER BY dest, flight, carrier, dep_delay
LIMIT 10
OFFSET 3280;
Displaying records 1 - 10
dest flight carrier dep_delay total_delay_dest pct_of_total_flight total_delay_dest_by_carrier pct_of_total_by_carrier
ATL 1547 DL 112 254414 0.0440227 135500 53.2596477
ATL 1547 DL 114 254414 0.0448089 135500 53.2596477
ATL 1547 DL 127 254414 0.0499186 135500 53.2596477
ATL 1547 DL 138 254414 0.0542423 135500 53.2596477
ATL 1547 DL 141 254414 0.0554215 135500 53.2596477
ATL 1547 DL 192 254414 0.0754675 135500 53.2596477
ATL 1554 UA 2 254414 0.0007861 1801 0.7079013
ATL 1554 UA 4 254414 0.0015722 1801 0.7079013
ATL 1554 UA 7 254414 0.0027514 1801 0.7079013
ATL 1554 UA 18 254414 0.0070751 1801 0.7079013
  • R equivalent
flights %>%
  filter(dep_delay > 0)  |> 
  # Keep only destinations with more than one carrier
  group_by(dest) %>%
  filter(n_distinct(carrier) > 1)  |> 
  ungroup() |> 
  # Compute window-like summaries
  group_by(dest) %>%
  mutate(
    total_delay_dest = sum(dep_delay),
    pct_of_total_flight = dep_delay * 100 / total_delay_dest
  ) |> 
  group_by(dest, carrier) |> 
  mutate(
    total_delay_dest_by_carrier = sum(dep_delay),
    pct_of_total_by_carrier = total_delay_dest_by_carrier * 100 / total_delay_dest
  ) |> 
  ungroup() |> 
  arrange(dest, flight, carrier, dep_delay) %>%
  select(dest, flight, carrier, dep_delay,
         total_delay_dest, pct_of_total_flight,
         total_delay_dest_by_carrier, pct_of_total_by_carrier) |> 
  slice(3281:3290)

Now, if we only want the delay at the destination and carrier level, we can simplify without the window for each flight.

  • Let’s ROUND() the totals as well.
SELECT
    dest,
    carrier,
    SUM(SUM(dep_delay)) OVER (PARTITION BY dest) AS total_delay_dest,
    SUM(dep_delay) AS total_delay_dest_by_carrier,
    ROUND(SUM(dep_delay) * 100.0 / SUM(SUM(dep_delay)) OVER (PARTITION BY dest), 2) AS pct_of_total_by_carrier
FROM flights
WHERE dep_delay > 0
  AND dest IN (
      SELECT dest
      FROM flights
      GROUP BY dest
      HAVING COUNT(DISTINCT carrier) > 1
  )
GROUP BY dest, carrier
ORDER BY dest, pct_of_total_by_carrier DESC
LIMIT 10
OFFSET 0;
Displaying records 1 - 10
dest carrier total_delay_dest total_delay_dest_by_carrier pct_of_total_by_carrier
ATL DL 254414 135500 53.26
ATL FL 254414 47001 18.47
ATL EV 254414 40568 15.95
ATL MQ 254414 29149 11.46
ATL UA 254414 1801 0.71
ATL WN 254414 217 0.09
ATL 9E 254414 178 0.07
AUS B6 36623 12824 35.02
AUS UA 36623 11195 30.57
AUS AA 36623 6146 16.78
  • R equivalent
flights %>%
  filter(dep_delay > 0) |>                 # Only positive delays
  group_by(dest) |> 
  filter(n_distinct(carrier) > 1) |>       # Only destinations with >1 carrier
  ungroup() |> 
  group_by(dest, carrier) |> 
  summarise(
    total_delay_carrier = sum(dep_delay),
    .groups = "drop"
  ) |> 
  group_by(dest) |> 
  mutate(
    total_delay_dest = sum(total_delay_carrier),                        # total per destination
    pct_of_total_by_carrier = round(total_delay_carrier * 100 / total_delay_dest, 2)
  ) |> 
  ungroup() |> 
  select(dest, carrier, total_delay_dest, total_delay_carrier, pct_of_total_by_carrier) |> 
  arrange(dest, desc(pct_of_total_by_carrier)) |> 
  slice_head(n = 10)  

See SQL Window Functions or SQL PARTITION BY Clause overview for other examples.

8.19 CASE and CASE WHEN Recode or Create New Fields Based on the Values of Existing Fields

CASE WHEN is the SQL expression for introducing conditional logic into queries so you can create new columns or modify existing ones based on conditions.

CASE WHEN evaluates a condition for each row and returns a specified value if the condition is true, otherwise it can return another value or NULL.

  • It is similar to dplyr::case_when() in R or IF … ELSE statements in other programming languages.
  • CASE cannot stand alone; it always needs at least one WHEN.
  • The WHEN may use simple logic or complicated expressions.
  • CASE WHEN is particularly useful for categorizing numeric values into buckets, creating flags, or computing conditional summaries directly in the SELECT statement.
  • CASE WHEN stops its evaluation at first TRUE so the order of the WHENs can affect the results.

Using CASE keeps the logic inside the SQL query to it is transparent and so you don’t need a separate step in R or Python to mutate the table after retrieval.

The minimally valid syntax for simple CASE WHEN working with a single column is :

CASE column
    WHEN value THEN result
END

The valid syntax for a more general CASE WHEN expression is:

CASE
    WHEN some_condition THEN result
    WHEN some_condition THEN result  -- 0 or more as needed
    ELSE result -- optional
END

Let’s use a simple CASE WHEN expression, covering a single column without using WHEN conditions, to add a variable with names for each value of origin.

```{sql}
#| label: case-when
#| connection: con
SELECT
    flight,
    origin,
    CASE origin
        WHEN 'JFK' THEN 'John F. Kennedy'
        WHEN 'LGA' THEN 'LaGuardia'
        WHEN 'EWR' THEN 'Newark'
        ELSE 'Other Airport'
    END AS origin_full_name
FROM flights
LIMIT 10;
```
Displaying records 1 - 10
flight origin origin_full_name
1545 EWR Newark
1714 LGA LaGuardia
1141 JFK John F. Kennedy
725 JFK John F. Kennedy
461 LGA LaGuardia
1696 EWR Newark
507 EWR Newark
5708 LGA LaGuardia
79 JFK John F. Kennedy
301 LGA LaGuardia
  • R equivalent using either case_when() or case_match().
flights |>
  select(flight, origin) |>
  mutate(orig_long = case_when(
    origin == "JFK" ~ "John F. Kennedy",
    origin == "LGA" ~ "LaGuardia",
    origin == "EWR" ~ "Newark Liberty",
    .default = "other"
  ))

## or for value matching
flights |>
  select(flight, origin) |>
  mutate(orig_long = case_match(
    origin,
    "JFK" ~ "John F. Kennedy",
    "LGA" ~ "LaGuardia",
    "EWR" ~"Newark Liberty",
    .default = "other"
  ))

Use the more general CASE WHEN to incorporate logical comparisons into the WHEN expression to create categories for departure delays.

```{sql}
#| connection: con
SELECT
    flight,
    dep_delay,
    CASE 
        WHEN dep_delay < 0 THEN 'early'
        WHEN dep_delay = 0 THEN 'on time'
        WHEN dep_delay <= 30 THEN 'slightly delayed'
        ELSE 'very delayed'
    END AS delay_category
FROM flights
LIMIT 10;
```
Displaying records 1 - 10
flight dep_delay delay_category
1545 2 slightly delayed
1714 4 slightly delayed
1141 2 slightly delayed
725 -1 early
461 -6 early
1696 -4 early
507 -5 early
5708 -3 early
79 -3 early
301 -2 early
  • R equivalent.
flights %>%
  select(flight, dep_delay) |> 
  mutate(
    delay_category = case_when(
      dep_delay < 0 ~ "early",
      dep_delay == 0 ~ "on time",
      dep_delay <= 30 ~ "slightly delayed",
      TRUE ~ "very delayed"
    )
  ) |> 
  slice_head(n = 10)

8.20 JOIN Uses Two Tables to Create a Single Table Using the Fields in ON

The JOIN clause combines rows from two tables into a single result set based on related columns.

  • Joins can add columns from another table or, depending on the type, remove rows (filtering joins) if matches no exist.
  • The default operation for JOIN is an INNER JOIN.

The ON sub-clause specifies the matching condition i.e., how columns from each table are compared to decide which rows pair together.

  • While the most common condition is equality (=), you must always specify the condition, e.g., =, <,>, <>, BETWEEN, etc.).
  • USING i9s a simpler version of ON when the column name(s) are the same in both tables and you want to join on equality of the values in the column.

The JOIN syntax is as follows where table1 is considered the “left” table and table2 is considered the “right` table.

SELECT ... 
FROM table1
JOIN table2 ON table1.id = table2.id
...

JOIN has several modifiers that affect how the join operates. These include:

  • INNER JOIN: Returns only rows with matching keys in both tables.
    • Columns from both tables appear in the result; unmatched rows are excluded (what makes it an “inner” join).
    • This is the default operation for JOIN without a modifier.
  • LEFT JOIN (or LEFT OUTER JOIN): Returns all rows from the left table, with columns from the right table included where there’s a match.
  • Rows with unmatched right-table columns get NULL values (what makes it an “outer” join).
  • RIGHT JOIN (or RIGHT OUTER JOIN): Returns all rows from the right table, with columns from the left table included where there’s a match;.
  • Unmatched left-table columns are NULL.
  • FULL JOIN (or FULL OUTER JOIN): Returns all rows from both tables; unmatched columns from either table are NULL.

Table 8.3 summarize the results for each type of join.

Table 8.3: Summary of JOIN modifiers and results
Join Type Rows Returned Columns Included
INNER JOIN Only rows with matching keys in both tables Columns from both tables; unmatched rows excluded
LEFT JOIN / LEFT OUTER All rows from left table; matched rows from right Columns from both tables; unmatched right columns = NULL
RIGHT JOIN / RIGHT OUTER All rows from right table; matched rows from left Columns from both tables; unmatched left columns = NULL
FULL JOIN / FULL OUTER All rows from both tables Columns from both tables; unmatched columns = NULL

8.20.1 JOIN and its Modifiers Examples

Default JOIN with USING.

  • USING uses implicit equality between the column names.
  • Since USING can take more one or more column names, they must be wrapped in (...).
  • USING automatically removes duplicate column names.
```{sql}
#| connection: con
-- Sample tables
-- flights: flight, carrier, dest
-- airlines: carrier, name

SELECT flight, origin, dest, name
FROM flights
JOIN airlines
  USING (carrier)
LIMIT 5;
```
5 records
flight origin dest name
1545 EWR IAH United Air Lines Inc.
1714 LGA IAH United Air Lines Inc.
1141 JFK MIA American Airlines Inc.
725 JFK BQN JetBlue Airways
461 LGA ATL Delta Air Lines Inc.

Default JOIN with ON.

  • Use AS implicit to create aliases for the tables f for flights and a for airlines.
  • Reference each variable with its table alias
  • Use explicit equality in the ON sub clause.
  • Note the columns in the ON do not have to be in the SELECT.
```{sql}
#| connection: con
-- Sample tables
-- flights: flight, carrier, dest
-- airlines: carrier, name

SELECT f.flight, f.origin, f.dest, a.name
FROM flights f
INNER JOIN airlines a
  ON f.carrier = a.carrier
LIMIT 5;
```
5 records
flight origin dest name
1545 EWR IAH United Air Lines Inc.
1714 LGA IAH United Air Lines Inc.
1141 JFK MIA American Airlines Inc.
725 JFK BQN JetBlue Airways
461 LGA ATL Delta Air Lines Inc.

The good news is the nycflights database is very robust with aligned column names and few NA which makes it useful for demonstrating USING.

  • The bad news is it is too good to be useful for demonstrating different outcomes using other joins as they will usually have the same results.

We will use a new connection to a small demo database for this section.

  1. Create a new connection called con_demo and show the tables.
db_path_demo <- "./data/demo_joins.duckdb"
con_demo <- dbConnect(duckdb::duckdb(), dbdir = db_path_demo, read_only = FALSE)
SHOW tables
4 records
name
departments
employees
projects
salaries

Since this is a very small database for demo purposes we will use SELECT * for the departments and employee tables for now.

```{sql}
#| connection: con_demo
SELECT * 
FROM departments
LIMIT 10;
```
5 records
department_id department_name
10 HR
20 IT
30 Finance
40 Marketing
60 Legal
```{sql}
#| connection: con_demo
SELECT * 
FROM employees
LIMIT 10;
```
7 records
emp_id name dept_id age salary
1 Alice 10 25 40000
2 Bob 20 32 55000
3 Charlie 30 45 70000
4 Diana 20 28 50000
5 Eve 50 35 62000
6 Frank 40 29 38000
7 Grace NA 41 80000
  • Note: the two tables do not have any column names in common so can’t do a join with USING.

Default JOIN with ON

  • ON requires fully referencing the variable names with the table name using a . in between.
  • Here we create an alias (implicit AS) for each table and use them in the SELECT and ON expressions.
  • Again, the columns in the ON do not have to be part of the SELECT.
```{sql}
#| connection: con_demo
#| label: join-example-flights
-- JOIN: by default an Inner Join so only employees with a matching department
SELECT e.emp_id, e.name, d.department_name
FROM employees e
JOIN departments d
  ON e.dept_id = d.department_id
LIMIT 5;
```
5 records
emp_id name department_name
1 Alice HR
2 Bob IT
3 Charlie Finance
4 Diana IT
6 Frank Marketing
  • Note: Employee Eve and Grace did not show up as there is no match for e.dept_id = 50 in d.department_id.

Adding the modifier INNER does not change the result since the default for JOIN is an inner join.

```{sql}
#| connection: con_demo
#| label: inner-join-example-explicit
-- JOIN: by default an Inner Join so only employees with a matching department
SELECT e.emp_id, e.name, d.department_name
FROM employees e
INNER JOIN departments d
  ON e.dept_id = d.department_id
LIMIT 5;
```
5 records
emp_id name department_name
1 Alice HR
2 Bob IT
3 Charlie Finance
4 Diana IT
6 Frank Marketing

LEFT JOIN: all rows from employees, include department columns when available, otherwise NULL.

```{sql}
#| connection: con_demo
#| label: left-join-example
-- LEFT JOIN: all employees, NULL if no matching department
SELECT e.emp_id, e.name, d.department_name
FROM employees e
LEFT JOIN departments d
  ON e.dept_id = d.department_id;
```
7 records
emp_id name department_name
1 Alice HR
2 Bob IT
3 Charlie Finance
4 Diana IT
6 Frank Marketing
5 Eve NA
7 Grace NA
  • Note: Eve and Grace have NULL in the added department_name column as there is no match for e.dept_id = 50 in d.department_id.

RIGHT JOIN: all rows from departments, include employee columns when available, otherwise NULL.

```{sql}
#| connection: con_demo
#| label: right-join-example
-- RIGHT JOIN: all departments, NULL if no matching employee
SELECT e.emp_id, e.name, d.department_name
FROM employees e
RIGHT JOIN departments d
  ON e.dept_id = d.department_id;
```
6 records
emp_id name department_name
1 Alice HR
2 Bob IT
3 Charlie Finance
4 Diana IT
6 Frank Marketing
NA NA Legal

Note: the legal department without any employees gets NULL for the employee columns.

FULL JOIN: all rows for all employees and all departments, NULL where unmatched

```{sql}
#| connection: con_demo
#| label: full-join-example
-- FULL JOIN: all employees and all departments, NULL where unmatched
SELECT e.emp_id, e.name, d.department_name
FROM employees e
FULL JOIN departments d
  ON e.dept_id = d.department_id;
```
8 records
emp_id name department_name
1 Alice HR
2 Bob IT
3 Charlie Finance
4 Diana IT
6 Frank Marketing
5 Eve NA
7 Grace NA
NA NA Legal

Note: Employees without a matched Department get NULL for the department column and departments without any employees get NULL for the employee columns.

SEMI JOIN returns rows in the left table only if they have a match in the right table.

```{sql}
#| connection: con_demo
#| label: lst-semi-join
SELECT e.*
FROM employees e
SEMI JOIN departments d
  ON e.dept_id = d.department_id;
```
5 records
emp_id name dept_id age salary
1 Alice 10 25 40000
2 Bob 20 32 55000
3 Charlie 30 45 70000
4 Diana 20 28 50000
6 Frank 40 29 38000

ANTI-JOIN returns rows in the left table that do not have a match in the right table.

```{sql}
#| connection: con_demo
#| label: lst-anti-join
SELECT e.*
FROM employees e
ANTI JOIN departments d
  ON e.dept_id = d.department_id;
```
2 records
emp_id name dept_id age salary
5 Eve 50 35 62000
7 Grace NA 41 80000

Some SQL flavors do not have an explicit ANTI join modifier, but there multiple ways to achieve the same kind of result. Here are two examples.

  • Using IS NULL
```{sql}
#| connection: con_demo
#| label: anti-join-example-NULL
-- ANTI JOIN: employees with no matching department
SELECT e.emp_id, e.name
FROM employees e
LEFT JOIN departments d 
  ON e.dept_id = d.department_id
WHERE d.department_id IS NULL;
```
2 records
emp_id name
5 Eve
7 Grace
  • Using NOT IN and account for potential NULL values.
```{sql}
#| connection: con_demo
#| label: anti-join-example-NOT-in
-- ANTI JOIN: employees with no matching department
SELECT e.emp_id, e.name
FROM employees e
WHERE e.dept_id NOT IN (SELECT department_id FROM departments)
 OR e.dept_id IS NULL;
```
2 records
emp_id name
5 Eve
7 Grace

8.20.2 ON with Non-Equality Conditions Examples

Let’s look at the other two tables.

```{sql}
#| connection: con_demo
SELECT * 
FROM salaries
```
3 records
band_id band_name min_salary max_salary
1 Junior 30000 49999
2 Mid-level 50000 69999
3 Senior 70000 99999
```{sql}
#| connection: con_demo
SELECT * 
FROM projects
```
6 records
emp_id project_name
1 Alpha
2 Beta
2 Gamma
3 Alpha
5 Delta
7 Gamma
  1. Inequality Join (<, >, <=, >=, <>)

Example: employees older than department_id × 2.

SELECT e.emp_id, e.name, e.age, d.department_id
FROM employees e
JOIN departments d
  ON e.age > d.department_id * 2
ORDER BY name;
9 records
emp_id name age department_id
1 Alice 25 10
2 Bob 32 10
3 Charlie 45 20
3 Charlie 45 10
4 Diana 28 10
5 Eve 35 10
6 Frank 29 10
7 Grace 41 20
7 Grace 41 10
  • Note: this gets all combinations that are true.
  1. Non-Equi Join with Ranges (using BETWEEN)
  • Map employees to salary bands.
```{sql}
#| connection: con_demo
#| label: lst-join-range
SELECT e.emp_id, e.name, e.salary, s.band_name
FROM employees e
JOIN salaries s
  ON e.salary BETWEEN s.min_salary AND s.max_salary
ORDER BY name;
```
7 records
emp_id name salary band_name
1 Alice 40000 Junior
2 Bob 55000 Mid-level
3 Charlie 70000 Senior
4 Diana 50000 Mid-level
5 Eve 62000 Mid-level
6 Frank 38000 Junior
7 Grace 80000 Senior
  1. Join with IN / NOT IN
```{sql}
#| connection: con_demo
#| label: lst-join-in-not-in
SELECT e.emp_id, e.name, d.department_name
FROM employees e
JOIN departments d
  ON e.dept_id NOT IN (10, 20) AND e.dept_id = d.department_id
ORDER BY name;
```
2 records
emp_id name department_name
3 Charlie Finance
6 Frank Marketing
  1. Join with Compound Conditions (AND / OR)`
```{sql}
#| connection: con_demo
#| label: lst-join-and-or
SELECT e.emp_id, e.name, e.salary, d.department_name, d.department_id
FROM employees e
LEFT JOIN departments d
  ON (e.dept_id IN (10, 20) OR e.salary > 70000)
ORDER BY name;
```
Displaying records 1 - 10
emp_id name salary department_name department_id
1 Alice 40000 HR 10
1 Alice 40000 IT 20
1 Alice 40000 Finance 30
1 Alice 40000 Marketing 40
1 Alice 40000 Legal 60
2 Bob 55000 HR 10
2 Bob 55000 IT 20
2 Bob 55000 Finance 30
2 Bob 55000 Marketing 40
2 Bob 55000 Legal 60

Note: This returns more rows than you might expect due to the combination of LEFT JOIN and OR.

  • ON does not just filter rows, it determines which right-table rows match.
  • Using OR inside ON with multiple right-table rows can create duplicates of left-table rows.
  • To avoid this, either:
    1. Filter the right table in a subquery before joining.
    2. Use = or carefully structured conditions in ON.
```{sql}
#| connection: con_demo
#| label: join-cross
SELECT e.emp_id, e.name, p.project_name
FROM employees e
CROSS JOIN projects p
LIMIT 10;
```
Displaying records 1 - 10
emp_id name project_name
1 Alice Alpha
2 Bob Alpha
3 Charlie Alpha
4 Diana Alpha
5 Eve Alpha
6 Frank Alpha
7 Grace Alpha
1 Alice Beta
2 Bob Beta
3 Charlie Beta
  1. Text Comparison with LIKE or IN
SELECT e.emp_id, e.name, d.department_name
FROM employees e
INNER JOIN departments d
  ON e.dept_id = d.department_id
     AND d.department_name LIKE '%I%'
ORDER BY e.emp_id;
2 records
emp_id name department_name
2 Bob IT
4 Diana IT

We are done with the demo database so let’s close the connection to the demo database.

dbDisconnect(con_demo, shutdown = TRUE)

8.21 Subqueries

SQL allows you to embed or nest queries within queries. These are referred to as subqueries.

“A subquery is just a query used as a data source in the FROM clause, instead of the usual table.” R for Data Science (2ed)

  • As an example, if you need data from a “table” that does not exist, you can use a query to create the new table (virtually) and insert it into the FROM clause of an outer query.
Important

To delineate the query as a subquery you put it inside parentheses (...) to require the subquery to be executed before other parts of the query can be evaluated.

If using the results in other parts of a complicated query, you must also assign an “alias” for the result. This is similar to using a pronoun in R piping. It represents the returned object from the subquery.

Subqueries are allowed in places where SQL expects a value, table, or expression.

  1. Subquery in the FROM clause (as a table)
  • Creates a “virtual table” that the outer query can use.
  • Must assign an alias so the outer query can reference it
  • The subquery is treated like a table.
```{sql}
#| connection: con
SELECT f.carrier, f.dep_delay
FROM (
    SELECT *
    FROM flights
    WHERE dep_delay > 60
) AS f
WHERE f.origin = 'JFK'
LIMIT 10;
```
Displaying records 1 - 10
carrier dep_delay
AA 71
MQ 853
B6 77
B6 122
EV 119
B6 88
B6 91
9E 88
AA 63
AA 131
  1. Subquery in the SELECTclause (as a scalar value)
  • This subquery is an expression: it returns a single value for each row.
```{sql}
#| connection: con
SELECT carrier,
       (SELECT AVG(dep_delay) 
        FROM flights f2 
        WHERE f2.carrier = f1.carrier) AS avg_delay
FROM flights f1
GROUP BY carrier
LIMIT 10;
```
Displaying records 1 - 10
carrier avg_delay
OO 12.586207
WN 17.711744
AS 5.804775
F9 20.215543
DL 9.264504
B6 13.022522
FL 18.726075
HA 4.900585
EV 19.955390
AA 8.586016
  1. Subquery in the WHERE clause (as a condition)
  • Returns values to compare with column values.
  • Subquery returns a list of values; the outer query filters based on that.
```{sql}
#| connection: con
SELECT flight, carrier
FROM flights
WHERE carrier IN (
    SELECT carrier
    FROM airlines
    WHERE name LIKE '%Jet%'
)
LIMIT 10;
```
Displaying records 1 - 10
flight carrier
373 B6
4694 EV
24 B6
56 B6
3260 EV
673 B6
5273 EV
1 B6
600 B6
4553 EV
  1. Subquery in the HAVING clause
  • Works like WHERE but after aggregation:
```{sql}
#| connection: con
SELECT carrier, AVG(dep_delay) AS avg_delay
FROM flights
GROUP BY carrier
HAVING AVG(dep_delay) > (
        SELECT AVG(dep_delay)
        FROM flights
        )
LIMIT 10;
```
8 records
carrier avg_delay
FL 18.72607
VX 12.86942
B6 13.02252
WN 17.71174
F9 20.21554
EV 19.95539
9E 16.72577
YV 18.99633
Important
  • The subquery must return exactly the type and number of values expected:
    • Table (one or more columns, any number of rows) with an Alias in FROM
    • Scalar (one value) in SELECT or WHERE =. If it returns multiple rows/columns, SQL throws an error.
    • Can return multiple rows, but only one column for INor FROM. If you tried to return multiple columns for IN, SQL would error.
  • You cannot just put a subquery anywhere; it must be valid in context.

8.21.1 Subquery Example

Let’s create a query that uses all four types:

  • Get flights delayed more than the average delay (WHERE subquery)
  • Include airline name (JOINsubquery in FROM)
  • Include average delay per carrier (SELECT scalar subquery)
  • Only include carriers whose average delay is above the overall average (HAVING subquery)
Listing 8.1: A query using four types of subqueries
```{sql}
#| connection: con
#| lst-label: lst-combined-subquery
#| lst-cap: A query using four types of subqueries
SELECT f.year, f.month, f.day, f.dep_delay, a.name AS airline_name,
       -- scalar subquery in SELECT: average delay per carrier
       (SELECT AVG(dep_delay) 
        FROM flights f2 
        WHERE f2.carrier = f.carrier) AS avg_delay_per_carrier
FROM (
    -- subquery in FROM: only flights delayed more than the overall average
    SELECT *
    FROM flights
    WHERE dep_delay > (SELECT AVG(dep_delay) FROM flights)
) AS f
JOIN airlines a
  ON f.carrier = a.carrier
GROUP BY f.carrier, f.year, f.month, f.day, f.dep_delay, a.name
-- subquery in HAVING: only carriers with avg delay above overall avg
HAVING AVG(f.dep_delay) > (SELECT AVG(dep_delay) FROM flights)
ORDER BY f.dep_delay DESC
LIMIT 10;
```
Displaying records 1 - 10
year month day dep_delay airline_name avg_delay_per_carrier
2013 1 9 1301 Hawaiian Airlines Inc. 4.900585
2013 6 15 1137 Envoy Air 10.552041
2013 1 10 1126 Envoy Air 10.552041
2013 9 20 1014 American Airlines Inc. 8.586016
2013 7 22 1005 Envoy Air 10.552041
2013 4 10 960 Delta Air Lines Inc. 9.264504
2013 3 17 911 Delta Air Lines Inc. 9.264504
2013 6 27 899 Delta Air Lines Inc. 9.264504
2013 7 22 898 Delta Air Lines Inc. 9.264504
2013 12 5 896 American Airlines Inc. 8.586016

Table 8.4 shows how each of the uses of a subquery contribute to the overall results while meeting the expectations of their usage.

Table 8.4: Four different uses of subqueries
Feature Type Role in Query
(SELECT AVG(dep_delay) FROM flights) in WHERE Scalar subquery Filters only flights delayed more than overall average
FROM (SELECT * FROM flights …) AS f Subquery in FROM Creates virtual table of delayed flights
(SELECT AVG(dep_delay) FROM flights f2 WHERE f2.carrier = f.carrier) in SELECT Scalar subquery Computes average delay per carrier
HAVING AVG(f.dep_delay) > (SELECT AVG(dep_delay) FROM flights) Subquery in HAVING Filters groups (carriers) with above-average delays

8.22 WITH Defines Common Table Expressions (CTE)

WITH defines Common Table Expressions (CTEs) — named subqueries that make queries easier to read and reuse.

  • Think of a CTE as a temporary view: it exists only for the duration of the statement.
  • WITH must appear before the main statement (SELECT, INSERT, UPDATE, DELETE).
  • You can define multiple CTEs by separating them with commas.
  • The final query must follow immediately after the last CTE.

The syntax for a single CTE is:

WITH cte_name AS (
    -- subquery here
)
SELECT ...
FROM cte_name;

In the case of multiple CTEs, use one WITH and separate the CTEs with commas.

  • Note: order matters as later CTEs can refer to earlier CTEs.
WITH
cte1 AS (
    -- first CTE
    SELECT ...
),
cte2 AS (
    -- second CTE, can reference cte1
    SELECT *
    FROM cte1
    WHERE ...
),
cte3 AS (
    -- third CTE, can reference cte1 or cte2
    SELECT ...
)
SELECT *
FROM cte3;

Let’s rewrite the query in Listing 8.1 using three CTEs to pull those subqueries out from being nested.

```{sql}
#| connection: con
#| label: lst-cte-subquery
WITH overall_avg AS (
    SELECT AVG(dep_delay) AS avg_dep_delay
    FROM flights
),
flights_above_avg AS (
    SELECT *
    FROM flights, overall_avg
    WHERE dep_delay > overall_avg.avg_dep_delay
),
carrier_avg AS (
    SELECT carrier, AVG(dep_delay) AS avg_delay_per_carrier
    FROM flights
    GROUP BY carrier
)
SELECT f.year, f.month, f.day, f.dep_delay,
       a.name AS airline_name,
       ca.avg_delay_per_carrier
FROM flights_above_avg f
JOIN airlines a
  ON f.carrier = a.carrier
JOIN carrier_avg ca
  ON f.carrier = ca.carrier
JOIN overall_avg oa
  ON TRUE
GROUP BY f.carrier, f.year, f.month, f.day, f.dep_delay, a.name, 
         ca.avg_delay_per_carrier, oa.avg_dep_delay
HAVING AVG(f.dep_delay) > oa.avg_dep_delay
ORDER BY f.dep_delay DESC
LIMIT 10;
```
Displaying records 1 - 10
year month day dep_delay airline_name avg_delay_per_carrier
2013 1 9 1301 Hawaiian Airlines Inc. 4.900585
2013 6 15 1137 Envoy Air 10.552041
2013 1 10 1126 Envoy Air 10.552041
2013 9 20 1014 American Airlines Inc. 8.586016
2013 7 22 1005 Envoy Air 10.552041
2013 4 10 960 Delta Air Lines Inc. 9.264504
2013 3 17 911 Delta Air Lines Inc. 9.264504
2013 6 27 899 Delta Air Lines Inc. 9.264504
2013 7 22 898 Delta Air Lines Inc. 9.264504
2013 12 5 896 American Airlines Inc. 8.586016
  • overall_avg computes the overall average once.
  • flights_above_avg uses that to filter flights.
  • carrier_avgpre-computes each carrier’s average (so no scalar subquery per row).
  • The final query joins them together.

This is considered easier to read as the CTEs are defined at the top so the later clauses are more compact.

8.23 CREATE VIEW Saves a Query in the Database

CREATE VIEW is creates a version of a query and stores it in the database:

  • CTEs and subqueries exist only while the query runs.
  • A view persists in the database, so you (and others with access) can reuse it in multiple queries and even across different sessions.
  • You may also have access to views created by others.

All views must be named, just like tables.

  • Use SHOW VIEWS; in many SQL flavors but to see the views in a DuckDB database use the following
SELECT table_name
FROM information_schema.tables
WHERE table_type = 'VIEW';
  • As a minimalist database, flights has no views.

Views must have unique names

  • A view is stored in the system catalog (metadata tables) of the database, alongside tables, indexes, functions, etc.
  • When you create a view with CREATE VIEW view_name AS …, the database inserts an entry for view_name into its catalog.
  • Uniqueness is enforced within a schema/namespace:
  • You can’t have two views (or a table and a view) with the same name in the same schema.

Users create a view for these reasons:

  • Reusability: If you have a long, complex query (say, joining 5–6 tables, filtering, aggregating), you don’t want to rewrite it every time. Save it once as a view, then just SELECT from the named view just like a table.
  • Consistency: Ensures everyone uses the same definition for a derived dataset (e.g., “active customers” or “on-time flights”).
  • Security: A view can expose only the columns/rows you want others to see, hiding sensitive fields in the base tables.

Performance Considerations

  • Regular Views: These are just stored query definitions. When you query the view, the database re-runs the underlying query. So, by default, they are not faster than repeating the query. The main gain is readability and reusability.
  • Materialized Views: Some systems (e.g., PostgreSQL, Oracle) allow you to create materialized views, which store the actual result set and can be indexed. These are precomputed and often faster, but they must be refreshed when the underlying data changes.

Let’s create a view called delayed_flights with AS acting as a keyword (not an alias) assigning the name to the query definition.

```{sql}
#| connection: con
CREATE VIEW delayed_flights AS
SELECT *
FROM flights
WHERE dep_delay > 60;
```

Check the view exists.

```{sql}
#| connection: con
SELECT table_name, table_type
FROM information_schema.tables
WHERE table_name = 'delayed_flights';
```
1 records
table_name table_type
delayed_flights VIEW

Use the view with SELECT as if it were a table.

```{sql}
#| connection: con
SELECT flight, month, day, dep_delay
FROM delayed_flights
LIMIT 10;
```
Displaying records 1 - 10
flight month day dep_delay
4576 1 1 101
443 1 1 71
3944 1 1 853
856 1 1 144
1086 1 1 134
4495 1 1 96
4646 1 1 71
673 1 1 77
4869 1 1 70
4497 1 1 115

Note: all are above 60 minutes.

If you no longer need the view in the database use DROP VIEW.

```{sql}
#| connection: con
DROP VIEW delayed_flights;
```

8.24 Reshaping/Pivoting Data Using SQL

Recall {tidyr} has several verbs for reshaping rectangular data: pivot_wider, pivot_longer, separate_*, and unite.

SQL doesn’t have these verbs natively, but you can reshape using:
1. CASE WHENstatements 2. GROUP BYwith aggregation 3. Vendor-specific PIVOT / UNPIVOT (e.g., SQL Server)

  1. Example of “Pivot-wider”-like reshaping in SQL with CASE WHEN

Suppose we want to see total departure delays per carrier for each origin airport. This is similar to a pivot_wider() in R.

```{sql}
#| connection: con
SELECT origin,
       SUM(CASE WHEN carrier = 'AA' THEN dep_delay ELSE 0 END) AS AA_total_delay,
       SUM(CASE WHEN carrier = 'DL' THEN dep_delay ELSE 0 END) AS DL_total_delay,
       SUM(CASE WHEN carrier = 'UA' THEN dep_delay ELSE 0 END) AS UA_total_delay
FROM flights
GROUP BY origin
ORDER BY origin;
```
3 records
origin AA_total_delay DL_total_delay UA_total_delay
EWR 34000 52000 571694
JFK 140542 171672 35471
LGA 101009 218810 94733
  • CASE WHEN carrier = 'AA' THEN dep_delay ELSE 0 END: creates a separate column for each carrier (like pivoting)
  • SUM(): aggregates the delays for each origin × carrier combination
  • GROUP BY origin: ensures aggregation per airport
  1. Example of “Pivot-longer”-like reshaping in SQL with UNION ALL
  • UNION ALL combines the results of two or more SELECT statements into a single result set.
    • It keeps all rows, including duplicates.
    • It stacks the columns into rows, producing a long format table.
    • Each SELECT must have the same number of columns and compatible data types.
  • WITH flights_wide AS (...) creates a temporary wide table with one column per carrier’s total departure delay since we don’t have one in the flights database.
```{sql}
#| connection: con

-- Step 1: Aggregate delays per origin per carrier (wide pivot)
WITH flights_wide AS (
    SELECT origin,
           SUM(CASE WHEN carrier = 'AA' THEN dep_delay ELSE 0 END) AS AA_total_delay,
           SUM(CASE WHEN carrier = 'DL' THEN dep_delay ELSE 0 END) AS DL_total_delay,
           SUM(CASE WHEN carrier = 'UA' THEN dep_delay ELSE 0 END) AS UA_total_delay
    FROM flights
    GROUP BY origin
)
-- Step 2: Convert wide format to long format using UNION ALL
SELECT origin, 'AA' AS carrier, AA_total_delay AS total_delay
FROM flights_wide
UNION ALL
SELECT origin, 'DL' AS carrier, DL_total_delay
FROM flights_wide
UNION ALL
SELECT origin, 'UA' AS carrier, UA_total_delay
FROM flights_wide
ORDER BY origin, carrier;
```
9 records
origin carrier total_delay
EWR AA 34000
EWR DL 52000
EWR UA 571694
JFK AA 140542
JFK DL 171672
JFK UA 35471
LGA AA 101009
LGA DL 218810
LGA UA 94733

If you are interested in reshaping, especially with an intent to aggregate the data, the various flavors have different approaches.

As an example, SQL Server has the statements PIVOT and UNPIVOT. See FROM - Using PIVOT and UNPIVOT.

8.25 CREATE TABLE Declares new Tables

You can create new tables using SQL.

Let’s create a new, temporary, connection we can use to create tables in a new back-end database that will be just in-memory.

tmpcon <- dbConnect(duckdb())

Use CREATE TABLE to declare new tables and INSERT INTO to insert new records into that table.

  • Note the semicolon ; after each SQL statement in the same code chunk.
```{sql}
#| connection: tmpcon
CREATE TABLE table4a (
  "country" VARCHAR,
  "1999"   INTEGER,
  "2000"   INTEGER
);

INSERT INTO table4a ("country", "1999", "2000") 
VALUES
  ('Afghanistan',745,   2666),
  ('Brazil' ,  37737,  80488),
  ('China'  , 212258, 213766);
```

DBI::dbWriteTable() will add a table to a connection from R.

Let’s write multiple data frames from {tidyr} into new tables in our tmpcon database.

data("table1", package = "tidyr")
dbWriteTable(conn = tmpcon, name = "table1", value = table1)

data("table2", package = "tidyr")
dbWriteTable(conn = tmpcon, name = "table2", value = table2)

data("table3", package = "tidyr")
dbWriteTable(conn = tmpcon, name = "table3", value = table3)

data("table4b", package = "tidyr")
dbWriteTable(conn = tmpcon, name = "table4b", value = table4b)

data("table5", package = "tidyr")
dbWriteTable(conn = tmpcon, name = "table5", value = table5)

Here they are.

```{sql}
#| connection: tmpcon
DESCRIBE;
```
6 records
database schema name column_names column_types temporary
memory main table1 country , year , cases , population VARCHAR, DOUBLE , DOUBLE , DOUBLE FALSE
memory main table2 country, year , type , count VARCHAR, DOUBLE , VARCHAR, DOUBLE FALSE
memory main table3 country, year , rate VARCHAR, DOUBLE , VARCHAR FALSE
memory main table4a country, 1999 , 2000 VARCHAR, INTEGER, INTEGER FALSE
memory main table4b country, 1999 , 2000 VARCHAR, DOUBLE , DOUBLE FALSE
memory main table5 country, century, year , rate VARCHAR, VARCHAR, VARCHAR, VARCHAR FALSE

Close the connection tempcon.

DBI::dbDisconnect(tmpcon, shutdown = TRUE)

8.26 COPY FROM / TO Imports/Exports Data from/to other Formats and Files

Use COPY TO to write the outputs of a SQL query to a CSV file (using a relative path).

```{sql}
#| eval: false
COPY
(
  Your SQL query goes here
)
TO 'myfile.csv' (HEADER, DELIMITER ',');
```

Let’s write some group summaries to a CSV file.

```{sql}
#| connection: con
#| eval: false
COPY
(
  SELECT "origin", AVG("dep_delay")
  FROM flights
  GROUP BY "origin"
)
TO './output/summaries.csv' (HEADER, DELIMITER ',');
```

This is what the resulting file looks like:

origin,avg(dep_delay)
LGA,10.3468756464944
EWR,15.10795435218885
JFK,12.112159099217665

8.27 Calling SQL from R

Use DBI::dbGetQuery() to run SQL code in R and obtain the result as a data frame.

con <- dbConnect(duckdb(dbdir = "./data/flights.duckdb", read_only = TRUE))
planes2 <- DBI::dbGetQuery(conn = con, 
                           statement = "SELECT DISTINCT tailnum, year FROM planes")
glimpse(planes2)
Rows: 3,322
Columns: 2
$ tailnum <chr> "N11121", "N11137", "N11140", "N11189", "N11191", "N11193", "N…
$ year    <int> 2003, 2003, 2003, 2005, 2005, 2005, 2001, 2002, 2000, 2003, 20…

Often, folks have SQL written in a separate file (that ends in .sql).

  • Here is what is in “query.sql”.
-- Just gets some planes
SELECT DISTINCT tailnum, year
FROM planes;

If you want to load the results of a SQL query in R (saved in query.sql), use readr::read_file() inside DBI::dbGetQuery().

mydf <- DBI::dbGetQuery(conn = con, statement = read_file("./sql/query.sql"))
glimpse(mydf)
Rows: 3,322
Columns: 2
$ tailnum <chr> "N103US", "N104UW", "N10575", "N11109", "N11194", "N11548", "N…
$ year    <int> 1999, 1999, 2002, 2002, 2005, 2002, 2000, 2004, 1987, 1998, 19…

8.28 Closing the Connection

After you are finished working with a database, you should close down your connection:

  • The shutdown = TRUE is a DuckDB option to close the connection and shut down the entire DuckDB engine.
DBI::dbDisconnect(con, shutdown = TRUE)

8.29 Basic SQL Exercises

Download the file starwars.duckdb and save to a data directory.

  • It’s the data from the {starwarsdb} as a DuckDB database.
  • Or use the following once you are sure there is a data directory:
# GitHub raw link for the starwars database file
url <- "https://github.com/AU-datascience/data/raw/main/413-613/starwars.duckdb"

# Local path to save the file
dest <- here("data", "starwars.duckdb")

# Download the file
download.file(url, dest, mode = "wb")

Use only SQL to answer the following questions.

  1. Open a connection to the database.
Show code
library(DBI)
library(duckdb)
starcon <- dbConnect(duckdb(dbdir = "./data/starwars.duckdb"))
  1. Print out a summary of the tables in this database.
Show code
```{sql}
#| connection: starcon
#| eval: false
#| code-fold: true
DESCRIBE;
```
  1. Select the entire people table.
Show code
```{sql}
#| connection: starcon
#| eval: false
#| code-fold: true
SELECT * 
  FROM people;
```
  1. Select just the name, height, mass, and species variables from the people table.
Show code
```{sql}
#| connection: starcon
#| eval: false
#| code-fold: true
SELECT
    "name",
    "height",
    "mass",
    "species"
FROM
    people;
```
  1. Add to the above query by selecting only the humans and droids.
Show code
```{sql}
#| connection: starcon
#| eval: false
#| code-fold: true
SELECT "name", "height", "mass", "species"
FROM people
WHERE "species" = 'Human' OR "species" = 'Droid';
```
  1. Remove the individuals with missing mass data from the above query.
Show code
```{sql}
#| connection: starcon
#| eval: false
#| code-fold: true
SELECT
    "name",
    "height",
    "mass",
    "species"
FROM
    people
WHERE
    ("species" = 'Human'
        OR "species" = 'Droid')
    AND "mass" IS NOT NULL;
```
  1. Modify the above query to calculate the average height and mass for humans and droids.
Show code
```{sql}
#| connection: starcon
#| eval: false
#| code-fold: true
SELECT "species", AVG("height") AS "height", AVG("mass") AS "mass"
  FROM people
  WHERE ("species" = 'Human' OR "species" = 'Droid') AND "mass" IS NOT NULL
  GROUP BY "species";
```
  1. Make sure Droids are in the first row and Humans are in the second row in the above summary output.
Show code
```{sql}
#| connection: starcon
#| eval: false
#| code-fold: true
SELECT
    "species",
    AVG("height") AS "height",
    AVG("mass") AS "mass"
FROM
    people
WHERE
    ("species" = 'Human'
        OR "species" = 'Droid')
    AND "mass" IS NOT NULL
GROUP BY
    "species"
ORDER BY
    "species";
```
  1. Here is the summary of the keys for the database from the {starwarsdb} GitHub page.

E-R diagram for the {starwarsdb} Database.

Select all films with characters whose homeworld is Kamino.

Show code
```{sql}
#| connection: starcon
#| eval: false
#| code-fold: true

FROM (
SELECT
    DISTINCT title AS title
FROM
    people,
    films_people,
WHERE
    people."homeworld" = 'Kamino'
    AND people."name" = films_people."character"
GROUP BY
    "title"
) LHS
INNER JOIN films
ON
LHS."title" = films."title";
```
  1. Filter the people table to only contain humans from Tatooine and export the result to a CSV file called “folks.csv”.
Show code
```{sql}
#| connection: starcon
#| eval: false
#| code-fold: true
COPY
(
  SELECT *
  FROM people
  WHERE "homeworld" = 'Tatooine' AND "species" = 'Human'
)
TO './output/folks.csv' (HEADER, DELIMITER ',');
```
  1. Close the SQL connection.
Show code
dbDisconnect(conn = starcon, shutdown = TRUE)

8.30 Integrating SQL with R

Section 8.4 demonstrated using SQL inside an R environment using the {DBI} and {duckdb} packages.

All of the SQL was run in a SQL code chunk in a Quarto document.

  • We manually converted R code chunks to SQL and inserted the connection name.

  • All of the SQL code chunks were interactive and they did not affect the R environment.

Section 8.27 showed how one could run SQL scripts (.sql files) in R and use the DBI::dbGetQuery() function and assign an R name to the result-set of a SQL query.

This section will demonstrate some shortcuts to both creating SQL code chunks and then getting access to the result-sets as R objects.

Let’s create a connection to the flights.duckdb database.

library(tidyverse)
library(DBI)
library(duckdb)
con <- dbConnect(duckdb(dbdir = "./data/flights.duckdb", read_only = TRUE))

8.30.1 Other SQL Code Chunk Options

See Section 8.4.6 for using RStudio to create SQL code chunks quickly.

The R Markdown Definitive Guide section on SQL discusses several SQL chunk options you can use to adjust how SQL is interpreted by {knitr} in RStudio documents.

  • max.print = XX will adjust to print XX lines of SQL output to print in the rendered document from the default of XX = 10.
  • tab.cap = "My Caption" will change the default for a caption of how many records are displayed to whatever you want in the rendered document.

These options will work as either R Markdown style options or Quarto style options.

```{sql, connection=con, max.print=5, tab.cap= "R Markdown Options"}
SELECT
    DISTINCT "flight",
    "origin",
    "dest"
FROM
    flights
WHERE
    NOT ("origin" = 'JFK'
        OR "origin" = 'LGA')
    AND dest = 'CMH';
```
R Markdown Options
flight origin dest
4250 EWR CMH
4692 EWR CMH
4622 EWR CMH
3852 EWR CMH
4147 EWR CMH
```{sql}
#| connection: con
#| max.print: 5
#| tab.cap: "Quarto Options"
SELECT
    DISTINCT "flight",
    "origin",
    "dest"
FROM
    flights
WHERE
    NOT ("origin" = 'JFK'
        OR "origin" = 'LGA')
    AND dest = 'CMH';
```
Quarto Options
flight origin dest
3813 EWR CMH
3842 EWR CMH
5906 EWR CMH
4230 EWR CMH
5679 EWR CMH

8.30.2 Creating R Variables from SQL Code

You can use a SQL code chunk option output.var to create an R data frame variable from a SQL result-set.

Let’s create R variables for the tables in the database and the SQL description of the tables in the data base.

  • Again, either style chunk option will work.

Notice that just like assignment in R, the results are implicit and not seen in the output.

  • You do see the new variables in the Global Environment.
```{sql, connection=con, output.var="flights_tables_rm"}
SHOW TABLES
```
```{sql}
#| connection: con
#| output.var: "flights_tables_qo"
SHOW TABLES
```
```{sql connection=con, output.var="flights_desc_rm"}
DESCRIBE
```
```{sql}
#| connection: con
#| output.var: "flights_desc_qo"
DESCRIBE
```
```{r}
flights_desc_rm
class(flights_desc_rm)
```
  database schema     name
1  flights   main airlines
2  flights   main airports
3  flights   main  flights
4  flights   main   planes
5  flights   main  weather
                                                                                                                                                                     column_names
1                                                                                                                                                                   carrier, name
2                                                                                                                                        faa, name, lat, lon, alt, tz, dst, tzone
3 year, month, day, dep_time, sched_dep_time, dep_delay, arr_time, sched_arr_time, arr_delay, carrier, flight, tailnum, origin, dest, air_time, distance, hour, minute, time_hour
4                                                                                                         tailnum, year, type, manufacturer, model, engines, seats, speed, engine
5                                                          origin, year, month, day, hour, temp, dewp, humid, wind_dir, wind_speed, wind_gust, precip, pressure, visib, time_hour
                                                                                                                                                           column_types
1                                                                                                                                                      VARCHAR, VARCHAR
2                                                                                                    VARCHAR, VARCHAR, DOUBLE, DOUBLE, DOUBLE, DOUBLE, VARCHAR, VARCHAR
3 INTEGER, INTEGER, INTEGER, INTEGER, INTEGER, DOUBLE, INTEGER, INTEGER, DOUBLE, VARCHAR, INTEGER, VARCHAR, VARCHAR, VARCHAR, DOUBLE, DOUBLE, DOUBLE, DOUBLE, TIMESTAMP
4                                                                                       VARCHAR, INTEGER, VARCHAR, VARCHAR, VARCHAR, INTEGER, INTEGER, INTEGER, VARCHAR
5                                        VARCHAR, INTEGER, INTEGER, INTEGER, INTEGER, DOUBLE, DOUBLE, DOUBLE, DOUBLE, DOUBLE, DOUBLE, DOUBLE, DOUBLE, DOUBLE, TIMESTAMP
  temporary
1     FALSE
2     FALSE
3     FALSE
4     FALSE
5     FALSE
[1] "data.frame"
```{r}
glimpse(flights_desc_qo)
class(flights_desc_qo)
```
Rows: 5
Columns: 6
$ database     <chr> "flights", "flights", "flights", "flights", "flights"
$ schema       <chr> "main", "main", "main", "main", "main"
$ name         <chr> "airlines", "airports", "flights", "planes", "weather"
$ column_names <list> <"carrier", "name">, <"faa", "name", "lat", "lon", "alt",…
$ column_types <list> <"VARCHAR", "VARCHAR">, <"VARCHAR", "VARCHAR", "DOUBLE", …
$ temporary    <lgl> FALSE, FALSE, FALSE, FALSE, FALSE
[1] "data.frame"

Let’s use SQL to get the departure delay for all flights from JFK and LaGuardia to Chicago and then use {ggplot2} to plot.

```{sql}
#| connection: con
#| output.var: "select_flights_df"
SELECT
    DISTINCT "flight",
    "origin",
    "dest",
    "dep_delay"
FROM
    flights
WHERE
     ("origin" = 'JFK'
        OR "origin" = 'LGA')
    AND dest = 'ORD';
```

Check our data.

glimpse(select_flights_df)
Rows: 3,862
Columns: 4
$ flight    <int> 1477, 327, 331, 343, 317, 329, 353, 317, 329, 331, 343, 353,…
$ origin    <chr> "LGA", "LGA", "LGA", "LGA", "LGA", "LGA", "LGA", "LGA", "LGA…
$ dest      <chr> "ORD", "ORD", "ORD", "ORD", "ORD", "ORD", "ORD", "ORD", "ORD…
$ dep_delay <dbl> 178, 181, 190, 97, NA, NA, NA, -3, -4, 3, 5, 66, -4, 125, 29…

Plot.

```{r}
#| warning: false
select_flights_df |> 
  ggplot(aes(x = origin, y = dep_delay)) +
  geom_boxplot(notch = TRUE) +
  scale_y_log10() +
  labs(title = "Departure Delays for Flights to ORD",
       subtitle = "Log Scale",
       caption = "Data from {nycflights13}.")
```

8.30.3 Using R Variables in SQL Code

You can also use an existing R variable as a value in a SQL query clause.

Create the R Variable and use the name preceded by a ?.

```{r}
my_dest <- 'MCO'
```
```{sql}
#| connection: con
#| output.var: "select_flights_df2"
SELECT
    DISTINCT "flight",
    "origin",
    "dest",
    "dep_delay"
FROM
    flights
WHERE
     ("origin" = 'JFK'
        OR "origin" = 'LGA')
    AND dest = ?my_dest;
```
glimpse(select_flights_df2)
Rows: 2,638
Columns: 4
$ flight    <int> 1183, 1099, 2181, 299, 2159, 683, 483, 1883, 883, 1815, 1885…
$ origin    <chr> "JFK", "LGA", "LGA", "LGA", "JFK", "JFK", "JFK", "JFK", "JFK…
$ dest      <chr> "MCO", "MCO", "MCO", "MCO", "MCO", "MCO", "MCO", "MCO", "MCO…
$ dep_delay <dbl> 16, 30, -6, 10, -5, -5, -4, -3, 11, 11, -4, -8, 1, 5, 30, -5…
Warning

Using parameters in SQL queries is common practice.

Use R to manipulate text to create queries as character strings is also common.

However, if you are building something, e.g., a Shiny app, where you are allowing others to input text that will feed into a query, you have to protect what you are doing.

You are liable to be victim of a SQL Injection Attack which can have bad consequences.

For R-specific approaches to mitigate the risk, see Run Queries Safely Posit (2023c) from the Posit Best Practices in working with Databases. Posit (2023b)

8.30.4 Setting the Connection Once in a Document

If you are going to use the same connection for all (or most chunks) in a document, you can set the connection as an option in a setup chunk.

Let’s close the current con.

DBI::dbDisconnect(con, shutdown = TRUE)

Now create a new connection.

```{r}
#| label: lst-setup
#| message: false
library(DBI)
library(duckdb)
con2 <- dbConnect(duckdb(dbdir = "./data/flights.duckdb", read_only = TRUE))
knitr::opts_chunk$set(connection = 'con2')
```

Now run a SQL code chunk without stating the connection.

```{sql}
#| label: lst-test-setup
#| eval: false
SHOW TABLES
```

8.30.5 Close the Connection.

To avoid a warning when rendering the document, close the connection at the end.

DBI::dbDisconnect(con2, shutdown = TRUE)

8.31 Using the {dbplyr} Package to Translate {dplyr} into SQL

The {dbplyr} package allows you to use {dplyr} code and it will translate it into SQL to interact (through an established connection) with a Back-end server.

  • Consider R as a higher-level language you use to access code written in C.
  • The {shiny} package can be considered as providing a higher-level language to generate HTML from {shiny} code.
  • You can consider {dbplyr} as a higher-level package that generates SQL from {dplyr} code.
Important

{dbplyr} works for many flavors of SQL and can generate many SQL expressions.

However it cannot always do everything. For more details see the {dbplyr} vignette on Function Translation.

Even if you can’t use {dbplyr} in your situation, you can still use it to generate SQL code you can see and use. That makes it a good tool for learning SQL since you know {dplyr}.

Let’s load {dbplyr} and {tidyverse} along with {DBI} and {duckdb}.

library(tidyverse)
library(dbplyr)
library(DBI)
library(duckdb)

Create the connection to the database.

con <- dbConnect(duckdb(dbdir = "./data/flights.duckdb", read_only = TRUE))

Retrieve a table from a SQL database using dplyr::tbl().

  • Note you have to specify the source of the data, here the database connection, and the table name.
flights2 <- tbl(src = con, "flights")

class(flights2)
[1] "tbl_duckdb_connection" "tbl_dbi"               "tbl_sql"              
[4] "tbl_lazy"              "tbl"                  
names(flights2)
[1] "src"        "lazy_query"
  • Note that flights2 is not a dataframe. It’s a list with class tbl_duckdb_connection

Now you can use your tidyverse code on the flights table to generate a query.

flights2 |>
  select(flight, origin, dest, dep_delay) |>
  filter(origin == "JFK", dest == "CMH") |>
  summarize(dep_delay = mean(dep_delay, na.rm = TRUE)) ->
  my_query

my_query
# Source:   SQL [?? x 1]
# Database: DuckDB 1.4.0 [root@Darwin 24.6.0:R 4.5.1//Users/rressler/Courses/DATA-413-613/lectures_book/data/flights.duckdb]
  dep_delay
      <dbl>
1      22.0
class(my_query)
[1] "tbl_duckdb_connection" "tbl_dbi"               "tbl_sql"              
[4] "tbl_lazy"              "tbl"                  

Execute the query by using collect() at the end.

flights2 |>
  select(flight, origin, dest, dep_delay) |>
  filter(origin == "JFK", dest == "CMH") |>
  summarize(dep_delay = mean(dep_delay, na.rm = TRUE)) |>
  collect() ->
  my_result

my_result
# A tibble: 1 × 1
  dep_delay
      <dbl>
1      22.0
class(my_result)
[1] "tbl_df"     "tbl"        "data.frame"

To see the SQL code that was generated, use show_query() at the end (this is a good way to learn SQL).

flights2 |>
  select(flight, origin, dest, dep_delay) |>
  filter(origin == "JFK", dest == "CMH") |>
  summarize(dep_delay = mean(dep_delay, na.rm = TRUE)) |>
  show_query()
<SQL>
SELECT AVG(dep_delay) AS dep_delay
FROM (
  SELECT flight, origin, dest, dep_delay
  FROM flights
  WHERE (origin = 'JFK') AND (dest = 'CMH')
) q01

After you are done with the database, close the connection.

DBI::dbDisconnect(con, shutdown = TRUE)

8.32 The DBeaver Integrated Development Environment (IDE)

  • This section based on work by Professor David Gerard.

There is a free, separate Front-end for the DuckDB database called “DBeaver Community”.

Per DBeaver community:

DBeaver Community is a free cross-platform database tool for developers, database administrators, analysts, and everyone working with data. It supports all popular SQL databases like MySQL, MariaDB, PostgreSQL, SQLite, Apache Family, and more.

8.32.1 Setup DBeaver Community

The instructions here are only for DuckDB. But DBeaver works with other DBMS’s pretty easily.

Download and install DBeaver Community here: https://dbeaver.io/download/.

When you open it up, select Database Navigator and under All or SQL, select “DuckDB”.

Then select Next.

Select “Host”, and Open to navigate to your flights.duckdb database.

Click on Test Connection.

If Successful, you will get a pop-up window with the results of the test and which will allow you to see the details of the connection.

If not successful, and you haven’t used it before, it may ask you to install a JDBC-compliant driver to use to connect to a DuckDB database.

  • Click Download Driver.
  • After the download, it will run the test and you should get a positive result.

Click “Finish”.

You should now see the database under the Database Navigator.

Right click on the connection and create a new SQL script file.

Type some SQL and hit the execute button or just hit Control + Enter.

You can now play around with SQL.

When you want to save the data to a CSV file, just hit “Export data”.

  • There are a lot of features in DBeaver, but we won’t cover them. They are pretty intuitive. Just play with it for awhile.

See the DBeaver General User Guide for detailed documentation.

8.32.2 Exercises

Use DBeaver for all of these exercises.

  1. What is the mean temperature for flights from each airport?
Show code
    SELECT "origin", AVG("temp") AS "temp"
    FROM weather 
    GROUP BY "origin";
  1. What is the average number of flights from each airport per day in January?
Show code
    SELECT "origin", COUNT() / 31.0 AS "ave"
    FROM flights 
    WHERE "month" = 1
    GROUP BY "origin";
  1. What are the top destinations for each airport?
Show code
    SELECT "origin", "dest", "n"
    FROM
    (
    SELECT "origin", "dest", "n", MAX("n") OVER (PARTITION BY "origin") AS "nmax"
    FROM
    (
    SELECT "origin", "dest", COUNT() AS "n",
    FROM flights
    GROUP BY "origin", "dest"
    )
    )
    WHERE "n" = "nmax"

8.33 Considerations for Working with SQL and Databases

8.33.1 Review Documentation about the database

When working with a new database, or one that is often changing, seek out and review the available documentation.

Data Dictionaries, schema descriptions, and/or E-R diagrams can help you understand the database design, the table descriptions and structure, the fields in each table, the primary keys, and whether tables are indexed.

The documentation can help you see a holistic picture of the data so you know where to look for tables and fields of interest and how to join tables most efficiently.

  • As an example, the database may have a naming convention for tables and fields to help you find the tables and fields you need to create your queries.

8.33.2 Review Documentation about the DBMS and its SQL

Every DBMS is slightly different as developers optimize their code for the goals for the database.

Review documentation on the “flavor” of SQL accepted by the database.

Many flavors have extra features (statements, clauses, keywords, or modifiers) you can use to simplify your code or speed up your queries.

8.33.3 Choose a Front-end IDE that works with the DBMS

SQL is just text so you can write SQL in many environments.

If you are doing a lot of SQL, find a Front-end IDE that has features to facilitate using SQL.

  • As an example, DBeaver has auto-complete for SQL statements while RStudio does not.

Many Front-ends have customized drivers for working with many different Back-end DBMS.

  • So chose a Front End that allows you to work with the different databases in your organization without having to switch.
  • Read the documentation for the Front-end to use it effectively.

8.33.4 Follow a Style Guide

Like tidyverse style, there are style guides to help you make your code easier to read, faster to debug, and more robust to future changes.

If you do nothing else from a style perspective, use line breaks often to break up your code into readable lines where each starts on a statement, clause, or keyword.

Both of the following code chunks are valid SQL.

Streaming SQL example.

```{sql}
#| eval: false
#| connection: con
SELECT "flight", "origin", CASE WHEN ("origin" = 'JFK') THEN 'John F. Kennedy' WHEN ("origin" = 'LGA') THEN 'LaGaurdia' WHEN ("origin" = 'EWR') THEN 'Newark Liberty' ELSE 'WHO KNOWS' END AS "orig_long" FROM flights LIMIT 10;
```

Stylized tidy SQL.

```{sql}
#| label: tidy-sql
#| eval: false
#| connection: con
SELECT
    "flight",
    "origin",
    CASE
        WHEN ("origin" = 'JFK') THEN 'John F. Kennedy'
        WHEN ("origin" = 'LGA') THEN 'LaGaurdia'
        WHEN ("origin" = 'EWR') THEN 'Newark Liberty'
        ELSE 'WHO KNOWS'
    END AS "orig_long"
FROM
    flights
LIMIT 10;
```

Which do you think is easier to read, debug, and maintain?

  • The second chunk was generated by the DBeaver Format SQL command under the Edit menu.

There is no single authoritative style guide so check if your organization has its own.

If it does not, consider the following.

8.33.5 Build-a-little-test-a-little

Writing and debugging SQL queries should proceed in steps just like building and debugging R code and functions.

Don’t write a 30 line query and then start to debug it.

Write short queries and use LIMIT to keep the returns fast.

Write multiple small queries and then integrate them piece by piece.

Consider using sub-queries and Common Table Expressions to combine them as well.

8.33.6 Follow Best Practices for Writing Queries

There are many sources for Best Practices in writing SQL queries to optimize speed.

Common threads appears across best practices:

  1. Shape the query to return only the data you need.
  2. Shape the query to take advantage of the DBMS ability to use indexes for faster performance.
  3. Make your queries easy to read.

The most commonly cited best practice is “Avoid the use of SELECT *”.

Other suggestions include:

  • Use Joins instead of sub-queries inside a WHERE when you are just matching values.
  • Avoid using LIKE with wildcards,%, at the beginning of a string as this means you have to search every record.
  • Use GROUP BY or add additional fields instead of using SELECT DISTINCT on large tables.
  • Use comments -- to explain why you are doing something.
  • Use AS to create aliases to clarify fields from different tables for readability.
  • Avoid sorting, especially for large tables.

For additional recommendations and rationale, consider the following as a starting point.

8.33.7 Consider Tuning Slow Queries

Queries go through three stages when being executed.

  • Syntax Check: The DBMS parser checks each query is syntactically correct.
  • Optimization: The DBMS optimizer examines the query and builds a plan for execution to minimize resource demand and time.
  • Execution: The DBMS runs the optimized plan to return any results.

Thus every query gets optimized but that does not mean the query could not be tuned to be faster.

If you have a query that is run many times, on large tables, and appears slow, then consider tuning it.

The first step is detecting a slow query.

The second step to is to assess how the query is actually being run.

  • DBMS also have a way get to the results of the Query Planner so you can review them to see what is actually happening.
  • Look at the help for EXPLAIN.
  • For DBeaver, review the PostgreSQL Documentation on Using EXPLAIN Group (2023)

Once you understand the execution, then consider if any of the best practices for writing queries will help improve performance.

For more explanation and ideas see SQL Performance Tuning. khushboogoyal499 (2022)

8.33.8 Integrate Your SQL Into Your Version Control Workflow

Regardless of the IDE you use for a Front-end, SQL is still text-based code.

You can put the files into a Git Repo and version control with GitHub.

  • If you are using DBeaver, use the search tool (magnifying glass) and look for “Install new software.”
  • Use the drop down arrow to select the Git option as in Figure 8.5.
Figure 8.5: Installing the DBeaver Extension for Git

For more details see How to Version Control Your SQL.

8.33.9 Consider More Advanced SQL Statments and Methods

These notes have just scratched the surface of a language that has been around for 50 years and is still going strong.

There are many more statements, clauses, and keywords you can use to create more effective SQL to access the data you want, especially for large databases that will be accessed many times.

Some of these are:

  • Views, Sub-queries, and Common Table Expressions
  • Additional clauses and modifiers for WHERE such as {stringr} style functions.
  • Dynamic SQL where you can pass parameters to a stored procedure akin to an R function.

There are many free references on SQL to expand your knowledge.

R for Data Science suggests two non-free books:

8.33.10 Strategize How to Use SQL in Your Analysis Workflow

From a data science perspective SQL is usually a means for getting data out of databases.

  • Data scientists are usually less involved in designing, building, and managing databases or maintaining the data.

Your working environment will often dictate which databases have the data of interest and the associated versions of SQL you must use.

There are SQL wizards who can make SQL dance by building and optimizing queries to the nth degree.

You do not have to be one!

As long as you can get the data you need, you can use the other tools in R and Python to make your analysis sing and dance.

If you have the time and hardware memory to get the data, you can clean and shape it outside of SQL if that works for you.

SQL is an important tool in your Data Science tool box for sure. But it works best when you use it with other tools in it as well.

Tip

If your primary analysis tools are R and RStudio, but you need to get subsets of data from a large database using SQL, consider using a SQL-focused IDE, such as DBeaver, to build your queries and then copying them into a .sql file or an SQL code chunk.

This way you get the advantage of using a SQL-focused IDE while using RStudio for the rest of the analysis workflow.