8 Getting Data from SQL Databases
database, relational, sql, duckdb, dbeaver, duckplyr
The code chunks for SQL and R have different colors in this chapter.
- Code chunks for SQL use this color.
- Code chunks for R use this color.
8.1 Introduction
8.1.1 Learning Outcomes
- Understand the organization of relational databases.
- Use Structured Query Language (SQL) to access and manipulate data in SQL databases.
- Use the {dbplyr} package to access and manipulate data in SQL databases.
8.1.2 References:
- R for Data Science 2nd edition Chapter 25 Wickham, Cetinkaya-Rundel, and Grolemund (2023)
- {DBI} package R Special Interest Group on Databases (R-SIG-DB), Wickham, and Muller (2022)
- {odbc} package Hester, Wickham, and Gjoneski (2023)
- {duckdb} package Mühleisen et al. (2023)
- DuckDB Database Management System DuckDB (2023c)
- DuckDB Documentation
- W3 Schools SQL Schools (2023)
- DBeaver SQL IDE DuckDB (2023a)
- DBeaver General User Guide Team (2003)
- R Markdown: Definitive Guide Xie, Alllaire, and Grolemund (2023) Section 2.7.3 on SQL
- {dbplyr} package Wickham, Girlich, and Ruiz (2023)
- Writing SQL with dbplyr Posit (2023a)
- {starwarsdb} package Aden-Buie (2020)
- Parquet File Format Documentation
8.1.2.1 Other References
- Posit Solutions Best Practices in Working with Databases Posit (2023b)
- MariaDB Tutorial Database Theory Foundation (2023b)
- Customizing Keyboard Shortcuts in the RStudio IDE Ushey (2023)
- Code Snippets in the RStudio IDE Allaire (2023)
- SQL Injection kingthorin and Foundation (2023)
- How to Version Control your SQL Pullum (Swartz) (2020)
- Mermaid Diagramming and Charting Tool mermaid.js (2023)
- DuckDB Installation DuckDB (2023b)
- SQL Style Guide Holywell (2023)
- SQL Style Guide for GitLab Mestel and Bright (2023)
- Basic SQL Debugging Foundation (2023a)
- Best Practices for Writing SQL Queries Metabase (2023)
- 7 Bad Practices to Avoid When Writing SQL Queries for Better Performance Allam (2023)
- 3 Ways to Detect Slow Queries in PostgreSQL Schönig (2018)
- PostgreSQL Documentation on Using EXPLAIN Group (2023)
- SQL Performance Tuning khushboogoyal499 (2022)
8.2 A Brief History of Databases
People and organizations have been collecting data for centuries. As early as the 21st century BCE, people were using clay tablets to make records of farmland and grain production. Records were kept as individual files (clay tablets) that were eventually organized based on conceptual model of the data. That model was typically a hierarchical set of fixed attributes of the records, e.g., subject, date, author, etc. for tracking records in a filing system.
It wasn’t until the 1960s that computerized “databases” began to appear. Early computerized databases focused on storing and retrieving individual records which were organized based on existing data models which used the attributes of the tables and records. Data were stored in tables with rows of values and columns for attributes. Users could find the desired table and row by navigating through the hierarchy. This model resulted in larger and larger tables with many redundant fields as users tried to integrate data at the table level. As an example, a record of a purchase could include all of the details of the purchase to include all of data on the buyer and seller such as name, address, and telephone number. Every record of every purchase by a buyer could include their name, address, and telephone number.
Database Management Systems (DBMS) were developed as the interface between the system user and the raw data in the records. DBMS used the data model to manage the physical, digital storage of the data and provide the navigation to find the right table and record. Users could now retrieve and update records electronically. However, computing and storage was expensive so researchers were looking for ways to reduce the redundant storage and inflexible perspective of the data inherent in the hierarchical data model.
In 1970, IBM researcher E.F. Codd published what became a landmark paper, A relational model of data for large shared data banks. Codd (1970) Codd proposed replacing the fixed hierarchy and its large tables with much smaller tables where the data model now focused on defining relationships among the tables instead of navigating across a hierarchy of attributes or parent child relationships among tables. The relational database model is now ubiquitous for storing extremely large data sets.
A strength and a drawback of the relational database model is the capability to manage not just the data, but all the relationships across the tables of data. This makes relational databases great for working with structured data, where the relationships are known ahead of time and built into the database model. The drawback is it takes time and processing power to enforce those relationships and structure. Thus, other approaches such as Apache HADOOP have been developed to capture and manage the variety, velocity, and volume of data associated with “Big Data” sources, e.g., capturing all the meta data and data associated with billions of transactions a day on a social media site.
For the relational database model to work, it needed a corresponding high-level language for implementation. Users had to be able to create databases and manage the life cycle of their databases and the data they contained. This need led to the development of what were known as “query languages” which eventually evolved into what is now called Structured Query Language or SQL (pronounced as “SEE-quil” or “ESS-q-l”). SQL began with a row-based approach to managing data in tables. Other languages e.g., No-SQL, developed a column-based approach to managing data. There are trade offs between the approaches. Many relational database developers create their own (proprietary) versions of SQL to be optimized for their database that incorporate features of both for speed and distributed processing. Apache has developed Apache Hive as a data warehouse system that sits atop HADOOP. “Hive allows users to read, write, and manage petabytes of data using SQL.”
Relational Databases and HADOOP/Hive-like approaches are not competitors as much as different solutions for different requirements.
This section will focus on the relational database methods and systems which use SQL.
8.3 Database Concepts and Terms
8.3.1 Database Tables
Relational databases use Relational DBMS (RDBMS) to keep track of large amounts of structured data in a single system.
A relational database uses rectangular tables, also called entities to organize its data by rows and/or columns.
Each table is typically designed to contain data on a single type or class of “entity”.
An entity can be a physical object such as a person, building, location, or sales order, or a concept such as this class or metadata on an object.
Each entity must have properties or attributes of interest.
- These become the names for the columns in the table.
- These are also known as the fields for the tables.
Each row in the table is also referred to as a record.
The intersection of row and column contains the value, the data, for that record for that field.
All of the values within a given field should be of the same data type. These include the basic types in R such as character, integer, double, or logical as well as complex objects such as pictures, videos, or other Binary Large Objects.
The tables are similar to R or Python data frames, but with some key differences.
- Database tables are stored on a drive so are not constrained, like R data frames, to having to fit into memory. They can have many columns and many millions of rows, as long as they fit on the drive.
- Unlike a data frame, a database field (column) may have a domain which limits the format and range of values in the field in addition to constraining the data type, e.g., the format for a telephone number may be
xx-xxx-xxx-xxxx
. - Database tables are often designed so each record is unique (no duplicate rows) with rules the DBMS uses to enforce the uniqueness.
- Database tables are almost always indexed so each row has a unique value associated with it (behind the scenes).
- The index identifies each record from a physical perspective, much like one’s operating system indexes files for faster search and retrieval.
- The index is what allows a DBMS to find and retrieve records (rows) quickly.
8.3.2 Database Table Keys and Relationships
Most database tables will also be designed with a primary key which is combination of one or more fields that uniquely identify each record, from a logical perspective, across the database.
- As an example, a key may be a person’s social security number which is expected to be unique to one person.
A combined key, uses multiple fields to create a key, such as house number, street name, city, state, zip-code.
Many database designers prefer to create a separate surrogate key, that does not use other values and has no meaning, to avoid potential changes in data over time.
The presence of primary keys for each table enable “relationships” between tables to be created and used by the RDBMS.
As an example, an organization is interested in tracking data about purchase orders and the buyer and a product in a purchase order.
A database designer might create three tables to hold this information:
- A
ORDER
table with fields for the order number, the date of the purchase, the buyer_id, the product_id, the quantity purchased, and the purchase price.- The primary key is
order_number
.
- The primary key is
- A
BUYER
table with fields for the buyer_id, name, company id, street address, state, postal code, country, telephone number, and email.- The primary key is
buyer_id
.
- The primary key is
- A
PRODUCT
table with fields for the products id, name, weight, volume, purchase sizes, default_shipper_id, and standard price.- The primary key is
product_id
.
- The primary key is
Each table describes a single object or entity and each has a primary key to uniquely identify each record.
The ORDER
table uses the primary keys from the BUYER
table and the PRODUCT
table to make a relationship with those tables.
The presence of a table’s primary key in a different table is called being a foreign key in the second table.
- The table with the primary key is known as a Parent table.
- The table that references the parent table (using the foreign key) is known as the child table.
- RDBMS use rules (constraints) to preserve data integrity so that someone editing or changing a child table cannot change or use a foreign key field that is not present in the parent table.
- In this example, there is a foreign key in the
BUYER
table,company _id
, and a foreign key in thePRODUCT
table,default_shipper_id
.
If you were to put all this data into a single table, the attributes of the buyers and products would be duplicated many times over as additional orders were created.
This example illustrates how relationships allow database designers to create multiple tables to minimize the amount of redundant data in a system.
8.3.3 ER Diagrams
Tables and their relationships are usually described in the database schema.
They can be depicted using class or entity-relationship diagrams to show the relationships between tables, fields and keys.
- E-R diagrams can include additional information such as constraints and type of relationship, e.g., one-to-one, or one-to-many.
E-R diagrams can serve as a useful reference for understanding the structure of a large database and designing queries.
Figure 8.1 is an example of a simplified ER diagram (created in Quarto using Mermaid) showing the relationships between the tables and the associated primary and foreign keys.

Figure 8.2 shows a different style of E-R diagram commonly used for database schema. The arrows link individual fields as primary and foreign keys for the tables and show the type of the relationships.

8.3.4 Bringing Tables Together
Now that we can see the database schema, we can find where the data is that we want.
Suppose we want to answer the question, how many orders were placed in the last month. We can do that with just the data in the ORDERS
table.
However, suppose we want to answer the question how many orders were placed from buyers in the states of Virginia and Maryland last month. We would have to get data from both the ORDERS
table and the BUYERS
table.
We get data from multiple tables by temporarily combining tables using a SQL query that creates a JOIN between two or more tables.
Database joins are the same joins you can execute using {dplyr} verbs but the SQL syntax is slight different.
If you are using the same join over and over in your analysis. SQL allows you to create a database “View” and save it. Then in the future, you can query the “view” instead of having to rejoin the tables.
- This is analogous to writing an R function to create a data frame based on joins and querying the results of the function instead of rewriting the code to join the data frames.
8.3.5 Database Design
Designing databases is its own area of expertise.
A poorly designed database will be slow and have insufficient rules (safeguards) to ensure data integrity.
We are used to operating as single users of our data since it is all in the memory of our machine and we can do what we want.
- A database may have thousands of users at the same time trying to update or retrieve records.
- Only one user can update a record at a time so databases use Record Locking to prevent others from updating the record or even retrieving it while someone else is updating the record.
- RDBMS can enforce locking at higher levels such as the table or even the entire database which can delay user queries. See All about locking in SQL Server for more details.
Database designers make trade offs when designing tables to balance competing desires.
- Storage Efficiency: limit each table to a single entity to minimize the amount of redundant data. This can generate lots of tables.
- Maintenance Efficiency: The more tables you have, the fewer users might be trying to update the table at a time.
- Query Speed: Minimize the number of joins needed to make a query. This leads to fewer tables.
These trade offs were recognized by Codd in his original paper.
- Formal descriptions of the ways to design databases are known as Normal forms.
- The process for determining what form is best for an application is known as Database Normalization. For more info see The Trade-offs Between Database Normalization and Denormalization. Explorer (2023)
Thus database designers may create tables that have redundant data if the most common queries join fields from those tables.
They will use database performance testing to optimize the queries and the database design to get the optimal cost performance trade offs. They can use other approaches to support data integrity in addition to database normalization.
Don’t be surprised to see E-R diagrams that have redundant data in them.
8.3.6 Relational Database Management Systems
8.3.6.1 Database Front and Back Ends
Databases have two main components.
- The Front-end user interface where users create code to generate queries and receive results.
- The Back-end server where the data is stored and the DBMS executes the queries to update or retrieve the data.
The Front-end and Back-end communicate through a Connection that uses an API for handling queries and data.
Many RDBMS bundle a Front-end and Back-end together. Examples include Oracle, SAP, and DuckDB has DBeaver.
Most databases allow you to use a Front-end of your choice as long as you have a server driver you can use to configure and open a connection to the Back-end server of the database.
- RStudio can operate as a Front-end to RDBMS for which you have installed a server.
- Excel can operate as a Front-end to a Microsoft SQL Server database.
- DBeaver allows you to make a connection to many different DBMS servers.
- MariaDB identifies over 30 Front-ends that can work with it, including DuckDB’s DBeaver SQL IDE.
There are multiple solutions. We will use both RStudio and DBeaver.
8.3.6.2 Database Connections: ODBC and the {DBI} Package
The reason there is so much flexibility with connecting Front and Back ends is the standardization of the Microsoft Open Data Base Connectivity (ODBC) Standard. David-Engel (2023)
The ODBC standard was designed for “maximum interoperability”.
The construct is similar to the HTTP interface for web browsers. The DBMS developer creates a driver that serves as an API for their Back-end DBMS that is ODBC-compliant. Then the Front-end developer creates their software to also be ODBC-compliant.
That way, the Front-end can create a connection to and interact with any DBMS back end for which there is an ODBC-compliant driver.
The R {DBI} package (for Data Base Interface) supports the connection between R and over 25 different DBMS Back-ends.
- Popular Back-end packages include: {duckdb}, {RMariaDB}, {RPostgres}, {RSQLite}, and {bigrquery}.
- These packages include the ODBC-compliant driver for the given database.
If no one has written a package but an ODBC-compliant driver exists, the {DBI} package works with the {ODBC} package and the driver to make the connection to the Back-end server.
8.3.6.3 Database Configuration Types
RDBMS can also be categorized based on their configuration or how front and back ends are physically connected and how the data is stored.
- In-process databases run on your computer. The data is stored locally and accessible without a network connection.
- They are optimized for the use case of a single user working with large amounts of data.
- Client-Server databases use a small front-end client on your computer (perhaps an IDE or in a browser) to connect to a robust server over a network. Data can be stored across a set of servers.
- They are optimized for the use case where multiple users can query the database at once and the RDBMS resolves contention.
- Cloud-databases also separate the front and back ends over a network but now the data may be distributed across multiple locations in the cloud.
- These are optimized for the use case of multiple users accessing a wide-variety of very large data and enabling distributed computing and parallel computing.
With the proliferation of cloud computing and software as a service (SAAS), you can expect to see more and more hybrids of client server and cloud databases.
As the amount of collected structured and unstructured data scales to Apache solutions for “Big Data”, the fundamental concepts of exploiting relationships in data using some version of SQL will continue to apply for the foreseeable future.
8.4 SQL Basics
- Portions of this section are based on work by Professor David Gerard.
We will use R Studio as the Front-end, DuckDB as the Back-end, and the {DBI} package to support the connection between the front and back ends.
8.4.1 Connecting to a Database
Before you can use SQL to interact with a database, you must have a way to connect the Front-end user interface to the Back-end server with the data.
- When working in R, the {DBI} package can create the connection and allow you to assign a name to it.
- Your database will be created by a specific DBMS which provides the Back-end server.
- If you are not using the Front-End that comes with the DBMS, you need an ODBC-compliant driver for the DBMS to make the connection.
We will use the DuckDB database and the {duckdb} Package for most of this section.
DuckDB is an in-process database so it has no external dependencies which makes it useful for single user analysis of large data. See Why DuckDB for their story.
You can install DuckDB directly on your computer and use the built-in Command Line Interface without using R.
However, the {duckdb} package installs the DuckDB DBMS as well as the ODBC-compliant driver for R.
Use the console to install {DBI} and {duckdb}.
To make the connection between R and DuckDB requires loading the two packages {DBI} and {duckdb}.
We also want to compare the SQL code (working with the database tables) with R code (working with data frames) so load {tidyverse} and {nycflights13} which has the same data as the database we will use.
- You will see the concepts for accessing data are quite similar but the syntax is different.
The DuckDB database with {nycflights13} data is on GitHub at https://github.com/AU-datascience/data/blob/main/413-613/flights.duckdb.
- {duckdb} only works with local data bases, not over a URL
- You can manually download to a
data
folder under your current file’s working directory or use the following once you are sure there is adata
directory:
Create a connection to flights.duckdb
by using DBI::dbConnect()
with duckdb()
and a relative path to the database and assign the name con
to the connection.
- Notice the connection to the database in the environment.
- You will use the
con
connection to interact with the database Back-end.
Once you have a connection, you can create a basic SQL code chunk like this in R Markdown.
name |
---|
airlines |
airports |
flights |
planes |
weather |
In Quarto .qmd files use the chunk option #| connection: con
.
name |
---|
airlines |
airports |
flights |
planes |
weather |
Whenever you are using a SQL chunk inside an R Markdown or Quarto document, identify the connection in each code chunk so the SQL interpreter knows with which database to interact.
All the SQL code chunks in this section will have the #| connection my_con
chunk option where my_con
is the name of the appropriate connection.
8.4.2 Connecting to Parquet Files
DuckDB and other modern databases (e.g., Apache Spark, Presto/Trino, Google BigQuery, Snowflake, Databricks) explicitly support the reading and writing of Parguet files.
- Parquet is extremely common today in big data and analytics; it’s basically the default format for modern data lakes and replacing CSV as a format for data exchange.
- Parquet is a columnar, compressed, self-describing format so systems built with columnar execution in mind (analytics DBs, OLAP engines, distributed query engines) tend to support direct querying of compressed Parquet.
- Traditional RDBMS (Postgres, MySQL, SQL Server, Oracle) may require an extension or pre-processing to ingest them.
Parquet is a storage format, not a relational database.
- It preserves schema information (column names, types, and optional metadata).
- It does not preserve or enforce higher-level relational rules such as: Primary keys, Foreign keys, Indexes, or Constraints (NOT NULL, CHECK, etc.).
In parquet format, each table can be a separate file or a set of files (for really large data). The files can be compressed for efficient storage.
Assume there is a directory (“./data/parquet”) of tables where each file is a separate table
- Create a new connection.
- Read individual files by name.
- DESCRIBE a table.
```{sql}
#| connection: con_p
#| label: lst-read_parquet
DESCRIBE './data/parquet/flights.parquet';
```
column_name | column_type | null | key | default | extra |
---|---|---|---|---|---|
year | INTEGER | YES | NA | NA | NA |
month | INTEGER | YES | NA | NA | NA |
day | INTEGER | YES | NA | NA | NA |
dep_time | INTEGER | YES | NA | NA | NA |
sched_dep_time | INTEGER | YES | NA | NA | NA |
dep_delay | DOUBLE | YES | NA | NA | NA |
arr_time | INTEGER | YES | NA | NA | NA |
sched_arr_time | INTEGER | YES | NA | NA | NA |
arr_delay | DOUBLE | YES | NA | NA | NA |
carrier | VARCHAR | YES | NA | NA | NA |
When you have a DuckDB connection (here called con_p
), DuckDB can automatically treat Parquet file paths as virtual tables so you can work with them.
- Here is an example of a
JOIN
of two parquet files. (Don’t worry about the syntax for now.)
```{sql}
#| connection: con_p
#| label: lst-join_parquet
SELECT f.flight, a.name AS airline_name
FROM './data/parquet/flights.parquet' AS f
JOIN './data/parquet/airlines.parquet' AS a
ON f.carrier = a.carrier
LIMIT 10;
```
flight | airline_name |
---|---|
1545 | United Air Lines Inc. |
1714 | United Air Lines Inc. |
1141 | American Airlines Inc. |
725 | JetBlue Airways |
461 | Delta Air Lines Inc. |
1696 | United Air Lines Inc. |
507 | JetBlue Airways |
5708 | ExpressJet Airlines Inc. |
79 | JetBlue Airways |
301 | American Airlines Inc. |
Using the file names directly though makes the code less portable and harder to read.
To make portable code, create a VIEW for each table (Section 8.23) with an alias of the table name and write your code using the VIEW
instead of the file name as working with a DBMS table.
- Let’s close the
con_p
connection
- Use R to create a new connection and the views.
# create connection (here, ephemeral in-memory db)
con_p <- dbConnect(duckdb())
# register all Parquet files in a folder
dbExecute(con_p, "
CREATE VIEW flights AS
SELECT * FROM parquet_scan('./data/parquet/flights.parquet');
CREATE VIEW airlines AS
SELECT * FROM parquet_scan('./data/parquet/airlines.parquet');
")
Now we can use the new connection to run the query as if it were standard SQL.
```{sql}
#| connection: con_p
#| label: lst-join_parquet-view
SELECT f.flight, a.name AS airline_name
FROM flights AS f
JOIN airlines AS a
ON f.carrier = a.carrier
LIMIT 10;
```
flight | airline_name |
---|---|
1545 | United Air Lines Inc. |
1714 | United Air Lines Inc. |
1141 | American Airlines Inc. |
725 | JetBlue Airways |
461 | Delta Air Lines Inc. |
1696 | United Air Lines Inc. |
507 | JetBlue Airways |
5708 | ExpressJet Airlines Inc. |
79 | JetBlue Airways |
301 | American Airlines Inc. |
- Close the connection
8.4.3 SQL Syntax Overview
Like all languages, SQL has standard syntax and reserved words.
There is a standard for SQL: SQL-92. However, many database developers see this as a point of departure rather than a destination.
- They use ODBC and SQL-92 to develop their own versions that are optimized for their intended use cases and database design.
- Thus, there are many flavors of SQL in the world.
Be sure to review the SQL implementation for the DBMS you are using to be aware of any special features.
If you use proprietary SQL, it may give you performance improvements, but it will also lock you into that DBMS and you will have to revise your code if the back-end DBMS is changed.
These notes use the SQL flavor used by the DuckDB DBMS.
The Tidyverse uses functions as “verbs” to manipulate data; SQL uses “STATEMENTS” which can contain clauses, expressions, and other syntactic building blocks.
Table 8.1 identifies the common types of building blocks used in SQL along with some examples and their usage.
Type | Examples | Role / Purpose |
---|---|---|
Statements | SELECT , INSERT , UPDATE , DELETE , CREATE , DROP |
Complete SQL commands that tell the database what action to perform. |
Clauses | FROM , WHERE , GROUP BY , HAVING , ORDER BY , LIMIT , JOIN |
Sub-components of statements that define how the action is carried out. |
Expressions | CASE , arithmetic (a + b ), boolean (x > 5 ), string concatenation |
Return a single value, often used inside clauses (computed or conditional). |
Functions | Aggregate: SUM() , AVG() , COUNT() Scalar: ROUND() , UPPER() , NOW() |
Built-in operations that transform values or sets of values. |
Modifiers | DISTINCT , ALL , TOP , AS (alias) |
Adjust or qualify the behavior of clauses or expressions. |
Predicates | IN , BETWEEN , LIKE , IS NULL , EXISTS |
Conditions that return TRUE /FALSE , often used in WHERE or HAVING . |
Operators | = , < , > , AND , OR , NOT , + , - , * , / |
Combine or compare values. |
Keywords | NULL , PRIMARY KEY , FOREIGN KEY , DEFAULT , CHECK |
Special reserved words that define schema rules or literal values. |
A SQL “statement” is a “complete sentence” that cab be executed independently as an executable query.
- A clause is a phrase in the sentence.
- An expression is like a formula inside a clause.
- A function is a built-in calculator you can call inside expressions.
- Modifiers tweak the behavior (like adverbs in English).
- Predicates make true/false checks.
- Operators are the glue (math or logic).
- Keywords are the reserved words that give SQL its grammar.
The following diagram shows how the building blocks are ordered in creating a complete sentence.
SQL Statement (e.g., SELECT)
│
├── Clauses (parts of the statement)
│ ├── FROM (which tables?)
│ ├── WHERE (filter rows before grouping)
│ │ └── Predicates (IN, BETWEEN, LIKE, IS NULL, EXISTS)
│ │ └── Expressions (x > 5, col1 + col2)
│ ├── GROUP BY (form groups)
│ ├── HAVING (filter groups after aggregation)
│ │ └── Aggregate Functions (SUM, AVG, COUNT, MIN, MAX)
│ ├── SELECT (choose fields/expressions to return)
│ │ ├── Expressions (CASE, arithmetic, concatenation)
│ │ │ └── Functions (ROUND, UPPER, NOW, SQRT, etc.)
│ │ └── Modifiers (DISTINCT, AS alias)
│ ├── ORDER BY (sort results)
│ └── LIMIT / OFFSET (restrict result size)
│
├── Operators (math: +, -, *, / ; logic: AND, OR, NOT ; comparison: =, <, >)
│
└── Keywords (NULL, DEFAULT, PRIMARY KEY, FOREIGN KEY, CHECK, etc.)
SQL is used across the entire database workflow:
- Database setup:
CREATE DATABASE
,CREATE TABLE
- Data management:
INSERT
,UPDATE
,DELETE
,DROP TABLE
,ALTER TABLE
- Data retrieval and analysis:
SELECT
(with various clauses)
We are most interested in SELECT
statements as that is how we get data from a database table or a set of joined tables.
8.4.4 The SELECT
statement
The SELECT
statement is the SQL workhorse for getting and manipulating data akin to dplyr::select()
combined with filter()
, arrange()
, and summarise()
.
SELECT
has several other clauses or keywords to support tailoring the data returned by a query.
FROM
: Required to identify the table or set of tables where the data will be located.WHERE
is a clause used to filter which records (rows) should be returned likedplyr::filter()
.ORDER BY
is a keyword for sorting records likedplyr::arrange()
. The default is ascending order.GROUP BY
is a keyword akin todplyr::group_by
and is used for creating summaries of data, often with aggregate functions (SUM()
,AVG()
,COUNT()
, …).
Other useful modifiers include:
AS
: rename/alias columns or tables (likedplyr::rename()
).IN
: test membership (like%in%
).DISTINCT
: removes duplicate rows (likedplyr::distinct()
).LIMIT
: restrict number of rows returned (likeutils::head()
).
SELECT Statement Operators
- SQL uses
=
for comparison instead of==
since you don’t assign anything in SQL. - Use
!=
for not equal. - SQL uses
NULL
instead ofNA
. Test withIS NULL
orIS NOT NULL
. - SQL uses
AND
,OR
andNOT
as logical operators.
8.4.5 SQL Syntax Rules to Remember
- Case insensitivity:
- Keywords are case insensitive, i.e.
select
is the same asSELECT
is the same asSeLeCt
). - Best practice is to have all statement keywords be in UPPERCASE (e.g.,
SELECT
) and table/field names in lowercase for readability.
- Order matters:
- A SQL
SELECT
statement’s clauses and keywords must be in the following order:SELECT
,FROM
,WHERE
,GROUP BY
,ORDER BY
.
- Whitespace & formatting:
- SQL ignores newlines and extra spaces.
- Best practice is to put statements, clauses, and keywords on new lines with spaces to align the keywords to preserve a “river” of white space between the keyword and the values.
- Quoting rules:
- Character string values must be in single quotes, e.g.,
'NYC'
. - Identifiers (column/table names): use double quotes (“first name”) if they contain spaces, reserved words, or special characters (same as using back ticks in R).
- It is fine to always use double quotes because it is not always clear what is an invalid variable name in the database management system.
- Comments
- Single-line comments start with two hyphens
--
. - Multi-line comments use
/* ... */
in most DBMS.
- Statement termination
- Some databases (e.g., PostgreSQL, Oracle) require a semicolon
;
at the end of each statement. - Best practice is to always include
;
so multiple statements can be run in one batch.
Together, these rules make SQL predictable (rigid order) but also forgiving (case, whitespace, optional semicolons).
8.4.6 Creating SQL Code Chunks
If you are creating working a lot with SQL in RStudio, you can manually enter a chunk and change the language engine to sql each time. However there are other ways to be more productive.
8.4.6.1 Customize a Keyboard Shortcut
RStudio has a built-in capability to create an SQL code chunk with an R Markdown code chunk option for the connection.
To access it quickly, suggest customizing a keyboard shortcut for it per Customizing Keyboard Shortcuts in the RStudio IDE. Ushey (2023)
Go to the RStudio Main Menu and select Tools/Modify Keyboard Shortcuts
.
A pop up pane will appear with many options and the existing keyboard shortcuts.
- Enter
S Q
in the search pane and you should see something like Figure 8.3.

Click in the
Shortcut
column to the right of “Insert Chunk S Q L”.Enter the shortcut combination you want. Figure 8.3 shows the Mac shortcut I chose,
Ctrl + Cmd + s
.Click
Apply
and your shortcut will be saved. You can use it right away.
Enter the shortcut in a text area (not an existing chunk) and you should get the following (without the con
).
- Make sure you enter in the name of the connection you have created.
You cannot render a document with an SQL code chunk with invalid or empty connection.
You have to have created a valid connection object before the SQL code chunk.
If you have a connection, you must also have a valid SQL line, else set the code chunk option #| eval: false
8.4.6.2 Create a Custom Code Snippet
Another option is to create a custom code snippet per Code Snippets in the RStudio IDE. Allaire (2023)
- RStudio snippets are templates for code or R Markdown.
To create a snippet for a custom SQL code chunk, go to the RStudio Main Menu and select Tools/Edit Code Snippets
.
- A pop-up pane should appear.
Select Markdown
as you will be using the snippet in a Markdown text section. You will see something like Figure 8.4 with the existing Markdown code snippets.

- Figure 8.4 already shows the snippet I saved for creating a custom SSQL code chunk.
You can create your own snippet.
- Enter
snippet s_name
where you enter your name for the snippet instead ofs_name
. It does not have to bersql
. - Enter the following lines and use the tab key not spaces to indent.
- The
${0}
is a placeholder for your SQL code. The0
means it is the first place your cursor will go.- You could replace
con
with${0}
so you can fill in which connection each time. Then edit the next${0}
on the original line to${1}
.
- You could replace
- Select
Save
to save it and then you can use it.
To use a snippet, enter the name your created, I use “rsql”, and then SHIFT+TAB
(on a mac) aTAB
on windows, and the code chunk will appear like below (without the other options).
Now we are ready to go through a number of the SQL Building blocks from Table 8.1 to manipulate and retrieve data using the connection to our DuckDB database.
8.5 SHOW
and DESCRIBE
Provide an Overview of the Tables in the Database
SHOW TABLES
returns a list of all of the tables in the database.
```{sql}
#| connection: con
#| label: lst-show-tables-3
-- List all tables in the database
SHOW TABLES;
```
name |
---|
airlines |
airports |
flights |
planes |
weather |
DESCRIBE
(DuckDB) returns the set of table with columns for the table names, their column names, and column types.
DESCRIBE tablename
(DuckDB) returns metadata about the table tablename
.
- column name, data type, whether NULL is allowed, key info (e.g. primary key), default values.
```{sql}
#| connection: con
#| label: lst-describe-table-airlines
-- Structure of a specific table
DESCRIBE airlines;
```
column_name | column_type | null | key | default | extra |
---|---|---|---|---|---|
carrier | VARCHAR | YES | NA | NA | NA |
name | VARCHAR | YES | NA | NA | NA |
- R equivalents for a data frame
tibble [336,776 × 19] (S3: tbl_df/tbl/data.frame)
$ year : int [1:336776] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
$ month : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
$ day : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
$ dep_time : int [1:336776] 517 533 542 544 554 554 555 557 557 558 ...
$ sched_dep_time: int [1:336776] 515 529 540 545 600 558 600 600 600 600 ...
$ dep_delay : num [1:336776] 2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
$ arr_time : int [1:336776] 830 850 923 1004 812 740 913 709 838 753 ...
$ sched_arr_time: int [1:336776] 819 830 850 1022 837 728 854 723 846 745 ...
$ arr_delay : num [1:336776] 11 20 33 -18 -25 12 19 -14 -8 8 ...
$ carrier : chr [1:336776] "UA" "UA" "AA" "B6" ...
$ flight : int [1:336776] 1545 1714 1141 725 461 1696 507 5708 79 301 ...
$ tailnum : chr [1:336776] "N14228" "N24211" "N619AA" "N804JB" ...
$ origin : chr [1:336776] "EWR" "LGA" "JFK" "JFK" ...
$ dest : chr [1:336776] "IAH" "IAH" "MIA" "BQN" ...
$ air_time : num [1:336776] 227 227 160 183 116 150 158 53 140 138 ...
$ distance : num [1:336776] 1400 1416 1089 1576 762 ...
$ hour : num [1:336776] 5 5 5 5 6 5 6 6 6 6 ...
$ minute : num [1:336776] 15 29 40 45 0 58 0 0 0 0 ...
$ time_hour : POSIXct[1:336776], format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...
Rows: 336,776
Columns: 19
$ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
$ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
$ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
$ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
$ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
$ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
$ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
$ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
$ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
$ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
$ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
$ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
$ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
$ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
$ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
$ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
$ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…
8.6 SELECT
Returns Specific Fields (columns) FROM
a Table
- The
SELECT
statement must always be followed by aFROM
clause which identifies the source of the data.- That source is a table, a set of joined tables, or a previously-created table view (a saved query like an intermediate data frame variable).
- The data which is returned as a result of executing a
SELECT
statement is stored in a result table, which can also be called the result-set.- This means you can use the output as a table in a new
FROM
clause.
- This means you can use the output as a table in a new
- A
SELECT
statement can have other clauses and modifiers to reduce the amount of data that is returned. - There is no equivalent for excluding fields (like
dplyr::select(-year)
). Just select the fields you want.
Let’s use this syntax to get all the data in the fields tailnum
, year
, and model
from the planes
table.
```{sql}
#| connection: con
#| label: lst-select-planes
SELECT "tailnum", "year", "model"
FROM planes;
```
tailnum | year | model |
---|---|---|
N10156 | 2004 | EMB-145XR |
N102UW | 1998 | A320-214 |
N103US | 1999 | A320-214 |
N104UW | 1999 | A320-214 |
N10575 | 2002 | EMB-145LR |
N105UW | 1999 | A320-214 |
N107US | 1999 | A320-214 |
N108UW | 1999 | A320-214 |
N109UW | 1999 | A320-214 |
N110UW | 1999 | A320-214 |
- R equivalent
Think of SELECT ... FROM ...
as the core skeleton of every SQL query.
- Everything else (
WHERE
,GROUP BY
,ORDER BY
, etc.) adds constraints to shape the result-set, typically to be fewer rows that more focused on the data needed for the question at hand.
You can select every field in a table by using the wildcard *
, e.g., SELECT *
.
tailnum | year | type | manufacturer | model | engines | seats | speed | engine |
---|---|---|---|---|---|---|---|---|
N10156 | 2004 | Fixed wing multi engine | EMBRAER | EMB-145XR | 2 | 55 | NA | Turbo-fan |
N102UW | 1998 | Fixed wing multi engine | AIRBUS INDUSTRIE | A320-214 | 2 | 182 | NA | Turbo-fan |
N103US | 1999 | Fixed wing multi engine | AIRBUS INDUSTRIE | A320-214 | 2 | 182 | NA | Turbo-fan |
N104UW | 1999 | Fixed wing multi engine | AIRBUS INDUSTRIE | A320-214 | 2 | 182 | NA | Turbo-fan |
N10575 | 2002 | Fixed wing multi engine | EMBRAER | EMB-145LR | 2 | 55 | NA | Turbo-fan |
However, this is generally considered a bad practice.
- Fragile code
- Tables evolve over time. New columns may appear or existing columns may change.
SELECT *
may suddenly return unexpected fields and break downstream code.
- Performance issues
- On wide tables (hundreds of columns), returning everything is wasteful.
- Extra data must be:
- retrieved from storage
- processed by the database
- transmitted over the connection
- Network overhead
- If the database is remote, sending unnecessary data across the network makes queries slower for you and everyone else.
The best practice is to select only the specific fields of interest.
Benefits:
- More robust code; your code won’t break if new, irrelevant fields are added.
- More efficient: only the data you need is selected, reducing workload on both the database and the connection.
Exceptions 1. Small tables you know won’t change (e.g., a static lookup table). 2. Quick interactive exploration and then switch to explicit fields in production queries.
Let’s examine several keywords that modify SELECT
to reduce the number of rows or modify the data.
8.6.1 Select Distinct Records to Remove Duplicate Records
Some tables may allow duplicate records or a query may create a result set with duplicate values if only a few fields are selected.
The statement SELECT DISTINCT field1, field2
modifies the result set so it has no duplicate rows.
Let’s select all the distinct (unique) destinations.
dest |
---|
DCA |
DAY |
SRQ |
CHS |
AUS |
SAV |
SNA |
OMA |
MYR |
MVY |
- R equivalent
8.7 WHERE
Filters Records (Rows) by Values
To modify the result set based on values in the records, add a WHERE
clause after the FROM
clause with a condition
that returns a logical value.
The
WHERE
clauses uses the condition to filter the returned records to those whose values meet the condition, i.e., only rows for which condition evaluates to TRUE will appear in the result-set.The
WHERE
clause requires an expression that returns a logical value.
```{sql}
#| connection: con
SELECT "flight", "distance", "origin", "dest"
FROM flights
WHERE "distance" < 50;
```
flight | distance | origin | dest |
---|---|---|---|
1632 | 17 | EWR | LGA |
- R equivalent:
The allowable operators in WHERE
are:
- Comparison:
=
,!=
,<
,>
,<=
,>=
- Logical:
AND
,OR
,NOT
- Membership:
IN
- Null check:
IS NULL
,IS NOT NULL
To test for equality, use a single equal sign, =
.
flight | month |
---|---|
745 | 12 |
839 | 12 |
1895 | 12 |
1487 | 12 |
2243 | 12 |
939 | 12 |
3819 | 12 |
1441 | 12 |
2167 | 12 |
605 | 12 |
- R equivalent
To use a character string you must use single quotes on the string, not double.
flight | origin |
---|---|
1141 | JFK |
725 | JFK |
79 | JFK |
49 | JFK |
71 | JFK |
194 | JFK |
1806 | JFK |
1743 | JFK |
303 | JFK |
135 | JFK |
- R equivalent
8.7.1 Combine Multiple Criteria Using the Logical Operators.
Here is an example of using the AND
logical operator.
```{sql}
#| connection: con
SELECT "flight", "origin", "dest"
FROM flights
WHERE "origin" = 'JFK' AND "dest" = 'CMH';
```
flight | origin | dest |
---|---|---|
4146 | JFK | CMH |
3783 | JFK | CMH |
4146 | JFK | CMH |
3783 | JFK | CMH |
4146 | JFK | CMH |
3783 | JFK | CMH |
4146 | JFK | CMH |
3783 | JFK | CMH |
4146 | JFK | CMH |
3650 | JFK | CMH |
- R equivalent:
You can use the OR
and `NOT logical operators too.
- Put parentheses around the portions of the expression to ensure you get the desired order of operations.
```{sql}
#| connection: con
SELECT
DISTINCT "flight",
"origin",
"dest"
FROM
flights
WHERE
NOT ("origin" = 'JFK'
OR "origin" = 'LGA')
AND dest = 'CMH';
```
flight | origin | dest |
---|---|---|
3852 | EWR | CMH |
4147 | EWR | CMH |
4446 | EWR | CMH |
4240 | EWR | CMH |
3839 | EWR | CMH |
3854 | EWR | CMH |
4704 | EWR | CMH |
5664 | EWR | CMH |
5815 | EWR | CMH |
4335 | EWR | CMH |
- R equivalent
8.7.2 Filter on Membership using IN
or NOT IN
You can simplify and generalize a WHERE
condition for membership using IN
instead of multiple OR
comparisons.
- To get flights whose destination is one of several airports (e.g., BWI, IAD, DCA) and exclude certain origins:
```{sql}
#| connection: con
SELECT DISTINCT
"flight",
"origin",
"dest"
FROM flights
WHERE
"origin" NOT IN ('JFK', 'LGA')
AND "dest" IN ('BWI', 'IAD', 'DCA');
```
flight | origin | dest |
---|---|---|
4241 | EWR | DCA |
3838 | EWR | DCA |
3845 | EWR | BWI |
4372 | EWR | DCA |
6049 | EWR | IAD |
3832 | EWR | DCA |
4368 | EWR | DCA |
4103 | EWR | DCA |
4348 | EWR | IAD |
1672 | EWR | BWI |
- R Equivalent
8.7.3 Remove Records with Missing Data
Missing data is NULL
in SQL (instead of NA
).
We can explicitly remove records with missing values with the clause WHERE field IS NOT NULL
.
```{sql}
#| connection: con
SELECT "flight", "dep_delay"
FROM flights
WHERE "dep_delay" IS NOT NULL;
```
flight | dep_delay |
---|---|
1545 | 2 |
1714 | 4 |
1141 | 2 |
725 | -1 |
461 | -6 |
1696 | -4 |
507 | -5 |
5708 | -3 |
79 | -3 |
301 | -2 |
- R equivalent
Just use IS
if you want only the records that are missing data.
8.8 String/Text Functions
SQL provides several functions for working with text.
These are often used in
WHERE
clauses to filter data:Find flights for carriers with “Jet” in their name (case-insensitive).
```{sql}
#| connection: con
SELECT f.flight, a.name AS airline_name, f.origin
FROM flights f
JOIN airlines a
ON f.carrier = a.carrier
WHERE a.name ILIKE '%jet%' -- ILIKE for case-insensitive matching
LIMIT 10;
```
flight | airline_name | origin |
---|---|---|
4424 | ExpressJet Airlines Inc. | EWR |
6177 | ExpressJet Airlines Inc. | EWR |
583 | JetBlue Airways | JFK |
4963 | ExpressJet Airlines Inc. | EWR |
4380 | ExpressJet Airlines Inc. | EWR |
601 | JetBlue Airways | JFK |
4122 | ExpressJet Airlines Inc. | EWR |
4334 | ExpressJet Airlines Inc. | EWR |
4533 | ExpressJet Airlines Inc. | EWR |
4166 | ExpressJet Airlines Inc. | EWR |
- Extract first characters of the carrier code.
```{sql}
#| connection: con
SELECT carrier, SUBSTR(carrier, 1, 2) AS carrier_prefix
FROM airlines
LIMIT 10;
```
carrier | carrier_prefix |
---|---|
9E | 9E |
AA | AA |
AS | AS |
B6 | B6 |
DL | DL |
EV | EV |
F9 | F9 |
FL | FL |
HA | HA |
MQ | MQ |
– Trim whitespace from airport codes (if any).
```{sql}
#| connection: con
SELECT TRIM(origin) AS clean_origin, TRIM(dest) AS clean_dest
FROM flights
LIMIT 10;
```
clean_origin | clean_dest |
---|---|
EWR | IAH |
LGA | IAH |
JFK | MIA |
JFK | BQN |
LGA | ATL |
EWR | ORD |
EWR | FLL |
LGA | IAD |
JFK | MCO |
LGA | ORD |
8.8.1 CONCAT
Concatenates Multiple Strings or Lists.
SQL has the CONCAT
function and an operator for combining strings into a single value.
- This includes concatenating column names. Add an alias so it is easier to reference.
```{sql}
#| connection: con
SELECT DISTINCT CONCAT(f.carrier, '-', a.name) AS airline_name
FROM flights f
JOIN airlines a
ON f.carrier = a.carrier
WHERE a.name ILIKE '%air%' -- ILIKE for case-insensitive matching
ORDER BY airline_name
LIMIT 10;
```
airline_name |
---|
9E-Endeavor Air Inc. |
AA-American Airlines Inc. |
AS-Alaska Airlines Inc. |
B6-JetBlue Airways |
DL-Delta Air Lines Inc. |
EV-ExpressJet Airlines Inc. |
F9-Frontier Airlines Inc. |
FL-AirTran Airways Corporation |
HA-Hawaiian Airlines Inc. |
MQ-Envoy Air |
Concatenation is so common there is an operator for it.
Use ||
as a binary operator to concatenate the elements on both sides of it.
```{sql}
#| connection: con
SELECT DISTINCT f.carrier || '-' || a.name AS airline_name
FROM flights f
JOIN airlines a
ON f.carrier = a.carrier
WHERE a.name ILIKE '%air%' -- ILIKE for case-insensitive matching
ORDER BY airline_name
LIMIT 10;
```
airline_name |
---|
9E-Endeavor Air Inc. |
AA-American Airlines Inc. |
AS-Alaska Airlines Inc. |
B6-JetBlue Airways |
DL-Delta Air Lines Inc. |
EV-ExpressJet Airlines Inc. |
F9-Frontier Airlines Inc. |
FL-AirTran Airways Corporation |
HA-Hawaiian Airlines Inc. |
MQ-Envoy Air |
8.8.2 Regular Expressions in SQL
SQL supports regex pattern matching to filter strings more precisely.
In many SQL dialects you can use
REGEXP
(in DuckDB useREGEXP_MATCHES
) to match patterns:Flights whose destination airport code starts with ‘O’ and ends with ‘A’.
```{sql}
#| connection: con
SELECT flight, dest
FROM flights
WHERE REGEXP_MATCHES(dest, '^O.*A$')
LIMIT 10;
```
flight | dest |
---|---|
4417 | OMA |
4085 | OMA |
4160 | OMA |
4417 | OMA |
4085 | OMA |
4294 | OMA |
4085 | OMA |
4294 | OMA |
4085 | OMA |
4085 | OMA |
- Carriers whose name contains ” Air “.
```{sql}
#| connection: con
SELECT name
FROM airlines
WHERE REGEXP_MATCHES(name, '^.* Air .*')
LIMIT 10;
```
name |
---|
Endeavor Air Inc. |
Delta Air Lines Inc. |
United Air Lines Inc. |
- Replace tail numbers that begin with “N” with only the digits after the “N”.
8.9 Date/Time Functions
SQL has operators and functions to work with dates and times.
- These can be useful for filtering or aggregating flights
Operators allow us to add and subtract DATE types
```{sql}
#| connection: con
SELECT flight, carrier, time_hour
FROM flights
WHERE time_hour = DATE '2013-07-04' + 1
LIMIT 10;
```
flight | carrier | time_hour |
---|---|---|
418 | B6 | 2013-07-05 |
2391 | DL | 2013-07-05 |
415 | VX | 2013-07-05 |
1505 | B6 | 2013-07-05 |
1464 | UA | 2013-07-05 |
1244 | UA | 2013-07-05 |
686 | B6 | 2013-07-05 |
485 | FL | 2013-07-05 |
1271 | WN | 2013-07-05 |
4079 | 9E | 2013-07-05 |
Functions allow us to manipulate or extract information from DATE
s or TIME
s
DuckDB has two versions of some functions, with and without an underscore, to help with compatibility.
- The ones without an underscore, e.g.,datediff, dateadd, etc., were originally added to mimic common SQL dialects like SQL Server or MySQL.
- The ones with an underscore, e.g., date_diff, date_add, follow standard SQL/ISO naming conventions.
The functions provide the same result (although the argument order may differ)
Say we want to select only those flights in that occurred in March at 10 PM.
- We can use the concatenation operator to construct a date and assign an alias
- We could also just use
time_hour
which is type TIMESTAMP (combines DATE and TIME information.) - Then select the month from date that is March and the hour is 10PM.
```{sql}
#| connection: con
SELECT flight, carrier, DATE(year || '-' || month || '-' || day) AS flight_date,
time_hour
FROM flights
WHERE EXTRACT(MONTH FROM flight_date) = 3 AND EXTRACT(HOUR from time_hour) = 22
LIMIT 10;
```
flight | carrier | flight_date | time_hour |
---|---|---|---|
575 | AA | 2013-03-01 | 2013-03-01 22:00:00 |
2136 | US | 2013-03-01 | 2013-03-01 22:00:00 |
127 | DL | 2013-03-01 | 2013-03-01 22:00:00 |
1499 | DL | 2013-03-01 | 2013-03-01 22:00:00 |
2183 | US | 2013-03-01 | 2013-03-01 22:00:00 |
257 | AA | 2013-03-01 | 2013-03-01 22:00:00 |
2042 | DL | 2013-03-01 | 2013-03-01 22:00:00 |
4202 | EV | 2013-03-01 | 2013-03-01 22:00:00 |
31 | DL | 2013-03-01 | 2013-03-01 22:00:00 |
689 | UA | 2013-03-01 | 2013-03-01 22:00:00 |
Extract parts of a date
```{sql}
#| connection: con
SELECT year, month, day, EXTRACT(DAY FROM date '2013-03-15') AS day_of_month
FROM flights
LIMIT 5;
```
year | month | day | day_of_month |
---|---|---|---|
2013 | 1 | 1 | 15 |
2013 | 1 | 1 | 15 |
2013 | 1 | 1 | 15 |
2013 | 1 | 1 | 15 |
2013 | 1 | 1 | 15 |
Calculate the difference between the scheduled departure date and a calculated arrival date.
- Use the INTERVAL function to convert a character air_time to a time interval
- Use the DATE_DIFF function to find the difference between two dates.
```{sql}
#| connection: con
SELECT flight, origin, dest, time_hour AS dep_datetime,
time_hour + (air_time * INTERVAL '1 MINUTE') AS arr_datetime,
date_diff('day',
time_hour,
time_hour + (air_time * INTERVAL '1 MINUTE')
) as days_different
FROM flights
WHERE date_diff('day',
time_hour,
time_hour + (air_time * INTERVAL '1 MINUTE')
) <> 0
LIMIT 10;
```
flight | origin | dest | dep_datetime | arr_datetime | days_different |
---|---|---|---|---|---|
51 | JFK | HNL | 2013-01-01 14:00:00 | 2013-01-02 00:59:00 | 1 |
117 | JFK | LAX | 2013-01-01 18:00:00 | 2013-01-02 00:02:00 | 1 |
15 | EWR | HNL | 2013-01-01 18:00:00 | 2013-01-02 04:56:00 | 1 |
16 | EWR | SEA | 2013-01-01 19:00:00 | 2013-01-02 00:48:00 | 1 |
257 | JFK | SFO | 2013-01-01 19:00:00 | 2013-01-02 01:02:00 | 1 |
355 | JFK | BUR | 2013-01-01 18:00:00 | 2013-01-02 00:11:00 | 1 |
1322 | JFK | SFO | 2013-01-01 19:00:00 | 2013-01-02 01:15:00 | 1 |
1467 | JFK | LAX | 2013-01-01 20:00:00 | 2013-01-02 01:40:00 | 1 |
759 | LGA | DFW | 2013-01-01 20:00:00 | 2013-01-02 00:08:00 | 1 |
1699 | EWR | SFO | 2013-01-01 20:00:00 | 2013-01-02 01:48:00 | 1 |
For more complex data (or REGEX) functions, we need to install and load the International Components for Unicode (ICU).
- It’s a library that provides:
- Unicode support – handles text in almost any language and script.
- Advanced text functions – like regular expressions, case conversions, collation (sorting), normalization, etc.
- Date/time parsing and formatting – especially when working with strings in different locales.
Compare with the current date.
8.10 LIMIT
Restricts the Number of Records Returned Starting with the OFFSET
Record
When you are building a query, you often want to subset the rows while you are finishing it (you don’t want to return the whole table each time you are trouble shooting a query).
Use the LIMIT
output modifier at the end of the query to restrict the number of records that are returned.
- Some flavors of SQL allow
SELECT TOP
andSELECT TOP PERCENT
statements to restrict to the top number/percentage of records.
flight | origin | dest |
---|---|---|
1545 | EWR | IAH |
1714 | LGA | IAH |
1141 | JFK | MIA |
725 | JFK | BQN |
461 | LGA | ATL |
If you want a limited number of records but you not just the first records, use the OFFSET
clause to identify how may records to skip before starting to collect the records.
- The
OFFSET
clause indicates at which position to start reading the values, i.e., the firstOFFSET
values are ignored.
8.11 USING
Returns a Random Sample of Records
You can also randomly sample rows with the clause USING SAMPLE
.
flight | origin | dest |
---|---|---|
927 | EWR | MIA |
601 | JFK | MIA |
3910 | JFK | RIC |
1708 | EWR | SJU |
1529 | JFK | LAS |
Depending upon your version of SQL there may be an modifier to SAMPLE
that allows you to set the random number seed for the sample so it is repeatable.
- DuckDB has multiple sampling methods. The following uses
bernoulli
sampling to get a small number of rows for a small percentage (just to avoid many rows in the output.) - Note that repeating the code above changes the selected rows whereas repeating the following does not.
```{sql}
#| connection: con
SELECT "flight", "origin", "dest"
FROM flights
USING SAMPLE .0027% (bernoulli, 377);
```
flight | origin | dest |
---|---|---|
340 | EWR | MIA |
When working with tables with more than a few hundred records, recommend using LIMIT
and one of the WHERE
or USING
clauses when initially building/testing your query to speed up the return of results, especially if the connection is over a network.
Remove the restrictions when the query is ready for use on the full table.
8.12 ORDER BY
Arranges Rows
The ORDER BY
clause modifys the output to sort the records that are returned.
- The default order is ascending.
Let’s remove missing values so we can see the ordering.
```{sql}
#| connection: con
SELECT "flight", "dep_delay"
FROM flights
WHERE "dep_delay" IS NOT NULL
ORDER BY "dep_delay"
LIMIT 20;
```
flight | dep_delay |
---|---|
97 | -43 |
1715 | -33 |
5713 | -32 |
1435 | -30 |
837 | -27 |
3478 | -26 |
4573 | -25 |
4361 | -25 |
2223 | -24 |
3318 | -24 |
- R equivalent
Use the modifier DESC
after the field name to arrange in descending order
```{sql}
#| connection: con
SELECT "flight", "dep_delay"
FROM flights
WHERE "dep_delay" IS NOT NULL
ORDER BY "dep_delay" DESC;
```
flight | dep_delay |
---|---|
51 | 1301 |
3535 | 1137 |
3695 | 1126 |
177 | 1014 |
3075 | 1005 |
2391 | 960 |
2119 | 911 |
2007 | 899 |
2047 | 898 |
172 | 896 |
- R equivalent
Break ties by adding more variables in the ORDER BY
clause.
```{sql}
#| connection: con
SELECT "flight", "origin", "dep_delay"
FROM flights
WHERE "dep_delay" IS NOT NULL
ORDER BY "origin" DESC, "dep_delay"
LIMIT 20;
```
flight | origin | dep_delay |
---|---|---|
1715 | LGA | -33 |
5713 | LGA | -32 |
1435 | LGA | -30 |
837 | LGA | -27 |
3478 | LGA | -26 |
4573 | LGA | -25 |
375 | LGA | -24 |
4065 | LGA | -24 |
2223 | LGA | -24 |
1371 | LGA | -23 |
- R equivalent
8.13 AS
creates an Alias for a Field, Table, or Sub-Query/View
It is often helpful to create a new, often shorter, name for a field, table , or sub-query.
- An alias is a temporary name which can make it much easier to read and debug the code., especially with multiple joins and sub-queries at play.
AS
does Not change or update the original table; it is creating the alias for a new temporary field/table that will disappear after the query is run.
To create an alias or new name for a field in a result set, use the keyword AS
, inside the SELECT
statement, after the column name or expression, to specify the alias (new name for the field resulting from the column name or expression).
Let’s calculate average speed from the flights
table. We’ll also keep the flight number, distance, and air time variables.
```{sql}
#| connection: con
SELECT
"flight",
"distance" / "air_time" AS "speed", -- "speed" is the alias
"distance",
"air_time"
FROM
flights
LIMIT 5;
```
flight | speed | distance | air_time |
---|---|---|---|
1545 | 6.167401 | 1400 | 227 |
1714 | 6.237885 | 1416 | 227 |
1141 | 6.806250 | 1089 | 160 |
725 | 8.612022 | 1576 | 183 |
461 | 6.568966 | 762 | 116 |
- R equivalent:
8.14 CAST()
Changes the Type of a Variable
The CAST()
function allows you to explicitly change the data type of a column, expression, or literal. It is commonly used when you need to:
- Run numeric operations on text columns
- Format dates or times
- Ensure compatible types for joins or calculations
- Expression is what you want to convert and the target is the new type (e.g.,
INTEGER
,FLOAT
(single precision 4 bytes),DOUBLE
(double precision 8 bytes),VARCHAR
,DATE
, ).
DuckDB also supports the shorthand us of ::
as a binary operator, e.g., expr::TYPENAME
, which is also present in PostgreSQL.
Numeric Types in DuckDB and Their Relation to Standard SQL
DuckDB supports a variety of numeric types, which align closely with standard SQL data types. These types can be broadly categorized into integer types, approximate numeric types, and exact numeric types.
- Integer Types
DuckDB provides several integer types for whole numbers:
Type | Storage | Description |
---|---|---|
TINYINT |
1 byte | Small integer values |
SMALLINT |
2 bytes | Medium-range integer values |
INTEGER / INT |
4 bytes | Standard integer values |
BIGINT |
8 bytes | Large integer values |
These are equivalent to standard SQL integer types and are used when exact whole numbers are needed.
- Floating-Point Types (Approximate)
Type | Storage | Description |
---|---|---|
FLOAT / REAL |
4 bytes | Single-precision floating point (approximate) |
DOUBLE / DOUBLE PRECISION |
8 bytes | Double-precision floating point (approximate) |
- These types store approximate numeric values, meaning they can introduce rounding errors.
- They are useful for scientific or large-scale calculations where exact precision is not required.
- Exact Numeric Types
Type | Description |
---|---|
DECIMAL(p, s) / NUMERIC(p, s) |
Exact fixed-point numbers. p = total digits, s = digits after the decimal point. |
- Ideal for financial or precise calculations, where exact decimal representation is required.
- Arithmetic with
DECIMAL
preserves exact results, unlike floating-point types.
Summary
- Integers:exact whole numbers.
- FLOAT / DOUBLE: approximate numbers, faster, but may have rounding errors.
- DECIMAL / NUMERIC: exact numbers with fixed precision, recommended for money or other high-precision use cases.
DuckDB’s numeric types closely follow standard SQL types, making it easy to write portable SQL queries and integrate with other database systems.
8.15 Math Functions Transform Values
SQL has many functions for transforming values. Common ones include:
ABS(x)
: Absolute ValueLN()
: Natural log transformation.EXP()
: Exponentiation.SQRT()
: Square root.POW()
: Power transformation.POW(2.0, x)
would be \(2^x\)POW(x, 2.0)
would be \(x^2\)
ROUND(x, n)
: Round x to n decimal placesCEIL(x)
/CEILING(x)
: Smallest integer ≥ xFLOOR(x)
Largest integer ≤ x
8.15.1 Integer Division
In SQL, when you divide two integer columns, many database systems perform integer division by default.
DuckDB has two different division operators that operate differently if both numbers are integers.
- The single slash,
/
,will convert to floating point numbers for the division. - The double slash,
//
, will use integer division. - See DuckDB Numeric Operators.
```{sql}
#| connection: con
SELECT 5 / 2; -- Most SQL DBs return 2, not 2.5. DuckDB returns 2.5
SELECT 5 // 2; -- returns 2, not 2.5 in DuckDB
```
(5 // 2) |
---|
2 |
To create more portable code, you can use CAST()
to convert prior to division.
- The fields
day
and “`month” are integer columns in the flights table. - Without casting, “day” // “month” would give truncated integer results, which would be misleading for the “ratio” calculation.
- Using CAST() ensures “ratio” is a precise decimal number, giving meaningful results.
Without CAST()
a standard SQL system with a single /
would yield the following.
```{sql}
#| connection: con
SELECT
DISTINCT "month",
"day",
"day" // "month" AS "ratio" -- using // to Force Integer Division
FROM
flights
WHERE
"month" >= 5
ORDER BY
"month",
"day";
```
month | day | ratio |
---|---|---|
5 | 1 | 0 |
5 | 2 | 0 |
5 | 3 | 0 |
5 | 4 | 0 |
5 | 5 | 1 |
5 | 6 | 1 |
5 | 7 | 1 |
5 | 8 | 1 |
5 | 9 | 1 |
5 | 10 | 2 |
Now with CAST()
for more portable code.
CAST("day" AS DOUBLE)
converts the integer column day into a double-precision floating-point number.CAST("month" AS DOUBLE)
does the same for month.- Dividing two
DOUBLE
s ensures floating-point division across most systems, preserving the decimal part.
```{sql}
#| connection: con
SELECT
DISTINCT "month",
"day",
CAST("day" AS DOUBLE) / CAST("month" AS DOUBLE) AS "ratio"
FROM
flights
WHERE
"month" >= 5
ORDER BY
"month",
"day";
```
month | day | ratio |
---|---|---|
5 | 1 | 0.2 |
5 | 2 | 0.4 |
5 | 3 | 0.6 |
5 | 4 | 0.8 |
5 | 5 | 1.0 |
5 | 6 | 1.2 |
5 | 7 | 1.4 |
5 | 8 | 1.6 |
5 | 9 | 1.8 |
5 | 10 | 2.0 |
8.16 Aggregate Functions Summarize Fields
SQL has functions for summarizing data (SQL calls these “Aggregates”).
Aggregate functions must be inside a SELECT
statement which identifies the fields of interest.
- Aggregates are similar to the functions used inside a
dplyr::summarize()
in R. - They calculate the results for the selected fields of the Fields and aggregate (collapse) the records to a single record.
- By default, all records with missing data (NULL) are ignored (like setting
na.rm = TRUE
) and there is no warning or message.
Some of the common aggregates are:
COUNT()
: Count the number of rows.AVG()
: Calculate average.MEDIAN()
: Median (not standard across all DBMS’s).SUM()
: Summation.MIN()
: Minimum.MAX()
: Maximum.STDDEV()
: Standard deviation.VARIANCE()
: VarianceCORR(y, x)
: Correlation betweenx
andy
.
Let’s calculate the average and standard deviation for departure delays across all records in the table.
```{sql}
#| connection: con
SELECT
AVG("dep_delay"), -- Note the default field name without an alias
STDDEV("dep_delay") AS "sd_dep_delay",
COUNT("dep_delay") AS "num_records"
FROM
flights;
```
avg(dep_delay) | sd_dep_delay | num_records |
---|---|---|
12.63907 | 40.21006 | 328521 |
Note only one record is returned with fields identified by default with the function call, but you can create aliases using the
AS
statement.compare to the R results to see that SQL automatically drops records with a
NULL
value before counting the records.R equivalent:
flights |>
summarize(avg_dep_delay = mean(dep_delay, na.rm = TRUE),
sd_dep_delay = sd(dep_delay, na.rm = TRUE),
num_records = n())
- R counts the number of records that went into the summary, before the records were removed by
is.na()
. - You need to filter the data in R to remove the
NA
s before sending to thesummarize()
to get the same count as SQL.
8.17 GROUP BY
Aggregates Results across Groups of Records.
GROUP BY
identifies one or more fields by which to aggregate the records.
- This is similar (but not exactly the same) as using
summarize()
inside agroup_by()
in R.
Let’s summarize the average and standard deviation of the departure delays by each origin and month.
```{sql}
#| connection: con
SELECT
"origin",
"month",
AVG("dep_delay") AS "avg_dep_delay",
STDDEV("dep_delay") AS "sd_dep_delay"
FROM
flights
GROUP BY
"origin",
"month"
ORDER BY
"origin",
"month"
LIMIT 5
OFFSET 10
```
origin | month | avg_dep_delay | sd_dep_delay |
---|---|---|---|
EWR | 11 | 6.723769 | 28.78092 |
EWR | 12 | 21.026575 | 45.74062 |
JFK | 1 | 8.615826 | 35.99002 |
JFK | 2 | 11.791355 | 37.43803 |
JFK | 3 | 10.721825 | 35.33440 |
- Now one record is returned for each of the 36 groups.
- Note the order of the fields
origin
andmonth
is based on the order of theSELECT
, not theGROUP BY
.
Change the order of fields in the SELECT
.
```{sql}
#| connection: con
SELECT
"month",
"origin",
AVG("dep_delay") AS "avg_dep_delay",
STDDEV("dep_delay") AS "sd_dep_delay"
FROM
flights
GROUP BY
"origin",
"month"
ORDER BY
"origin",
"month"
LIMIT 5
OFFSET 10
```
month | origin | avg_dep_delay | sd_dep_delay |
---|---|---|---|
11 | EWR | 6.723769 | 28.78092 |
12 | EWR | 21.026575 | 45.74062 |
1 | JFK | 8.615826 | 35.99002 |
2 | JFK | 11.791355 | 37.43803 |
3 | JFK | 10.721825 | 35.33440 |
Changing the order of the fields in the GROUP BY
does not change sequence of the fields or the results.
```{sql}
#| connection: con
SELECT
"origin",
"month",
AVG("dep_delay") AS "avg_dep_delay",
STDDEV("dep_delay") AS "sd_dep_delay"
FROM
flights
GROUP BY
"month",
"origin"
ORDER BY
"origin",
"month"
LIMIT 5
OFFSET 10
```
origin | month | avg_dep_delay | sd_dep_delay |
---|---|---|---|
EWR | 11 | 6.723769 | 28.78092 |
EWR | 12 | 21.026575 | 45.74062 |
JFK | 1 | 8.615826 | 35.99002 |
JFK | 2 | 11.791355 | 37.43803 |
JFK | 3 | 10.721825 | 35.33440 |
- R equivalent
- Note the order of the
origin
andmonth
. - Changing the order in the
select()
does not matter in R.
Changing the order in the group_by()
does matter in how the fields are arranged by default.
8.17.1 Use WHERE
to Filter Records Before Grouping with GROUP BY
Using WHERE
before a GROUP BY
allows you to filter the rows that will be grouped.
As an example, we want the departure delay calculations to only include flights that were delayed, not those that arrived early so had a negative departure delay.
```{sql}
#| connection: con
SELECT
"origin",
"month",
AVG("dep_delay") AS "avg_dep_delay",
STDDEV("dep_delay") AS "sd_dep_delay"
FROM
flights
WHERE dep_delay > 0
GROUP BY
"month",
"origin"
ORDER BY
"origin",
"month"
LIMIT 5
OFFSET 10
```
origin | month | avg_dep_delay | sd_dep_delay |
---|---|---|---|
EWR | 11 | 27.34014 | 41.06456 |
EWR | 12 | 38.42523 | 53.11234 |
JFK | 1 | 33.11765 | 53.54925 |
JFK | 2 | 36.82556 | 50.55263 |
JFK | 3 | 35.88225 | 48.38411 |
Note the increases in the avg_dep_delay
as the summary is based only on the filtered rows.
- R equivalent
flights %>%
filter(dep_delay > 0) |> # WHERE dep_delay > 0
group_by(origin, month) |> # GROUP BY origin, month
summarise(
avg_dep_delay = mean(dep_delay), # AVG(dep_delay)
sd_dep_delay = sd(dep_delay), # STDDEV(dep_delay)
.groups = "drop"
) |>
arrange(origin, month) |> # ORDER BY origin, month
slice(11:15) # LIMIT 5 OFFSET 10
8.17.2 Use HAVING
to Filter Records After Grouping with GROUP BY
HAVING
allow you to filter groups after aggregation, rather than individual rows going into the group.
- Often, you calculate summary statistics such as averages, totals, or counts, and then only want to keep groups that meet certain criteria. For example, you might only want destinations where the average departure delay exceeds a threshold.
- Using
HAVING
allows these filtered groups to be used in subsequent calculations or as a subquery for joining with other tables, without having to manually filter the aggregated results outside the query. - This makes queries more compact, readable, and efficient, especially when working with multiple levels of aggregation.
The key advantage of HAVING
is that it allows you to filter based on the results of aggregate functions (like SUM()
, AVG()
, COUNT()
) computed for each group.
- In contrast,
WHERE
cannot use aggregate functions because it operates before any grouping or aggregation has taken place, it only filters individual rows.
This distinction is important: if you want to keep only groups that meet certain criteria, such as “destinations with more than one carrier” or “airlines with an average delay above 20 minutes,” HAVING
lets you do this directly on the aggregated values.
- Without HAVING, you would need an extra subquery or intermediate table to perform the same filtering.
Let’s use both WHERE
to filter the rows to positive delays, and HAVING
to find those origins and months with average delays greater than 35.
```{sql}
#| connection: con
SELECT
origin,
month,
AVG(dep_delay) AS avg_dep_delay,
STDDEV(dep_delay) AS sd_dep_delay
FROM flights
WHERE dep_delay > 0 -- row-level filter
GROUP BY origin, month
HAVING AVG(dep_delay) > 35 -- group-level filter
ORDER BY origin, month
LIMIT 5
OFFSET 10;
```
origin | month | avg_dep_delay | sd_dep_delay |
---|---|---|---|
JFK | 3 | 35.88225 | 48.38411 |
JFK | 4 | 40.91712 | 57.72617 |
JFK | 5 | 39.10808 | 52.10440 |
JFK | 6 | 47.98522 | 62.77220 |
JFK | 7 | 49.56254 | 64.39900 |
-R equivalent
flights %>%
filter(dep_delay > 0) %>% # WHERE dep_delay > 0
group_by(origin, month) %>% # GROUP BY origin, month
summarise(
avg_dep_delay = mean(dep_delay),
sd_dep_delay = sd(dep_delay),
.groups = "drop"
) %>%
filter(avg_dep_delay > 35) %>% # HAVING AVG(dep_delay) > 20
arrange(origin, month) %>% # ORDER BY origin, month
slice(11:15) # LIMIT 5 OFFSET 10
We write queries in the following order:
However, the database actually executes them in this logical order:
Step | Clause | Purpose |
---|---|---|
1 | FROM | Choose the source tables (and joins). |
2 | WHERE | Filter individual rows BEFORE grouping. |
3 | GROUP BY | Group rows into partitions (if requested). |
4 | HAVING | Filter entire groups AFTER aggregation. |
5 | SELECT | Choose columns and compute expressions. |
6 | ORDER BY | Sort the final result set. |
7 | LIMIT | Restrict the number of rows returned. |
The difference between the way we write SQL syntax and the actual query execution order means that clauses like WHERE
and HAVING
operate at different stages than they appear in the code.
Understanding this helps explain why WHERE
filters rows before aggregation while HAVING
filters groups after aggregation, and why aggregate functions cannot be used in WHERE
but can be used in HAVING
.
8.18 Window Functions use OVER
to Create Grouped Summaries without Aggregating the Records
Window Functions are there for when you want to add a grouped summary to the data without collapsing the data into one record for each groups.
- As an example, you want to calculate the total departure delay time for each destination so you can use it to then calculate the percentage of that departure delay time for each airline flying to that destination.
- You could do that with a summarized data frame and then a join.
- In R, this can be done by using
group_by()
and then usingmutate()
instead ofsummarize()
to add a new column to the data frame with the result for each group while preserving all of the rows.
In SQL, you use the aggregate function you want in a what that SQL calls a Window Function.
- A Window function uses an aggregate function to calculate a result for one or more fields (across a specified set of records) and it adds the results as new fields in the result set (table), but it does Not collapse the records, so the result set has all the original un-aggregated records with the new aggregated results (for each set of records).
- All the records for the given set of specified records have the same values in the new aggregated fields.
To use an aggregate function as a window function, use the OVER
clause with a (PARTITION BY myfield)
modifier.
AGG_FUNCTION
is an aggregate function (e.g., SUM, AVG, COUNT)PARTITION BY
divides the data into groups (like group_by in R)ORDER BY
(optional) defines the order of rows inside each partition (important forROW_NUMBER()
,RANK()
,LEAD()
,LAG()
, etc.)The
PARTITION BY
clause is a sub-clause of theOVER
clause.The partition by breaks up (partitions) the table into “separate” tables based one or more chosen fields.
This is similar to creating a virtual grouped data frame in R with one or more groups.
Then, the aggregate function is applied to the records in each partition and a new field is added to the records in that partition with the aggregated value.
The comparison in Table Table 8.2 might help connect the concepts for using dplyr::group_by()
with using window functions.
dplyr::group_by()
with the SQL Window function
R (dplyr) | SQL (Window Function) |
---|---|
group_by(dest) |
PARTITION BY dest |
operate on each group | aggregate over each partition |
mutate() preserves rows |
window function preserves rows |
summarize() collapses rows |
normal aggregate without OVER() collapses rows |
Window functions are most useful when you need a grouped statistic as part of another calculation, rather than as a final summarized result.
- In R, this is the same situation where you’d use
group_by()
withmutate()
instead of with the usualsummarize()
, you want to preserve the full dataset, but also attach totals, percentages, or ranks. - For example, calculating the percentage of total delay per airline requires knowing the total delays for each destination (the group), but keeping the individual flight rows (the detail).
In everyday SQL, people often use simple SELECT
and GROUP BY
queries, so window functions are less common.
- But when you need running totals, percentages of group totals, rankings, or comparisons within groups without collapsing the rows, window functions are the most natural and powerful tool.
8.18.1 Window Function Example 1
Let’s find the flight numbers for the longest flights (in terms of air time) for each airport (origin
).
- This uses the window function as part of the
SELECT
to create a new field. - The Aggregate function is
MAX
. - The
PARTITON_BY
isorigin
- Let’s use
AS
to create the aliasapt_max
for the name of the new field with the airport maximum flight time.
```{sql}
#| connection: con
SELECT
"flight",
"origin",
"air_time",
MAX("air_time") -- The AGG_function
OVER (PARTITION BY "origin") AS "apt_max"
FROM
flights
LIMIT 10
OFFSET 120830;
```
flight | origin | air_time | apt_max |
---|---|---|---|
1691 | JFK | 309 | 691 |
1447 | JFK | 75 | 691 |
583 | JFK | 128 | 691 |
303 | JFK | 350 | 691 |
1743 | JFK | 101 | 691 |
5796 | EWR | 97 | 695 |
1223 | EWR | 200 | 695 |
297 | EWR | 355 | 695 |
5 | EWR | 172 | 695 |
5814 | EWR | 120 | 695 |
- Note the
apt_max
values are the same for all the records for each origin.
Having this new field (max_air_time
) may not be interesting on its own, but by computing it in a derived table, we can use it as an intermediate result for further analysis, like finding the percentage of each flight’s time relative to the maximum or joining with other datasets.
A “derived table” is a temporary table created on the fly inside a query. It is defined by using a query in the FROM
clause and exists only for the duration of that query.
- A derived table allows you to compute intermediate results that you can then filter, join, or aggregate in the outer query.
- It is often used when you need a result that cannot be expressed directly in
WHERE
(like a window function). - It does not create a permanent table in the database; it is transient.
Let’s find the destination for the longest flights from each airport.
```{sql}
#| label: lst-derived-table
#| connection: con
SELECT flight, origin, dest, air_time --
FROM (
2 SELECT --
flight,
origin,
dest,
air_time,
MAX(air_time) OVER (PARTITION BY origin) AS max_air_time
FROM flights
) AS t
WHERE air_time = max_air_time;
```
- 2
-
Inner query / derived table (
AS t
):
flight | origin | dest | air_time |
---|---|---|---|
51 | JFK | HNL | 691 |
745 | LGA | DEN | 331 |
15 | EWR | HNL | 695 |
- Computes the maximum flight time (
MAX(air_time) OVER (PARTITION BY origin)
) for each origin airport. - Produces a temporary table with all flights and the maximum flight time for that airport.
- Outer query:
- Filters the derived table to keep only flights where
air_time
equals the maximum for that origin. - Returns the longest flight(s) from each airport.
8.18.2 Window Function Example 2
Here is a slightly more complicated example.
We want the percentage of the total departure delay for a destination for each flight to that destination.
- We first use a window function to
SUM
all delays for a destination and give it the aliastotal_delay_dest
. - Then we use
total_delay_dest
to calculate for each flight, their percentage of that delay. - Use
WHERE
to only calculate for flights with a positive delay, i.e., that were actually delayed.
```{sql}
#| connection: con
SELECT
dest,
flight,
month,
day,
carrier,
dep_delay,
SUM(dep_delay) OVER (PARTITION BY dest) AS total_delay_dest,
dep_delay * 100.0 / SUM(dep_delay) OVER (PARTITION BY dest) AS pct_of_total
FROM flights
WHERE dep_delay > 0
ORDER BY dest, carrier, day, flight
LIMIT 5;
```
dest | flight | month | day | carrier | dep_delay | total_delay_dest | pct_of_total |
---|---|---|---|---|---|---|---|
ABQ | 1505 | 8 | 1 | B6 | 54 | 4076 | 1.3248283 |
ABQ | 1505 | 7 | 1 | B6 | 9 | 4076 | 0.2208047 |
ABQ | 65 | 10 | 2 | B6 | 9 | 4076 | 0.2208047 |
ABQ | 1505 | 9 | 2 | B6 | 125 | 4076 | 3.0667321 |
ABQ | 1505 | 7 | 2 | B6 | 21 | 4076 | 0.5152110 |
Now we want to also get the percentage of the total delay at each destination caused by each carrier across all their flights.
- We add another window function to create
pct_of_total_by_carrier
.
```{sql}
#| connection: con
SELECT
dest,
flight,
carrier,
dep_delay,
-- total delay at the destination
SUM(dep_delay) OVER (PARTITION BY dest) AS total_delay_dest,
-- flight-level % contribution
dep_delay * 100.0 / SUM(dep_delay) OVER (PARTITION BY dest) AS pct_of_total_flight,
-- carrier-level total delay
SUM(dep_delay) OVER (PARTITION BY dest, carrier) AS total_delay_dest_by_carrier,
-- carrier-level % contribution
SUM(dep_delay) OVER (PARTITION BY dest, carrier) * 100.0
/ SUM(dep_delay) OVER (PARTITION BY dest) AS pct_of_total_by_carrier
FROM flights
WHERE dep_delay > 0
ORDER BY dest, carrier, flight, dep_delay
lIMIT 10
```
dest | flight | carrier | dep_delay | total_delay_dest | pct_of_total_flight | total_delay_dest_by_carrier | pct_of_total_by_carrier |
---|---|---|---|---|---|---|---|
ABQ | 65 | B6 | 1 | 4076 | 0.0245339 | 4076 | 100 |
ABQ | 65 | B6 | 1 | 4076 | 0.0245339 | 4076 | 100 |
ABQ | 65 | B6 | 2 | 4076 | 0.0490677 | 4076 | 100 |
ABQ | 65 | B6 | 3 | 4076 | 0.0736016 | 4076 | 100 |
ABQ | 65 | B6 | 3 | 4076 | 0.0736016 | 4076 | 100 |
ABQ | 65 | B6 | 4 | 4076 | 0.0981354 | 4076 | 100 |
ABQ | 65 | B6 | 4 | 4076 | 0.0981354 | 4076 | 100 |
ABQ | 65 | B6 | 6 | 4076 | 0.1472031 | 4076 | 100 |
ABQ | 65 | B6 | 7 | 4076 | 0.1717370 | 4076 | 100 |
ABQ | 65 | B6 | 9 | 4076 | 0.2208047 | 4076 | 100 |
Let’s limit to those destinations that have more than one carrier.
- We will use the derived table approach.
SELECT
dest,
flight,
carrier,
dep_delay,
SUM(dep_delay) OVER (PARTITION BY dest) AS total_delay_dest,
dep_delay * 100.0 / SUM(dep_delay) OVER (PARTITION BY dest) AS pct_of_total_flight,
SUM(dep_delay) OVER (PARTITION BY dest, carrier) AS total_delay_dest_by_carrier,
SUM(dep_delay) OVER (PARTITION BY dest, carrier) * 100.0
/ SUM(dep_delay) OVER (PARTITION BY dest) AS pct_of_total_by_carrier
FROM flights
WHERE dep_delay > 0
AND dest IN (
SELECT dest
FROM flights
GROUP BY dest
HAVING COUNT(DISTINCT carrier) > 1
)
ORDER BY dest, flight, carrier, dep_delay
LIMIT 10
OFFSET 3280;
dest | flight | carrier | dep_delay | total_delay_dest | pct_of_total_flight | total_delay_dest_by_carrier | pct_of_total_by_carrier |
---|---|---|---|---|---|---|---|
ATL | 1547 | DL | 112 | 254414 | 0.0440227 | 135500 | 53.2596477 |
ATL | 1547 | DL | 114 | 254414 | 0.0448089 | 135500 | 53.2596477 |
ATL | 1547 | DL | 127 | 254414 | 0.0499186 | 135500 | 53.2596477 |
ATL | 1547 | DL | 138 | 254414 | 0.0542423 | 135500 | 53.2596477 |
ATL | 1547 | DL | 141 | 254414 | 0.0554215 | 135500 | 53.2596477 |
ATL | 1547 | DL | 192 | 254414 | 0.0754675 | 135500 | 53.2596477 |
ATL | 1554 | UA | 2 | 254414 | 0.0007861 | 1801 | 0.7079013 |
ATL | 1554 | UA | 4 | 254414 | 0.0015722 | 1801 | 0.7079013 |
ATL | 1554 | UA | 7 | 254414 | 0.0027514 | 1801 | 0.7079013 |
ATL | 1554 | UA | 18 | 254414 | 0.0070751 | 1801 | 0.7079013 |
- R equivalent
flights %>%
filter(dep_delay > 0) |>
# Keep only destinations with more than one carrier
group_by(dest) %>%
filter(n_distinct(carrier) > 1) |>
ungroup() |>
# Compute window-like summaries
group_by(dest) %>%
mutate(
total_delay_dest = sum(dep_delay),
pct_of_total_flight = dep_delay * 100 / total_delay_dest
) |>
group_by(dest, carrier) |>
mutate(
total_delay_dest_by_carrier = sum(dep_delay),
pct_of_total_by_carrier = total_delay_dest_by_carrier * 100 / total_delay_dest
) |>
ungroup() |>
arrange(dest, flight, carrier, dep_delay) %>%
select(dest, flight, carrier, dep_delay,
total_delay_dest, pct_of_total_flight,
total_delay_dest_by_carrier, pct_of_total_by_carrier) |>
slice(3281:3290)
Now, if we only want the delay at the destination and carrier level, we can simplify without the window for each flight.
- Let’s
ROUND()
the totals as well.
SELECT
dest,
carrier,
SUM(SUM(dep_delay)) OVER (PARTITION BY dest) AS total_delay_dest,
SUM(dep_delay) AS total_delay_dest_by_carrier,
ROUND(SUM(dep_delay) * 100.0 / SUM(SUM(dep_delay)) OVER (PARTITION BY dest), 2) AS pct_of_total_by_carrier
FROM flights
WHERE dep_delay > 0
AND dest IN (
SELECT dest
FROM flights
GROUP BY dest
HAVING COUNT(DISTINCT carrier) > 1
)
GROUP BY dest, carrier
ORDER BY dest, pct_of_total_by_carrier DESC
LIMIT 10
OFFSET 0;
dest | carrier | total_delay_dest | total_delay_dest_by_carrier | pct_of_total_by_carrier |
---|---|---|---|---|
ATL | DL | 254414 | 135500 | 53.26 |
ATL | FL | 254414 | 47001 | 18.47 |
ATL | EV | 254414 | 40568 | 15.95 |
ATL | MQ | 254414 | 29149 | 11.46 |
ATL | UA | 254414 | 1801 | 0.71 |
ATL | WN | 254414 | 217 | 0.09 |
ATL | 9E | 254414 | 178 | 0.07 |
AUS | B6 | 36623 | 12824 | 35.02 |
AUS | UA | 36623 | 11195 | 30.57 |
AUS | AA | 36623 | 6146 | 16.78 |
- R equivalent
flights %>%
filter(dep_delay > 0) |> # Only positive delays
group_by(dest) |>
filter(n_distinct(carrier) > 1) |> # Only destinations with >1 carrier
ungroup() |>
group_by(dest, carrier) |>
summarise(
total_delay_carrier = sum(dep_delay),
.groups = "drop"
) |>
group_by(dest) |>
mutate(
total_delay_dest = sum(total_delay_carrier), # total per destination
pct_of_total_by_carrier = round(total_delay_carrier * 100 / total_delay_dest, 2)
) |>
ungroup() |>
select(dest, carrier, total_delay_dest, total_delay_carrier, pct_of_total_by_carrier) |>
arrange(dest, desc(pct_of_total_by_carrier)) |>
slice_head(n = 10)
See SQL Window Functions or SQL PARTITION BY Clause overview for other examples.
8.19 CASE
and CASE WHEN
Recode or Create New Fields Based on the Values of Existing Fields
CASE WHEN
is the SQL expression for introducing conditional logic into queries so you can create new columns or modify existing ones based on conditions.
CASE WHEN
evaluates a condition for each row and returns a specified value if the condition is true, otherwise it can return another value or NULL.
- It is similar to
dplyr::case_when()
in R orIF … ELSE
statements in other programming languages. CASE
cannot stand alone; it always needs at least oneWHEN
.- The
WHEN
may use simple logic or complicated expressions. CASE WHEN
is particularly useful for categorizing numeric values into buckets, creating flags, or computing conditional summaries directly in theSELECT
statement.CASE WHEN
stops its evaluation at firstTRUE
so the order of theWHEN
s can affect the results.
Using CASE
keeps the logic inside the SQL query to it is transparent and so you don’t need a separate step in R or Python to mutate the table after retrieval.
The minimally valid syntax for simple CASE WHEN
working with a single column is :
The valid syntax for a more general CASE WHEN
expression is:
Let’s use a simple CASE WHEN
expression, covering a single column without using WHEN
conditions, to add a variable with names for each value of origin
.
```{sql}
#| label: case-when
#| connection: con
SELECT
flight,
origin,
CASE origin
WHEN 'JFK' THEN 'John F. Kennedy'
WHEN 'LGA' THEN 'LaGuardia'
WHEN 'EWR' THEN 'Newark'
ELSE 'Other Airport'
END AS origin_full_name
FROM flights
LIMIT 10;
```
flight | origin | origin_full_name |
---|---|---|
1545 | EWR | Newark |
1714 | LGA | LaGuardia |
1141 | JFK | John F. Kennedy |
725 | JFK | John F. Kennedy |
461 | LGA | LaGuardia |
1696 | EWR | Newark |
507 | EWR | Newark |
5708 | LGA | LaGuardia |
79 | JFK | John F. Kennedy |
301 | LGA | LaGuardia |
- R equivalent using either
case_when()
orcase_match()
.
flights |>
select(flight, origin) |>
mutate(orig_long = case_when(
origin == "JFK" ~ "John F. Kennedy",
origin == "LGA" ~ "LaGuardia",
origin == "EWR" ~ "Newark Liberty",
.default = "other"
))
## or for value matching
flights |>
select(flight, origin) |>
mutate(orig_long = case_match(
origin,
"JFK" ~ "John F. Kennedy",
"LGA" ~ "LaGuardia",
"EWR" ~"Newark Liberty",
.default = "other"
))
Use the more general CASE WHEN
to incorporate logical comparisons into the WHEN
expression to create categories for departure delays.
```{sql}
#| connection: con
SELECT
flight,
dep_delay,
CASE
WHEN dep_delay < 0 THEN 'early'
WHEN dep_delay = 0 THEN 'on time'
WHEN dep_delay <= 30 THEN 'slightly delayed'
ELSE 'very delayed'
END AS delay_category
FROM flights
LIMIT 10;
```
flight | dep_delay | delay_category |
---|---|---|
1545 | 2 | slightly delayed |
1714 | 4 | slightly delayed |
1141 | 2 | slightly delayed |
725 | -1 | early |
461 | -6 | early |
1696 | -4 | early |
507 | -5 | early |
5708 | -3 | early |
79 | -3 | early |
301 | -2 | early |
- R equivalent.
8.20 JOIN
Uses Two Tables to Create a Single Table Using the Fields in ON
The JOIN
clause combines rows from two tables into a single result set based on related columns.
- Joins can add columns from another table or, depending on the type, remove rows (filtering joins) if matches no exist.
- The default operation for
JOIN
is anINNER JOIN
.
The ON
sub-clause specifies the matching condition i.e., how columns from each table are compared to decide which rows pair together.
- While the most common condition is equality (
=
), you must always specify the condition, e.g.,=,
<
,>
,<>,
BETWEEN
, etc.). USING
i9s a simpler version ofON
when the column name(s) are the same in both tables and you want to join on equality of the values in the column.
The JOIN
syntax is as follows where table1
is considered the “left” table and table2
is considered the “right` table.
JOIN
has several modifiers that affect how the join operates. These include:
INNER JOIN
: Returns only rows with matching keys in both tables.- Columns from both tables appear in the result; unmatched rows are excluded (what makes it an “inner” join).
- This is the default operation for
JOIN
without a modifier.
LEFT JOIN
(orLEFT OUTER JOIN
): Returns all rows from the left table, with columns from the right table included where there’s a match.- Rows with unmatched right-table columns get
NULL
values (what makes it an “outer” join). RIGHT JOIN
(orRIGHT OUTER JOIN
): Returns all rows from the right table, with columns from the left table included where there’s a match;.- Unmatched left-table columns are NULL.
FULL JOIN
(orFULL OUTER JOIN
): Returns all rows from both tables; unmatched columns from either table are NULL.
Table 8.3 summarize the results for each type of join.
Join Type | Rows Returned | Columns Included |
---|---|---|
INNER JOIN | Only rows with matching keys in both tables | Columns from both tables; unmatched rows excluded |
LEFT JOIN / LEFT OUTER | All rows from left table; matched rows from right | Columns from both tables; unmatched right columns = NULL |
RIGHT JOIN / RIGHT OUTER | All rows from right table; matched rows from left | Columns from both tables; unmatched left columns = NULL |
FULL JOIN / FULL OUTER | All rows from both tables | Columns from both tables; unmatched columns = NULL |
8.20.1 JOIN
and its Modifiers Examples
Default JOIN
with USING
.
USING
uses implicit equality between the column names.- Since
USING
can take more one or more column names, they must be wrapped in(...)
. USING
automatically removes duplicate column names.
```{sql}
#| connection: con
-- Sample tables
-- flights: flight, carrier, dest
-- airlines: carrier, name
SELECT flight, origin, dest, name
FROM flights
JOIN airlines
USING (carrier)
LIMIT 5;
```
flight | origin | dest | name |
---|---|---|---|
1545 | EWR | IAH | United Air Lines Inc. |
1714 | LGA | IAH | United Air Lines Inc. |
1141 | JFK | MIA | American Airlines Inc. |
725 | JFK | BQN | JetBlue Airways |
461 | LGA | ATL | Delta Air Lines Inc. |
Default JOIN
with ON
.
- Use
AS
implicit to create aliases for the tablesf
forflights
anda
forairlines
. - Reference each variable with its table alias
- Use explicit equality in the
ON
sub clause. - Note the columns in the
ON
do not have to be in theSELECT
.
```{sql}
#| connection: con
-- Sample tables
-- flights: flight, carrier, dest
-- airlines: carrier, name
SELECT f.flight, f.origin, f.dest, a.name
FROM flights f
INNER JOIN airlines a
ON f.carrier = a.carrier
LIMIT 5;
```
flight | origin | dest | name |
---|---|---|---|
1545 | EWR | IAH | United Air Lines Inc. |
1714 | LGA | IAH | United Air Lines Inc. |
1141 | JFK | MIA | American Airlines Inc. |
725 | JFK | BQN | JetBlue Airways |
461 | LGA | ATL | Delta Air Lines Inc. |
The good news is the nycflights
database is very robust with aligned column names and few NA
which makes it useful for demonstrating USING
.
- The bad news is it is too good to be useful for demonstrating different outcomes using other joins as they will usually have the same results.
We will use a new connection to a small demo database for this section.
- Create a new connection called
con_demo
and show the tables.
Since this is a very small database for demo purposes we will use SELECT *
for the departments
and employee
tables for now.
department_id | department_name |
---|---|
10 | HR |
20 | IT |
30 | Finance |
40 | Marketing |
60 | Legal |
emp_id | name | dept_id | age | salary |
---|---|---|---|---|
1 | Alice | 10 | 25 | 40000 |
2 | Bob | 20 | 32 | 55000 |
3 | Charlie | 30 | 45 | 70000 |
4 | Diana | 20 | 28 | 50000 |
5 | Eve | 50 | 35 | 62000 |
6 | Frank | 40 | 29 | 38000 |
7 | Grace | NA | 41 | 80000 |
- Note: the two tables do not have any column names in common so can’t do a join with
USING
.
Default JOIN
with ON
ON
requires fully referencing the variable names with the table name using a.
in between.- Here we create an alias (implicit
AS
) for each table and use them in theSELECT
andON
expressions. - Again, the columns in the
ON
do not have to be part of theSELECT
.
```{sql}
#| connection: con_demo
#| label: join-example-flights
-- JOIN: by default an Inner Join so only employees with a matching department
SELECT e.emp_id, e.name, d.department_name
FROM employees e
JOIN departments d
ON e.dept_id = d.department_id
LIMIT 5;
```
emp_id | name | department_name |
---|---|---|
1 | Alice | HR |
2 | Bob | IT |
3 | Charlie | Finance |
4 | Diana | IT |
6 | Frank | Marketing |
- Note: Employee Eve and Grace did not show up as there is no match for
e.dept_id = 50
ind.department_id
.
Adding the modifier INNER
does not change the result since the default for JOIN
is an inner join.
```{sql}
#| connection: con_demo
#| label: inner-join-example-explicit
-- JOIN: by default an Inner Join so only employees with a matching department
SELECT e.emp_id, e.name, d.department_name
FROM employees e
INNER JOIN departments d
ON e.dept_id = d.department_id
LIMIT 5;
```
emp_id | name | department_name |
---|---|---|
1 | Alice | HR |
2 | Bob | IT |
3 | Charlie | Finance |
4 | Diana | IT |
6 | Frank | Marketing |
LEFT JOIN
: all rows from employees, include department columns when available, otherwise NULL
.
```{sql}
#| connection: con_demo
#| label: left-join-example
-- LEFT JOIN: all employees, NULL if no matching department
SELECT e.emp_id, e.name, d.department_name
FROM employees e
LEFT JOIN departments d
ON e.dept_id = d.department_id;
```
emp_id | name | department_name |
---|---|---|
1 | Alice | HR |
2 | Bob | IT |
3 | Charlie | Finance |
4 | Diana | IT |
6 | Frank | Marketing |
5 | Eve | NA |
7 | Grace | NA |
- Note: Eve and Grace have
NULL
in the addeddepartment_name
column as there is no match fore.dept_id = 50
ind.department_id
.
RIGHT JOIN
: all rows from departments, include employee columns when available, otherwise NULL
.
```{sql}
#| connection: con_demo
#| label: right-join-example
-- RIGHT JOIN: all departments, NULL if no matching employee
SELECT e.emp_id, e.name, d.department_name
FROM employees e
RIGHT JOIN departments d
ON e.dept_id = d.department_id;
```
emp_id | name | department_name |
---|---|---|
1 | Alice | HR |
2 | Bob | IT |
3 | Charlie | Finance |
4 | Diana | IT |
6 | Frank | Marketing |
NA | NA | Legal |
Note: the legal department without any employees gets NULL
for the employee columns.
FULL JOIN
: all rows for all employees and all departments, NULL
where unmatched
```{sql}
#| connection: con_demo
#| label: full-join-example
-- FULL JOIN: all employees and all departments, NULL where unmatched
SELECT e.emp_id, e.name, d.department_name
FROM employees e
FULL JOIN departments d
ON e.dept_id = d.department_id;
```
emp_id | name | department_name |
---|---|---|
1 | Alice | HR |
2 | Bob | IT |
3 | Charlie | Finance |
4 | Diana | IT |
6 | Frank | Marketing |
5 | Eve | NA |
7 | Grace | NA |
NA | NA | Legal |
Note: Employees without a matched Department get NULL
for the department column and departments without any employees get NULL
for the employee columns.
SEMI JOIN
returns rows in the left table only if they have a match in the right table.
```{sql}
#| connection: con_demo
#| label: lst-semi-join
SELECT e.*
FROM employees e
SEMI JOIN departments d
ON e.dept_id = d.department_id;
```
emp_id | name | dept_id | age | salary |
---|---|---|---|---|
1 | Alice | 10 | 25 | 40000 |
2 | Bob | 20 | 32 | 55000 |
3 | Charlie | 30 | 45 | 70000 |
4 | Diana | 20 | 28 | 50000 |
6 | Frank | 40 | 29 | 38000 |
ANTI-JOIN
returns rows in the left table that do not have a match in the right table.
```{sql}
#| connection: con_demo
#| label: lst-anti-join
SELECT e.*
FROM employees e
ANTI JOIN departments d
ON e.dept_id = d.department_id;
```
emp_id | name | dept_id | age | salary |
---|---|---|---|---|
5 | Eve | 50 | 35 | 62000 |
7 | Grace | NA | 41 | 80000 |
Some SQL flavors do not have an explicit ANTI
join modifier, but there multiple ways to achieve the same kind of result. Here are two examples.
- Using
IS NULL
’
```{sql}
#| connection: con_demo
#| label: anti-join-example-NULL
-- ANTI JOIN: employees with no matching department
SELECT e.emp_id, e.name
FROM employees e
LEFT JOIN departments d
ON e.dept_id = d.department_id
WHERE d.department_id IS NULL;
```
emp_id | name |
---|---|
5 | Eve |
7 | Grace |
- Using
NOT IN
and account for potentialNULL
values.
8.20.2 ON
with Non-Equality Conditions Examples
Let’s look at the other two tables.
band_id | band_name | min_salary | max_salary |
---|---|---|---|
1 | Junior | 30000 | 49999 |
2 | Mid-level | 50000 | 69999 |
3 | Senior | 70000 | 99999 |
emp_id | project_name |
---|---|
1 | Alpha |
2 | Beta |
2 | Gamma |
3 | Alpha |
5 | Delta |
7 | Gamma |
- Inequality Join (
<
,>
,<=
,>=
,<>
)
Example: employees older than department_id × 2.
SELECT e.emp_id, e.name, e.age, d.department_id
FROM employees e
JOIN departments d
ON e.age > d.department_id * 2
ORDER BY name;
emp_id | name | age | department_id |
---|---|---|---|
1 | Alice | 25 | 10 |
2 | Bob | 32 | 10 |
3 | Charlie | 45 | 20 |
3 | Charlie | 45 | 10 |
4 | Diana | 28 | 10 |
5 | Eve | 35 | 10 |
6 | Frank | 29 | 10 |
7 | Grace | 41 | 20 |
7 | Grace | 41 | 10 |
- Note: this gets all combinations that are true.
- Non-Equi Join with Ranges (using
BETWEEN
)
- Map employees to salary bands.
```{sql}
#| connection: con_demo
#| label: lst-join-range
SELECT e.emp_id, e.name, e.salary, s.band_name
FROM employees e
JOIN salaries s
ON e.salary BETWEEN s.min_salary AND s.max_salary
ORDER BY name;
```
emp_id | name | salary | band_name |
---|---|---|---|
1 | Alice | 40000 | Junior |
2 | Bob | 55000 | Mid-level |
3 | Charlie | 70000 | Senior |
4 | Diana | 50000 | Mid-level |
5 | Eve | 62000 | Mid-level |
6 | Frank | 38000 | Junior |
7 | Grace | 80000 | Senior |
- Join with
IN
/NOT IN
```{sql}
#| connection: con_demo
#| label: lst-join-in-not-in
SELECT e.emp_id, e.name, d.department_name
FROM employees e
JOIN departments d
ON e.dept_id NOT IN (10, 20) AND e.dept_id = d.department_id
ORDER BY name;
```
emp_id | name | department_name |
---|---|---|
3 | Charlie | Finance |
6 | Frank | Marketing |
- Join with Compound Conditions (
AND
/OR
)`
```{sql}
#| connection: con_demo
#| label: lst-join-and-or
SELECT e.emp_id, e.name, e.salary, d.department_name, d.department_id
FROM employees e
LEFT JOIN departments d
ON (e.dept_id IN (10, 20) OR e.salary > 70000)
ORDER BY name;
```
emp_id | name | salary | department_name | department_id |
---|---|---|---|---|
1 | Alice | 40000 | HR | 10 |
1 | Alice | 40000 | IT | 20 |
1 | Alice | 40000 | Finance | 30 |
1 | Alice | 40000 | Marketing | 40 |
1 | Alice | 40000 | Legal | 60 |
2 | Bob | 55000 | HR | 10 |
2 | Bob | 55000 | IT | 20 |
2 | Bob | 55000 | Finance | 30 |
2 | Bob | 55000 | Marketing | 40 |
2 | Bob | 55000 | Legal | 60 |
Note: This returns more rows than you might expect due to the combination of LEFT JOIN
and OR
.
ON
does not just filter rows, it determines which right-table rows match.- Using
OR
insideON
with multiple right-table rows can create duplicates of left-table rows. - To avoid this, either:
- Filter the right table in a subquery before joining.
- Use
=
or carefully structured conditions inON
.
```{sql}
#| connection: con_demo
#| label: join-cross
SELECT e.emp_id, e.name, p.project_name
FROM employees e
CROSS JOIN projects p
LIMIT 10;
```
emp_id | name | project_name |
---|---|---|
1 | Alice | Alpha |
2 | Bob | Alpha |
3 | Charlie | Alpha |
4 | Diana | Alpha |
5 | Eve | Alpha |
6 | Frank | Alpha |
7 | Grace | Alpha |
1 | Alice | Beta |
2 | Bob | Beta |
3 | Charlie | Beta |
- Text Comparison with
LIKE
orIN
SELECT e.emp_id, e.name, d.department_name
FROM employees e
INNER JOIN departments d
ON e.dept_id = d.department_id
AND d.department_name LIKE '%I%'
ORDER BY e.emp_id;
emp_id | name | department_name |
---|---|---|
2 | Bob | IT |
4 | Diana | IT |
We are done with the demo database so let’s close the connection to the demo database.
8.21 Subqueries
SQL allows you to embed or nest queries within queries. These are referred to as subqueries.
“A subquery is just a query used as a data source in the
FROM
clause, instead of the usual table.” R for Data Science (2ed)
- As an example, if you need data from a “table” that does not exist, you can use a query to create the new table (virtually) and insert it into the
FROM
clause of an outer query.
To delineate the query as a subquery you put it inside parentheses (...)
to require the subquery to be executed before other parts of the query can be evaluated.
If using the results in other parts of a complicated query, you must also assign an “alias” for the result. This is similar to using a pronoun in R piping. It represents the returned object from the subquery.
Subqueries are allowed in places where SQL expects a value, table, or expression.
- There are several varieties as you can see in How to write subqueries in SQL.
- Subquery in the
FROM
clause (as a table)
- Creates a “virtual table” that the outer query can use.
- Must assign an alias so the outer query can reference it
- The subquery is treated like a table.
```{sql}
#| connection: con
SELECT f.carrier, f.dep_delay
FROM (
SELECT *
FROM flights
WHERE dep_delay > 60
) AS f
WHERE f.origin = 'JFK'
LIMIT 10;
```
carrier | dep_delay |
---|---|
AA | 71 |
MQ | 853 |
B6 | 77 |
B6 | 122 |
EV | 119 |
B6 | 88 |
B6 | 91 |
9E | 88 |
AA | 63 |
AA | 131 |
- Subquery in the
SELECT
clause (as a scalar value)
- This subquery is an expression: it returns a single value for each row.
```{sql}
#| connection: con
SELECT carrier,
(SELECT AVG(dep_delay)
FROM flights f2
WHERE f2.carrier = f1.carrier) AS avg_delay
FROM flights f1
GROUP BY carrier
LIMIT 10;
```
carrier | avg_delay |
---|---|
OO | 12.586207 |
WN | 17.711744 |
AS | 5.804775 |
F9 | 20.215543 |
DL | 9.264504 |
B6 | 13.022522 |
FL | 18.726075 |
HA | 4.900585 |
EV | 19.955390 |
AA | 8.586016 |
- Subquery in the
WHERE
clause (as a condition)
- Returns values to compare with column values.
- Subquery returns a list of values; the outer query filters based on that.
```{sql}
#| connection: con
SELECT flight, carrier
FROM flights
WHERE carrier IN (
SELECT carrier
FROM airlines
WHERE name LIKE '%Jet%'
)
LIMIT 10;
```
flight | carrier |
---|---|
373 | B6 |
4694 | EV |
24 | B6 |
56 | B6 |
3260 | EV |
673 | B6 |
5273 | EV |
1 | B6 |
600 | B6 |
4553 | EV |
- Subquery in the
HAVING
clause
- Works like
WHERE
but after aggregation:
```{sql}
#| connection: con
SELECT carrier, AVG(dep_delay) AS avg_delay
FROM flights
GROUP BY carrier
HAVING AVG(dep_delay) > (
SELECT AVG(dep_delay)
FROM flights
)
LIMIT 10;
```
carrier | avg_delay |
---|---|
FL | 18.72607 |
VX | 12.86942 |
B6 | 13.02252 |
WN | 17.71174 |
F9 | 20.21554 |
EV | 19.95539 |
9E | 16.72577 |
YV | 18.99633 |
- The subquery must return exactly the type and number of values expected:
- Table (one or more columns, any number of rows) with an Alias in
FROM
- Scalar (one value) in
SELECT
orWHERE =
. If it returns multiple rows/columns, SQL throws an error. - Can return multiple rows, but only one column for
IN
orFROM
. If you tried to return multiple columns forIN
, SQL would error.
- Table (one or more columns, any number of rows) with an Alias in
- You cannot just put a subquery anywhere; it must be valid in context.
8.21.1 Subquery Example
Let’s create a query that uses all four types:
- Get flights delayed more than the average delay (
WHERE
subquery) - Include airline name (
JOIN
subquery inFROM
) - Include average delay per carrier (
SELECT
scalar subquery) - Only include carriers whose average delay is above the overall average (
HAVING
subquery)
```{sql}
#| connection: con
#| lst-label: lst-combined-subquery
#| lst-cap: A query using four types of subqueries
SELECT f.year, f.month, f.day, f.dep_delay, a.name AS airline_name,
-- scalar subquery in SELECT: average delay per carrier
(SELECT AVG(dep_delay)
FROM flights f2
WHERE f2.carrier = f.carrier) AS avg_delay_per_carrier
FROM (
-- subquery in FROM: only flights delayed more than the overall average
SELECT *
FROM flights
WHERE dep_delay > (SELECT AVG(dep_delay) FROM flights)
) AS f
JOIN airlines a
ON f.carrier = a.carrier
GROUP BY f.carrier, f.year, f.month, f.day, f.dep_delay, a.name
-- subquery in HAVING: only carriers with avg delay above overall avg
HAVING AVG(f.dep_delay) > (SELECT AVG(dep_delay) FROM flights)
ORDER BY f.dep_delay DESC
LIMIT 10;
```
year | month | day | dep_delay | airline_name | avg_delay_per_carrier |
---|---|---|---|---|---|
2013 | 1 | 9 | 1301 | Hawaiian Airlines Inc. | 4.900585 |
2013 | 6 | 15 | 1137 | Envoy Air | 10.552041 |
2013 | 1 | 10 | 1126 | Envoy Air | 10.552041 |
2013 | 9 | 20 | 1014 | American Airlines Inc. | 8.586016 |
2013 | 7 | 22 | 1005 | Envoy Air | 10.552041 |
2013 | 4 | 10 | 960 | Delta Air Lines Inc. | 9.264504 |
2013 | 3 | 17 | 911 | Delta Air Lines Inc. | 9.264504 |
2013 | 6 | 27 | 899 | Delta Air Lines Inc. | 9.264504 |
2013 | 7 | 22 | 898 | Delta Air Lines Inc. | 9.264504 |
2013 | 12 | 5 | 896 | American Airlines Inc. | 8.586016 |
Table 8.4 shows how each of the uses of a subquery contribute to the overall results while meeting the expectations of their usage.
Feature | Type | Role in Query |
---|---|---|
(SELECT AVG(dep_delay) FROM flights) in WHERE |
Scalar subquery | Filters only flights delayed more than overall average |
FROM (SELECT * FROM flights …) AS f |
Subquery in FROM | Creates virtual table of delayed flights |
(SELECT AVG(dep_delay) FROM flights f2 WHERE f2.carrier = f.carrier) in SELECT |
Scalar subquery | Computes average delay per carrier |
HAVING AVG(f.dep_delay) > (SELECT AVG(dep_delay) FROM flights) |
Subquery in HAVING | Filters groups (carriers) with above-average delays |
8.22 WITH
Defines Common Table Expressions (CTE)
WITH
defines Common Table Expressions (CTEs) — named subqueries that make queries easier to read and reuse.
- Think of a CTE as a temporary view: it exists only for the duration of the statement.
WITH
must appear before the main statement (SELECT
,INSERT
,UPDATE
,DELETE
).- You can define multiple CTEs by separating them with commas.
- The final query must follow immediately after the last CTE.
The syntax for a single CTE is:
In the case of multiple CTEs, use one WITH
and separate the CTEs with commas.
- Note: order matters as later CTEs can refer to earlier CTEs.
Let’s rewrite the query in Listing 8.1 using three CTEs to pull those subqueries out from being nested.
```{sql}
#| connection: con
#| label: lst-cte-subquery
WITH overall_avg AS (
SELECT AVG(dep_delay) AS avg_dep_delay
FROM flights
),
flights_above_avg AS (
SELECT *
FROM flights, overall_avg
WHERE dep_delay > overall_avg.avg_dep_delay
),
carrier_avg AS (
SELECT carrier, AVG(dep_delay) AS avg_delay_per_carrier
FROM flights
GROUP BY carrier
)
SELECT f.year, f.month, f.day, f.dep_delay,
a.name AS airline_name,
ca.avg_delay_per_carrier
FROM flights_above_avg f
JOIN airlines a
ON f.carrier = a.carrier
JOIN carrier_avg ca
ON f.carrier = ca.carrier
JOIN overall_avg oa
ON TRUE
GROUP BY f.carrier, f.year, f.month, f.day, f.dep_delay, a.name,
ca.avg_delay_per_carrier, oa.avg_dep_delay
HAVING AVG(f.dep_delay) > oa.avg_dep_delay
ORDER BY f.dep_delay DESC
LIMIT 10;
```
year | month | day | dep_delay | airline_name | avg_delay_per_carrier |
---|---|---|---|---|---|
2013 | 1 | 9 | 1301 | Hawaiian Airlines Inc. | 4.900585 |
2013 | 6 | 15 | 1137 | Envoy Air | 10.552041 |
2013 | 1 | 10 | 1126 | Envoy Air | 10.552041 |
2013 | 9 | 20 | 1014 | American Airlines Inc. | 8.586016 |
2013 | 7 | 22 | 1005 | Envoy Air | 10.552041 |
2013 | 4 | 10 | 960 | Delta Air Lines Inc. | 9.264504 |
2013 | 3 | 17 | 911 | Delta Air Lines Inc. | 9.264504 |
2013 | 6 | 27 | 899 | Delta Air Lines Inc. | 9.264504 |
2013 | 7 | 22 | 898 | Delta Air Lines Inc. | 9.264504 |
2013 | 12 | 5 | 896 | American Airlines Inc. | 8.586016 |
overall_avg
computes the overall average once.flights_above_avg
uses that to filterflights
.carrier_avg
pre-computes each carrier’s average (so no scalar subquery per row).- The final query joins them together.
This is considered easier to read as the CTEs are defined at the top so the later clauses are more compact.
8.23 CREATE VIEW
Saves a Query in the Database
CREATE VIEW
is creates a version of a query and stores it in the database:
- CTEs and subqueries exist only while the query runs.
- A view persists in the database, so you (and others with access) can reuse it in multiple queries and even across different sessions.
- You may also have access to views created by others.
All views must be named, just like tables.
- Use
SHOW VIEWS;
in many SQL flavors but to see the views in a DuckDB database use the following
- As a minimalist database,
flights
has no views.
Views must have unique names
- A view is stored in the system catalog (metadata tables) of the database, alongside tables, indexes, functions, etc.
- When you create a view with
CREATE VIEW view_name AS …
, the database inserts an entry forview_name
into its catalog. - Uniqueness is enforced within a schema/namespace:
- You can’t have two views (or a table and a view) with the same name in the same schema.
Users create a view for these reasons:
- Reusability: If you have a long, complex query (say, joining 5–6 tables, filtering, aggregating), you don’t want to rewrite it every time. Save it once as a view, then just
SELECT
from the named view just like a table. - Consistency: Ensures everyone uses the same definition for a derived dataset (e.g., “active customers” or “on-time flights”).
- Security: A view can expose only the columns/rows you want others to see, hiding sensitive fields in the base tables.
Performance Considerations
- Regular Views: These are just stored query definitions. When you query the view, the database re-runs the underlying query. So, by default, they are not faster than repeating the query. The main gain is readability and reusability.
- Materialized Views: Some systems (e.g., PostgreSQL, Oracle) allow you to create materialized views, which store the actual result set and can be indexed. These are precomputed and often faster, but they must be refreshed when the underlying data changes.
Let’s create a view called delayed_flights
with AS
acting as a keyword (not an alias) assigning the name to the query definition.
Check the view exists.
```{sql}
#| connection: con
SELECT table_name, table_type
FROM information_schema.tables
WHERE table_name = 'delayed_flights';
```
table_name | table_type |
---|---|
delayed_flights | VIEW |
Use the view with SELECT
as if it were a table.
flight | month | day | dep_delay |
---|---|---|---|
4576 | 1 | 1 | 101 |
443 | 1 | 1 | 71 |
3944 | 1 | 1 | 853 |
856 | 1 | 1 | 144 |
1086 | 1 | 1 | 134 |
4495 | 1 | 1 | 96 |
4646 | 1 | 1 | 71 |
673 | 1 | 1 | 77 |
4869 | 1 | 1 | 70 |
4497 | 1 | 1 | 115 |
Note: all are above 60 minutes.
If you no longer need the view in the database use DROP VIEW
.
8.24 Reshaping/Pivoting Data Using SQL
Recall {tidyr} has several verbs for reshaping rectangular data: pivot_wider
, pivot_longer
, separate_*
, and unite
.
SQL doesn’t have these verbs natively, but you can reshape using:
1. CASE WHEN
statements 2. GROUP BY
with aggregation 3. Vendor-specific PIVOT
/ UNPIVOT
(e.g., SQL Server)
- Example of “Pivot-wider”-like reshaping in SQL with
CASE WHEN
Suppose we want to see total departure delays per carrier for each origin airport. This is similar to a pivot_wider()
in R.
```{sql}
#| connection: con
SELECT origin,
SUM(CASE WHEN carrier = 'AA' THEN dep_delay ELSE 0 END) AS AA_total_delay,
SUM(CASE WHEN carrier = 'DL' THEN dep_delay ELSE 0 END) AS DL_total_delay,
SUM(CASE WHEN carrier = 'UA' THEN dep_delay ELSE 0 END) AS UA_total_delay
FROM flights
GROUP BY origin
ORDER BY origin;
```
origin | AA_total_delay | DL_total_delay | UA_total_delay |
---|---|---|---|
EWR | 34000 | 52000 | 571694 |
JFK | 140542 | 171672 | 35471 |
LGA | 101009 | 218810 | 94733 |
CASE WHEN carrier = 'AA' THEN dep_delay ELSE 0 END
: creates a separate column for each carrier (like pivoting)SUM()
: aggregates the delays for each origin × carrier combinationGROUP BY origin
: ensures aggregation per airport
- Example of “Pivot-longer”-like reshaping in SQL with
UNION ALL
UNION ALL
combines the results of two or moreSELECT
statements into a single result set.- It keeps all rows, including duplicates.
- It stacks the columns into rows, producing a long format table.
- Each
SELECT
must have the same number of columns and compatible data types.
WITH flights_wide AS (...)
creates a temporary wide table with one column per carrier’s total departure delay since we don’t have one in the flights database.
```{sql}
#| connection: con
-- Step 1: Aggregate delays per origin per carrier (wide pivot)
WITH flights_wide AS (
SELECT origin,
SUM(CASE WHEN carrier = 'AA' THEN dep_delay ELSE 0 END) AS AA_total_delay,
SUM(CASE WHEN carrier = 'DL' THEN dep_delay ELSE 0 END) AS DL_total_delay,
SUM(CASE WHEN carrier = 'UA' THEN dep_delay ELSE 0 END) AS UA_total_delay
FROM flights
GROUP BY origin
)
-- Step 2: Convert wide format to long format using UNION ALL
SELECT origin, 'AA' AS carrier, AA_total_delay AS total_delay
FROM flights_wide
UNION ALL
SELECT origin, 'DL' AS carrier, DL_total_delay
FROM flights_wide
UNION ALL
SELECT origin, 'UA' AS carrier, UA_total_delay
FROM flights_wide
ORDER BY origin, carrier;
```
origin | carrier | total_delay |
---|---|---|
EWR | AA | 34000 |
EWR | DL | 52000 |
EWR | UA | 571694 |
JFK | AA | 140542 |
JFK | DL | 171672 |
JFK | UA | 35471 |
LGA | AA | 101009 |
LGA | DL | 218810 |
LGA | UA | 94733 |
If you are interested in reshaping, especially with an intent to aggregate the data, the various flavors have different approaches.
As an example, SQL Server has the statements PIVOT
and UNPIVOT
. See FROM - Using PIVOT and UNPIVOT.
8.25 CREATE TABLE
Declares new Tables
You can create new tables using SQL.
Let’s create a new, temporary, connection we can use to create tables in a new back-end database that will be just in-memory.
Use CREATE TABLE
to declare new tables and INSERT INTO
to insert new records into that table.
- Note the semicolon
;
after each SQL statement in the same code chunk.
DBI::dbWriteTable()
will add a table to a connection from R.
Let’s write multiple data frames from {tidyr} into new tables in our tmpcon
database.
data("table1", package = "tidyr")
dbWriteTable(conn = tmpcon, name = "table1", value = table1)
data("table2", package = "tidyr")
dbWriteTable(conn = tmpcon, name = "table2", value = table2)
data("table3", package = "tidyr")
dbWriteTable(conn = tmpcon, name = "table3", value = table3)
data("table4b", package = "tidyr")
dbWriteTable(conn = tmpcon, name = "table4b", value = table4b)
data("table5", package = "tidyr")
dbWriteTable(conn = tmpcon, name = "table5", value = table5)
Here they are.
database | schema | name | column_names | column_types | temporary |
---|---|---|---|---|---|
memory | main | table1 | country , year , cases , population | VARCHAR, DOUBLE , DOUBLE , DOUBLE | FALSE |
memory | main | table2 | country, year , type , count | VARCHAR, DOUBLE , VARCHAR, DOUBLE | FALSE |
memory | main | table3 | country, year , rate | VARCHAR, DOUBLE , VARCHAR | FALSE |
memory | main | table4a | country, 1999 , 2000 | VARCHAR, INTEGER, INTEGER | FALSE |
memory | main | table4b | country, 1999 , 2000 | VARCHAR, DOUBLE , DOUBLE | FALSE |
memory | main | table5 | country, century, year , rate | VARCHAR, VARCHAR, VARCHAR, VARCHAR | FALSE |
Close the connection tempcon
.
8.26 COPY
FROM
/ TO
Imports/Exports Data from/to other Formats and Files
Use COPY
TO
to write the outputs of a SQL query to a CSV file (using a relative path).
Let’s write some group summaries to a CSV file.
This is what the resulting file looks like:
8.27 Calling SQL from R
Use DBI::dbGetQuery()
to run SQL code in R and obtain the result as a data frame.
Often, folks have SQL written in a separate file (that ends in .sql
).
- Here is what is in “query.sql”.
If you want to load the results of a SQL query in R (saved in query.sql
), use readr::read_file()
inside DBI::dbGetQuery()
.
8.28 Closing the Connection
After you are finished working with a database, you should close down your connection:
- The
shutdown = TRUE
is a DuckDB option to close the connection and shut down the entire DuckDB engine.
8.29 Basic SQL Exercises
Download the file starwars.duckdb and save to a data
directory.
- It’s the data from the {starwarsdb} as a DuckDB database.
- Or use the following once you are sure there is a
data
directory:
Use only SQL to answer the following questions.
- Open a connection to the database.
- Print out a summary of the tables in this database.
- Select the entire
people
table.
- Select just the
name
,height
,mass
, andspecies
variables from thepeople
table.
- Add to the above query by selecting only the humans and droids.
- Remove the individuals with missing
mass
data from the above query.
- Modify the above query to calculate the average height and mass for humans and droids.
- Make sure Droids are in the first row and Humans are in the second row in the above summary output.
- Here is the summary of the keys for the database from the
{starwarsdb}
GitHub page.
Select all films with characters whose homeworld is Kamino.
- Filter the
people
table to only contain humans from Tatooine and export the result to a CSV file called “folks.csv”.
- Close the SQL connection.
8.30 Integrating SQL with R
Section 8.4 demonstrated using SQL inside an R environment using the {DBI} and {duckdb} packages.
All of the SQL was run in a SQL code chunk in a Quarto document.
We manually converted R code chunks to SQL and inserted the connection name.
All of the SQL code chunks were interactive and they did not affect the R environment.
Section 8.27 showed how one could run SQL scripts (.sql
files) in R and use the DBI::dbGetQuery()
function and assign an R name to the result-set of a SQL query.
This section will demonstrate some shortcuts to both creating SQL code chunks and then getting access to the result-sets as R objects.
Let’s create a connection to the flights.duckdb
database.
8.30.1 Other SQL Code Chunk Options
See Section 8.4.6 for using RStudio to create SQL code chunks quickly.
The R Markdown Definitive Guide section on SQL discusses several SQL chunk options you can use to adjust how SQL is interpreted by {knitr} in RStudio documents.
max.print = XX
will adjust to printXX
lines of SQL output to print in the rendered document from the default ofXX = 10
.tab.cap = "My Caption"
will change the default for a caption of how many records are displayed to whatever you want in the rendered document.
These options will work as either R Markdown style options or Quarto style options.
8.30.2 Creating R Variables from SQL Code
You can use a SQL code chunk option output.var
to create an R data frame variable from a SQL result-set.
Let’s create R variables for the tables in the database and the SQL description of the tables in the data base.
- Again, either style chunk option will work.
Notice that just like assignment in R, the results are implicit and not seen in the output.
- You do see the new variables in the Global Environment.
database schema name
1 flights main airlines
2 flights main airports
3 flights main flights
4 flights main planes
5 flights main weather
column_names
1 carrier, name
2 faa, name, lat, lon, alt, tz, dst, tzone
3 year, month, day, dep_time, sched_dep_time, dep_delay, arr_time, sched_arr_time, arr_delay, carrier, flight, tailnum, origin, dest, air_time, distance, hour, minute, time_hour
4 tailnum, year, type, manufacturer, model, engines, seats, speed, engine
5 origin, year, month, day, hour, temp, dewp, humid, wind_dir, wind_speed, wind_gust, precip, pressure, visib, time_hour
column_types
1 VARCHAR, VARCHAR
2 VARCHAR, VARCHAR, DOUBLE, DOUBLE, DOUBLE, DOUBLE, VARCHAR, VARCHAR
3 INTEGER, INTEGER, INTEGER, INTEGER, INTEGER, DOUBLE, INTEGER, INTEGER, DOUBLE, VARCHAR, INTEGER, VARCHAR, VARCHAR, VARCHAR, DOUBLE, DOUBLE, DOUBLE, DOUBLE, TIMESTAMP
4 VARCHAR, INTEGER, VARCHAR, VARCHAR, VARCHAR, INTEGER, INTEGER, INTEGER, VARCHAR
5 VARCHAR, INTEGER, INTEGER, INTEGER, INTEGER, DOUBLE, DOUBLE, DOUBLE, DOUBLE, DOUBLE, DOUBLE, DOUBLE, DOUBLE, DOUBLE, TIMESTAMP
temporary
1 FALSE
2 FALSE
3 FALSE
4 FALSE
5 FALSE
[1] "data.frame"
Rows: 5
Columns: 6
$ database <chr> "flights", "flights", "flights", "flights", "flights"
$ schema <chr> "main", "main", "main", "main", "main"
$ name <chr> "airlines", "airports", "flights", "planes", "weather"
$ column_names <list> <"carrier", "name">, <"faa", "name", "lat", "lon", "alt",…
$ column_types <list> <"VARCHAR", "VARCHAR">, <"VARCHAR", "VARCHAR", "DOUBLE", …
$ temporary <lgl> FALSE, FALSE, FALSE, FALSE, FALSE
[1] "data.frame"
Let’s use SQL to get the departure delay for all flights from JFK and LaGuardia to Chicago and then use {ggplot2} to plot.
Check our data.
Rows: 3,862
Columns: 4
$ flight <int> 1477, 327, 331, 343, 317, 329, 353, 317, 329, 331, 343, 353,…
$ origin <chr> "LGA", "LGA", "LGA", "LGA", "LGA", "LGA", "LGA", "LGA", "LGA…
$ dest <chr> "ORD", "ORD", "ORD", "ORD", "ORD", "ORD", "ORD", "ORD", "ORD…
$ dep_delay <dbl> 178, 181, 190, 97, NA, NA, NA, -3, -4, 3, 5, 66, -4, 125, 29…
Plot.
8.30.3 Using R Variables in SQL Code
You can also use an existing R variable as a value in a SQL query clause.
Create the R Variable and use the name preceded by a ?
.
Rows: 2,638
Columns: 4
$ flight <int> 1183, 1099, 2181, 299, 2159, 683, 483, 1883, 883, 1815, 1885…
$ origin <chr> "JFK", "LGA", "LGA", "LGA", "JFK", "JFK", "JFK", "JFK", "JFK…
$ dest <chr> "MCO", "MCO", "MCO", "MCO", "MCO", "MCO", "MCO", "MCO", "MCO…
$ dep_delay <dbl> 16, 30, -6, 10, -5, -5, -4, -3, 11, 11, -4, -8, 1, 5, 30, -5…
Using parameters in SQL queries is common practice.
Use R to manipulate text to create queries as character strings is also common.
However, if you are building something, e.g., a Shiny app, where you are allowing others to input text that will feed into a query, you have to protect what you are doing.
You are liable to be victim of a SQL Injection Attack which can have bad consequences.
For R-specific approaches to mitigate the risk, see Run Queries Safely Posit (2023c) from the Posit Best Practices in working with Databases. Posit (2023b)
8.30.4 Setting the Connection Once in a Document
If you are going to use the same connection for all (or most chunks) in a document, you can set the connection as an option in a setup chunk.
Let’s close the current con
.
Now create a new connection.
Now run a SQL code chunk without stating the connection.
8.30.5 Close the Connection.
To avoid a warning when rendering the document, close the connection at the end.
8.31 Using the {dbplyr}
Package to Translate {dplyr} into SQL
The {dbplyr}
package allows you to use {dplyr}
code and it will translate it into SQL to interact (through an established connection) with a Back-end server.
- Consider R as a higher-level language you use to access code written in C.
- The {shiny} package can be considered as providing a higher-level language to generate HTML from {shiny} code.
- You can consider {dbplyr} as a higher-level package that generates SQL from {dplyr} code.
{dbplyr} works for many flavors of SQL and can generate many SQL expressions.
However it cannot always do everything. For more details see the {dbplyr} vignette on Function Translation.
Even if you can’t use {dbplyr} in your situation, you can still use it to generate SQL code you can see and use. That makes it a good tool for learning SQL since you know {dplyr}.
Let’s load {dbplyr} and {tidyverse} along with {DBI} and {duckdb}.
Create the connection to the database.
Retrieve a table from a SQL database using dplyr::tbl()
.
- Note you have to specify the source of the data, here the database connection, and the table name.
- Note that
flights2
is not a dataframe. It’s a list with classtbl_duckdb_connection
Now you can use your tidyverse code on the flights
table to generate a query.
flights2 |>
select(flight, origin, dest, dep_delay) |>
filter(origin == "JFK", dest == "CMH") |>
summarize(dep_delay = mean(dep_delay, na.rm = TRUE)) ->
my_query
my_query
Execute the query by using collect()
at the end.
To see the SQL code that was generated, use show_query()
at the end (this is a good way to learn SQL).
After you are done with the database, close the connection.
8.32 The DBeaver Integrated Development Environment (IDE)
- This section based on work by Professor David Gerard.
There is a free, separate Front-end for the DuckDB database called “DBeaver Community”.
Per DBeaver community:
DBeaver Community is a free cross-platform database tool for developers, database administrators, analysts, and everyone working with data. It supports all popular SQL databases like MySQL, MariaDB, PostgreSQL, SQLite, Apache Family, and more.
8.32.1 Setup DBeaver Community
The instructions here are only for DuckDB. But DBeaver works with other DBMS’s pretty easily.
Download and install DBeaver Community here: https://dbeaver.io/download/.
When you open it up, select Database Navigator
and under All
or SQL
, select “DuckDB”.
Then select Next
.
Select “Host”, and Open
to navigate to your flights.duckdb
database.
Click on Test Connection
.
If Successful, you will get a pop-up window with the results of the test and which will allow you to see the details of the connection.
If not successful, and you haven’t used it before, it may ask you to install a JDBC-compliant driver to use to connect to a DuckDB database.
- Click
Download Driver
. - After the download, it will run the test and you should get a positive result.
Click “Finish”.
You should now see the database under the Database Navigator.
Right click on the connection and create a new SQL script file.
Type some SQL and hit the execute button or just hit
Control + Enter
.
You can now play around with SQL.
When you want to save the data to a CSV file, just hit “Export data”.
- There are a lot of features in DBeaver, but we won’t cover them. They are pretty intuitive. Just play with it for awhile.
See the DBeaver General User Guide for detailed documentation.
8.32.2 Exercises
Use DBeaver for all of these exercises.
- What is the mean temperature for flights from each airport?
- What is the average number of flights from each airport per day in January?
- What are the top destinations for each airport?
8.33 Considerations for Working with SQL and Databases
8.33.1 Review Documentation about the database
When working with a new database, or one that is often changing, seek out and review the available documentation.
Data Dictionaries, schema descriptions, and/or E-R diagrams can help you understand the database design, the table descriptions and structure, the fields in each table, the primary keys, and whether tables are indexed.
The documentation can help you see a holistic picture of the data so you know where to look for tables and fields of interest and how to join tables most efficiently.
- As an example, the database may have a naming convention for tables and fields to help you find the tables and fields you need to create your queries.
8.33.2 Review Documentation about the DBMS and its SQL
Every DBMS is slightly different as developers optimize their code for the goals for the database.
Review documentation on the “flavor” of SQL accepted by the database.
Many flavors have extra features (statements, clauses, keywords, or modifiers) you can use to simplify your code or speed up your queries.
8.33.3 Choose a Front-end IDE that works with the DBMS
SQL is just text so you can write SQL in many environments.
If you are doing a lot of SQL, find a Front-end IDE that has features to facilitate using SQL.
- As an example, DBeaver has auto-complete for SQL statements while RStudio does not.
Many Front-ends have customized drivers for working with many different Back-end DBMS.
- So chose a Front End that allows you to work with the different databases in your organization without having to switch.
- Read the documentation for the Front-end to use it effectively.
8.33.4 Follow a Style Guide
Like tidyverse style, there are style guides to help you make your code easier to read, faster to debug, and more robust to future changes.
If you do nothing else from a style perspective, use line breaks often to break up your code into readable lines where each starts on a statement, clause, or keyword.
Both of the following code chunks are valid SQL.
Streaming SQL example.
Stylized tidy SQL.
Which do you think is easier to read, debug, and maintain?
- The second chunk was generated by the DBeaver
Format SQL
command under the Edit menu.
There is no single authoritative style guide so check if your organization has its own.
If it does not, consider the following.
8.33.5 Build-a-little-test-a-little
Writing and debugging SQL queries should proceed in steps just like building and debugging R code and functions.
Don’t write a 30 line query and then start to debug it.
Write short queries and use LIMIT
to keep the returns fast.
Write multiple small queries and then integrate them piece by piece.
Consider using sub-queries and Common Table Expressions to combine them as well.
8.33.6 Follow Best Practices for Writing Queries
There are many sources for Best Practices in writing SQL queries to optimize speed.
Common threads appears across best practices:
- Shape the query to return only the data you need.
- Shape the query to take advantage of the DBMS ability to use indexes for faster performance.
- Make your queries easy to read.
The most commonly cited best practice is “Avoid the use of SELECT *
”.
Other suggestions include:
- Use Joins instead of sub-queries inside a
WHERE
when you are just matching values. - Avoid using
LIKE
with wildcards,%
, at the beginning of a string as this means you have to search every record. - Use
GROUP BY
or add additional fields instead of usingSELECT DISTINCT
on large tables. - Use comments
--
to explain why you are doing something. - Use
AS
to create aliases to clarify fields from different tables for readability. - Avoid sorting, especially for large tables.
For additional recommendations and rationale, consider the following as a starting point.
- 7 Bad Practices to Avoid When Writing SQL Queries for Better Performance Allam (2023) is a good place to start.
- Best Practices for Writing SQL Queries Metabase (2023) has an extensive list of suggestions.
8.33.7 Consider Tuning Slow Queries
Queries go through three stages when being executed.
- Syntax Check: The DBMS parser checks each query is syntactically correct.
- Optimization: The DBMS optimizer examines the query and builds a plan for execution to minimize resource demand and time.
- Execution: The DBMS runs the optimized plan to return any results.
Thus every query gets optimized but that does not mean the query could not be tuned to be faster.
If you have a query that is run many times, on large tables, and appears slow, then consider tuning it.
The first step is detecting a slow query.
- Most DBMS have a way to monitor query performance.
- For DBeaver, see 3 Ways to Detect Slow Queries in PostgreSQL. Schönig (2018)
The second step to is to assess how the query is actually being run.
- DBMS also have a way get to the results of the Query Planner so you can review them to see what is actually happening.
- Look at the help for
EXPLAIN
. - For DBeaver, review the PostgreSQL Documentation on Using EXPLAIN Group (2023)
Once you understand the execution, then consider if any of the best practices for writing queries will help improve performance.
For more explanation and ideas see SQL Performance Tuning. khushboogoyal499 (2022)
8.33.8 Integrate Your SQL Into Your Version Control Workflow
Regardless of the IDE you use for a Front-end, SQL is still text-based code.
You can put the files into a Git Repo and version control with GitHub.
- If you are using DBeaver, use the search tool (magnifying glass) and look for “Install new software.”
- Use the drop down arrow to select the Git option as in Figure 8.5.

For more details see How to Version Control Your SQL.
8.33.9 Consider More Advanced SQL Statments and Methods
These notes have just scratched the surface of a language that has been around for 50 years and is still going strong.
There are many more statements, clauses, and keywords you can use to create more effective SQL to access the data you want, especially for large databases that will be accessed many times.
Some of these are:
- Views, Sub-queries, and Common Table Expressions
- Additional clauses and modifiers for
WHERE
such as {stringr} style functions. - Dynamic SQL where you can pass parameters to a stored procedure akin to an R function.
There are many free references on SQL to expand your knowledge.
R for Data Science suggests two non-free books:
8.33.10 Strategize How to Use SQL in Your Analysis Workflow
From a data science perspective SQL is usually a means for getting data out of databases.
- Data scientists are usually less involved in designing, building, and managing databases or maintaining the data.
Your working environment will often dictate which databases have the data of interest and the associated versions of SQL you must use.
There are SQL wizards who can make SQL dance by building and optimizing queries to the nth degree.
You do not have to be one!
As long as you can get the data you need, you can use the other tools in R and Python to make your analysis sing and dance.
If you have the time and hardware memory to get the data, you can clean and shape it outside of SQL if that works for you.
SQL is an important tool in your Data Science tool box for sure. But it works best when you use it with other tools in it as well.
If your primary analysis tools are R and RStudio, but you need to get subsets of data from a large database using SQL, consider using a SQL-focused IDE, such as DBeaver, to build your queries and then copying them into a .sql
file or an SQL code chunk.
This way you get the advantage of using a SQL-focused IDE while using RStudio for the rest of the analysis workflow.