8 Getting Data from SQL Databases
database, relational, sql, duckdb, dbeaver, duckplyr
The code chunks for SQL and R have different colors in this chapter.
- Code chunks for SQL use this color.
- Code chunks for R use this color.
8.1 Introduction
8.1.1 Learning Outcomes
- Understand the organization of relational databases.
- Use Structured Query Language (SQL) to access and manipulate data in SQL databases.
- Use the {dbplyr} package to access and manipulate data in SQL databases.
8.1.2 References:
- R for Data Science 2nd edition Chapter 25 Wickham, Cetinkaya-Rundel, and Grolemund (2023)
- {DBI} package R Special Interest Group on Databases (R-SIG-DB), Wickham, and Muller (2022)
- {odbc} package Hester, Wickham, and Gjoneski (2023)
- {duckdb} package Mühleisen et al. (2023)
- DuckDB Database Management System DuckDB (2023c)
- W3 Schools SQL Schools (2023)
- DBeaver SQL IDE DuckDB (2023a)
- DBeaver General User Guide Team (2003)
- R Markdown: Definitive Guide Xie, Alllaire, and Grolemund (2023) Section 2.7.3 on SQL
- {dbplyr} package Wickham, Girlich, and Ruiz (2023)
- Writing SQL with dbplyr Posit (2023a)
- {starwarsdb} package Aden-Buie (2020)
8.1.2.1 Other References
- Posit Solutions Best Practices in Working with Databases Posit (2023b)
- MariaDB Tutorial Database Theory Foundation (2023b)
- Customizing Keyboard Shortcuts in the RStudio IDE Ushey (2023)
- Code Snippets in the RStudio IDE Allaire (2023)
- SQL Injection kingthorin and Foundation (2023)
- How to Version Control your SQL Pullum (Swartz) (2020)
- Mermaid Diagramming and Charting Tool mermaid.js (2023)
- DuckDB Installation DuckDB (2023b)
- SQL Style Guide Holywell (2023)
- SQL Style Guide for GitLab Mestel and Bright (2023)
- Basic SQL Debugging Foundation (2023a)
- Best Practices for Writing SQL Queries Metabase (2023)
- 7 Bad Practices to Avoid When Writing SQL Queries for Better Performance Allam (2023)
- 3 Ways to Detect Slow Queries in PostgreSQL Schönig (2018)
- PostgreSQL Documentation on Using EXPLAIN Group (2023)
- SQL Performance Tuning khushboogoyal499 (2022)
8.2 A Brief History of Databases
People and organizations have been collecting data for centuries. As early as the 21st century BCE, people were using clay tablets to make records of farmland and grain production. Records were kept as individual files (clay tablets) that were eventually organized based on conceptual model of the data. That model was typically a hierarchical set of fixed attributes of the records, e.g., subject, date, author, etc. for tracking records in a filing system.
It wasn’t until the 1960s that computerized “databases” began to appear. Early computerized databases focused on storing and retrieving individual records which were organized based on existing data models which used the attributes of the tables and records. Data were stored in tables with rows of values and columns for attributes. Users could find the desired table and row by navigating through the hierarchy. This model resulted in larger and larger tables with many redundant fields as users tried to integrate data at the table level. As an example, a record of a purchase could include all of the details of the purchase to include all of data on the buyer and seller such as name, address, and telephone number. Every record of every purchase by a buyer could include their name, address, and telephone number.
Database Management Systems (DBMS) were developed as the interface between the system user and the raw data in the records. DBMS used the data model to manage the physical, digital storage of the data and provide the navigation to find the right table and record. Users could now retrieve and update records electronically. However, computing and storage was expensive so researchers were looking for ways to reduce the redundant storage and inflexible perspective of the data inherent in the hierarchical data model.
In 1970, IBM researcher E.F. Codd published what became a landmark paper, A relational model of data for large shared data banks. Codd (1970) Codd proposed replacing the fixed hierarchy and its large tables with much smaller tables where the data model now focused on defining relationships among the tables instead of navigating across a hierarchy of attributes or parent child relationships among tables. The relational database model is now ubiquitous for storing extremely large data sets.
A strength and a drawback of the relational database model is the capability to manage not just the data, but all the relationships across the tables of data. This makes relational databases great for working with structured data, where the relationships are known ahead of time and built into the database model. The drawback is it takes time and processing power to enforce those relationships and structure. Thus, other approaches such as Apache HADOOP have been developed to capture and manage the variety, velocity, and volume of data associated with “Big Data” sources, e.g., capturing all the meta data and data associated with billions of transactions a day on a social media site.
For the relational database model to work, it needed a corresponding high-level language for implementation. Users had to be able to create databases and manage the life cycle of their databases and the data they contained. This need led to the development of what were known as “query languages” which eventually evolved into what is now called Structured Query Language or SQL (pronounced as “SEE-quil” or “ESS-q-l”). SQL began with a row-based approach to managing data in tables. Other languages e.g., No-SQL, developed a column-based approach to managing data. There are trade offs between the approaches. Many relational database developers create their own (proprietary) versions of SQL to be optimized for their database that incorporate features of both for speed and distributed processing. Apache has developed Apache Hive as a data warehouse system that sits atop HADOOP. “Hive allows users to read, write, and manage petabytes of data using SQL.”
Relational Databases and HADOOP/Hive-like approaches are not competitors as much as different solutions for different requirements.
This section will focus on the relational database methods and systems which use SQL.
8.3 Database Concepts and Terms
8.3.1 Database Tables
Relational databases use Relational DBMS (RDBMS) to keep track of large amounts of structured data in a single system.
A relational database uses rectangular tables, also called entities to organize its data by rows and/or columns.
Each table is typically designed to contain data on a single type or class of “entity”.
An entity can be a physical object such as a person, building, location, or sales order, or a concept such as this class or metadata on an object.
Each entity must have properties or attributes of interest.
- These become the names for the columns in the table.
- These are also known as the fields for the tables.
Each row in the table is also referred to as a record.
The intersection of row and column contains the value, the data, for that record for that field.
All of the values within a given field should be of the same data type. These include the basic types in R such as character, integer, double, or logical as well as complex objects such as pictures, videos, or other Binary Large Objects.
The tables are similar to R or Python data frames, but with some key differences.
- Database tables are stored on a drive so are not constrained, like R data frames, to having to fit into memory. They can have many columns and many millions of rows, as long as they fit on the drive.
- Unlike a data frame, a database field (column) may have a domain which limits the format and range of values in the field in addition to constraining the data type, e.g., the format for a telephone number may be
xx-xxx-xxx-xxxx
. - Database tables are often designed so each record is unique (no duplicate rows) with rules the DBMS uses to enforce the uniqueness.
- Database tables are almost always indexed so each row has a unique value associated with it (behind the scenes).
- The index identifies each record from a physical perspective, much like one’s operating system indexes files for faster search and retrieval.
- The index is what allows a DBMS to find and retrieve records (rows) quickly.
8.3.2 Database Table Keys and Relationships
Most database tables will also be designed with a primary key which is combination of one or more fields that uniquely identify each record, from a logical perspective, across the database.
- As an example, a key may be a person’s social security number which is expected to be unique to one person.
A combined key, uses multiple fields to create a key, such as house number, street name, city, state, zip-code.
Many database designers prefer to create a separate surrogate key, that does not use other values and has no meaning, to avoid potential changes in data over time.
The presence of primary keys for each table enable “relationships” between tables to be created and used by the RDBMS.
As an example, an organization is interested in tracking data about purchase orders and the buyer and a product in a purchase order.
A database designer might create three tables to hold this information:
- A
ORDER
table with fields for the order number, the date of the purchase, the buyer_id, the product_id, the quantity purchased, and the purchase price.- The primary key is
order_number
.
- The primary key is
- A
BUYER
table with fields for the buyer_id, name, company id, street address, state, postal code, country, telephone number, and email.- The primary key is
buyer_id
.
- The primary key is
- A
PRODUCT
table with fields for the products id, name, weight, volume, purchase sizes, default_shipper_id, and standard price.- The primary key is
product_id
.
- The primary key is
Each table describes a single object or entity and each has a primary key to uniquely identify each record.
The ORDER
table uses the primary keys from the BUYER
table and the PRODUCT
table to make a relationship with those tables.
The presence of a table’s primary key in a different table is called being a foreign key in the second table.
- The table with the primary key is known as a Parent table.
- The table that references the parent table (using the foreign key) is known as the child table.
- RDBMS use rules (constraints) to preserve data integrity so that someone editing or changing a child table cannot change or use a foreign key field that is not present in the parent table.
- In this example, there is a foreign key in the
BUYER
table,company _id
, and a foreign key in thePRODUCT
table,default_shipper_id
.
If you were to put all this data into a single table, the attributes of the buyers and products would be duplicated many times over as additional orders were created.
This example illustrates how relationships allow database designers to create multiple tables to minimize the amount of redundant data in a system.
8.3.3 ER Diagrams
Tables and their relationships are usually described in the database schema.
They can be depicted using class or entity-relationship diagrams to show the relationships between tables, fields and keys.
- E-R diagrams can include additional information such as constraints and type of relationship, e.g., one-to-one, or one-to-many.
E-R diagrams can serve as a useful reference for understanding the structure of a large database and designing queries.
Figure 8.1 is an example of a simplified ER diagram (created in Quarto using Mermaid) showing the relationships between the tables and the associated primary and foreign keys.
Figure 8.2 shows a different style of E-R diagram commonly used for database schema. The arrows link individual fields as primary and foreign keys for the tables and show the type of the relationships.
8.3.4 Bringing Tables Together
Now that we can see the database schema, we can find where the data is that we want.
Suppose we want to answer the question, how many orders were placed in the last month. We can do that with just the data in the ORDERS
table.
However, suppose we want to answer the question how many orders were placed from buyers in the states of Virginia and Maryland last month. We would have to get data from both the ORDERS
table and the BUYERS
table.
We get data from multiple tables by temporarily combining tables using a SQL query that creates a JOIN between two or more tables.
Database joins are the same joins you can execute using {dplyr} verbs but the SQL syntax is slight different.
If you are using the same join over and over in your analysis. SQL allows you to create a database “View” and save it. Then in the future, you can query the “view” instead of having to rejoin the tables.
- This is analogous to writing an R function to create a data frame based on joins and querying the results of the function instead of rewriting the code to join the data frames.
8.3.5 Database Design
Designing databases is its own area of expertise.
A poorly designed database will be slow and have insufficient rules (safeguards) to ensure data integrity.
We are used to operating as single users of our data since it is all in the memory of our machine and we can do what we want.
- A database may have thousands of users at the same time trying to update or retrieve records.
- Only one user can update a record at a time so databases use Record Locking to prevent others from updating the record or even retrieving it while someone else is updating the record.
- RDBMS can enforce locking at higher levels such as the table or even the entire database which can delay user queries. See All about locking in SQL Server for more details.
Database designers make trade offs when designing tables to balance competing desires.
- Storage Efficiency: limit each table to a single entity to minimize the amount of redundant data. This can generate lots of tables.
- Maintenance Efficiency: The more tables you have, the fewer users might be trying to update the table at a time.
- Query Speed: Minimize the number of joins needed to make a query. This leads to fewer tables.
These trade offs were recognized by Codd in his original paper.
- Formal descriptions of the ways to design databases are known as Normal forms.
- The process for determining what form is best for an application is known as Database Normalization. For more info see The Trade-offs Between Database Normalization and Denormalization. Explorer (2023)
Thus database designers may create tables that have redundant data if the most common queries join fields from those tables.
They will use database performance testing to optimize the queries and the database design to get the optimal cost performance trade offs. They can use other approaches to support data integrity in addition to database normalization.
Don’t be surprised to see E-R diagrams that have redundant data in them.
8.3.6 Relational Database Management Systems
8.3.6.1 Database Front and Back Ends
Databases have two main components.
- The Front-end user interface where users create code to generate queries and receive results.
- The Back-end server where the data is stored and the DBMS executes the queries to update or retrieve the data.
The Front-end and Back-end communicate through a Connection that uses an API for handling queries and data.
Many RDBMS bundle a Front-end and Back-end together. Examples include Oracle, SAP, and DuckDB has DBeaver.
Most databases allow you to use a Front-end of your choice as long as you have a server driver you can use to configure and open a connection to the Back-end server of the database.
- RStudio can operate as a Front-end to RDBMS for which you have installed a server.
- Excel can operate as a Front-end to a Microsoft SQL Server database.
- DBeaver allows you to make a connection to many different DBMS servers.
- MariaDB identifies over 30 Front-ends that can work with it, including DuckDB’s DBeaver SQL IDE.
There are multiple solutions. We will use both RStudio and DBeaver.
8.3.6.2 Database Connections: ODBC and the {DBI} Package
The reason there is so much flexibility with connecting Front and Back ends is the standardization of the Microsoft Open Data Base Connectivity (ODBC) Standard. David-Engel (2023)
The ODBC standard was designed for “maximum interoperability”.
The construct is similar to the HTTP interface for web browsers. The DBMS developer creates a driver that serves as an API for their Back-end DBMS that is ODBC-compliant. Then the Front-end developer creates their software to also be ODBC-compliant.
That way, the Front-end can create a connection to and interact with any DBMS back end for which there is an ODBC-compliant driver.
The R {DBI} package (for Data Base Interface) supports the connection between R and over 25 different DBMS Back-ends.
- Popular Back-end packages include: {duckdb}, {RMariaDB}, {RPostgres}, {RSQLite}, and {bigrquery}.
- These packages include the ODBC-compliant driver for the given database.
If no one has written a package but an ODBC-compliant driver exists, the {DBI} package works with the {ODBC} package and the driver to make the connection to the Back-end server.
8.3.6.3 Database Configuration Types
RDBMS can also be categorized based on their configuration or how front and back ends are physically connected and how the data is stored.
- In-process databases run on your computer. The data is stored locally and accessible without a network connection.
- They are optimized for the use case of a single user working with large amounts of data.
- Client-Server databases use a small front-end client on your computer (perhaps an IDE or in a browser) to connect to a robust server over a network. Data can be stored across a set of servers.
- They are optimized for the use case where multiple users can query the database at once and the RDBMS resolves contention.
- Cloud-databases also separate the front and back ends over a network but now the data may be distributed across multiple locations in the cloud.
- These are optimized for the use case of multiple users accessing a wide-variety of very large data and enabling distributed computing and parallel computing.
With the proliferation of cloud computing and software as a service (SAAS), you can expect to see more and more hybrids of client server and cloud databases.
As the amount of collected structured and unstructured data scales to Apache solutions for “Big Data”, the fundamental concepts of exploiting relationships in data using some version of SQL will continue to apply for the foreseeable future.
8.4 SQL Basics
- This section is based on work by Professor David Gerard.
We will use R Studio as the Front-end, DuckDB as the Back-end and the {DBI} package to support the connection between the front and back ends.
8.4.1 Conecting to a Database
Before you can use SQL to interact with a database, you must have a way to connect the Front-end user interface to the Back-end server with the data.
- When working in R, the {DBI} package can create the connection and allow you to assign a name to it.
- Your database will be created by a specific DBMS which provides the Back-end server.
- If you are not using the Front-End that comes with the DBMS, you need an ODBC-compliant driver for the DBMS to make the connection.
We will use the DuckDB database and the {duckdb} Package for most of this section.
DuckDB is an in-process database so it has no external dependencies which makes it useful for single user analysis of large data. See Why DuckDB for their story.
You can install DuckDB directly on your computer and use the built-in Command Line Interface without using R.
However, the {duckdb} package installs the DuckDB DBMS as well as the ODBC-compliant driver for R.
Use the console to install {DBI} and {duckdb}.
To make the connection between R and DuckDB requires loading the two packages {DBI} and {duckdb}.
We also want to compare the SQL code (working with the database tables) with R code (working with data frames) so load {tidyverse} and {nycflights13} which has the same data as the database we will use.
- You will see the concepts for accessing data are quite similar but the syntax is different.
Download the DuckDB database {nycflights13} data from https://github.com/AU-datascience/data/blob/main/413-613/flights.duckdb and save to a data
folder under your current file’s working directory.
- {duckdb} only works with local data bases not over a URL
Use duckdb()
and DBI::dbConnect()
to create a connection to flights.duckdb
using a relative path to the database.
- Assign a name to it.
- Notice the connection to the database in the environment.
- You will use the
con
connection to interact with the database Back-end.
Once you have a connection, you can create a basic SQL code chunk like this in R Markdown.
In Quarto .qmd files use the chunk option #| connection: con
.
name |
---|
airlines |
airports |
flights |
planes |
weather |
Whenever you are using a SQL chunk inside an R Markdown or Quarto document, you have to identify the connection in each code chunk so the SQL interpreter knows with which database to interact.
All the SQL code chunks in this section will have the #| connection my_con
chunk option where my_con
is the name of the appropriate connection.
8.4.2 Basic SQL Syntax
Whereas Tidyverse uses “verbs”, SQL uses “STATEMENTS” which can have clauses, keywords, or other modifiers.
- SQL statements are used across the life cycle of a database; to create it, to create tables, add data, update data, select data, delete data and tables, establish rules, and delete a database.
We are most interested in SELECT
statements as that is how we get data from a database table or a set of joined tables.
The SELECT
statement identifies which fields are to be returned.
SELECT
has several other clauses or keywords to support tailoring the data returned by a query.
FROM
: Required to identify the table or set of tables where the data will be located.WHERE
is a clause used to filter which records (rows) should be returned.ORDER BY
is a keyword for sorting records. The default is ascending order.GROUP BY
is a keyword akin todplyr::group_by
and is used for creating summaries of data.
Other useful statements include:
AS
: used for renaming or creating aliases.IN
: similar to R’s%in%
.
SQL uses =
for comparison instead of ==
since you don’t assign anything in SQL.
- Use
!=
for not equal. - SQL uses
NULL
instead ofNA
. - SQL uses
AND
,OR
andNOT
as logical operators.
There is a standard for SQL: SQL-92. However, many database developers see this as a point of departure rather than a destination.
- They use ODBC and SQL-92 to develop their own versions that are optimized for their intended use cases and database design.
- Thus, there are many flavors of SQL in the world.
Be sure to review the SQL implementation for the DBMS you are using to be aware of any special features.
If you use proprietary SQL, it may give you performance improvements, but it will also lock you into that DBMS and you will have to revise your code if the Back-end DBMS is changed.
Like all languages, SQL has standard syntax and reserved words.
Rules to remember:
- Case does not matter (i.e.
select
is the same asSELECT
is the same asSeLeCt
), but it is standard to have all statements be in UPPERCASE (e.g.,SELECT
). - Order matters. A SQL
SELECT
statement’s clauses and keywords must be in the following order:SELECT
,FROM
,WHERE
,GROUP BY
,ORDER BY
. - New lines and white space don’t matter. But it is common to put those statements, clauses, and keywords on new lines with spaces to preserve a “river” of white space between the keyword and the values.
- Character string values must be in single quotes.
- You can use invalid variable names by putting them in double quotes (same as using back ticks in R).
- Some folks always use double quotes because it is not always clear what is an invalid variable name in the database management system. This is what you see below.
- Comments in SQL start with two hyphens
--
. - Some DBMS require a semicolon at the end of each statement.
- It is good practice to put a semicolon
;
at the end of a SQL statement. - This will allow you to have multiple SQL statements passed to the database at once, e.g., in one chunk.
- It is good practice to put a semicolon
8.5 Common SQL Actions and their Syntax
- This section is based on work by Professor David Gerard.
8.5.1 Show and Describe Tables in the Database
The SHOW TABLES
command can be used to get a list of all of the tables in the database.
name |
---|
airlines |
airports |
flights |
planes |
weather |
The DESCRIBE
command can be used to show tables and the variables (in list columns).
database | schema | name | column_names | column_types | temporary |
---|---|---|---|---|---|
flights | main | airlines | carrier, name | VARCHAR, VARCHAR | FALSE |
flights | main | airports | faa , name , lat , lon , alt , tz , dst , tzone | VARCHAR, VARCHAR, DOUBLE , DOUBLE , DOUBLE , DOUBLE , VARCHAR, VARCHAR | FALSE |
flights | main | flights | year , month , day , dep_time , sched_dep_time, dep_delay , arr_time , sched_arr_time, arr_delay , carrier , flight , tailnum , origin , dest , air_time , distance , hour , minute , time_hour | INTEGER , INTEGER , INTEGER , INTEGER , INTEGER , DOUBLE , INTEGER , INTEGER , DOUBLE , VARCHAR , INTEGER , VARCHAR , VARCHAR , VARCHAR , DOUBLE , DOUBLE , DOUBLE , DOUBLE , TIMESTAMP | FALSE |
flights | main | planes | tailnum , year , type , manufacturer, model , engines , seats , speed , engine | VARCHAR, INTEGER, VARCHAR, VARCHAR, VARCHAR, INTEGER, INTEGER, INTEGER, VARCHAR | FALSE |
flights | main | weather | origin , year , month , day , hour , temp , dewp , humid , wind_dir , wind_speed, wind_gust , precip , pressure , visib , time_hour | VARCHAR , INTEGER , INTEGER , INTEGER , INTEGER , DOUBLE , DOUBLE , DOUBLE , DOUBLE , DOUBLE , DOUBLE , DOUBLE , DOUBLE , DOUBLE , TIMESTAMP | FALSE |
8.5.2 Select Fields (Columns) From a Table
To return specific fields (columns) from a table, use the SELECT
statement.
- The
SELECT
statement must always be followed by aFROM
statement which identifies the source of the data. - That source is a table or a set of joined tables or a view.
- The data which is returned as a result of executing a
SELECT
statement is stored in a result table, which is called the result-set. This means you can use it in aFROM
clause. - A
SELECT
statement can have other clauses and key words to reduce the amount of data that is returned.
Let’s use this syntax to get the data in the fields tailnum
, year
, and model
from the planes
table.
tailnum | year | model |
---|---|---|
N10156 | 2004 | EMB-145XR |
N102UW | 1998 | A320-214 |
N103US | 1999 | A320-214 |
N104UW | 1999 | A320-214 |
N10575 | 2002 | EMB-145LR |
N105UW | 1999 | A320-214 |
N107US | 1999 | A320-214 |
N108UW | 1999 | A320-214 |
N109UW | 1999 | A320-214 |
N110UW | 1999 | A320-214 |
- R equivalent
You can select every field in a table by using the wildcard *
, e.g., SELECT *
.
For tables with many fields, select only the specific fields of interest unless you are certain the table has only the fields of interest.
This has two benefits:
- This makes your code more robust as using
SELECT *
will select every field and you don’t know what someone else may have done to the table. - On tables with many fields this can make your query much faster in two ways. The first is actually selecting the data, and the second is by limiting the amount of data that must returned through the connection.
- For In-process databases that run on your computer the connection speed is less of a concern.
- However, for the other database types, where the connection is over a network, using
SELECT *
can mean you are trying to return a lot of unneeded data through the connection, slowing you (and others) down.
tailnum | year | type | manufacturer | model | engines | seats | speed | engine |
---|---|---|---|---|---|---|---|---|
N10156 | 2004 | Fixed wing multi engine | EMBRAER | EMB-145XR | 2 | 55 | NA | Turbo-fan |
N102UW | 1998 | Fixed wing multi engine | AIRBUS INDUSTRIE | A320-214 | 2 | 182 | NA | Turbo-fan |
N103US | 1999 | Fixed wing multi engine | AIRBUS INDUSTRIE | A320-214 | 2 | 182 | NA | Turbo-fan |
N104UW | 1999 | Fixed wing multi engine | AIRBUS INDUSTRIE | A320-214 | 2 | 182 | NA | Turbo-fan |
N10575 | 2002 | Fixed wing multi engine | EMBRAER | EMB-145LR | 2 | 55 | NA | Turbo-fan |
N105UW | 1999 | Fixed wing multi engine | AIRBUS INDUSTRIE | A320-214 | 2 | 182 | NA | Turbo-fan |
N107US | 1999 | Fixed wing multi engine | AIRBUS INDUSTRIE | A320-214 | 2 | 182 | NA | Turbo-fan |
N108UW | 1999 | Fixed wing multi engine | AIRBUS INDUSTRIE | A320-214 | 2 | 182 | NA | Turbo-fan |
N109UW | 1999 | Fixed wing multi engine | AIRBUS INDUSTRIE | A320-214 | 2 | 182 | NA | Turbo-fan |
N110UW | 1999 | Fixed wing multi engine | AIRBUS INDUSTRIE | A320-214 | 2 | 182 | NA | Turbo-fan |
- R equivalent
- There is no equivalent for excluding fields (like
dplyr::select(-year)
). Just select the ones you want.
8.5.3 Select Distinct or Unique Records
Some tables may allow duplicate records or have many records with duplicate values if only a few fields are selected.
To get distinct records use the statement SELECT DISTINCT field1, field2
.
The following example will return all unique destinations.
8.5.4 Reduce the Number of Records that are Returned
8.5.4.1 Filter Records (Rows) by Value
You can use a WHERE
clause after the FROM
inside a SELECT
statement to filter which subset of records will be returned based on the values in the record (row).
- The
WHERE
clause requires an expression that returns a logical value.
```{sql}
#| connection: con
SELECT "flight", "distance", "origin", "dest"
FROM flights
WHERE "distance" < 50;
```
flight | distance | origin | dest |
---|---|---|---|
1632 | 17 | EWR | LGA |
- R equivalent:
To test for equality, use a single equal sign, =
.
flight | month |
---|---|
745 | 12 |
839 | 12 |
1895 | 12 |
1487 | 12 |
2243 | 12 |
939 | 12 |
3819 | 12 |
1441 | 12 |
2167 | 12 |
605 | 12 |
- R equivalent
To use a character string you must use single quotes on the string, not double.
flight | origin |
---|---|
1141 | JFK |
725 | JFK |
79 | JFK |
49 | JFK |
71 | JFK |
194 | JFK |
1806 | JFK |
1743 | JFK |
303 | JFK |
135 | JFK |
- R equivalent
You can filter records based on multiple criteria and the logical operators.
Here is an example of using the AND
logical operator.
```{sql}
#| connection: con
SELECT "flight", "origin", "dest"
FROM flights
WHERE "origin" = 'JFK' AND "dest" = 'CMH';
```
flight | origin | dest |
---|---|---|
4146 | JFK | CMH |
3783 | JFK | CMH |
4146 | JFK | CMH |
3783 | JFK | CMH |
4146 | JFK | CMH |
3783 | JFK | CMH |
4146 | JFK | CMH |
3783 | JFK | CMH |
4146 | JFK | CMH |
3650 | JFK | CMH |
- R equivalent:
You can use the OR
and `NOT logical operators too.
- Put parentheses around the portions of the expression to ensure you get the desired order of operations.
```{sql}
#| connection: con
SELECT
DISTINCT "flight",
"origin",
"dest"
FROM
flights
WHERE
NOT ("origin" = 'JFK'
OR "origin" = 'LGA')
AND dest = 'CMH';
```
flight | origin | dest |
---|---|---|
3851 | EWR | CMH |
3853 | EWR | CMH |
5906 | EWR | CMH |
5664 | EWR | CMH |
4275 | EWR | CMH |
4147 | EWR | CMH |
4318 | EWR | CMH |
5675 | EWR | CMH |
3843 | EWR | CMH |
4240 | EWR | CMH |
- R equivalent
8.5.4.2 Remove Records with Missing Data
Missing data is NULL
in SQL (instead of NA
).
We can explicitly remove records with missing values with the clause WHERE field IS NOT NULL
.
```{sql}
#| connection: con
SELECT "flight", "dep_delay"
FROM flights
WHERE "dep_delay" IS NOT NULL;
```
flight | dep_delay |
---|---|
1545 | 2 |
1714 | 4 |
1141 | 2 |
725 | -1 |
461 | -6 |
1696 | -4 |
507 | -5 |
5708 | -3 |
79 | -3 |
301 | -2 |
- R equivalent
Just use IS
if you want only the records that are missing data.
8.5.4.3 Restrict the Number of Records Returned
When you are building a query, you often want to subset the rows while you are finishing it (you don’t want to return the whole table each time you are trouble shooting a query).
Use LIMIT
output modifier at the end of the query to restrict the number of records that are returned.
- Some flavors of SQL allow
SELECT TOP
andSELECT TOP PERCENT
statements to restrict to the top number/percentage of records.
flight | origin | dest |
---|---|---|
1545 | EWR | IAH |
1714 | LGA | IAH |
1141 | JFK | MIA |
725 | JFK | BQN |
461 | LGA | ATL |
If you want a limited number of records but you not just the first records, use the OFFSET
keyword to identify how may records to skip before starting to collect the records.
8.5.4.4 Return a Random Sample of Records
You can also randomly sample rows with the clause USING SAMPLE
.
flight | origin | dest |
---|---|---|
3459 | JFK | BNA |
1253 | EWR | ORD |
4368 | EWR | DAY |
2140 | LGA | BOS |
1718 | LGA | IAH |
Depending upon your version of SQL there may be an modifier to SAMPLE
that allows you to set the random number seed for the sample so it is repeatable.
DuckDB has multiple sampling methods. The following uses bernoulli
sampling to get a small number of rows for a small percentage (just to avoid many rows in the output.)
- Note that repeating the code above changes the selected rows whereas repeating the following does not.
```{sql}
#| connection: con
SELECT "flight", "origin", "dest"
FROM flights
USING SAMPLE .0027% (bernoulli, 377);
```
flight | origin | dest |
---|---|---|
1164 | EWR | FLL |
1160 | LGA | MIA |
1281 | EWR | CLT |
77 | JFK | FLL |
1733 | LGA | CLT |
When working with tables with more than a few hundred records, recommend using one of these methods when initially building your query to speed up the return of results, especially if the connection is over a network.
Remove the restrictions when the query is ready for use on the full table.
8.5.5 Arrange Rows
Use the ORDER BY
clause to sort the records that are returned.
- The default order is ascending.
Let’s remove missing values so we can see the ordering.
```{sql}
#| connection: con
SELECT "flight", "dep_delay"
FROM flights
WHERE "dep_delay" IS NOT NULL
ORDER BY "dep_delay"
LIMIT 20;
```
flight | dep_delay |
---|---|
97 | -43 |
1715 | -33 |
5713 | -32 |
1435 | -30 |
837 | -27 |
3478 | -26 |
4361 | -25 |
4573 | -25 |
2223 | -24 |
3318 | -24 |
- R equivalent
Use the modifier DESC
after the field name to arrange in descending order
```{sql}
#| connection: con
SELECT "flight", "dep_delay"
FROM flights
WHERE "dep_delay" IS NOT NULL
ORDER BY "dep_delay" DESC;
```
flight | dep_delay |
---|---|
51 | 1301 |
3535 | 1137 |
3695 | 1126 |
177 | 1014 |
3075 | 1005 |
2391 | 960 |
2119 | 911 |
2007 | 899 |
2047 | 898 |
172 | 896 |
- R equivalent
You break ties by adding more variables in the ORDER BY
clause.
```{sql}
#| connection: con
SELECT "flight", "origin", "dep_delay"
FROM flights
WHERE "dep_delay" IS NOT NULL
ORDER BY "origin" DESC, "dep_delay"
LIMIT 20;
```
flight | origin | dep_delay |
---|---|---|
1715 | LGA | -33 |
5713 | LGA | -32 |
1435 | LGA | -30 |
837 | LGA | -27 |
3478 | LGA | -26 |
4573 | LGA | -25 |
2223 | LGA | -24 |
375 | LGA | -24 |
4065 | LGA | -24 |
1371 | LGA | -23 |
- R equivalent
8.5.6 Create, Transform, or Mutate Fields
In SQL, you can create, transform, or mutate the fields that will appear in a returned record-set based on existing fields in the table.
- You are not updating the original table; you are creating an alias for a new temporary field that will disappear after the query is run.
Use the keyword AS
, inside a SELECT
statement, to specify the alias (new name for the new field)
Let’s calculate average speed from the flights
table. We’ll also keep the flight number, distance, and air time variables.
```{sql}
#| connection: con
SELECT
"flight",
"distance" / "air_time" AS "speed",
"distance",
"air_time"
FROM
flights
LIMIT 10;
```
flight | speed | distance | air_time |
---|---|---|---|
1545 | 6.167401 | 1400 | 227 |
1714 | 6.237885 | 1416 | 227 |
1141 | 6.806250 | 1089 | 160 |
725 | 8.612022 | 1576 | 183 |
461 | 6.568966 | 762 | 116 |
1696 | 4.793333 | 719 | 150 |
507 | 6.740506 | 1065 | 158 |
5708 | 4.320755 | 229 | 53 |
79 | 6.742857 | 944 | 140 |
301 | 5.311594 | 733 | 138 |
- R equivalent:
SQL has many functions for transforming values. Common ones include:
LN()
: Natural log transformation.EXP()
: Exponentiation.SQRT()
: Square root.POW()
: Power transformation.POW(2.0, x)
would be \(2^x\)POW(x, 2.0)
would be \(x^2\)
Beware of integer division during transformations. Integer division will round results to integers.
```{sql}
#| connection: con
SELECT
DISTINCT "month",
"day",
"day" // "month" AS "ratio"
FROM
flights
WHERE
"month" >= 5
ORDER BY
"month",
"day";
```
month | day | ratio |
---|---|---|
5 | 1 | 0 |
5 | 2 | 0 |
5 | 3 | 0 |
5 | 4 | 0 |
5 | 5 | 1 |
5 | 6 | 1 |
5 | 7 | 1 |
5 | 8 | 1 |
5 | 9 | 1 |
5 | 10 | 2 |
Use the CAST()
function to explicitly convert an integer to a double before operations that could produce doubles.
```{sql}
#| connection: con
SELECT
DISTINCT "month",
"day",
CAST("day" AS DOUBLE) / CAST("month" AS DOUBLE) AS "ratio"
FROM
flights
WHERE
"month" >= 5
ORDER BY
"month",
"day";
```
month | day | ratio |
---|---|---|
5 | 1 | 0.2 |
5 | 2 | 0.4 |
5 | 3 | 0.6 |
5 | 4 | 0.8 |
5 | 5 | 1.0 |
5 | 6 | 1.2 |
5 | 7 | 1.4 |
5 | 8 | 1.6 |
5 | 9 | 1.8 |
5 | 10 | 2.0 |
- Note: DuckDB has two different division operators that operate differently if both numbers are integers.
- The single slash,
/
,will convert to floating point numbers for the division. - The double slash,
//
, will use integer division. - See DuckDB Numeric Operators.
- The single slash,
8.5.7 Summarize Fields with Aggregate Functions
SQL has functions for summarizing data (SQL calls these “Aggregates”).
Aggregate functions must be inside a SELECT
statement which identifies the fields of interest.
- Aggregates are similar to the functions used inside a
summarize()
in R. - They calculate the results for the selected fields of the Fields and aggregate (collapse) the records to a single record.
- By default, all records with missing data (NULL) are ignored (like setting
na.rm = TRUE
) and there is no warning or message.
Some of the common aggregates are:
COUNT()
: Count the number of rows.AVG()
: Calculate average.MEDIAN()
: Median (not standard across all DBMS’s).SUM()
: Summation.MIN()
: Minimum.MAX()
: Maximum.STDDEV()
: Standard deviation.VARIANCE()
: VarianceCORR(y, x)
: Correlation betweenx
andy
.
Let’s calculate the average and standard deviation for departure delays across all records in the table.
```{sql}
#| connection: con
SELECT
AVG("dep_delay"),
STDDEV("dep_delay") AS "sd_dep_delay",
COUNT("dep_delay") as "num_records"
FROM
flights;
```
avg(dep_delay) | sd_dep_delay | num_records |
---|---|---|
12.63907 | 40.21006 | 328521 |
Note only one record is returned with fields identified by default with the function call, but you can create aliases using the
AS
statement.Note how SQL automatically drops records with a
NULL
value before counting the records.R equivalent:
- Note how R counts the number of records that went into the summary, before the records were removed by
is.na()
. - You need to filter the data in R to remove the
NA
s before sendint to thesummarize()
to get the same count.
8.5.7.1 Use the GROUP BY
statement to Aggregate results across Groups of Records.
GROUP BY
identifies one or more fields by which to aggregate the records.
- This is similar (but not exactly the same) as using
summarize()
inside agroup_by()
in R.
Let’s summarize the average and standard deviation of the departure delays by each origin and month.
```{sql}
#| connection: con
SELECT
"origin",
"month",
AVG("dep_delay") AS "avg_dep_delay",
STDDEV("dep_delay") AS "sd_dep_delay"
FROM
flights
GROUP BY
"origin",
"month"
ORDER BY
"origin",
"month";
```
origin | month | avg_dep_delay | sd_dep_delay |
---|---|---|---|
EWR | 1 | 14.905748 | 40.80044 |
EWR | 2 | 13.067263 | 37.15890 |
EWR | 3 | 18.102457 | 44.09987 |
EWR | 4 | 17.400058 | 43.86926 |
EWR | 5 | 15.366141 | 38.95333 |
EWR | 6 | 22.470810 | 50.78085 |
EWR | 7 | 22.035112 | 49.50592 |
EWR | 8 | 13.493714 | 37.60204 |
EWR | 9 | 7.290953 | 34.98619 |
EWR | 10 | 8.636436 | 32.73880 |
- Now one record is returned for each of the 36 groups.
- Note the order of the fields
origin
andmonth
is based on the order of theSELECT
, not theGROUP BY
.
Change the order of fields in the SELECT
.
```{sql}
#| connection: con
SELECT
"month",
"origin",
AVG("dep_delay") AS "avg_dep_delay",
STDDEV("dep_delay") AS "sd_dep_delay"
FROM
flights
GROUP BY
"origin",
"month"
ORDER BY
"origin",
"month";
```
month | origin | avg_dep_delay | sd_dep_delay |
---|---|---|---|
1 | EWR | 14.905748 | 40.80044 |
2 | EWR | 13.067263 | 37.15890 |
3 | EWR | 18.102457 | 44.09987 |
4 | EWR | 17.400058 | 43.86926 |
5 | EWR | 15.366141 | 38.95333 |
6 | EWR | 22.470810 | 50.78085 |
7 | EWR | 22.035112 | 49.50592 |
8 | EWR | 13.493714 | 37.60204 |
9 | EWR | 7.290953 | 34.98619 |
10 | EWR | 8.636436 | 32.73880 |
Changing the order of the fields in the GROUP BY
does not change sequence of the fields or the results.
```{sql}
#| connection: con
SELECT
"month",
"origin",
AVG("dep_delay") AS "avg_dep_delay",
STDDEV("dep_delay") AS "sd_dep_delay"
FROM
flights
GROUP BY
"month",
"origin"
ORDER BY
"origin",
"month";
```
month | origin | avg_dep_delay | sd_dep_delay |
---|---|---|---|
1 | EWR | 14.905748 | 40.80044 |
2 | EWR | 13.067263 | 37.15890 |
3 | EWR | 18.102457 | 44.09987 |
4 | EWR | 17.400058 | 43.86926 |
5 | EWR | 15.366141 | 38.95333 |
6 | EWR | 22.470810 | 50.78085 |
7 | EWR | 22.035112 | 49.50592 |
8 | EWR | 13.493714 | 37.60204 |
9 | EWR | 7.290953 | 34.98619 |
10 | EWR | 8.636436 | 32.73880 |
- R equivalent
Note the order of the
origin
andmonth
.Changing the order in the
select()
does not matter in R.
Changing the order in the group_by()
does matter in how the fields are arranged by default.
8.5.8 Window Functions are for Creating Grouped Summaries without Aggregating the Records
There may be times when you want to add a grouped summary to the data without collapsing the data into one record for each groups.
- As an example, you want to calculate the total departure delay time for each destination so you can use it to then calculate the percentage of that departure delay time for each airline flying to that destination.
- You could do that with a summarized data frame and then a join.
- In R, this can be done by using
group_by()
and then usingmutate()
instead ofsummarize()
to add a new column to the data frame with the result for each group while preserving all of the rows.
In SQL, you use the aggregate function you want in a what that SQL calls a Window Function.
- A Window function uses an aggregate function to calculate a result for one or more fields (across a specified set of records) and it adds the results as new fields in the result set (table), but it does Not collapse the records, so the result set has all the original un-aggregated records with the new aggregated results (for each set of records).
- All the records for the given set of specified records have the same values in the new aggregated fields.
To use an aggregate function as a window function, use the OVER
clause with a (PARTITION BY myfield)
modifier.
- The
PARTITION BY
clause is a sub-clause of theOVER
clause. The partition by breaks up (partitions) the table into separate tables based one or more chosen fields. This is similar to creating a grouped data frame in R with one or more groups. - Then, the aggregate function is applied to the records in each partition and a new field is added to the records in that partition with the aggregated value.
Let’s find the flight numbers for the longest flights (in terms of air time) from each airport (origin
).
- We will specify
origin
as the partitioning (group by) field. - We will specify the alias
amax
as the name of the new field added to the table.
```{sql}
#| connection: con
SELECT
"flight",
"origin",
"air_time",
MAX("air_time")
OVER (PARTITION BY "origin") AS "amax"
FROM
flights
LIMIT 10 OFFSET 120827;
```
flight | origin | air_time | amax |
---|---|---|---|
1100 | EWR | 324 | 695 |
449 | EWR | 82 | 695 |
4382 | EWR | 87 | 695 |
4691 | EWR | 114 | 695 |
1037 | EWR | 330 | 695 |
1266 | EWR | 301 | 695 |
1199 | EWR | 38 | 695 |
3709 | EWR | 114 | 695 |
4407 | EWR | 164 | 695 |
580 | EWR | 192 | 695 |
Note the
amax
values are the same for all the records for each origin.See SQL Window Functions or SQL PARTITION BY Clause overview for other examples.
This approach can be useful as a sub-query (or inner nested query) inside the outer query FROM
clause.
Here is how you find the flight numbers and destination for the longest flights (in terms of air time) from each airport after using the sub-query to find the longest air time from each origin.
```{sql}
#| label: sub-query-1
#| connection: con
SELECT
"flight",
"origin",
"dest",
"air_time"
FROM
(
SELECT
"flight",
"origin",
"dest",
"air_time",
MAX("air_time") OVER (PARTITION BY "origin") AS "amax"
FROM
flights
)
WHERE
"air_time" = "amax"
```
flight | origin | dest | air_time |
---|---|---|---|
51 | JFK | HNL | 691 |
15 | EWR | HNL | 695 |
745 | LGA | DEN | 331 |
- Note that all fields used in the outer query must be returned by the inner query
8.5.9 Recode Field Values
SQL has a CASE
expression you can use with a WHEN
clause to recode the values in the records in a field.
- Like
dplyr::case_when()
, you use theWHEN
clause to specify a logical expression with theTHEN
keyword specifying the new value. CASE
will stop checking at the firstTRUE
.
Use the CASE WHEN
syntax to recode values.
```{sql}
#| label: case-when
#| connection: con
SELECT
"flight",
"origin",
CASE
WHEN ("origin" = 'JFK') THEN 'John F. Kennedy'
WHEN ("origin" = 'LGA') THEN 'LaGaurdia'
WHEN ("origin" = 'EWR') THEN 'Newark Liberty'
ELSE 'WHO KNOWS'
END
AS "orig_long"
FROM
flights
LIMIT 10;
```
flight | origin | orig_long |
---|---|---|
1545 | EWR | Newark Liberty |
1714 | LGA | LaGaurdia |
1141 | JFK | John F. Kennedy |
725 | JFK | John F. Kennedy |
461 | LGA | LaGaurdia |
1696 | EWR | Newark Liberty |
507 | EWR | Newark Liberty |
5708 | LGA | LaGaurdia |
79 | JFK | John F. Kennedy |
301 | LGA | LaGaurdia |
- R equivalent using either
case_when()
orcase_match()
.
```{r}
#| eval: false
flights |>
select(flight, origin) |>
mutate(orig_long = case_when(
origin == "JFK" ~ "John F. Kennedy",
origin == "LGA" ~ "LaGuardia",
origin == "EWR" ~ "Newark Liberty",
.default = "other"
))
## or for value matching
flights |>
select(flight, origin) |>
mutate(orig_long = case_match(
origin,
"JFK" ~ "John F. Kennedy",
"LGA" ~ "LaGuardia",
"EWR" ~"Newark Liberty",
.default = "other"
))
```
You can also use CASE WHEN
to recode based on other logical operations.
```{sql}
#| connection: con
SELECT
"flight",
"air_time",
CASE
WHEN ("air_time" > 2500) THEN 'Long'
WHEN ("air_time" <= 2500) THEN 'Short'
END
AS "qual_dist"
FROM
flights;
```
flight | air_time | qual_dist |
---|---|---|
1545 | 227 | Short |
1714 | 227 | Short |
1141 | 160 | Short |
725 | 183 | Short |
461 | 116 | Short |
1696 | 150 | Short |
507 | 158 | Short |
5708 | 53 | Short |
79 | 140 | Short |
301 | 138 | Short |
- R equivalent.
8.5.10 Joining Tables
To join tables, use a SELECT
statement and write out all of the columns in both tables you are joining.
- If the tables share one or more column names, you need to distinguish between the two fields via
table1."var"
ortable2."var"
.
Use the LEFT JOIN
clause to declare a left join, and ON
to declare the keys.
```{sql}
#| connection: con
-- flight is from the flights table
-- type is from the planes table
-- both tables have a tailnum column, so we need to tell them apart
-- if you list both tailnums in SELECT, you'll get two tailnum columns
SELECT
"flight",
flights."tailnum",
"type"
FROM
flights
JOIN planes
ON
flights."tailnum" = planes."tailnum";
```
flight | tailnum | type |
---|---|---|
461 | N693DL | Fixed wing multi engine |
4424 | N19966 | Fixed wing multi engine |
6177 | N34111 | Fixed wing multi engine |
731 | N319NB | Fixed wing multi engine |
684 | N809UA | Fixed wing multi engine |
1279 | N328NB | Fixed wing multi engine |
1691 | N34137 | Fixed wing multi engine |
1447 | N117UW | Fixed wing multi engine |
3574 | N790SW | Fixed wing multi engine |
3351 | N711MQ | Fixed wing multi engine |
- R equivalent:
```{r}
planes |>
select(tailnum, type) ->
planes2
flights |>
select(flight, tailnum) |>
left_join(planes2, by = "tailnum")
```
# A tibble: 336,776 × 3
flight tailnum type
<int> <chr> <chr>
1 1545 N14228 Fixed wing multi engine
2 1714 N24211 Fixed wing multi engine
3 1141 N619AA Fixed wing multi engine
4 725 N804JB Fixed wing multi engine
5 461 N668DN Fixed wing multi engine
6 1696 N39463 Fixed wing multi engine
7 507 N516JB Fixed wing multi engine
8 5708 N829AS Fixed wing multi engine
9 79 N593JB Fixed wing multi engine
10 301 N3ALAA <NA>
# ℹ 336,766 more rows
The other joins are:
RIGHT JOIN
INNER JOIN
FULL OUTER JOIN
, which you should avoid using if possible.
8.5.11 Subqueries
SQL has the capability to allow you to embed or nest queries within queries. There are several varieties as you can see in How to write subqueries in SQL.
- As an example, if you need data from a “table” that does not exist, you can use a query to create the new table (virtually) and insert it into the FROM clause of an outer query.
- As stated in R for Data Science (2ed), “A subquery is just a query used as a data source in the FROM clause, instead of the usual table.”
To delineate the query as a subquery you put it inside parentheses (...)
. This is similar to R to require the subquery to be executed before other parts of the query can be evaluated.
If using the results in other parts of a complicated query, you must also assign an “alias” for the result. This is similar to using a pronoun in R piping. It represents the returned object from the subquery.
A silly example of using a sub query to create a table with just the JFK flights.
- Note this does not assign an alias as we are not using the results elsewhere.
```{sql}
#| connection: con
SELECT
"flight",
"tailnum",
"origin"
FROM
(SELECT *
FROM flights
WHERE "origin" = 'JFK'
LIMIT 20)
```
flight | tailnum | origin |
---|---|---|
1141 | N619AA | JFK |
725 | N804JB | JFK |
79 | N593JB | JFK |
49 | N793JB | JFK |
71 | N657JB | JFK |
194 | N29129 | JFK |
1806 | N708JB | JFK |
1743 | N3739P | JFK |
303 | N532UA | JFK |
135 | N635JB | JFK |
Subqueries can also be very useful when combined with JOINS.
```{sql}
#| connection: con
SELECT "flight",
fli."tailnum",
"origin",
pla."manufacturer"
FROM flights
AS fli
JOIN
(SELECT *
FROM planes
WHERE "manufacturer" = 'AIRBUS INDUSTRIE')
AS pla
ON
fli."tailnum"=pla."tailnum"
WHERE fli."origin"='JFK'
LIMIT 20
```
flight | tailnum | origin | manufacturer |
---|---|---|---|
1447 | N117UW | JFK | AIRBUS INDUSTRIE |
2190 | N312US | JFK | AIRBUS INDUSTRIE |
867 | N332NB | JFK | AIRBUS INDUSTRIE |
1831 | N183UW | JFK | AIRBUS INDUSTRIE |
189 | N520JB | JFK | AIRBUS INDUSTRIE |
1444 | N701UW | JFK | AIRBUS INDUSTRIE |
1 | N509JB | JFK | AIRBUS INDUSTRIE |
931 | N119US | JFK | AIRBUS INDUSTRIE |
2043 | N332NB | JFK | AIRBUS INDUSTRIE |
325 | N505JB | JFK | AIRBUS INDUSTRIE |
8.5.12 Reshaping Data
Recall {tidyr} has several verbs for reshaping rectangular data: pivot_wider
, pivot_longer
, separate_*
, and unite
.
Reshaping is rarely done using SQL as people use SQL to query a subset of data they can then reshape much easier using other methods (R, Python, Excel, etc).
You can create a series of CASE WHEN
statements to custom make an equivalent to a pivot.
If you are interested in reshaping, especially with an intent to aggregate the data, the various flavors have different approaches.
As an example, SQL Server has the statements PIVOT
and UNPIVOT
. See FROM - Using PIVOT and UNPIVOT.
8.6 Working with Tables
8.6.1 Creating Tables
You can create new tables using SQL.
Let’s create a new, temporary, connection we can use to create tables in a new Back-end database.
Use CREATE TABLE
to declare new tables and INSERT INTO
to insert new records into that table.
- Note the semicolon
;
after each SQL statement in the same code chunk.
DBI::dbWriteTable()
will add a table to a connection from R.
Let’s write multiple data frames from {tidyr} into new tables in our tmpcon
database.
```{r}
#| label: write-tidyr-tables
data("table1", package = "tidyr")
dbWriteTable(conn = tmpcon, name = "table1", value = table1)
data("table2", package = "tidyr")
dbWriteTable(conn = tmpcon, name = "table2", value = table2)
data("table3", package = "tidyr")
dbWriteTable(conn = tmpcon, name = "table3", value = table3)
data("table4b", package = "tidyr")
dbWriteTable(conn = tmpcon, name = "table4b", value = table4b)
data("table5", package = "tidyr")
dbWriteTable(conn = tmpcon, name = "table5", value = table5)
```
Here they are.
database | schema | name | column_names | column_types | temporary |
---|---|---|---|---|---|
memory | main | table1 | country , year , cases , population | VARCHAR, DOUBLE , DOUBLE , DOUBLE | FALSE |
memory | main | table2 | country, year , type , count | VARCHAR, DOUBLE , VARCHAR, DOUBLE | FALSE |
memory | main | table3 | country, year , rate | VARCHAR, DOUBLE , VARCHAR | FALSE |
memory | main | table4a | country, 1999 , 2000 | VARCHAR, INTEGER, INTEGER | FALSE |
memory | main | table4b | country, 1999 , 2000 | VARCHAR, DOUBLE , DOUBLE | FALSE |
memory | main | table5 | country, century, year , rate | VARCHAR, VARCHAR, VARCHAR, VARCHAR | FALSE |
8.6.2 Writing Tables to a CSV File
Use COPY
TO
to write the outputs of a SQL query to a CSV file (using a relative path).
Let’s write some group summaries to a CSV file.
This is what the resulting file looks like:
8.7 Calling SQL from R
Use DBI::dbGetQuery()
to run SQL code in R and obtain the result as a data frame.
```{r}
#| label: dbgetquery
planes2 <- DBI::dbGetQuery(conn = con,
statement = "SELECT * FROM planes")
glimpse(planes2)
```
Rows: 3,322
Columns: 9
$ tailnum <chr> "N10156", "N102UW", "N103US", "N104UW", "N10575", "N105UW…
$ year <int> 2004, 1998, 1999, 1999, 2002, 1999, 1999, 1999, 1999, 199…
$ type <chr> "Fixed wing multi engine", "Fixed wing multi engine", "Fi…
$ manufacturer <chr> "EMBRAER", "AIRBUS INDUSTRIE", "AIRBUS INDUSTRIE", "AIRBU…
$ model <chr> "EMB-145XR", "A320-214", "A320-214", "A320-214", "EMB-145…
$ engines <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ seats <int> 55, 182, 182, 182, 55, 182, 182, 182, 182, 182, 55, 55, 5…
$ speed <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ engine <chr> "Turbo-fan", "Turbo-fan", "Turbo-fan", "Turbo-fan", "Turb…
Often, folks have SQL written in a separate file (that ends in .sql
).
- Here is what is in “query.sql”.
If you want to load the results of a SQL query in R (saved in query.sql
), use readr::read_file()
inside DBI::dbGetQuery()
.
```{r}
mydf <- DBI::dbGetQuery(conn = con, statement = read_file("./sql/query.sql"))
glimpse(mydf)
```
Rows: 3,322
Columns: 9
$ tailnum <chr> "N10156", "N102UW", "N103US", "N104UW", "N10575", "N105UW…
$ year <int> 2004, 1998, 1999, 1999, 2002, 1999, 1999, 1999, 1999, 199…
$ type <chr> "Fixed wing multi engine", "Fixed wing multi engine", "Fi…
$ manufacturer <chr> "EMBRAER", "AIRBUS INDUSTRIE", "AIRBUS INDUSTRIE", "AIRBU…
$ model <chr> "EMB-145XR", "A320-214", "A320-214", "A320-214", "EMB-145…
$ engines <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
$ seats <int> 55, 182, 182, 182, 55, 182, 182, 182, 182, 182, 55, 55, 5…
$ speed <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ engine <chr> "Turbo-fan", "Turbo-fan", "Turbo-fan", "Turbo-fan", "Turb…
8.8 Closing the Connection
After you are finished working with a database, you should close down your connection:
8.9 Basic SQL Exercises
Download the file starwars.duckdb and save to a data
directory.
- It’s the data from the {starwarsdb} as a DuckDB database.
Use only SQL to answer the following questions.
- Download these data and open a connection to the database.
- Print out a summary of the tables in this database.
- Select the entire
people
table.
- Select just the
name
,height
,mass
, andspecies
variables from thepeople
table.
- Add to the above query by selecting only the humans and droids.
- Remove the individuals with missing
mass
data from the above query.
- Modify the above query to calculate the average height and mass for humans and droids.
- Make sure Droids are in the first row and Humans are in the second row in the above summary output.
- Here is the summary of the keys for the database from the
{starwarsdb}
GitHub page.
Select all films with characters whose homeworld is Kamino.
- Filter the
people
table to only contain humans from Tatooine and export the result to a CSV file called “folks.csv”.
- Close the SQL connection.
8.10 Integrating SQL with R
Section 8.4 demonstrated using SQL inside an R environment using the {DBI} and {duckdb} packages.
All of the SQL was run in a SQL code chunk in a Quarto document.
We manually converted R code chunks to SQL and inserted the connection name.
All of the SQL code chunks were interactive and they did not affect the R environment.
Section 8.7 showed how one could run SQL scripts (.sql
files) in R and use the DBI::dbGetQuery()
function and assign an R name to the result-set of a SQL query.
This section will demonstrate some shortcuts to both creating SQL code chunks and then getting access to the result-sets as R objects.
Let’s create a connection to the flights.duckdb
database.
8.10.1 Create SQL Code Chunks
8.10.1.1 Customize a Keyboard Shortcut
RStudio has a built-in capability to create an SQL code chunk with an R Markdown code chunk option for the connection.
To access it quickly, suggest customizing a keyboard shortcut for it per Customizing Keyboard Shortcuts in the RStudio IDE. Ushey (2023)
Go to the RStudio Main Menu and select Tools/Modify Keyboard Shortcuts
.
A pop up pane will appear with many options and the existing keyboard shortcuts.
- Enter
S Q
in the search pane and you should see something like Figure 8.3.
Click in the
Shortcut
column to the right of “Insert Chunk S Q L”.Enter the shortcut combination you want. Figure 8.3 shows the Mac shortcut I chose,
Ctrl + Cmd + s
.Click
Apply
and your shortcut will be saved. You can use it right away.
Enter the shortcut in an area for R Markdown (not an existing chunk) in a .Rmd
or .qmd
file and you should get the following (without the con
).
- Make sure you enter in the name of the connection you have created.
You cannot render a document with an SQL code chunk with invalid or empty connection.
You have to have created a valid connection object before the SQL code chunk.
8.10.1.2 Create a Custom Code Snippet
Another option is to create a custom code snippet per Code Snippets in the RStudio IDE. Allaire (2023)
RStudio snippets are templates for code or R Markdown. We used a code snippet to create a bare-bones template for a Shiny app.
To create a snippet for a custom SQL code chunk, go to the RStudio Main Menu and select Tools/Edit Code Snippets
.
A pop-up pane should appear.
Select Markdown
as you will be using the snippet in a Markdown section. You will see something like Figure 8.4 with the existing Markdown code snippets.
Figure 8.4 already shows the snippet I saved for creating a custom SSQL code chunk.
You can create your own snippet by following this example.
- Enter
snippet s_name
where you enter your name for the snippet instead ofs_name
. It does not have to bersql
. - Enter the following lines and use the tab key not spaces to indent.
- The
${0}
is a placeholder for your SQL code. The0
means it is the first place your cursor will go.- You could replace
con
with${0}
so you can fill in which connection each time. Then edit the next${0}
on the original line to${1}
.
- You could replace
- Select
Save
to save it and then you can use it.
To use a snippet, enter the name, here “rsql”, and then SHIFT+TAB
and the code chunk will appear like below (without the other options).
8.10.2 Other SQL Code Chunk Options
The R Markdown Definitive Guide section on SQL discusses several SQL chunk options you can use to adjust how SQL is interpreted by {knitr} in RStudio documents.
max.print = XX
will adjust to printXX
lines of SQL output to print in the rendered document from the default ofXX = 10
.tab.cap = "My Caption"
will change the default for a caption of how many records are displayed to whatever you want in the rendered document.
These options will work as either R Markdown style options or Quarto style options.
8.10.3 Creating R Variables from SQL Code
You can use a SQL code chunk option output.var
to create an R data frame variable from a SQL result-set.
Let’s create R variables for the tables in the database and the SQL description of the tables in the data base.
- Again, either style chunk option will work.
Notice that just like assignment in R, the results are implicit and not seen in the output.
- You do see the new variables in the Global Environment.
database schema name
1 flights main airlines
2 flights main airports
3 flights main flights
4 flights main planes
5 flights main weather
column_names
1 carrier, name
2 faa, name, lat, lon, alt, tz, dst, tzone
3 year, month, day, dep_time, sched_dep_time, dep_delay, arr_time, sched_arr_time, arr_delay, carrier, flight, tailnum, origin, dest, air_time, distance, hour, minute, time_hour
4 tailnum, year, type, manufacturer, model, engines, seats, speed, engine
5 origin, year, month, day, hour, temp, dewp, humid, wind_dir, wind_speed, wind_gust, precip, pressure, visib, time_hour
column_types
1 VARCHAR, VARCHAR
2 VARCHAR, VARCHAR, DOUBLE, DOUBLE, DOUBLE, DOUBLE, VARCHAR, VARCHAR
3 INTEGER, INTEGER, INTEGER, INTEGER, INTEGER, DOUBLE, INTEGER, INTEGER, DOUBLE, VARCHAR, INTEGER, VARCHAR, VARCHAR, VARCHAR, DOUBLE, DOUBLE, DOUBLE, DOUBLE, TIMESTAMP
4 VARCHAR, INTEGER, VARCHAR, VARCHAR, VARCHAR, INTEGER, INTEGER, INTEGER, VARCHAR
5 VARCHAR, INTEGER, INTEGER, INTEGER, INTEGER, DOUBLE, DOUBLE, DOUBLE, DOUBLE, DOUBLE, DOUBLE, DOUBLE, DOUBLE, DOUBLE, TIMESTAMP
temporary
1 FALSE
2 FALSE
3 FALSE
4 FALSE
5 FALSE
[1] "data.frame"
Rows: 5
Columns: 6
$ database <chr> "flights", "flights", "flights", "flights", "flights"
$ schema <chr> "main", "main", "main", "main", "main"
$ name <chr> "airlines", "airports", "flights", "planes", "weather"
$ column_names <list> <"carrier", "name">, <"faa", "name", "lat", "lon", "alt",…
$ column_types <list> <"VARCHAR", "VARCHAR">, <"VARCHAR", "VARCHAR", "DOUBLE", …
$ temporary <lgl> FALSE, FALSE, FALSE, FALSE, FALSE
[1] "data.frame"
Let’s use SQL to get the departure delay for all flights from JFK and LaGuardia to Chicago and then use {ggplot2} to plot.
Check our data.
Rows: 3,862
Columns: 4
$ flight <int> 3521, 905, 301, 325, 1435, 255, 325, 3521, 3523, 331, 1477, …
$ origin <chr> "JFK", "JFK", "LGA", "LGA", "LGA", "LGA", "LGA", "JFK", "JFK…
$ dest <chr> "ORD", "ORD", "ORD", "ORD", "ORD", "ORD", "ORD", "ORD", "ORD…
$ dep_delay <dbl> NA, 5, 1, 146, 164, -11, NA, -4, 111, -2, 2, -3, 136, -7, 13…
Plot.
8.10.4 Using R Variables in SQL Code
You can also use an existing R variable as a value in a SQL query clause.
Create the R Variable and use the name preceded by a ?
.
Rows: 2,638
Columns: 4
$ flight <int> 283, 1959, 583, 1815, 2445, 299, 1099, 1885, 1443, 1783, 188…
$ origin <chr> "JFK", "JFK", "JFK", "JFK", "LGA", "LGA", "LGA", "LGA", "LGA…
$ dest <chr> "MCO", "MCO", "MCO", "MCO", "MCO", "MCO", "MCO", "MCO", "MCO…
$ dep_delay <dbl> 31, -1, 6, 11, -2, 10, 0, 43, -3, -2, 1, -5, -3, 18, 18, -5,…
Using parameters in SQL queries is common practice.
Use R to manipulate text to create queries as character strings is also common.
However, if you are building something, e.g., a Shiny app, where you are allowing others to input text that will feed into a query, you have to protect what you are doing.
You are liable to be victim of a SQL Injection Attack which can have bad consequences.
For R-specific approaches to mitigate the risk, see Run Queries Safely Posit (2023c) from the Posit Best Practices in working with Databases. Posit (2023b)
8.10.5 Setting the Connection Once in a Document
If you are going to use the same connection for all (or most chunks) in a document, you can set the connection as an option in a setup chunk.
Let’s close the current con
.
Now create a new connection.
Now run a SQL code chunk without stating the connection.
This document will render, however, it does not work interactively as of August 2023.
That really reduces the utility of the set-up chunk option.
A comment to a related issue (13240) has been submitted on GitHub for RStudio.
I suspect this will be resolved in the near future.
8.10.6 Close the Connection.
To avoid a warning when rendering the document, close the connection at the end.
8.11 Using the {dbplyr}
Package to Translate {dplyr} into SQL
The {dbplyr}
package allows you to use {dplyr}
code and it will translate it into SQL to interact (through an established connection) with a Back-end server.
- Consider R as a higher-level language you use to access code written in C.
- The {shiny} package can be considered as providing a higher-level language to generate HTML from {shiny} code.
- You can consider {dbplyr} as a higher-level package that generates SQL from {dplyr} code.
{dbplyr} works for many flavors of SQL and can generate many SQL expressions.
However it cannot always do everything. For more details see the {dbplyr} vignette on Function Translation.
Even if you can’t use {dbplyr} in your situation, you can still use it to generate SQL code you can see and use. That makes it a good tool for learning SQL since you know {dplyr}.
Let’s load {dbplyr} and {tidyverse} along with {DBI} and {duckdb}.
Create the connection to the database.
Retrieve a table from a SQL database using dplyr::tbl()
.
- Note you have to specify the source of the data, here the database connection, and the table name.
- Note that
flights2
is not a dataframe. It’s a list with classtbl_duckdb_connection
Now you can use your tidyverse code on the flights
table to generate a query.
Execute the query by using collect()
at the end.
To see the SQL code that was generated, use show_query()
at the end (this is a good way to learn SQL).
After you are done with the database, close the connection.
8.12 The DBeaver Integrated Development Environment (IDE)
- This section based on work by Professor David Gerard.
There is a free, separate Front-end for the DuckDB database called “DBeaver Community”.
Per DBeaver community:
DBeaver Community is a free cross-platform database tool for developers, database administrators, analysts, and everyone working with data. It supports all popular SQL databases like MySQL, MariaDB, PostgreSQL, SQLite, Apache Family, and more.
8.12.1 Setup DBeaver Community
The instructions here are only for DuckDB. But DBeaver works with other DBMS’s pretty easily.
Download and install DBeaver Community here: https://dbeaver.io/download/.
When you open it up, select Database Navigator
and under All
or SQL
, select “DuckDB”.
Then select Next
.
Select “Host”, and Open
to navigate to your flights.duckdb
database.
Click on Test Connection
.
If Successful, you will get a pop-up window with the results of the test and which will allow you to see the details of the connection.
If not successful, and you haven’t used it before, it may ask you to install a JDBC-compliant driver to use to connect to a DuckDB database.
- Click
Download Driver
. - After the download, it will run the test and you should get a positive result.
Click “Finish”.
You should now see the database under the Database Navigator.
Right click on the connection and create a new SQL script file.
Type some SQL and hit the execute button or just hit Control + Enter
.
You can now play around with SQL.
When you want to save the data to a CSV file, just hit “Export data”.
- There are a lot of features in DBeaver, but we won’t cover them. They are pretty intuitive. Just play with it for awhile.
See the DBeaver General User Guide for detailed documentation.
8.12.2 Exercises
Use DBeaver for all of these exercises.
- What is the mean temperature for flights from each airport?
- What is the average number of flights from each airport per day in January?
- What are the top destinations for each airport?
Show code
8.13 Considerations for Working with SQL and Databases
8.13.1 Review Documentation about the database
When working with a new database, or one that is often changing, seek out and review the available documentation.
Data Dictionaries, schema descriptions, and/or E-R diagrams can help you understand the database design, the table descriptions and structure, the fields in each table, the primary keys, and whether tables are indexed.
The documentation can help you see a holistic picture of the data so you know where to look for tables and fields of interest and how to join tables most efficiently.
- As an example, the database may have a naming convention for tables and fields to help you find the tables and fields you need to create your queries.
8.13.2 Review Documentation about the DBMS and its SQL
Every DBMS is slightly different as developers optimize their code for the goals for the database.
Review documentation on the “flavor” of SQL accepted by the database.
Many flavors have extra features (statements, clauses, keywords, or modifiers) you can use to simplify your code or speed up your queries.
8.13.3 Choose a Front-end IDE that works with the DBMS
SQL is just text so you can write SQL in many environments.
If you are doing a lot of SQL, find a Front-end IDE that has features to facilitate using SQL.
- As an example, DBeaver has auto-complete for SQL statements while RStudio does not.
Many Front-ends have customized drivers for working with many different Back-end DBMS.
- So chose a Front End that allows you to work with the different databases in your organization without having to switch.
- Read the documentation for the Front-end to use it effectively.
8.13.4 Follow a Style Guide
Like tidyverse style, there are style guides to help you make your code easier to read, faster to debug, and more robust to future changes.
If you do nothing else from a style perspective, use line breaks often to break up your code into readable lines where each starts on a statement, clause, or keyword.
Both of the following code chunks are valid SQL.
Streaming SQL example.
Stylized tidy SQL.
Which do you think is easier to read, debug, and maintain?
- The second chunk was generated by the DBeaver
Format SQL
command under the Edit menu.
There is no single authoritative style guide so check if your organization has its own.
If it does not, consider the following.
8.13.5 Build-a-little-test-a-little
Writing and debugging SQL queries should proceed in steps just like building and debugging R code and functions.
Don’t write a 30 line query and then start to debug it.
Write short queries and use LIMIT
to keep the returns fast.
Write multiple small queries and then integrate them piece by piece.
Consider using sub-queries and Common Table Expressions to combine them as well.
8.13.6 Follow Best Practices for Writing Queries
There are many sources for Best Practices in writing SQL queries to optimize speed.
Common threads appears across best practices:
- Shape the query to return only the data you need.
- Shape the query to take advantage of the DBMS ability to use indexes for faster performance.
- Make your queries easy to read.
The most commonly cited best practice is “Avoid the use of SELECT *
”.
Other suggestions include:
- Use Joins instead of sub-queries inside a
WHERE
when you are just matching values. - Avoid using
LIKE
with wildcards,%
, at the beginning of a string as this means you have to search every record. - Use
GROUP BY
or add additional fields instead of usingSELECT DISTINCT
on large tables. - Use comments
--
to explain why you are doing something. - Use
AS
to create aliases to clarify fields from different tables for readability. - Avoid sorting, especially for large tables.
For additional recommendations and rationale, consider the following as a starting point.
- 7 Bad Practices to Avoid When Writing SQL Queries for Better Performance Allam (2023) is a good place to start.
- Best Practices for Writing SQL Queries Metabase (2023) has an extensive list of suggestions.
8.13.7 Consider Tuning Slow Queries
Queries go through three stages when being executed.
- Syntax Check: The DBMS parser checks each query is syntactically correct.
- Optimization: The DBMS optimizer examines the query and builds a plan for execution to minimize resource demand and time.
- Execution: The DBMS runs the optimized plan to return any results.
Thus every query gets optimized but that does not mean the query could not be tuned to be faster.
If you have a query that is run many times, on large tables, and appears slow, then consider tuning it.
The first step is detecting a slow query.
- Most DBMS have a way to monitor query performance.
- For DBeaver, see 3 Ways to Detect Slow Queries in PostgreSQL. Schönig (2018)
The second step to is to assess how the query is actually being run.
- DBMS also have a way get to the results of the Query Planner so you can review them to see what is actually happening.
- Look at the help for
EXPLAIN
. - For DBeaver, review the PostgreSQL Documentation on Using EXPLAIN Group (2023)
Once you understand the execution, then consider if any of the best practices for writing queries will help improve performance.
For more explanation and ideas see SQL Performance Tuning. khushboogoyal499 (2022)
8.13.8 Integrate Your SQL Into Your Version Control Workflow
Regardless of the IDE you use for a Front-end, SQL is still text-based code.
You can put the files into a Git Repo and version control with GitHub.
- If you are using DBeaver, use the search tool (magnifying glass) and look for “Install new software.”
- Use the drop down arrow to select the Git option as in Figure 8.5.
For more details see How to Version Control Your SQL.
8.13.9 Consider More Advanced SQL Statments and Methods
These notes have just scratched the surface of a language that has been around for 50 years and is still going strong.
There are many more statements, clauses, and keywords you can use to create more effective SQL to access the data you want, especially for large databases that will be accessed many times.
Some of these are:
- Views, Sub-queries, and Common Table Expressions
- Additional clauses and modifiers for
WHERE
such as {stringr} style functions. - Dynamic SQL where you can pass parameters to a stored procedure akin to an R function.
There are many free references on SQL to expand your knowledge.
R for Data Science suggests two non-free books:
8.13.10 Strategize How to Use SQL in Your Analysis Workflow
From a data science perspective SQL is usually a means for getting data out of databases.
- Data scientists are usually less involved in designing, building, and managing databases or maintaining the data.
Your working environment will often dictate which databases have the data of interest and the associated versions of SQL you must use.
There are SQL wizards who can make SQL dance by building and optimizing queries to the nth degree.
You do not have to be one!
As long as you can get the data you need, you can use the other tools in R and Python to make your analysis sing and dance.
If you have the time and hardware memory to get the data, you can clean and shape it outside of SQL if that works for you.
SQL is an important tool in your Data Science tool box for sure. But it works best when you use it with other tools in it as well.
If your primary analysis tools are R and RStudio, but you need to get subsets of data from a large database using SQL, consider using a SQL-focused IDE, such as DBeaver, to build your queries and then copying them into a .sql
file or an SQL code chunk.
This way you get the advantage of using a SQL-focused IDE while using RStudio for the rest of the analysis workflow.