3  Univariate (Single Variable) Analysis

Published

June 9, 2026

Keywords

univariate statistics, ggplot2, histograms, barplots, exploratory data analysis

3.1 Introduction

This section focuses on the analysis of individual variables, also called univariate data analysis.

Learning Outcomes

  • Generate statistics for variables
  • Create univariate plots using the {ggplot2} package
  • Customize {ggplot2} plots
  • Conduct Exploratory Data Analysis (EDA)

3.1.1 References

3.2 Univariate Data Analysis

Univariate data analysis is about understanding the characteristics of a single variable.

Characteristics of interest could include:

  • What type of data is it: quantitative, categorical, text, numeric?
  • How many observations are there?
  • What is its middle or most common value (mean, median, mode)?
  • How varied is the data in terms of its variance/ standard deviation and its range (from the minimum to its maximum.
  • What is the shape of its distribution?

Let’s explore the variables in two different data sets

  • The house_ed_gdp_joined data frame we created earlier which has mostly numeric values.
  • A new data frame from Open Data Albania with data about Agricultural land called land_clean.csv. This has some categorical values of interest.
  • The built in Penguins data set. Enter ?penguins in the console to read about it.

The tidyverse collection of packages arguably contains the most downloaded R packages today. It was introduced to the R community by Hadley Wickham and others starting in about 2016, and we will make heavy use of it.

However, you also have the basic R functionality. To give you a feel for both the basic R world and the tidyverse world, you will see similar skills demonstrated both ways. You will probably appreciate tidyverse more after seeing the “low tech” graphics!

Let’s start by loading the {tidyverse} package into our global environment and reading in the two data sets into the global environment.

  • We use data() as the penguins data frame is accessible from the {datasets} package that comes with R
library(tidyverse)
house_ed_gdp_joined <- read_csv("./data/house_ed_gdp_joined.csv")
land <- read_csv("./data/land_clean.csv")
data(penguins)

3.3 Getting Numerical Attributes

For our first peek at the data, we want to get some numerical summaries.

Here are a few base R commands to look at the data we’re going to use:

mean(house_ed_gdp_joined$Average_size)
[1] 3.242623
mean(house_ed_gdp_joined$population, na.rm = T) #  use na.rm=T  to drop NA values.
[1] 39563.95
var(house_ed_gdp_joined$general_revenue_per_person, na.rm = TRUE) # variance
[1] 257442323
sd(house_ed_gdp_joined$general_revenue_per_person, na.rm = TRUE) # standard deviation
[1] 16045.01

These can be done using tidyverse as well.

house_ed_gdp_joined |> summarize(mean(Average_size))
# A tibble: 1 × 1
  `mean(Average_size)`
                 <dbl>
1                 3.24
house_ed_gdp_joined |> summarize(mean(population, na.rm = TRUE))
# A tibble: 1 × 1
  `mean(population, na.rm = TRUE)`
                             <dbl>
1                           39564.
house_ed_gdp_joined |> summarize(var(general_revenue_per_person, na.rm = TRUE))
# A tibble: 1 × 1
  `var(general_revenue_per_person, na.rm = TRUE)`
                                            <dbl>
1                                      257442323.
house_ed_gdp_joined |> summarize(sd(general_revenue_per_person, na.rm = TRUE))
# A tibble: 1 × 1
  `sd(general_revenue_per_person, na.rm = TRUE)`
                                           <dbl>
1                                         16045.
house_ed_gdp_joined |> summarize(cor(Average_size, has_washing_machine))
# A tibble: 1 × 1
  `cor(Average_size, has_washing_machine)`
                                     <dbl>
1                                   -0.128

For the moment, this seems to be considerably more verbose but that is because tidyverse is designed to work with data frames more than individual variables.

  • Often tidyverse code is easier to understand and maintain, but it really is a matter of preference.

For quick summaries of the variables in a data frame we can use several options

  • The base R summary() function gives us different statistics based on the class of the variable.
    • Compare Municipality which is character with Average_sizewhich is numeric.
summary(house_ed_gdp_joined)
    Municipality    culture_site  Average_size   has_washing_machine
 Length   :61    Length   :61    Min.   :2.500   Min.   :86.70      
 N.unique :61    N.unique : 2    1st Qu.:2.800   1st Qu.:93.50      
 N.blank  : 0    N.blank  : 0    Median :3.100   Median :95.10      
 Min.nchar: 3    Min.nchar: 7    Mean   :3.243   Mean   :94.41      
 Max.nchar:14    Max.nchar: 8    3rd Qu.:3.700   3rd Qu.:95.90      
                                 Max.   :4.900   Max.   :98.00      
                                                                    
  has_internet    has_computer      has_car      Illiteracy_rate
 Min.   :27.70   Min.   : 5.80   Min.   :20.90   Min.   :1.200  
 1st Qu.:52.80   1st Qu.:12.20   1st Qu.:31.50   1st Qu.:2.300  
 Median :70.60   Median :14.90   Median :38.70   Median :2.600  
 Mean   :65.83   Mean   :17.16   Mean   :38.57   Mean   :2.836  
 3rd Qu.:78.10   3rd Qu.:20.40   3rd Qu.:45.40   3rd Qu.:3.000  
 Max.   :87.20   Max.   :49.30   Max.   :61.60   Max.   :7.500  
                                                                
 primary_lower_secondary upper_secondary   university     population    
 Min.   :25.00           Min.   :24.70   Min.   : 5.3   Min.   :  1843  
 1st Qu.:48.30           1st Qu.:30.10   1st Qu.:10.5   1st Qu.: 11082  
 Median :53.40           Median :32.80   Median :12.9   Median : 19261  
 Mean   :52.68           Mean   :33.23   Mean   :13.3   Mean   : 39564  
 3rd Qu.:57.80           3rd Qu.:36.10   3rd Qu.:14.7   3rd Qu.: 32741  
 Max.   :68.90           Max.   :46.00   Max.   :36.4   Max.   :598176  
                                                        NAs    :2       
 general_revenue_per_person
 Min.   : 14856            
 1st Qu.: 20960            
 Median : 26377            
 Mean   : 30425            
 3rd Qu.: 36042            
 Max.   :100942            
 NAs    :2                 
  • Tidyverse dplyr::glimpse().
    • This provides a look at size of the data frame, the variables, their class, and as many values as fit across the page.
    • Note: during cleaning a new variable was created, prop_tyhpe_broad to align the many variants of Property_type into fewer categories.
    • Looking at the unique values of Property_type show many variations in spelling, capitalization and abbreviation common in human-generated data.
glimpse(land)
Rows: 7,369
Columns: 9
$ County_Region   <chr> "Berat", "Berat", "Berat", "Berat", "Berat", "Berat", …
$ Municipality    <chr> "Berat", "Berat", "Berat", "Berat", "Berat", "Berat", …
$ muni_ascii      <chr> "berat", "berat", "berat", "berat", "berat", "berat", …
$ Property_type   <chr> "Ullishte", "Vresht", "Vresht", "Are", "Are", "Vresht"…
$ prop_type_broad <chr> "Olive_grove", "Vineyard", "Vineyard", "Arable", "Arab…
$ Area_hectares   <dbl> 0.2550, 0.0925, 2.1100, 0.3125, 0.4400, 0.1300, 0.2200…
$ Regestry_area   <dbl> 2550, 925, 21100, 3125, 4400, 1300, 2200, 675, 1775, 9…
$ Land_quality    <chr> "8", "6", "6", "6", "6", "6", "8", "8", "8", NA, NA, N…
$ Ownership       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
unique(land$Property_type)
 [1] "Ullishte"       "Vresht"         "Are"            "Pemtore"       
 [5] "pemtore"        "Toke Are"       "Vreshte"        "Pemtare"       
 [9] "Pemtar"         "LEDH"           NA               "Are+Truall"    
[13] "are"            "truall"         "Are+truall"     "ullishte"      
[17] "Vresht/Pemtore" "Are truall"     "Pemtore/are"    "Vresht+truall" 
[21] "vresht"         "ARE"            "PEMTORE"        "ULLISHTE"      
[25] "Are+ truall"    "Truall/Are"     "Kanal/are"      "Rruge/are"     
[29] "Rruge /Are"     "Rruge/Are"      "Are/Are"        "Livadh/Are"    
[33] "Pemtor"         "Pemt."          "Livadh"         "Pemt"          
[37] "are+truall"     "Pemetore"       "Arre"           "shkemb"        
[41] "Peme"           "Ullisht"        "Rruge +kanal"   "pemetore"      
[45] "vreshte"        "rruge"          "Pemetor"        "3"             
[49] "kullot"         "ledh"           "Truall / Are"   "shkembore"     
[53] "gurishte"       "pyll"          
unique(land$prop_type_broad)
[1] "Olive_grove"    "Vineyard"       "Arable"         "Orchard"       
[5] "Other"          "Meadow"         "Forest_pasture"
  • The {skimr} package skim() which groups several different summaries include a simple histogram (scroll right)
skimr::skim(house_ed_gdp_joined)
Data summary
Name house_ed_gdp_joined
Number of rows 61
Number of columns 13
_______________________
Column type frequency:
character 2
numeric 11
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Municipality 0 1 3 14 0 61 0
culture_site 0 1 7 8 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Average_size 0 1.00 3.24 0.52 2.5 2.8 3.1 3.7 4.9 ▇▆▅▁▁
has_washing_machine 0 1.00 94.41 2.67 86.7 93.5 95.1 95.9 98.0 ▂▁▂▇▅
has_internet 0 1.00 65.83 16.00 27.7 52.8 70.6 78.1 87.2 ▂▂▃▆▇
has_computer 0 1.00 17.16 7.58 5.8 12.2 14.9 20.4 49.3 ▇▆▃▁▁
has_car 0 1.00 38.57 9.35 20.9 31.5 38.7 45.4 61.6 ▅▇▇▅▂
Illiteracy_rate 0 1.00 2.84 1.07 1.2 2.3 2.6 3.0 7.5 ▆▇▁▁▁
primary_lower_secondary 0 1.00 52.68 8.23 25.0 48.3 53.4 57.8 68.9 ▁▂▅▇▃
upper_secondary 0 1.00 33.23 4.49 24.7 30.1 32.8 36.1 46.0 ▃▇▆▂▁
university 0 1.00 13.30 5.08 5.3 10.5 12.9 14.7 36.4 ▆▇▂▁▁
population 2 0.97 39563.95 80223.57 1843.0 11081.5 19261.0 32741.0 598176.0 ▇▁▁▁▁
general_revenue_per_person 2 0.97 30424.88 16045.01 14856.0 20960.5 26377.0 36041.5 100942.0 ▇▃▁▁▁

3.4 The ggplot2 Package

The {ggplot2} package is the primary tidyverse package for creating plots.

  • The “gg” in ggplot2 stands for the Grammar of Graphics, a framework developed by Leland Wilkinson for describing statistical graphics as combinations of independent components.
  • Rather than thinking of a plot as a single chart type (e.g., “scatter plot” or “bar chart”), the Grammar of Graphics treats a visualization as a layered mapping between data and visual representations.
  • Hadley Wickham used the GG framework for developing ggplot2 to help users make quick, useful, and extensible graphics.

In ggplot2, plots are built incrementally by combining layers with +. Each layer corresponds to a conceptual part of the grammar and is implemented through specific functions.

A typical ggplot2 plot starts with the function ggplot() followed by several key components implemented using functions:

  • Data: the dataset being visualized e.g., ggplot(data = mpg).
  • Aesthetics (aes()): mappings from variables to visual properties such as x-position, y-position, color, size, or shape, e.g., aes(x = displ, y = hwy, color = class).
  • Geometries (geom_*()): the geometric objects used to display observations, e.g., geom_point(),geom_histogram(), orgeom_bar()` and many others.
  • Statistical transformations (stat_*()): computations performed before drawing the plot to know how to draw it, e.g., stat_smooth
    • Many geoms have default statistical transformations built in. For example, geom_histogram() automatically bins data using stat_bin().
  • Scales (scale_*()): control how data values map to visual properties and how axes or legends appear.
    • Scales can affect breaks, labels, colors, transformations, and legend formatting.
  • Coordinate systems (coord_*()): define how the plot is projected into space, e.g., coord_flip() or coord_cartesian()
  • Faceting (facet_*()): splits the data into multiple panels (plots) based on one or more variables, e.g., facet_wrap(~ class) or facet_grid(~ class).
  • Themes (theme_*()): control non-data visual elements such as fonts, backgrounds, grids, and spacing - the scaffolding of the plot

Many of these have default settings so you do not have to include them in every plot; at a minimum you need the data, the aesthetics, and a geom.

  • If you wish to see the various geoms or stats in RStudio, type geom and wait there at the end of the m (or stat and wait at the end of the t). A menu should appear to guide you through the information you seek. To explore a specific geom or stat, type ?gemo_bar (etc).

Here are two examples: a minimal plot and one with more custom settings.

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point()
ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
  geom_point() +
  stat_smooth(method = lm, se = FALSE) +
  scale_color_brewer(palette = "Dark2") +
  labs(
    title = "Engine Displacement and Highway MPG",
    x = "Engine Size",
    y = "Highway MPG"
  ) +
  theme_minimal()

  • Each of the functions has multiple arguments (most with default values) to help you customize the plot to what you want.
  • There are also multiple extension packages based on {ggplot2} to add even more capabilities,, e.g., {ggrepel}, {ggmosaic}, {gganimate}, and {patchwork}.

3.5 Visualizing a Variable’s Sample Distribution

Now that we have seen some summary statistics about our variables, we want to explore what the distribution looks like.

  • The shape of the distribution is often important for determining how well the data set meets certain assumptions for statistical tests or how we can align it with known probability distributions.

We can plot the shape of the distribution for a quantitative variable using histograms, density plots, or box plots.

  • Histograms break up the values into a fixed number of evenly-sized bins based on the range of the variable and count how many observations are in each bin..
  • Density plots also break up the values but smooth the counts in each bin into a smooth curve.
  • Box Plots are a summary plot that highlight key statistics of the variables sample distribution.

We can plot the counts for a categorical variable by counting the observations in each category and using a bar plot to show them as the height of each bar.

  • Unlike histograms where the bins are sorted numerically, the bars on a bar chart can be sorted in different ways based on the intent for the plot.

To a human, histograms and bar plots can look similar, but to a computer, it’s a significant difference.

  • So, when talking to a computer (or to a statistician), you must be careful not to confuse the two.

3.5.1 Plotting Continuous Variables

Let’s compare the base R plot and the tidyverse plot for a histogram of house_ed_gdp_joined$population.

hist(house_ed_gdp_joined$population)
# ggplot2
house_ed_gdp_joined |>
  ggplot(aes(x = population)) +
  geom_histogram()

  • They both show the general shape of the distribution to include the presence of extreme values on the right end making this a highly skewed distribution.
  • However, the base R plot chose a smaller number of wider bins compared to the ggplot2 plot which hides some of the variability on the left end.
    • You can choose different numbers of bins or different bin sizes by adding the argument inside the geom_histogram() function.
  • They each choose different titles based on their inputs.

Let’s compare the tidyverse histogram versus a density plot.

  • We can add a theme to change the look of the plot.
house_ed_gdp_joined |>
  ggplot(aes(x = population)) +
  geom_histogram() +
  theme_minimal()
house_ed_gdp_joined |>
  ggplot(aes(x = population)) +
  geom_density() +
  theme_minimal()

  • They both show the general shape of the distribution to include the presence of extreme values on the right.
  • The density plot might smooths out the differences and can be adjusted for different levels of smoothing.
  • While the X axis values are the same, the Y axis values are different with the histogram using counts and the density using
  • In short, they are conveying the same information but in slightly different ways and depending on the intent, one may be better than the other.

Finally, lets look at two versions of a univariate box plot which is a summary of the distribution.

  • The box plot shows several key values that allow you to interpret the shape of the distribution.
    • The line in the middle is at the value of the sample median
    • The top and bottom lines of the box (the hinges) are at the 25th and 75th percentiles
      • The distance between then is the Inter-Quartile Range (IQR)
    • The lines (or whiskers) extend to the larges values above or below the box but no more than \(1.5 \times \text{IQR}\) from the hinge line.
    • If there are observations more extreme than the whiskers, they are shown as individual points.
  • The box plot on the right is the same data and plot, but the Y axis has been rescaled to be a logarithmic scale.
    • This helps to reduce the skewness of the distribution
house_ed_gdp_joined |>
  ggplot(aes(y = population)) +
  geom_boxplot() ->
my_box_plot

my_box_plot
my_box_plot +
  scale_y_log10()

  • The plot on the left is hard to read because the distribution is so skewed, it shrinks the box to being relatively small.
  • The plot on the right shows that the log of the population is much more symmetric in shape as the median is in the middle of the box and there are single extreme values both above and below.
    • Note that the values on the Y axis are the actual values of the population, but that are arranged logarithmic and not linearly like in the plot on the left.
    • Using a log scale is quite common when working with skewed distributions such as associated with populations, financial data, or word counts in text data.
    • One caveat with the box plot is it tells you nothing about how many observations are used to make the plot. Just two observations can create a box plot (which will not be very meaningful)

Note: this code also shows that you can assign a plot a name so you can reuse it to add more layers without having to repeat all the code.

  • This makes it much faster to make additional plots while maintaining consistency.

We will see box plots again as they are very useful for plotting and comparing multiple distributions at once.

3.5.2 Plotting Categorical Variables

3.5.2.1 Bar Plots

Let’s compare the base R and the tidyverse versions of a bar plot of land$prop_type_broad.

counts <- table(land$prop_type_broad) # This counts our variable by category
barplot(counts) # a bar plot

You can see that this does the job, but it is not very inspiring, not every category is labeled, and we had to generate the counts on our own.

  • Even if you were to take the time to give it better axis labels, it would still look basically the way it does right now.

So, let’s use tidyverse to get two versions of a bar plot:

land |>
  ggplot() +
  geom_bar(aes(x = prop_type_broad))
land |>
  ggplot() +
  stat_count(aes(x = prop_type_broad))

Notice that either geom_bar or stat_count returns the same bar plot and we now have default titles for each axis.

3.5.2.2 Pie charts (round relative frequency histograms)

Some people like to see pie charts (though they can be misleading as they are measures of area)

The usual pie chart can be drawn in Base R as follows and makes use of the same table() command to get the counts we already saw for Base R for bar plots.

myLabels <- names(counts) # You could change the labels here.
pie(counts, myLabels, main = "Land Type")

A tidyverse pie chart looks quite different and is more customizable. But, it takes many more steps. Here, we follow the instructions from: https://www.tutorialspoint.com/ggplot2/ggplot2_pie_charts.htm and add some labels.

df <- as.data.frame(counts)
colnames(df) <- c("Land Type", "freq")
df <- df |>
  mutate(
    `Land Type` = factor(`Land Type`, levels = `Land Type`),
    percent = freq / sum(freq) * 100,
    label = paste0(round(percent, 1), "%"),
    ypos = cumsum(freq) - freq / 2
  )
pie <- ggplot(df, aes(x = "", y = freq, fill = `Land Type`)) +
  geom_bar(width = 1, stat = "identity") +
  #  theme(axis.line = element_blank(), plot.title = element_text(hjust = 0.5)) +
  labs(
    fill = "Land Type", x = NULL, y = NULL,
    title = "Pie Chart", caption = "Land for Rent Data: Albania Open Data"
  )

pie # Look at pie now! After all that effort, it's not even a pie yet!

# Let's convert to polar coordiantes and add the labels
pie +
  coord_polar(theta = "y") +
  geom_text(aes(x = 1.3, label = label), 
             position = position_stack(vjust = 0.5),
            color = "white", size = 4) +
  theme_void() +
  theme(plot.title = element_text(hjust = 0.5))

# Phew.I thought we'd never get an actual pie chart!

3.6 Exploratory Data Analysis (EDA)

Exploratory Data Analysis refers to what data scientists do when they are trying to understand the data set in relation to the question they are trying to answer.

  • No preconceived ideas, no hypotheses, no focus other than trying to figure out what they have to work with and what might need to be changed.

Let’s switch over to the penguins! The dataset contains observations of various types of penguins near the Palmer Station in Antarctica.

(a) Adelie
(b) Chinstrap
(c) Gentoo
Figure 3.1: Penguins near Palmer Station, Antarctica

Let’s get a summary and glimpse the data

summary(penguins)
      species          island       bill_len        bill_dep    
 Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
 Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
 Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
                                 Mean   :43.92   Mean   :17.15  
                                 3rd Qu.:48.50   3rd Qu.:18.70  
                                 Max.   :59.60   Max.   :21.50  
                                 NAs    :2       NAs    :2      
  flipper_len      body_mass        sex           year     
 Min.   :172.0   Min.   :2700   female:165   Min.   :2007  
 1st Qu.:190.0   1st Qu.:3550   male  :168   1st Qu.:2007  
 Median :197.0   Median :4050   NAs   : 11   Median :2008  
 Mean   :200.9   Mean   :4202                Mean   :2008  
 3rd Qu.:213.0   3rd Qu.:4750                3rd Qu.:2009  
 Max.   :231.0   Max.   :6300                Max.   :2009  
 NAs    :2       NAs    :2                                 
glimpse(penguins)
Rows: 344
Columns: 8
$ species     <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Ad…
$ island      <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgersen, Tor…
$ bill_len    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, 42.0, …
$ bill_dep    <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, 20.2, …
$ flipper_len <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186, 180,…
$ body_mass   <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, 4250, …
$ sex         <fct> male, female, female, NA, female, male, female, male, NA, …
$ year        <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
  • We can see several of the measurements have NA values and the species, island and sex variables have class factor. The year has class integer instead of date.

Let’s plot the counts for each species and then for each sex.

penguins |>
  ggplot(aes(x = species)) +
  geom_bar()

penguins |>
  ggplot(aes(x = sex)) +
  geom_bar()

  • We can see that there are different numbers of observations for each species but male and female are pretty balanced.

What if we want to see the counts for each species by sex? This is leading us into multi-variate analysis

  • We can use the fill aesthetic to explore another categorical variable within your categories.
  • We can choose which variable we want as primary and which as fill.
penguins |>
  # drop_na(sex) |>
  ggplot(aes(x = species, fill = sex)) +
  geom_bar()

penguins |>
  # drop_na(sex) |>
  ggplot(aes(fill = species, x = sex)) +
  geom_bar()

Perhaps you wish to line up all the categories by their subcategories, each in individual bars; change position argument.

penguins |>
  ggplot(aes(x = species, fill = sex)) +
  geom_bar(position = "dodge")

Or, perhaps you prefer to weight each of your primary categories to 100% so you can visually see if the proportions of the second category are roughly the same or not.

penguins |>
  ggplot(aes(x = species, fill = sex)) +
  geom_bar(position = "fill")

3.6.1 Comparing Distributions by Sex

Let’s get a histogram of the body mass observations by sex (noting that there may be multiple observations from each bird as it grows)

  • The NAs eill get in the way so let’s get rid of them for each variable with drop_na().
penguins |>
  drop_na(body_mass, sex) |> 
  ggplot(aes(x = body_mass, fill = sex)) +
  geom_histogram()

Another view is a Density plot. This is a smoothed version of the histogram.

penguins |>
  drop_na(body_mass, sex) |> 
  ggplot(aes(x = body_mass, fill = sex)) +
  geom_density()

  • Now we can’t see everything so let’s adjust alpha to increase the transparency (decrease the opaqueness)
penguins |>
  drop_na(sex, body_mass) |>
  ggplot(aes(x = body_mass, fill = sex)) +
  geom_density(alpha = .2)

  • Now we can see both Males and Females have two humped plots or what we call Bimodal (the mode is the most common value).
  • This suggests the distributions for each sex are actually a mixture of two different kinds of makes and females.

But maybe that is too busy of a plot. We can clean that up by using other geom_ choices. For instance, here is a classic box plot where we look at summaries of the individual distributions for each sex:

penguins |>
  drop_na(sex) |>
  ggplot(aes(x = sex, y = body_mass)) +
  geom_boxplot()

Notice that the aes() which selects which variables are involved is a little different between the above two commands as we are now mapping two variables to positions on the plot, not a position and a fill value.

penguins |>
  drop_na(island, body_mass) |>
  ggplot(aes(x = island, y = body_mass)) +
  geom_boxplot()

Now if you set the notch = TRUE argument in geom_boxplot() you will some notches appear.

penguins |>
  drop_na(island, body_mass) |>
  ggplot(aes(x = island, y = body_mass)) +
  geom_boxplot(notch = TRUE)

  • As the help for the geom_boxplot() notch argument says, “Notches are used to compare groups; if the notches of two boxes do not overlap, this suggests that the medians are significantly different.
  • So the penguins on Biscoe have different median mass than on the other islands

As discussed before. the box plot does not tell us how many observations are in each plot.

  • A variant that gives us some idea is the Violin plot as it combines the box plot with the histogram density plot.
penguins |>
  drop_na(island, body_mass, sex) |>
  ggplot(aes(x = island, y = body_mass, fill = sex)) +
  geom_violin()

  • Now we can see three variables at once.
  • Note: the males are generally heaver than the females on every island and but the different species have different shaped distributions of observations based on weight.

Although we’ll work with scatter plots more next lesson, here’s a taste of what you can do when exploring the data:

penguins |>
  drop_na(body_mass, bill_len, sex) |>
  ggplot(aes(x = body_mass, y = bill_len, color = sex)) +
  geom_point()

This seems to suggest that bigger penguins of both sexes generally have longer bills.

3.7 Exercise 03: Univariate Analysis

In this exercise you will explore a dataset focused on individual variables using univariate tools: descriptive statistics and basic plots. The dataset is a sample of health measurements from by work by Yani and Lercher.

  • Feel free to use Help function by searching for function names in the help pane or using ?function_name in the console.

3.7.1 Load the Packages and Read in the Data

Library the {tidyverse} package.

Show code
library(tidyverse)

Read in the yani_lercher.csv from the data/ folder and assign the resulting data frame the name yani.

  • Be sure to use the relative path “data/yani_lercher.csv” to access the data file.
  • glimpse() the data frame.
Show code
yani <- read_csv("./data/yani_lercher.csv")
glimpse(yani)
Rows: 1,786
Columns: 4
$ ID    <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 1…
$ steps <dbl> 15000, 15000, 15000, 14861, 14861, 14861, 14861, 14699, 14699, 1…
$ bmi   <dbl> 16.9, 16.9, 17.0, 17.2, 17.2, 16.8, 16.8, 17.3, 16.8, 20.5, 20.6…
$ sex   <chr> "male", "male", "female", "female", "female", "male", "male", "m…

3.7.2 Classify the Variables

List all four columns (bmi, steps, sex, id) and classify each one as:

  • Numerical (continuous or discrete)
  • Categorical (nominal, ordinal, or binary)

Write your classification here as a short bullet list.

3.7.3 Generate Descriptive Statistics

Create a summary of the data frame and calculate the variance and standard deviation for numeric variables:

  • There are multiple ways to do this.
Show code
summary(yani)
       ID             steps            bmi               sex      
 Min.   :   1.0   Min.   :    0   Min.   :15.00   Length   :1786  
 1st Qu.: 447.2   1st Qu.: 5301   1st Qu.:20.93   N.unique :   2  
 Median : 893.5   Median : 7431   Median :26.80   N.blank  :   0  
 Mean   : 893.5   Mean   : 7435   Mean   :25.29   Min.nchar:   4  
 3rd Qu.:1339.8   3rd Qu.: 9699   3rd Qu.:29.30   Max.nchar:   6  
 Max.   :1786.0   Max.   :15000   Max.   :32.00                   
Show code
var(yani$bmi)
[1] 24.03905
Show code
sd(yani$bmi)
[1] 4.902963
Show code
var(yani[, 2], na.rm = TRUE)
         steps
steps 12401511
Show code
sd(yani$steps, na.rm = TRUE)
[1] 3521.578
Show code
yani |> 
  summarize(across(where(is.numeric), 
                   .fns = list(var = var, standard_dev = sd)))
# A tibble: 1 × 6
   ID_var ID_standard_dev steps_var steps_standard_dev bmi_var bmi_standard_dev
    <dbl>           <dbl>     <dbl>              <dbl>   <dbl>            <dbl>
1 265965.            516. 12401511.              3522.    24.0             4.90

3.7.4 Create Histograms for Continuous/Quantitative Data

Use ggplot with geom_histogram() to create a histogram of bmi.

  • Experiment with changing the value of geom_histogram() argument bins= 30.
  • Then set the aes() argument fill = sex to see how the two groups overlap.
  • Then set geom_histogram() argument position = "dodge".
Show code
ggplot(yani, aes(x = bmi, fill= sex)) +
  geom_histogram(bins = 30, position = "dodge")

Repeat for steps.

Show code
ggplot(yani, aes(x = steps, fill= sex)) +
  geom_histogram(bins = 30, position = "dodge")

Comment: describe the shape of each distribution (symmetric, skewed, bimodal?).

3.7.5 Create Bar charts for Discrete/Categorical/Qualitative Data

Make a bar chart showing the count of observations for each category of sex.

Show code
ggplot(yani, aes(x = sex)) +
  geom_bar()

Comment: is the dataset balanced between the two groups?

3.7.6 Create a New Variable (Feature) for Bar Charts

Use case_when() to create a bmi_category variable, e.g. Underweight < 18.5, Normal 18.5–24.9, Overweight 25–29.9, Obese ≥ 30) and save the updated data frame

Show code
yani |> 
  mutate(bmi_category = case_when(
    bmi < 18.5 ~ "Underweight",
    bmi <= 24.9 ~ "Normal",
    bmi <= 29.9 ~ "Overweight",
    bmi > 29.9 ~ "Obese",
    .default = as.character(bmi))
    ) ->
  yani

Create a bar chart with fill = bmi_category to show the proportional breakdown of bmi_categorywithin each sex group on a single plot.

Show code
ggplot(yani, aes(x = sex, fill = bmi_category)) +
  geom_bar()

Convert bmi_category to a factor with the order Obese, Overweight, Normal, and Underweight and redo the plot.

Show code
yani$bmi_category <- factor(yani$bmi_category, 
levels = c("Obese", "Overweight", "Normal", "Underweight"))

ggplot(yani, aes(x = sex, fill = bmi_category)) +
  geom_bar()

3.7.7 Create a Layered Histogram and Density Plot

Add geom_density() on top of geom_histogram() for bmi.

  • This one is a little tricky as you have to use after_stat() to rescale the heights (y axis values) for the histogram after calculating the the density plot.
Show code
ggplot(yani, aes(x = bmi)) +
  geom_histogram(aes(y = after_stat(density), fill = sex),
    bins = 30, alpha = 0.4, position = "identity" ) +
  geom_density(aes(color = sex), linewidth = 1)

3.7.8 Summary

In the space below, write 3–5 sentences summarizing what you found most interesting about this dataset.

Write your summary here.