10  Unsupervised Learning Methods

Author
Affiliation

Richard Ressler

American University

Published

April 16, 2024

10.1 Unsupervised Learning

Supervised Learning, with labeled data, allows us to check our models. While we don’t know the exact true nature of the relationship, we do know some true results.

Unsupervised learning uses unlabeled data. We don’t know the true results.

Unsupervised learning is often about “exploring the data” to find relationships that might help us understand.

  • We have to be careful if relationships that emerge are “true” or just artifacts of the noise in the data.

10.2 Principal Components Analysis (PCA) Review

PCA was an unsupervised method where we were looking to find relationships among the \(X\) that would allow us to reduce multicollinearity among the predictors and reduce the dimension of our model.

We used PCA for Principal Components Regression (PCR) where we were willing to trade off a little bias for a reduction in the variance of our parameter estimates by using \(K < p\) predictors that captured “most” of the variance in \(Y\).

Recall we determined that scaling the predictors is important in PCR when we have wide ranges in the magnitudes of the predictors.

10.3 Clustering Methods (12.4)

Clustering is a broad set of unsupervised methods for trying to determine which observations are “similar” to each other based on the variables in the data.

We will use a measure of “similarity” to identify subgroups where the observations in a subgroup are more similar to each other than to members of other subgroups.

We are trying to find the groups that are most homogeneous

  • These methods can differ based on how they measure “similar” and the purpose of the analysis.

  • A very popular method is \(K\)-means Clustering where you first determine how many subgroups you think are appropriate, \(K\), and then try to find the \(K\) subgroups within the observations.

  • A second popular method is Hierarchical Clustering where you aggregate observations together until you have created clusters of size 1 to size \(n\) and then you determine how many clusters are appropriate.

Concept: We want to maximize the distance between clusters (sub-groups) while minimizing the distances between points within each cluster in \(p\) dimensional space.

  • We could make each point its own cluster, but that has not simplified anything.

Goal: For a given \(K\), assign each point to one \(K\) clusters to minimize the following objective function based on the weighted sum of the distance between each point in a cluster:

\[ \sum_{k=1}^{K} \frac{1}{\underbrace{|C_k|}_\text{size, points in cluster k}}\sum_{i, j \in C_k}\underbrace{||X_i - X_j||^2}_\text{L2 Norm for Distance} \]

This will help us create clusters that are the most dense, with the points in each cluster closest to each other.

How do we solve this? Exhaustive Search?

  • If we have a sample of size \(n\), how many ways can we split it?
  • There are \(\sim K^n\) possible choices, so infeasible for any reasonable data set.

Two popular methods use Sequential Search get a “local optima” as it requires much less computation than finding the true “global optima”.

  • Hierarchical Clustering
  • \(K\)-means Clustering

10.4 Hierarchical Clustering

Concept: Build an upside-down tree starting with \(n\) leaves on the bottom and create clusters (branches) as we work our way up to from the leaves to build a hierarchy of similar observations within clusters.

We have to make two choices:

  • A measure for distance, here we are using the \(L_2\) Norm but others are possible.
  • A method for determining the “linkage” between the clusters so we can calculate their “dissimilarity”.
    • Four common linkages: Complete, Single, Average, and Centroid of which Average and Complete are generally preferred.
    • There are many others.
    • The linkage can affect the shape of the clusters in \(p\) space.

An Algorithm

  1. Start with each observation as its own cluster - a leaf node.
  2. Calculate the distances between every pair of clusters i.e., \(n\choose{2}\) \(= \frac{n(n-1)}{2}\) of \(O(n^2)\) calculations.
  3. Calculate the chosen dissimilarity among the clusters by computing all pairwise distances between observations in cluster A and observations in cluster B and then choosing the appropriate value for dissimilarity based on the chosen “linkage”. If there are 3 observations in cluster A and 2 in cluster B there will be 6 measurements. Use a linkage such as the following to get the dissimilarity value for (A, B).
  • For Complete (Maximal Inter-cluster dissimilarity) take the Largest distance.
  • For Average (Mean Inter-cluster dissimilarity) take the Average of the distances.
  1. Merge the two clusters with the smallest dissimilarity (most similar; the closest) into a cluster.
  2. Repeat steps 2-4 until only one cluster remains (as the root node).

The height of the branches from one set of clusters to their joined (or fused) cluster is based on the chosen dissimilarity.

  • The greater the sum of the branches between two clusters to their fused or common cluster, the more dissimilar the two clusters.

Now that you have a tree (known as a dendrogram- “tree figure drawn by math”), you can choose the number of clusters you think appropriate by make a horizontal slice across the tree.

  • The number of branches you cut with the horizontal line determines the number of clusters.

10.4.1 Example of Hierarchical Clustering in R with USArrests Data

Let’s use a small data set, datasets::USArrests.

  • This data set contains statistics, in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973. Also given is the percent of the population living in urban areas.
  • The states are row labels and not predictors so all the data is of class numeric.
head(USArrests)
           Murder Assault UrbanPop Rape
Alabama      13.2     236       58 21.2
Alaska       10.0     263       48 44.5
Arizona       8.1     294       80 31.0
Arkansas      8.8     190       50 19.5
California    9.0     276       91 40.6
Colorado      7.9     204       78 38.7

We use stats::hclust() for hierarchical clustering.

  • dist() first produces a distance matrix across all the points.
    • The default is the \(L_2\) or euclidean norm.
    • Here that means for two observations (Alabama and Alaska), \(\text{dist}(\text{AL, AK}) = \sum_{i=1}^4(USArrests[1, i] - USArrests[2, i])^2\)
    • This produces a lower triangular matrix of the distance between each state.
    • If you have mixed classes of data, consider the {cluster} package to compute distance
  • method = "complete" is the default agglomeration method for hclust.
# View(dist(USArrests) |> as.matrix() |> as.data.frame())
HC <- hclust(dist(USArrests))
HC

Call:
hclust(d = dist(USArrests))

Cluster method   : complete 
Distance         : euclidean 
Number of objects: 50 

We can plot the dendrogram (with plot.hclust()).

Show code
plot(HC)

Let’s cut the tree to five clusters with stats::cutree() which produces a named vector of the clustered points.

HC5 <- cutree(HC, k = 5)
HC5
       Alabama         Alaska        Arizona       Arkansas     California 
             1              1              1              2              1 
      Colorado    Connecticut       Delaware        Florida        Georgia 
             2              3              1              4              2 
        Hawaii          Idaho       Illinois        Indiana           Iowa 
             5              3              1              3              5 
        Kansas       Kentucky      Louisiana          Maine       Maryland 
             3              3              1              5              1 
 Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
             2              1              5              1              2 
       Montana       Nebraska         Nevada  New Hampshire     New Jersey 
             3              3              1              5              2 
    New Mexico       New York North Carolina   North Dakota           Ohio 
             1              1              4              5              3 
      Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
             2              2              3              2              1 
  South Dakota      Tennessee          Texas           Utah        Vermont 
             5              2              2              3              5 
      Virginia     Washington  West Virginia      Wisconsin        Wyoming 
             2              2              5              5              2 

We can see that each state has a cluster number.

With table() we can see how many are in each cluster.

table(HC5)
HC5
 1  2  3  4  5 
14 14 10  2 10 
USArrests[HC5 == 4,]
               Murder Assault UrbanPop Rape
Florida          15.4     335       80 31.9
North Carolina   13.0     337       45 16.1

What if we used a different method, say centroid?

HCC <- hclust(dist(USArrests), method = "centroid")
plot(HCC)

HCC5 <- cutree(HCC, 5)
table(HCC5)
HCC5
 1  2  3  4  5 
14 14 20  1  1 
HCC5[HCC5 >= 4]
       Florida North Carolina 
             4              5 

We get a different hierarchy. If we used “average” we get the same tree as “Complete”.

HCA <- hclust(dist(USArrests), method = "average")
plot(HCA)

HCA5 <- cutree(HCA, 5)
table(HCA5)
HCA5
 1  2  3  4  5 
14 14 10  2 10 
data.frame(HC5, HCA5, HCC5)[11:20,]
          HC5 HCA5 HCC5
Hawaii      5    5    3
Idaho       3    3    3
Illinois    1    1    1
Indiana     3    3    3
Iowa        5    5    3
Kansas      3    3    3
Kentucky    3    3    3
Louisiana   1    1    1
Maine       5    5    3
Maryland    1    1    1
  • We can see some states stay in the same cluster and others change depending upon the method.

10.5 \(K\)-Means Clustering

Not all populations have a predictable or inherent hierarchical structure.

Example:

  • If you split the USA into 51 clusters you might wind up with states and DC. Then States can split into counties (about 3000). That seems hierarchical.
  • If you have data with hobbies, gender, and education, that does not create a natural hierarchy.

An alternative is \(K\)-means clustering.

\(K\)-means, like most of the methods we have looked at, is non-parametric and makes no assumptions about the underlying joint distribution of the \(X\).

Algorithm:

  1. Pick a \(K\) for the number of clusters where \(1<K<n\).
  2. Initialize: Assign the sample observations to clusters \(1\) to \(K\) at random (set the seed).
  3. Calculate the centroid \(\hat{\mu}_k\), (the multivariate mean of all points in the cluster) of each of the \(K\) clusters.
  4. Calculate the “distance” from each point to the centroid of each of the \(K\) clusters.
  5. Reassign observation \(X_i\) to the cluster \(k\) with the closest centroid so as to minimize the following:

\[\text{min}_{K} ||X_i - \hat{\mu}^{k}|| = \sqrt{\sum_k (X_{i} - \hat{\mu}_{k})^2} \tag{10.1}\]

Equation 10.1 suggests we should standardize (scale) the data before attempting to cluster when the data is of different magnitudes. Why?

What is the border between two clusters in \(p\) space?

  • It is a point that is equidistant from the two centroids.
  • In 2D, it would be a line, and in higher dimensions, it is a hyperplane.
  • All the clusters will be separated by hyperplanes.
  1. Now go back to step 2 and repeat until the sum of the differences between the centroids at step \(n-1\) and \(n\) is less than some \(\epsilon\). At that point we consider the algorithm as having converged on a solution as no points will shift from one cluster to another.

This is essentially a two step process: Assign points to the closest cluster and then update the cluster centroids and check if every point is in its closest cluster. If not, reassign and repeat.

When the algorithm finishes it should have

  • minimized the Within-cluster Sum of Squares (\(\sum_k \sum_{i\in k} (X_{i} - \hat{\mu}_{k})^2\)) and
  • maximized the Between Cluster Sum of Squares \(\sum_{k,j} (\hat{\mu}_{k}-\hat{\mu}_j)^2\).

10.5.1 Example of \(K\)-means Clustering in R with College Data

Let’s load the college data from {ISLR2}.

We can use stats::kmeans() to calculate the clusters.

library(tidyverse)
library(ISLR2)
names(College)
 [1] "Private"     "Apps"        "Accept"      "Enroll"      "Top10perc"  
 [6] "Top25perc"   "F.Undergrad" "P.Undergrad" "Outstate"    "Room.Board" 
[11] "Books"       "Personal"    "PhD"         "Terminal"    "S.F.Ratio"  
[16] "perc.alumni" "Expend"      "Grad.Rate"  

With almost 800 colleges, we want to figure out it there are groups of colleges that are similar to help us avoid comparing every single college to every other college.

The summary() does not tell us much about potential clusters but it does suggest we scale the data.

summary(College)
 Private        Apps           Accept          Enroll       Top10perc    
 No :212   Min.   :   81   Min.   :   72   Min.   :  35   Min.   : 1.00  
 Yes:565   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242   1st Qu.:15.00  
           Median : 1558   Median : 1110   Median : 434   Median :23.00  
           Mean   : 3002   Mean   : 2019   Mean   : 780   Mean   :27.56  
           3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902   3rd Qu.:35.00  
           Max.   :48094   Max.   :26330   Max.   :6392   Max.   :96.00  
   Top25perc      F.Undergrad     P.Undergrad         Outstate    
 Min.   :  9.0   Min.   :  139   Min.   :    1.0   Min.   : 2340  
 1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0   1st Qu.: 7320  
 Median : 54.0   Median : 1707   Median :  353.0   Median : 9990  
 Mean   : 55.8   Mean   : 3700   Mean   :  855.3   Mean   :10441  
 3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0   3rd Qu.:12925  
 Max.   :100.0   Max.   :31643   Max.   :21836.0   Max.   :21700  
   Room.Board       Books           Personal         PhD        
 Min.   :1780   Min.   :  96.0   Min.   : 250   Min.   :  8.00  
 1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850   1st Qu.: 62.00  
 Median :4200   Median : 500.0   Median :1200   Median : 75.00  
 Mean   :4358   Mean   : 549.4   Mean   :1341   Mean   : 72.66  
 3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700   3rd Qu.: 85.00  
 Max.   :8124   Max.   :2340.0   Max.   :6800   Max.   :103.00  
    Terminal       S.F.Ratio      perc.alumni        Expend     
 Min.   : 24.0   Min.   : 2.50   Min.   : 0.00   Min.   : 3186  
 1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00   1st Qu.: 6751  
 Median : 82.0   Median :13.60   Median :21.00   Median : 8377  
 Mean   : 79.7   Mean   :14.09   Mean   :22.74   Mean   : 9660  
 3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00   3rd Qu.:10830  
 Max.   :100.0   Max.   :39.80   Max.   :64.00   Max.   :56233  
   Grad.Rate     
 Min.   : 10.00  
 1st Qu.: 53.00  
 Median : 65.00  
 Mean   : 65.46  
 3rd Qu.: 78.00  
 Max.   :118.00  

Let’s scale the numeric data using scale().

College[,-1] <- scale(College[,-1]) # don't scale Private
Important

\(K\)-Means only works with numeric data as we need to calculate means.

Let’s create a model matrix for the College data of just the 17 predictor \(X\) observations.

X <- model.matrix(lm(as.numeric(Private) ~ . -1, data = College))
dim(X)
[1] 777  17

Use stats::kmeans() to create \(K=4\) clusters.

set.seed(123)
KM4 <- kmeans(X, centers = 4)
# KM4   # shows the cluster means and the cluster for all observations
summary(KM4) # not very helpful
             Length Class  Mode   
cluster      777    -none- numeric
centers       68    -none- numeric
totss          1    -none- numeric
withinss       4    -none- numeric
tot.withinss   1    -none- numeric
betweenss      1    -none- numeric
size           4    -none- numeric
iter           1    -none- numeric
ifault         1    -none- numeric

The $iter tells us it ran the entire algorithm four times (the outer loop) before convergence.

KM4$iter
[1] 4
KM4$centers
        Apps     Accept      Enroll  Top10perc  Top25perc F.Undergrad
1  0.5398184  0.1449172  0.02577456  1.9487413  1.5775137  -0.1109848
2  1.7914064  1.9869052  2.18089492  0.1633924  0.3669752   2.2516430
3 -0.3677656 -0.3569215 -0.32040156 -0.6402977 -0.6893761  -0.2883564
4 -0.3125754 -0.2810587 -0.35821885  0.1780618  0.2734593  -0.3829521
  P.Undergrad   Outstate Room.Board       Books   Personal        PhD
1  -0.4167289  1.7892966  1.2033598  0.27197640 -0.4800938  1.1518326
2   1.5744281 -0.5829019 -0.2020433  0.33569776  0.8414841  0.6701259
3  -0.1142392 -0.6373495 -0.5508946 -0.10197151  0.1131701 -0.7154017
4  -0.2790396  0.4736585  0.4034070 -0.06631352 -0.2897056  0.3220929
    Terminal  S.F.Ratio perc.alumni      Expend  Grad.Rate
1  1.1069760 -1.2244617   1.3265167  2.10480472  1.1615206
2  0.6614536  0.6415119  -0.5867402 -0.10629184 -0.3800179
3 -0.7312400  0.3341976  -0.5014563 -0.51170872 -0.5502098
4  0.3570762 -0.2820479   0.4392361  0.06859301  0.4749987
KM4$cluster[1:10]
Abilene Christian University           Adelphi University 
                           3                            3 
              Adrian College          Agnes Scott College 
                           3                            1 
   Alaska Pacific University            Albertson College 
                           3                            4 
     Albertus Magnus College               Albion College 
                           4                            4 
            Albright College    Alderson-Broaddus College 
                           4                            3 

The ratio of between_SS (explained) and total_SS is the \(R^2\).

  • The within cluster SS is the variation within each cluster - what we want to minimize.
  • The between_SS is the variation between the clusters - what we want to maximize.
  • The maximum it can be is the total_SS so the closer the ratio is to 1, the better.
  • The Between SS is the explained variance and the Within SS is the unexplained (perhaps irreducible) variance.
KM4$totss
[1] 13192
sum(KM4$withinss) + KM4$betweenss 
[1] 13192
KM4$betweenss / KM4$totss
[1] 0.4364856

10.5.2 Understanding the Clusters

Let’s do some partial plots to try to understand the clusters.

Show code
plot(College$Expend, College$Grad.Rate, col = KM4$cluster)
plot(College$S.F.Ratio, College$Outstate, col = KM4$cluster)
plot(College$Accept, College$Top25perc, col = KM4$cluster)
plot(College$Apps, College$PhD, col = KM4$cluster)

Show code
ggplot(College, aes(Expend, Grad.Rate, color =  as.factor(KM4$cluster))) +
  geom_point(alpha = .8, size = .5) + labs(color = 'Cluster')
ggplot(College, aes(S.F.Ratio, Outstate, color = as.factor(KM4$cluster))) +
  geom_point(alpha = .8, size = .5) + labs(color = 'Cluster')
ggplot(College, aes(Accept, Top25perc, color = as.factor(KM4$cluster))) +
  geom_point(alpha = .8, size = .5) + labs(color = 'Cluster')
ggplot(College, aes(Apps, PhD, color = as.factor(KM4$cluster))) +
  geom_point(alpha = .8, size = .5) + labs(color = 'Cluster')

By looking at each cluster you can try to see how the clusters align with the various attributes of interest

10.5.3 Choosing K?

Some packages will create a plot of\(R^2\) vs \(K\) (see factoextra)

We will just create a short for-loop to iterate though various values of \(K\) and calculate the \(R^2\) and then plot.

k <- seq(2:700)
r2 <- vector("double", length = length(k))

for (i in seq_along(k)) {
  set.seed(123)
  tempk <- kmeans(X, k[i])
  r2[i] <- tempk$betweenss/tempk$totss
}
plot(k, r2)

Since \(R^2\) is un-adjusted, we can see that increasing \(K\) will always increase \(R^2\).

Thus we look for a “knee” in the curve to find a reasonable solution.

Let’s look at just the first ten values and it seems \(k=4\) is not a bad choice.

Show code
plot(k[1:10], r2[1:10])

10.5.4 Considerations for \(K\)-Means

  1. Scaling matters. Not scaling can change the \(R^2\) by quite a bit as high magnitude variables will dominate the formation of the clusters.
  • Scaling can help generate clusters with the same size.
  1. Standard \(K\)-Means works best on data that can form clusters that are generally the same size, circular, and not overlapping.

Even when the data “looks nice”, the “local-optima” clusters can be off due to just enough noise.

Show code
set.seed(123)
x = c(seq(1:100) / 100, 3 + seq(1:100) / 100)
df <- data.frame(x = x, 
                 y =  x + rnorm(200, 0, 2),
                 Group = as.factor(c(rep("1", 100), rep("2", 100))))
ggplot(df,(aes(x, y ))) +
  geom_point(aes(shape = Group)) -> plot1
plot1

Show code
samp1 <- kmeans(df[, 1:2], 2)
df$cluster <- as.factor(samp1$cluster)
ggplot(df,(aes(x, y, color = cluster))) +
  geom_point(aes(shape = Group)) +
  geom_point(data = data.frame(samp1$center), mapping = aes(x, y),
             color = "purple", size = 4)

Not circular, different magnitude.

Show code
set.seed(123)
df <- data.frame(x = c(seq(1:100) / 10000, .2 + seq(1:100) / 10000),
                 y = 50 * rnorm(200, 0, 1),
                 Group = as.factor(c(rep("1", 100), rep("2", 100))))
ggplot(df,(aes(x, y ))) +
  geom_point(aes(shape = Group)) 

Show code
samp2 <- kmeans(df[,1:2], 2)
df$cluster <- as.factor(samp2$cluster)
ggplot(df, aes(x, y , color = cluster)) +
  geom_point(aes(shape = Group)) +
  geom_point(data = data.frame(samp2$center), mapping = aes(x, y), 
             color = "purple", size = 4)

Overlapping.

Show code
set.seed(1234)
df <- data.frame(x = c(seq(1:100) / 100, .7 + seq(1:100)/100), 
                 y = rnorm(200, 0, 1),
                 Group = as.factor(c(rep("1", 100), rep("2", 100))))
ggplot(df,(aes(x, y ))) +
  geom_point(aes(shape = Group, color = Group)) 

Show code
samp3 <- kmeans(df[,1:2], 2)
df$cluster <- as.factor(samp3$cluster)
ggplot(df, aes(x, y , color = cluster)) +
  geom_point(aes(shape = Group)) +
  geom_point(data = data.frame(samp3$center), mapping = aes(x, y), 
             color = "purple", size = 4)

Not similar in size

Show code
set.seed(1234)
df <- data.frame(x = c(seq(1:100) / 100, .7 + seq(1:100)/100), 
                 y = c(rnorm(150, 0, 1),rnorm(50, 0, .1)),
                 Group = as.factor(c(rep("1", 150), rep("2", 50))))
ggplot(df,(aes(x, y ))) +
  geom_point(aes(shape = Group, color = Group)) 

Show code
samp3 <- kmeans(df[,1:2], 2)
df$cluster <- as.factor(samp3$cluster)
ggplot(df, aes(x, y , color = cluster)) +
  geom_point(aes(shape = Group)) +
  geom_point(data = data.frame(samp3$center), mapping = aes(x, y), 
             color = "purple", size = 4)

Important

\(K\)-means is a popular method for unsupervised learning to understand how observations might be clustered into meaningful groups.

Like all methods, use it with caution seasoned with some domain knowledge to assess interpretation and usefulness.