PCA Example with iris Dataset

1. Load and Inspect Data

We’ll use the classic iris dataset. It contains 150 flower measurements from three species.

data(iris)
head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

The variables Sepal.Length, Sepal.Width, Petal.Length, and Petal.Width will be used for PCA, while Species is a grouping factor.

2. Perform PCA with Base R

We’ll use prcomp() on the four numeric variables and scale them to equalize variance.

pca_fit <- prcomp(iris[, 1:4], scale. = TRUE)
summary(pca_fit)

## Importance of components:
##                           PC1    PC2     PC3     PC4
## Standard deviation     1.7084 0.9560 0.38309 0.14393
## Proportion of Variance 0.7296 0.2285 0.03669 0.00518
## Cumulative Proportion  0.7296 0.9581 0.99482 1.00000

The summary shows how much variance is explained by each principal component (PC).

3. Visualize PCA Results (Base Plot)

We can plot the first two principal components using base R. Points are colored by species.

species_colors <- c("setosa" = "red", "versicolor" = "green", "virginica" = "blue")

plot(pca_fit$x[, 1], pca_fit$x[, 2],
     col = species_colors[iris$Species],
     pch = 19,
     xlab = "PC1",
     ylab = "PC2",
     main = "Base R PCA Plot: iris dataset")
legend("topright", legend = names(species_colors), col = species_colors, pch = 19)

for (sp in unique(iris$Species)) {
  dat <- pca_fit$x[iris$Species == sp, 1:2]
  dataEllipse(dat[, 1], dat[, 2], 
              levels = 0.95, 
              add = TRUE, 
              col = species_colors[sp], 
              lwd = 2)
}

These ellipses represent the regions containing about 95% of points for each species.

5. Examine Variable Contributions (FactoMineR)

Now, let’s use FactoMineR to see which variables contribute most to each principal component.

res_pca <- PCA(iris[, 1:4], graph = FALSE)

# Contributions to PC1 and PC2
res_pca$var$contrib

##                  Dim.1       Dim.2     Dim.3     Dim.4
## Sepal.Length 27.150969 14.24440565 51.777574  6.827052
## Sepal.Width   7.254804 85.24748749  5.972245  1.525463
## Petal.Length 33.687936  0.05998389  2.019990 64.232089
## Petal.Width  31.906291  0.44812296 40.230191 27.415396

This table shows how much each original variable contributes to each principal component.

6. Visualize Contributions (FactoMineR / factoextra)

We can use fviz_pca_var() to visualize which variables load most strongly on each PC axis.

fviz_pca_var(res_pca,
             col.var = "contrib",
             gradient.cols = c("blue", "orange", "red"),
             repel = TRUE)

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the ggpubr package.
##   Please report the issue at <https://github.com/kassambara/ggpubr/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

7. Compare to Raw Measurements

Finally, let’s compare visually: do the first two PCs separate the species better than any two raw measurements?

par(mfrow = c(1, 2))

# Raw variables
plot(iris$Petal.Length, iris$Petal.Width,
     col = species_colors[iris$Species],
     pch = 19,
     main = "Raw: Petal.Length vs Petal.Width")

# PCA axes
plot(pca_fit$x[, 1], pca_fit$x[, 2],
     col = species_colors[iris$Species],
     pch = 19,
     main = "PCA: PC1 vs PC2")

legend("topright", legend = names(species_colors), col = species_colors, pch = 19)

Conclusion

The PCA compresses four correlated variables into two new axes that maximize variance captured.
The species clusters above are not more distinct in PCA space than in the raw measurement combinations.
What could explain this? Do other PCs do a better job than 1 and 2?