We’ll use the classic iris dataset. It contains 150
flower measurements from three species.
data(iris)
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
The variables Sepal.Length, Sepal.Width,
Petal.Length, and Petal.Width will be used for
PCA, while Species is a grouping factor.
We’ll use prcomp() on the four numeric variables and
scale them to equalize variance.
pca_fit <- prcomp(iris[, 1:4], scale. = TRUE)
summary(pca_fit)
## Importance of components:
## PC1 PC2 PC3 PC4
## Standard deviation 1.7084 0.9560 0.38309 0.14393
## Proportion of Variance 0.7296 0.2285 0.03669 0.00518
## Cumulative Proportion 0.7296 0.9581 0.99482 1.00000
The summary shows how much variance is explained by each principal component (PC).
We can plot the first two principal components using base R. Points are colored by species.
species_colors <- c("setosa" = "red", "versicolor" = "green", "virginica" = "blue")
plot(pca_fit$x[, 1], pca_fit$x[, 2],
col = species_colors[iris$Species],
pch = 19,
xlab = "PC1",
ylab = "PC2",
main = "Base R PCA Plot: iris dataset")
legend("topright", legend = names(species_colors), col = species_colors, pch = 19)
for (sp in unique(iris$Species)) {
dat <- pca_fit$x[iris$Species == sp, 1:2]
dataEllipse(dat[, 1], dat[, 2],
levels = 0.95,
add = TRUE,
col = species_colors[sp],
lwd = 2)
}
These ellipses represent the regions containing about 95% of points for each species.
Now, let’s use FactoMineR to see which variables contribute most to each principal component.
res_pca <- PCA(iris[, 1:4], graph = FALSE)
# Contributions to PC1 and PC2
res_pca$var$contrib
## Dim.1 Dim.2 Dim.3 Dim.4
## Sepal.Length 27.150969 14.24440565 51.777574 6.827052
## Sepal.Width 7.254804 85.24748749 5.972245 1.525463
## Petal.Length 33.687936 0.05998389 2.019990 64.232089
## Petal.Width 31.906291 0.44812296 40.230191 27.415396
This table shows how much each original variable contributes to each principal component.
We can use fviz_pca_var() to visualize which variables
load most strongly on each PC axis.
fviz_pca_var(res_pca,
col.var = "contrib",
gradient.cols = c("blue", "orange", "red"),
repel = TRUE)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## ℹ The deprecated feature was likely used in the ggpubr package.
## Please report the issue at <https://github.com/kassambara/ggpubr/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the factoextra package.
## Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Finally, let’s compare visually: do the first two PCs separate the species better than any two raw measurements?
par(mfrow = c(1, 2))
# Raw variables
plot(iris$Petal.Length, iris$Petal.Width,
col = species_colors[iris$Species],
pch = 19,
main = "Raw: Petal.Length vs Petal.Width")
# PCA axes
plot(pca_fit$x[, 1], pca_fit$x[, 2],
col = species_colors[iris$Species],
pch = 19,
main = "PCA: PC1 vs PC2")
legend("topright", legend = names(species_colors), col = species_colors, pch = 19)