This package is specifically designed to help draw Hotelling’s T-squared ellipses on PCA or PLS score scatterplots, a crucial tool in chemometrics for multivariate data analysis and quality control. In the field of chemometrics, these ellipses serve as powerful visual aids for measurements assessment, outlier detection, and process monitoring. By superimposing Hotelling’s T-squared ellipses onto score plots, analysts can quickly identify abnormal samples and establish well-defined confidence regions for normal process operations.
The package’s functionality enables computing the Hotelling’s T-squared statistic, the semi-minor and semi-major axes of the Hotelling’s T-squared ellipse, and generates the coordinate points necessary for constructing a confidence ellipse. Specifically, the package provides two primary functions:
ellipseParam()
, is used to calculate the Hotelling’s
T-squared statistic and the semi-axes of the confidence ellipses at 99%
and 95% confidence intervals.
ellipseCoord()
, is primarily used to get the
x, y, and z coordinates for plotting 2D or 3D
confidence ellipses at user-defined confidence interval. The confidence
interval is set at 95% by default.
In this example, we use FactoMineR::PCA()
to perform the
Principal Component Analysis (PCA) on a LIBS spectral dataset
specData
, and extract the PCA scores.
set.seed(002)
pca_mod <- specData %>%
select(where(is.numeric)) %>%
PCA(scale.unit = FALSE, graph = FALSE)
pca_scores <- pca_mod %>%
pluck("ind", "coord") %>%
as_tibble() %>%
print()
#> # A tibble: 100 × 5
#> Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 25306. -10831. -1851. -83.4 -560.
#> 2 -67.3 1137. -2946. 2495. -568.
#> 3 -1822. -22.0 -2305. 1640. -409.
#> 4 -1238. 3734. 4039. -2428. 379.
#> 5 3299. 4727. -888. -1089. 262.
#> 6 5006. -49.5 2534. 1917. -970.
#> 7 -8325. -5607. 960. -3361. 103.
#> 8 -4955. -1056. 2510. -397. -354.
#> 9 -1610. 1271. -2556. 2268. -760.
#> 10 19582. 2289. 886. -843. 1483.
#> # ℹ 90 more rows
The Hotelling’s T-squared statistic can be computed using the
ellipseParam
function in two distinct ways. (1) By
specifying the parameter k
: This represents the number of
principal components to retain. (2) By setting the
threshold
parameter: This defines the cumulative explained
variance to determine the number of components.
T2
#> [1] 8.79727659 0.42574446 0.26720102 1.12758053 0.73815225 0.50178543
#> [7] 1.48895076 0.52248052 0.35995391 3.43505326 2.20554170 0.14665580
#> [13] 1.23019477 1.08187330 0.06404802 0.82350867 4.12091044 0.18901522
#> [19] 0.74937040 1.81186352 1.01387374 0.36322618 0.18201057 1.19587478
#> [25] 0.82387604 0.59000025 1.64384908 1.35023679 0.04035757 1.07207307
#> [31] 1.35289851 0.89287224 0.58566878 0.51087876 0.33557545 0.35322970
#> [37] 2.75957814 0.73510967 0.42097537 0.65499771 0.41084500 0.26551834
#> [43] 0.48229565 0.36048584 0.25235451 0.76994454 2.93735077 0.07159629
#> [49] 0.15337115 0.09562119 0.12416659 0.41598605 0.38472295 0.21898479
#> [55] 0.20860524 1.26153407 0.26389302 1.35227317 0.05781487 0.65100278
#> [61] 2.61898699 0.94048980 1.60881058 1.02169526 0.41282496 0.27235598
#> [67] 0.14854682 0.66617260 2.82560083 0.12785702 1.25327046 0.66607730
#> [73] 1.22850658 1.13312107 1.46864227 1.46171265 0.10436499 0.26006454
#> [79] 0.97128044 1.92028425 1.95885654 0.36084958 4.84793496 0.24714317
#> [85] 0.55723720 0.64080960 1.87718652 0.14398651 0.50051409 0.43680390
#> [91] 0.76213175 0.58130268 1.48307214 0.55700045 0.17208837 0.11091915
#> [97] 0.76406153 0.66296513 0.11773916 1.40404240
T2
#> [1] 5.35198031 0.80483847 0.41280666 1.09396270 0.54357239 1.05866612
#> [7] 1.54758552 0.38932531 0.85084455 3.33998923 1.33661709 2.67345992
#> [13] 1.44052380 0.74554569 0.09716023 0.63692986 3.03905923 0.14527271
#> [19] 1.65973096 2.03032292 0.68825673 0.70721780 1.05449654 0.95268942
#> [25] 0.67437285 0.51832530 1.60329970 1.65265923 0.46804197 1.61273965
#> [31] 0.90591680 0.83720307 0.57655751 1.27928674 0.40830752 0.25568007
#> [37] 1.92746313 0.43666648 0.42052805 0.53674167 0.33326678 0.31518697
#> [43] 0.53420621 0.28056547 0.39345409 1.31456541 2.34976600 0.21583160
#> [49] 0.19463398 0.06499698 0.76421962 0.82346628 0.60492739 0.25441618
#> [55] 0.44826694 0.75245708 0.84507014 0.97855109 0.25134633 0.68527571
#> [61] 2.26652396 1.59851453 1.12427685 0.84476325 0.25321959 0.44806235
#> [67] 0.74685450 0.45892445 2.81481934 0.49094953 0.80909145 0.57714113
#> [73] 0.77330414 0.74629051 1.84932002 0.88840206 0.46801132 0.33977349
#> [79] 0.58104753 1.35207024 1.73787700 0.51984216 2.91262151 0.20476197
#> [85] 0.70355158 0.55268503 1.63351195 0.24424689 0.55678461 1.86446207
#> [91] 0.45875232 0.39755343 0.98469700 0.55441164 0.20947523 0.63142698
#> [97] 0.71018219 0.51824037 0.17165582 0.88278586
T2
#> [1] 13.098377222 0.053558139 0.042763017 0.596881084 1.064949043
#> [6] 0.322634358 2.193588461 0.362201444 0.100269823 5.152202188
#> [11] 0.136476512 0.187288163 0.457018299 0.836596116 0.037948942
#> [16] 0.499423175 6.201798003 0.283938583 1.053279861 2.176093882
#> [21] 1.495054754 0.328273167 0.001077433 0.385735197 0.770015903
#> [26] 0.855099441 1.918512180 1.898354875 0.059730859 1.544579259
#> [31] 2.032241146 1.336688195 0.733955470 0.192465562 0.189870642
#> [36] 0.228768334 2.513724266 0.227842165 0.596628858 0.794052585
#> [41] 0.579569949 0.180495608 0.375303017 0.101758464 0.290449128
#> [46] 1.151911637 0.616332156 0.016266200 0.230869862 0.026046210
#> [51] 0.181580829 0.187191855 0.318096039 0.049303079 0.301571988
#> [56] 1.849456533 0.093358887 0.338259448 0.054705074 0.865871257
#> [61] 3.419405506 0.932424618 1.363364796 0.870186313 0.512986568
#> [66] 0.075112028 0.016206207 0.195475035 3.986477406 0.142755791
#> [71] 0.379374196 0.809324698 1.859972412 0.058403575 1.958324315
#> [76] 0.060646515 0.131828892 0.155979037 0.731558582 2.704001013
#> [81] 0.243720099 0.546490035 6.914605380 0.369950255 0.425961362
#> [86] 0.376793369 0.040293609 0.146837113 0.524107524 0.652320554
#> [91] 0.013362935 0.319746716 1.968145064 0.729551769 0.214084611
#> [96] 0.093838067 0.917982030 0.158779544 0.089226394 2.122042242
T2
#> [1] 6.53045498 0.77985914 0.39875666 1.27633910 0.63623773 0.64621852
#> [7] 1.94706323 0.39955561 0.65027891 2.60270069 1.68777974 2.33832862
#> [13] 1.73627591 0.89148735 0.09154971 0.61396625 3.74797974 0.14792815
#> [19] 1.16098653 1.37203684 0.75262904 0.74133604 0.39917534 1.18375135
#> [25] 0.84423117 0.49281117 2.02255319 1.09460844 0.09591725 0.96401329
#> [31] 1.02350705 1.05620821 0.58298463 1.26807473 0.51557341 0.27067599
#> [37] 2.41734914 0.54565033 0.47149967 0.56074231 0.36897968 0.39285391
#> [43] 0.52187589 0.26763215 0.48000911 1.01010296 2.54789673 0.16441988
#> [49] 0.11408734 0.07745065 0.89519603 1.01950514 0.59274496 0.25875285
#> [55] 0.50533086 0.94128973 1.06662628 1.21217577 0.13332013 0.49597035
#> [61] 2.49744440 1.05004728 1.20861993 0.82193478 0.31977530 0.36498547
#> [67] 0.91019140 0.50545945 2.13281990 0.27229890 0.93957749 0.72809042
#> [73] 0.94080967 0.89232795 1.12148274 1.12208355 0.58120095 0.24318126
#> [79] 0.72706291 1.70413391 1.59117880 0.26850066 3.67343823 0.25443017
#> [85] 0.75731684 0.48234231 1.94290842 0.11555869 0.37941945 1.20420583
#> [91] 0.56964432 0.45545274 1.11992496 0.57298359 0.14954172 0.75681961
#> [97] 0.89556581 0.49625512 0.14666752 1.05902198
To visualize the confidence region for our multivariate data, we
employ the ellipseParam()
function to generate a confidence
ellipse. Our objective is to calculate the lengths of the semi-axes for
this ellipse, focusing on the bivariate relationship within the PC2-PC4
subspace of our principal component analysis. We maintain the default
value for k
(number of components) at 2. This ensures we’re
working with a two-dimensional representation. We specify
pcx = 2
and pcy = 4
as inputs. This directs
the function to use the 2nd and 4th principal components for the
x and y axes, respectively.
str(ellipse_axes)
#> List of 5
#> $ Tsquare : tibble [100 × 1] (S3: tbl_df/tbl/data.frame)
#> ..$ value: num [1:100] 13.0984 0.0536 0.0428 0.5969 1.0649 ...
#> $ cutoff.99pct: num 9.76
#> $ cutoff.95pct: num 6.24
#> $ nb.comp : num 2
#> $ Ellipse : tibble [1 × 4] (S3: tbl_df/tbl/data.frame)
#> ..$ a.99pct: num 19369
#> ..$ b.99pct: num 8416
#> ..$ a.95pct: num 15492
#> ..$ b.95pct: num 6732
We can extract parameters for further use:
a1 <- ellipse_axes %>% pluck("Ellipse", "a.99pct")
b1 <- ellipse_axes %>% pluck("Ellipse", "b.99pct")
a2 <- ellipse_axes %>% pluck("Ellipse", "a.95pct")
b2 <- ellipse_axes %>% pluck("Ellipse", "b.95pct")
t1 <- round(as.numeric(pca_mod$eig[1,2]), 2)
t2 <- round(as.numeric(pca_mod$eig[2,2]), 2)
t3 <- round(as.numeric(pca_mod$eig[3,2]), 2)
pca_scores %>%
ggplot(aes(x = Dim.1, y = Dim.3)) +
geom_ellipse(aes(x0 = 0, y0 = 0, a = a1, b = b1, angle = 0), linewidth = .5, linetype = "solid", fill = "white") +
geom_ellipse(aes(x0 = 0, y0 = 0, a = a2, b = b2, angle = 0), linewidth = .5, linetype = "solid", fill = "white") +
geom_point(aes(fill = Tsq), shape = 21, size = 3, color = "black") +
scale_fill_viridis_c(option = "viridis") +
geom_hline(yintercept = 0, linetype = "solid", color = "black", linewidth = .2) +
geom_vline(xintercept = 0, linetype = "solid", color = "black", linewidth = .2) +
labs(
title = "Scatterplot of PCA scores",
subtitle = "PC1 vs. PC3",
x = glue("PC1 [{t1}%]"),
y = glue("PC3 [{t3}%]"),
fill = "T2"
) +
theme_grey() +
theme(
aspect.ratio = .7,
panel.grid = element_blank(),
panel.background = element_rect(
colour = "black",
linewidth = .3
)
)
An alternative method for visualizing the Hotelling’s T-squared
confidence region is to utilize the ellipseCoord()
function. This function generates the x and y
coordinates for plotting the confidence ellipse, offering greater
flexibility in visualization and further analysis. It allows users to
specify a custom confidence level via the confi.limit
parameter. The default confidence level is set at 95%, which is commonly
used in statistical analyses. The function returns a set of coordinates
that define the ellipse’s boundary in the chosen subspace, thereby
complementing the semi-axes information provided by the
ellipseParam()
function.
In the example below, we focus on the subspace spanned by the 1st and 5th components.
str(xy_coord)
#> tibble [500 × 2] (S3: tbl_df/tbl/data.frame)
#> $ x: num [1:500] 9620 9620 9617 9613 9608 ...
#> $ y: num [1:500] 5.93e-14 9.44e+01 1.89e+02 2.83e+02 3.77e+02 ...
ggplot() +
geom_polygon(data = xy_coord, aes(x, y), color = "black", fill = "white") +
geom_point(data = pca_scores, aes(x = Dim.2, y = Dim.3), shape = 21, size = 3, fill = "black", color = "black", alpha = 0.7) +
geom_hline(yintercept = 0, linetype = "solid", color = "black", linewidth = .2) +
geom_vline(xintercept = 0, linetype = "solid", color = "black", linewidth = .2) +
labs(
title = "Scatterplot of PCA scores",
subtitle = "PC2 vs. PC3",
x = glue("PC2 [{t2}%]"),
y = glue("PC3 [{t3}%]")
) +
theme_grey() +
theme(
aspect.ratio = .7,
panel.grid = element_blank(),
panel.background = element_rect(
colour = "black",
linewidth = .3
)
)
The ellipseCoord
function offers an optional parameter
pcz
. When activated, this parameter extends the function’s
capabilities from two-dimensional ellipses to three-dimensional
ellipsoids, providing a more comprehensive visualization of multivariate
data, particularly in situation where the first two principal components
do not adequately capture the data’s variability.
In the example below, the resulting 3D Hotelling’s T-squared ellipsoid serves as a volumetric confidence region in the subspace spanned by the 1st, 2nd, and 3rd components. It encapsulates a specified proportion of data points, determined by the confidence level, providing a more holistic view of the data’s distribution and outliers in three dimensions.
str(xyz_coord)
#> tibble [10,000 × 3] (S3: tbl_df/tbl/data.frame)
#> $ x: num [1:10000] 7.75e-13 7.75e-13 7.75e-13 7.75e-13 7.75e-13 ...
#> $ y: num [1:10000] -2.16e-13 -2.16e-13 -2.16e-13 -2.16e-13 -2.16e-13 ...
#> $ z: num [1:10000] 7745 7745 7745 7745 7745 ...
color_palette <- viridisLite::viridis(nrow(pca_scores))
scaled_T2 <- scales::rescale(T2, to = c(1, nrow(pca_scores)))
point_colors <- color_palette[round(scaled_T2)]
rgl::setupKnitr(autoprint = TRUE)
rgl::plot3d(
x = xyz_coord$x,
y = xyz_coord$y,
z = xyz_coord$z,
xlab = "PC1",
ylab = "PC2",
zlab = "PC3",
type = "l",
lwd = 0.5,
col = "lightgray",
alpha = 0.5)
rgl::points3d(
x = pca_scores$Dim.1,
y = pca_scores$Dim.2,
z = pca_scores$Dim.3,
col = point_colors,
size = 5,
add = TRUE)
rgl::bgplot3d({
par(mar = c(0,0,0,0))
plot.new()
color_legend <- as.raster(matrix(rev(color_palette), ncol = 1))
rasterImage(color_legend, 0.85, 0.1, 0.9, 0.9)
text(
x = 0.92,
y = seq(0.1, 0.9, length.out = 5),
labels = round(seq(min(T2), max(T2), length.out = 5), 2),
cex = 0.7)
text(x = 0.92, y = 0.95, labels = "T2", cex = 0.8)})
rgl::view3d(theta = 30, phi = 25, zoom = .8)