Title: | Hotelling’s T-Squared Statistic and Ellipse |
---|---|
Description: | Functions to calculate the Hotelling’s T-squared statistic and corresponding confidence ellipses. Provides the semi-axes of the Hotelling’s T-squared ellipses at 95% and 99% confidence levels. Enables users to obtain the coordinates in two or three dimensions at user-defined confidence levels, allowing for the construction of 2D or 3D ellipses with customized confidence levels. Bro and Smilde (2014) <DOI:10.1039/c3ay41907j>. Brereton (2016) <DOI:10.1002/cem.2763>. |
Authors: | Christian L. Goueguel [aut, cre]
|
Maintainer: | Christian L. Goueguel <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.2.0 |
Built: | 2025-03-05 06:04:34 UTC |
Source: | https://github.com/christiangoueguel/hotellingellipse |
This function calculates the coordinate points for drawing a Hotelling’s T-squared ellipse based on multivariate data. It can generate points for both 2D and 3D ellipses.
ellipseCoord(x, pcx = 1, pcy = 2, pcz = NULL, conf.limit = 0.95, pts = 200)
ellipseCoord(x, pcx = 1, pcy = 2, pcz = NULL, conf.limit = 0.95, pts = 200)
x |
A matrix, data frame or tibble containing scores from PCA, PLS, ICA, or other dimensionality reduction methods. Each column should represent a component, and each row an observation. |
pcx |
An integer specifying which component to use for the x-axis (default is 1). |
pcy |
An integer specifying which component to use for the y-axis (default is 2). |
pcz |
An integer specifying which component to use for the z-axis for 3D ellipsoids. If |
conf.limit |
A numeric value between 0 and 1 specifying the confidence level for the ellipse (default is 0.95, i.e., 95% confidence). |
pts |
An integer specifying the number of points to generate for drawing the ellipse (default is 200). Higher values result in smoother ellipses. |
The function computes the shape and orientation of the ellipse based on the
Hotelling’s T-squared distribution and the specified components. It then generates a set of
points that lie on the ellipse's surface at the specified confidence level.
For 2D ellipses, the function uses two components pcx
and pcy
. For 3D ellipsoids, it uses three components pcx
, pcy
, and pcz
.
The conf.limit
parameter determines the size of the ellipse. A higher confidence
level results in a larger ellipse that encompasses more data points.
A data frame containing the coordinate points of the Hotelling’s T-squared ellipse:
For 2D ellipses: columns x
and y
For 3D ellipsoids: columns x
, y
, and z
Christian L. Goueguel [email protected]
## Not run: # Load required libraries library(HotellingEllipse) library(dplyr) data("specData", package = "HotellingEllipse") # Perform PCA set.seed(123) pca_mod <- specData %>% select(where(is.numeric)) %>% FactoMineR::PCA(scale.unit = FALSE, graph = FALSE) # Extract PCA scores pca_scores <- pca_mod$ind$coord %>% as.data.frame() # Example 1: Calculate Hotelling’s T-squared ellipse coordinates xy_coord <- ellipseCoord(pca_scores, pcx = 1, pcy = 2) # Example 2: Calculate Hotelling’s T-squared ellipsoid coordinates xyz_coord <- ellipseCoord(pca_scores, pcx = 1, pcy = 2, pcz = 3) ## End(Not run)
## Not run: # Load required libraries library(HotellingEllipse) library(dplyr) data("specData", package = "HotellingEllipse") # Perform PCA set.seed(123) pca_mod <- specData %>% select(where(is.numeric)) %>% FactoMineR::PCA(scale.unit = FALSE, graph = FALSE) # Extract PCA scores pca_scores <- pca_mod$ind$coord %>% as.data.frame() # Example 1: Calculate Hotelling’s T-squared ellipse coordinates xy_coord <- ellipseCoord(pca_scores, pcx = 1, pcy = 2) # Example 2: Calculate Hotelling’s T-squared ellipsoid coordinates xyz_coord <- ellipseCoord(pca_scores, pcx = 1, pcy = 2, pcz = 3) ## End(Not run)
This function calculates Hotelling’s T-squared statistic and, when applicable, the lengths of the semi-axes of the Hotelling’s ellipse. It can work with a specified number of components or use a cumulative variance threshold.
ellipseParam( x, k = 2, pcx = 1, pcy = 2, threshold = NULL, rel.tol = 0.001, abs.tol = .Machine$double.eps )
ellipseParam( x, k = 2, pcx = 1, pcy = 2, threshold = NULL, rel.tol = 0.001, abs.tol = .Machine$double.eps )
x |
A matrix, data frame or tibble containing scores from PCA, PLS, ICA, or other similar methods. Each column should represent a component, and each row an observation. |
k |
An integer specifying the number of components to use (default is 2). This parameter is ignored if |
pcx |
An integer specifying which component to use for the x-axis when |
pcy |
An integer specifying which component to use for the y-axis when |
threshold |
A numeric value between 0 and 1 specifying the desired cumulative explained variance threshold (default is |
rel.tol |
A numeric value specifying the minimum proportion of total variance a component should explain to be considered non-negligible (default is 0.001, i.e., 0.1%). |
abs.tol |
A numeric value specifying the minimum absolute variance a component should have to be considered non-negligible (default is |
When threshold
is used, the function selects the minimum number of k
components
that cumulatively explain at least the specified proportion of variance. This
parameter allows for dynamic component selection based on explained variance,
rather than using a fixed number of components. It must be greater than rel.tol
.
Typical values range from 0.8 to 0.95.
The rel.tol
parameter sets a minimum variance threshold for individual components.
Components with variance below this threshold are considered negligible and are
removed from the analysis. Setting rel.tol
too high
may remove potentially important components, while setting it too low may
retain noise or cause computational issues. Adjust based on your data
characteristics and analysis goals.
Note that components are considered to have near-zero variance and are removed
if their relative variance is below rel_tol
or their absolute variance is
below abs_tol
. This dual-threshold approach helps ensure numerical stability
while also accounting for the relative importance of components. The default
value for abs.tol
is set to .Machine$double.eps
, providing a lower bound
for detecting near-zero variance that may cause numerical instability.
A list containing the following elements:
Tsquare
: A data frame containing the T-squared statistic for each observation.
Ellipse
: A data frame containing the lengths of the semi-minor and semi-major axes (only when k = 2
).
cutoff.99pct
: The T-squared cutoff value at the 99% confidence level.
cutoff.95pct
: The T-squared cutoff value at the 95% confidence level.
nb.comp
: The number of components used in the calculation.
Christian L. Goueguel [email protected]
## Not run: # Load required libraries library(HotellingEllipse) library(dplyr) data("specData", package = "HotellingEllipse") # Perform PCA set.seed(123) pca_mod <- specData %>% select(where(is.numeric)) %>% FactoMineR::PCA(scale.unit = FALSE, graph = FALSE) # Extract PCA scores pca_scores <- pca_mod$ind$coord %>% as.data.frame() # Example 1: Calculate Hotelling’s T-squared and ellipse parameters using # the 2nd and 4th components T2_fixed <- ellipseParam(x = pca_scores, pcx = 2, pcy = 4) # Example 2: Calculate using the first 4 components T2_comp <- ellipseParam(x = pca_scores, k = 4) # Example 3: Calculate using a cumulative variance threshold T2_threshold <- ellipseParam(x = pca_scores, threshold = 0.95) ## End(Not run)
## Not run: # Load required libraries library(HotellingEllipse) library(dplyr) data("specData", package = "HotellingEllipse") # Perform PCA set.seed(123) pca_mod <- specData %>% select(where(is.numeric)) %>% FactoMineR::PCA(scale.unit = FALSE, graph = FALSE) # Extract PCA scores pca_scores <- pca_mod$ind$coord %>% as.data.frame() # Example 1: Calculate Hotelling’s T-squared and ellipse parameters using # the 2nd and 4th components T2_fixed <- ellipseParam(x = pca_scores, pcx = 2, pcy = 4) # Example 2: Calculate using the first 4 components T2_comp <- ellipseParam(x = pca_scores, k = 4) # Example 3: Calculate using a cumulative variance threshold T2_threshold <- ellipseParam(x = pca_scores, threshold = 0.95) ## End(Not run)
Data set of the emission spectra of 100 soils measured in laboratory conditions. The samples were cleaned, dried, homogenized, sieved (10 Mesh size) and thereafter pelletized prior to LIBS measurements. LIBS spectra were preprocessed by performing baseline removal.
specData
specData
Data frame of 100 rows (soil samples) and 3152 columns (wavelengths).