PCA is a powerful statistical technique used for dimensionality reduction, which simplifies the complexity inherent in high-dimensional data while retaining as much of the variation in the dataset as possible. This method transforms the original set of variables into a new set of uncorrelated variables called principal components. These components are linear combinations of the original variables and are orthogonal to each other. Each successive component captures the maximum variance possible, given that it is orthogonal to the preceding components.
PCA assumes that the data points are situated in a linear space, meaning that the relationships between variables can be accurately captured using straight lines. This assumption is fundamental because PCA projects the original data onto lower-dimensional linear subspaces. If the relationships among variables are non-linear, PCA might not capture the true structure of the data, leading to misleading or incomplete findings.
PCA assumes that the principal components are orthogonal to each other. This orthogonality implies that the components are statistically independent, with no correlation among them. This assumption is crucial for PCA’s objective of reducing dimensionality by transforming correlated variables into a set of linearly uncorrelated components.
Before applying PCA, it is assumed that the data has been centered around the mean. This centering is a preprocessing step where the mean of each variable is subtracted from the dataset, ensuring that the PCA focuses on the covariance among variables rather than their mean values. This assumption allows PCA to effectively identify the directions of maximum variance without the influence of the variables’ absolute magnitudes.
PCA is sensitive to the scaling of variables. Variables with larger scales can dominate the outcome of PCA, influencing the direction of the first principal components towards these variables. Therefore, it’s often assumed that the data has been appropriately scaled (e.g., standardization to unit variance) before applying PCA. This ensures that all variables contribute equally to the analysis, preventing scale discrepancies from skewing the results.
The calculation of PCA involves several steps:
Standardization:
The original variables \(X_1, X_2, \ldots,
X_p\) are standardized to have a mean of 0 and a standard
deviation of 1. This step is essential as it ensures that each variable
contributes equally to the analysis, preventing variables with larger
scales from dominating the outcome.
\[Z = \frac{X - \mu}{\sigma}\]
where \(Z\) is the standardized data, \(X\) is the original data, \(\mu\) is the mean, and \(\sigma\) is the standard deviation.
Covariance Matrix Computation:
The covariance matrix \(\Sigma\) of the
standardized variables is computed to understand the variance and the
covariance among the variables. The formula for the covariance matrix
is:
\[\Sigma = \frac{1}{n-1} (Z^T Z)\]
where \(Z\) is the matrix of standardized data. This matrix plays a crucial role in identifying the directions that maximize the variance in the data.
Eigen Decomposition:
The eigenvalues \(\lambda_1, \lambda_2,
\ldots, \lambda_p\) and eigenvectors \(v_1, v_2, \ldots, v_p\) of the covariance
matrix \(\Sigma\) are calculated. The
eigenvalues indicate the amount of variance captured by each principal
component, while the eigenvectors represent the directions in the
feature space along which variance is maximized.
Selection of Principal Components:
The eigenvectors are sorted in descending order of their corresponding
eigenvalues. The first \(k\)
eigenvectors, corresponding to the largest \(k\) eigenvalues, are selected to form the
principal components. This selection is often based on the criterion of
explained variance, aiming to retain as much information as
possible.
Projection Onto New Features:
The original standardized data \(Z\) is
projected onto the space spanned by the top \(k\) eigenvectors to form the new features,
or principal components.
The mathematical representation of the principal components \(Y\) is:
\[Y = ZV\]
where:
Each principal component \(Y_i\) is calculated as follows:
\[Y_i = Z v_i\]
where:
By retaining only the top \(k\) principal components, we reduce the dimensionality of the data while preserving as much of the data’s variation as possible.
The mtcars dataset, included in R, consists of various automobile design and performance metrics for 32 automobiles. Due to its multivariate nature, it is well-suited for demonstrating PCA. We’ll use PCA to reduce dimensionality and highlight the primary variance within the dataset.
# Loading the necessary library
library(stats)
# Loading the mtcars dataset
data(mtcars)
# Performing PCA with automatic scaling
pca_result <- prcomp(mtcars, center = TRUE, scale. = TRUE)
# Summarizing PCA results
summary(pca_result)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.5707 1.6280 0.79196 0.51923 0.47271 0.46000 0.3678
## Proportion of Variance 0.6008 0.2409 0.05702 0.02451 0.02031 0.01924 0.0123
## Cumulative Proportion 0.6008 0.8417 0.89873 0.92324 0.94356 0.96279 0.9751
## PC8 PC9 PC10 PC11
## Standard deviation 0.35057 0.2776 0.22811 0.1485
## Proportion of Variance 0.01117 0.0070 0.00473 0.0020
## Cumulative Proportion 0.98626 0.9933 0.99800 1.0000
The summary(pca_result) command in R provides a concise summary of the principal components analysis results. The output includes several important pieces of information for each principal component (PC):
This is the square root of the eigenvalues of the covariance matrix and reflects the amount of variance captured by each PC. A higher standard deviation indicates that the component accounts for a greater amount of variance in the data.For example, the first principal component (PC1) has a standard deviation of 2.5707, which is the highest among all components, indicating that it captures the most variance.
This is the proportion of the dataset’s total variance that each PC captures. It is obtained by squaring the standard deviation (sdev^2), dividing by the total variance, and is often expressed as a percentage.Here, PC1 captures approximately 60.08% of the variance in the data, and PC2 captures about 24.09%. The other components capture progressively less variance.
This indicates the cumulative variance captured by the PCs up to that point. It helps to understand how many components might be necessary to capture a substantial amount of the total variance. For instance, by combining PC1 and PC2, you capture about 84.17% of the total variance. Adding PC3 brings the total to 89.87%, and so on. When all the components are combined (up to PC11), they capture 100% of the variance, which is a characteristic of PCA since the number of components equals the number of original variables.
These metrics are critical for deciding how many principal components to retain for further analysis. You would typically retain components until the point where adding another component doesn’t significantly increase the cumulative proportion of variance explained. This is often done using a scree plot (please see section on visualization), where you look for the “elbow,” a point where the marginal gain in explained variance significantly drops, indicating that subsequent components contribute less to the explanation of variability in the dataset.
After performing PCA using the prcomp function in R, the results are returned as an object that contains several components. Each component plays a role in understanding the PCA output. Here is an overview of the key components of the PCA object, which is typically returned as a list with the class “prcomp”:
str(pca_result$sdev)
## num [1:11] 2.571 1.628 0.792 0.519 0.473 ...
The sdev component represents the standard deviations of the principal components, a measure of how much the data is spread out along each principal component axis. This spread is intricately linked to the eigenvalues derived from the covariance matrix \(\Sigma\) of the data through eigen decomposition. When this decomposition is performed, each eigenvalue \(\lambda_i\) is paired with a corresponding eigenvector \(v_i\). The eigenvector defines a direction in the feature space, and its associated eigenvalue quantifies the variance of the data along this direction. Thus, the eigenvalue for each principal component is the square of the sdev, indicating that the sdev squared gives us the variance captured by that principal component. This relationship between the eigenvalues and the standard deviations is important for understanding the proportion of total variance each principal component accounts for, enabling the calculation of the variance explained by each principal component.
str(pca_result$rotation)
## num [1:11, 1:11] -0.363 0.374 0.368 0.33 -0.294 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:11] "mpg" "cyl" "disp" "hp" ...
## ..$ : chr [1:11] "PC1" "PC2" "PC3" "PC4" ...
The rotation matrix represents the principal component loadings. Each column corresponds to a principal component, and each row corresponds to the original variables. The magnitude and sign of each loading represent the contribution and direction of the variable to the principal component. The loadings can be interpreted as correlation coefficients between the original variables and the principal components. This matrix is key to understanding the composition of the principal components.
str(pca_result$center)
## Named num [1:11] 20.09 6.19 230.72 146.69 3.6 ...
## - attr(*, "names")= chr [1:11] "mpg" "cyl" "disp" "hp" ...
The center component shows the mean of each variable that was used to center the data before performing PCA. Centering is subtracting the variable mean from each observation. It ensures that the PCA operates on a mean-centered data, which is necessary for the analysis.
str(pca_result$scale)
## Named num [1:11] 6.027 1.786 123.939 68.563 0.535 ...
## - attr(*, "names")= chr [1:11] "mpg" "cyl" "disp" "hp" ...
The scale component contains the scaling applied to each variable before performing PCA, if the ‘scale.’ argument in prcomp was set to TRUE. Scaling is dividing each centered variable by its standard deviation. This standardizes variables to have unit variance and is crucial when the variables are measured on different scales.
str(pca_result$x)
## num [1:32, 1:11] -0.647 -0.619 -2.736 -0.307 1.943 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
## ..$ : chr [1:11] "PC1" "PC2" "PC3" "PC4" ...
The x component, often referred to as the score matrix, contains the coordinates of the original data projected onto the principal components. Each column represents a principal component, while each row represents an observation from the original dataset. The score matrix is useful for plotting the data in the reduced dimensional space defined by the principal components.
Determining the appropriate number of principal components to retain is crucial for both the interpretability and effectiveness of PCA. Various criteria and methods can guide this decision:
A scree plot graphically represents the variance explained by each principal component, ordered by size. Analysts look for an “elbow” in the plot, indicating a point where the decrease in variance explained significantly slows, suggesting that subsequent components add less explanatory value.
The Kaiser criterion recommends retaining components with eigenvalues greater than 1. This is based on the idea that a principal component should explain more variance than a single standardized variable, which has a variance of 1.
This approach involves selecting the smallest number of components that together explain a certain percentage of the total variance in the dataset, such as 80% or 90%. This threshold is arbitrary and may vary depending on the specific goals of the analysis or the domain-specific requirements. Higher thresholds result in more components being retained, capturing more of the total variance at the expense of a less parsimonious model.
Parallel analysis is a more sophisticated technique that involves comparing the eigenvalues from the PCA of the actual dataset to those obtained from randomly generated datasets of the same size. Only components with eigenvalues exceeding those from the random data are retained. This method helps to distinguish between meaningful components and those that might arise from random noise.
In some cases, the decision on the number of components to retain may also consider the interpretability of the components. Components that have clear and meaningful interpretations in the context of the research question or data may be preferred even if they explain a smaller portion of the total variance.
Selecting the number of principal components is a balance between retaining as much information about the dataset as possible and achieving simplification through dimensionality reduction. The choice of method may depend on the specific goals of the analysis, the nature of the data, and the requirements for subsequent analyses. It is also common to apply more than one of these criteria to ensure a robust decision-making process in determining the optimal number of components to retain.
Interpreting the loadings (coefficients in the eigenvectors) of principal components is crucial for understanding the underlying structure of the data in PCA. Loadings indicate the contribution of each original variable to a principal component, providing insights into the dataset’s dimensions that capture the most variance.
Loadings are the elements of the eigenvectors that result from the eigen decomposition of the covariance or correlation matrix of the dataset. They can be interpreted as:
Identifying Key Variables: Variables with high absolute loadings on a principal component are considered key contributors to the component. These variables share a common variance captured by the component.
Understanding Dimensions: Each principal component represents a dimension within the dataset. By examining the loadings, we can interpret these dimensions in terms of the original variables. For instance, a component with high loadings from variables related to financial metrics might represent an underlying “financial” dimension of the data.
Correlation Among Variables: Loadings can also reveal correlations among variables. High loadings of the same sign on a component suggest a group of variables that vary together in the same direction.
The loading vector for the \(k\)-th principal component can be expressed as:
\[ \mathbf{l}_k = [l_{1k}, l_{2k}, \ldots, l_{pk}]^T \]
where \(l_{ik}\) is the loading of the \(i\)-th variable on the \(k\)-th component, and \(p\) is the number of variables. The squared loadings of a component sum to 1, reflecting the normalization of eigenvectors:
\[ \sum_{i=1}^{p} l_{ik}^2 = 1 \]
Next, we plot the proportion of variance explained by each principal component and the cumulative variance explained.
# Extract the proportion of variance explained by each principal component
var_explained <- pca_result$sdev^2 / sum(pca_result$sdev^2)
# Cumulative variance explained
cum_var_explained <- cumsum(var_explained)
# Setting up the layout to have 1 row and 2 columns
par(mfrow = c(1, 2))
# First plot: Proportion of variance explained by each principal component
var_explained <- pca_result$sdev^2 / sum(pca_result$sdev^2)
plot(var_explained,
xlab = "Principal Component",
ylab = "Proportion of Variance",
type = 'b', pch = 19, main = "Variance by Each PC", ylim = c(0, max(c(var_explained, cumsum(var_explained)))))
abline(h = 0, col = "gray")
# Second plot: Cumulative variance explained
cum_var_explained <- cumsum(var_explained)
plot(cum_var_explained,
xlab = "Principal Component",
ylab = "Cumulative Proportion of Variance",
type = 'b', pch = 18, col = 'red', main = "Cumulative Variance Explained", ylim = c(0, 1))
abline(h = 0, col = "gray")
Here’s what the two plots indicate:
Each point denotes the proportion of the total variance captured by the corresponding principal component. The first principal component (PC1) is responsible for a significant portion of the variance (approximately 60%), suggesting it encapsulates most of the dataset’s information. The variance captured by subsequent components decreases markedly, with the second component (PC2) accounting for around 24% of the variance. The sharp drop after PC1 implies that the first component contains the bulk of the useful information.
This plot illustrates the cumulative proportion of variance explained as more components are considered. The quick rise at the start and the eventual leveling off imply that a small number of components account for most of the information. The cumulative proportion approaches 100% by the 11th component, which is to be expected as the number of principal components equals the number of variables in the dataset.
Scree plots can also be computed based on the eigenvalues of the covariance matrix. Each eigenvalue represents the amount of variance explained by its corresponding principal component. Plotting these eigenvalues in descending order provides insight into the contribution of each principal component to the total variance in the dataset.
A score plot is a graphical tool that visually represents the positions of the original data points within the space defined by the principal components.
Let’s plot the scores of the first principal component against the second principal component. Each point represents an observation from the original dataset, plotted according to its scores on these two components.
# Plotting the first two principal components
plot(pca_result$x[,1], pca_result$x[,2],
xlab = "Principal Component 1",
ylab = "Principal Component 2",
main = "PCA Score Plot",
pch = 19, col = "blue")
# Adding labels
text(pca_result$x[,1], pca_result$x[,2], labels = row.names(pca_result$x), pos = 4, cex = 0.7, col = "red")
To elevate the informativeness of a PCA score plot, incorporating color coding based on a categorical variable or utilizing different symbols for distinct groups can be beneficial.
For example, grouping by the number of cylinders offers another layer of analysis, potentially correlating specific performance or design features with the principal components. Here’s the approach for visualizing the PCA score plot by cylinder categories:
# Define colors and symbols for Cylinder Categories
colors_cyl <- c("green", "orange", "purple") # One color per cylinder category
pch_cyl <- c(15, 17, 19) # Different symbols for 4, 6, and 8 cylinders
# Converting 'mtcars$cyl' to a factor for clearer categorization
cylinder_categories <- factor(mtcars$cyl, labels = c("4 Cyl", "6 Cyl", "8 Cyl"))
# Plotting with color coding and symbols based on Cylinder Categories
plot(pca_result$x[,1], pca_result$x[,2],
xlab = "Principal Component 1",
ylab = "Principal Component 2",
main = "PCA Score Plot by Cylinder Categories",
pch = pch_cyl[as.numeric(cylinder_categories)], col = colors_cyl[as.numeric(cylinder_categories)])
# Adding a more compact legend for Cylinder Categories
legend("topright", inset = .05, # Reducing legend size
legend = levels(cylinder_categories),
pch = pch_cyl, col = colors_cyl,
title = "Cylinders", cex = 0.8, pt.cex = 1, bty = "n")
Here, Principal Component 1 (PC1) effectively separates vehicles based on the number of cylinders, indicating that it captures significant variance in the dataset related to this feature. The distribution along Principal Component 2 (PC2) appears to be less distinctive for grouping, suggesting that PC2 captures variance due to other features or is less influenced by the number of cylinders.
Principal Component Analysis (PCA) effectively reduces the complexity of high-dimensional data by transforming it into a set of orthogonal principal components. These components, which are linear combinations of the original variables, are structured to capture and order the data’s variance, simplifying analysis while retaining essential variation.