Canonical Correlation Analysis (CCA) is a multivariate statistical technique used to explore the relationships between two sets of variables. It seeks to identify and measure the associations between the variables by finding linear combinations of the variables in each set that are maximally correlated with each other.
Mathematically, if we have two sets of variables, \(X = \{x_1, x_2, \ldots, x_p\}\) and \(Y = \{y_1, y_2, \ldots, y_q\}\), CCA seeks to find linear combinations of \(X\) and \(Y\), denoted as \(U = a_1x_1 + a_2x_2 + \ldots + a_px_p\) and \(V = b_1y_1 + b_2y_2 + \ldots + b_qy_q\). The vectors \(a = [a_1, a_2, \ldots, a_p]\) and \(b = [b_1, b_2, \ldots, b_q]\) are chosen to maximize the correlation between \(U\) and \(V\), under the constraint of orthogonality. This means that each pair of canonical variates (\(U\) and \(V\)) is uncorrelated with all other pairs. This property is crucial for the method’s ability to extract multiple, distinct dimensions of correlation between the two sets of variables.
Additionally, the minimum canonical correlation value that can be identified through CCA is equivalent to the largest bivariate correlation that exists between any single pair of variables across the \(X\) and \(Y\) sets. This characteristic emphasizes CCA’s capacity to uncover the most significant linear relationship possible between the two variable sets, beyond what is achievable through simple bivariate correlation analysis.
For this tutorial, we will use the mtcars
dataset
included in R, which contains various measurements in miles per gallon
(mpg), cylinder count, horsepower, and other aspects of automobile
design and performance for 32 automobiles. We will focus on exploring
the relationship between a set of variables related to car performance
(mpg
, hp
, wt
) and a set related
to design specifications (disp
, drat
,
qsec
).
First, install and load the necessary packages:
## Uncomment the line below to install the CCA package if you haven't already
# install.packages("CCA")
library(CCA) # To compute canonical correlations
Next, prepare the data:
data(mtcars)
performance <- mtcars[, c("mpg", "hp", "wt")]
design <- mtcars[, c("disp", "drat", "qsec")]
# Check for NA values to ensure data integrity
sum(is.na(performance))
## [1] 0
sum(is.na(design))
## [1] 0
Now, perform the canonical correlation analysis:
cca_result <- cancor(performance, design)
cca_result
## $cor
## [1] 0.9295183 0.7912781 0.3069647
##
## $xcoef
## [,1] [,2] [,3]
## mpg 0.0008956961 -0.011633574 -0.070645646
## hp -0.0009510815 0.002925112 -0.002799602
## wt -0.1274417707 -0.248515669 -0.242116355
##
## $ycoef
## [,1] [,2] [,3]
## disp -0.0014145342 -0.0006212359 -0.00186371
## drat 0.0123441384 0.0151949679 -0.50735976
## qsec -0.0006053208 -0.1102697491 -0.04391710
##
## $xcenter
## mpg hp wt
## 20.09062 146.68750 3.21725
##
## $ycenter
## disp drat qsec
## 230.721875 3.596563 17.848750
The values under $cor
in the output represent the
correlations between pairs of canonical variates from two sets of
variables:
0.9295183: Correlation between the first canonical variates (\(CV1_X\) and \(CV1_Y\)), indicating a very strong linear relationship.
0.7912781: Correlation between the second canonical variates (\(CV2_X\) and \(CV2_Y\)), showing a strong but lesser relationship than the first pair.
0.3069647: Correlation between the third canonical variates (\(CV3_X\) and \(CV3_Y\)), indicating a relatively weak relationship.
Each correlation measures the strength of the linear relationship captured by each corresponding pair of canonical variates.
The canonical variates are generated through linear combinations of
the original variables within each set, aiming to maximize the
correlation between the pairs of canonical variables from the two sets.
In our analysis of the mtcars
dataset, we have two sets:
one related to car performance (mpg
, hp
,
wt
) and the other to design specifications
(disp
, drat
, qsec
). Their
coefficients for each canonical variates are given by
$xcoef
and $ycoef
in the output above. The
equations for canonical variates are crucial for interpreting the
results of CCA, showing how each original variable contributes to the
relationships between the two sets of variables.
\[CV1_X = (0.0008956961 \times mpg) - (0.0009510815 \times hp) - (0.1274417707 \times wt)\] \[CV2_X = (-0.011633574 \times mpg) + (0.002925112 \times hp) - (0.248515669 \times wt)\] \[CV3_X = (-0.070645646 \times mpg) - (0.002799602 \times hp) - (0.242116355 \times wt)\]
\[CV1_Y = (-0.0014145342 \times disp) + (0.0123441384 \times drat) - (0.0006053208 \times qsec)\] \[CV2_Y = (-0.0006212359 \times disp) + (0.0151949679 \times drat) - (0.1102697491 \times qsec)\] \[CV3_Y = (-0.00186371 \times disp) - (0.50735976 \times drat) - (0.04391710 \times qsec)\]
The $xcenter
and $ycenter
values in output
represent the means of variables for the X-set (Car Performance) and
Y-set (Design Specifications), respectively. These means are used to
center the variables by subtracting them from the original data, a
standard preprocessing step in CCA to ensure variables are on a
comparable scale and to improve interpretability.
$xcenter:
mpg: 20.09062 (average miles per gallon)
hp: 146.68750 (average horsepower)
wt: 3.21725 (average weight in thousands of pounds)
$ycenter:
disp: 230.721875 (average engine displacement in cubic inches)
drat: 3.596563 (average rear axle ratio)
qsec: 17.848750 (average quarter-mile time in seconds)
We will visualize the canonical correlations, indicating the strength of the relationships between canonical variates, and the canonical coefficients (loadings) to show the contribution of individual variables to these variates.
library(ggplot2) # For plotting
library(reshape2) # For melting data frames
library(gridExtra) # For arranging multiple plots
# Canonical correlations plot data
can_cor_data <- data.frame(
Canonical_Variate = paste("Pair", 1:length(cca_result$cor)),
Correlation = cca_result$cor
)
# Prepare melted data frames for xcoef and ycoef for loadings plots
xcoef_melted <- melt(cca_result$xcoef)
colnames(xcoef_melted) <- c("Variable", "Canonical_Variate", "Loading")
xcoef_melted$Set <- "X-set"
ycoef_melted <- melt(cca_result$ycoef)
colnames(ycoef_melted) <- c("Variable", "Canonical_Variate", "Loading")
ycoef_melted$Set <- "Y-set"
# Combine xcoef and ycoef data for a unified plot of loadings
coef_melted <- rbind(xcoef_melted, ycoef_melted)
# Define distinct color sets for the canonical correlations and the sets (X-set and Y-set)
cor_colors <- c("Pair 1" = "#FF9999",
"Pair 2" = "#9999FF",
"Pair 3" = "#99FF99")
set_colors <- c("X-set" = "steelblue", "Y-set" = "darkorange")
# Plot canonical correlations with distinct colors
p1 <- ggplot(can_cor_data, aes(x = Canonical_Variate, y = Correlation, fill = Canonical_Variate)) +
geom_bar(stat = "identity") +
scale_fill_manual(values = cor_colors) +
theme_minimal() +
labs(title = "Canonical Correlations", x = "", y = "Correlation") +
coord_flip()
# Update the melted data to reflect the Pair labels
coef_melted$Canonical_Variate <- factor(coef_melted$Canonical_Variate,
levels = c(1, 2, 3),
labels = c("Pair 1", "Pair 2", "Pair 3"))
# Plot loadings with updated labels and set-specific colors
p2 <- ggplot(coef_melted, aes(x = Variable, y = Loading, fill = Set)) +
geom_bar(stat = "identity", position = "dodge") +
scale_fill_manual(values = set_colors) +
facet_wrap(~Canonical_Variate, scales = "free") +
theme_minimal() +
labs(title = "Canonical Coefficients (Loadings)", x = "Variable", y = "Loading") +
coord_flip()
# Combine plots
grid.arrange(p1, p2, ncol = 1)