Canonical Correlation Analysis (CCA) is a multivariate statistical technique used to explore the relationships between two sets of variables. It seeks to identify and measure the associations between the variables by finding linear combinations of the variables in each set that are maximally correlated with each other.
Mathematically, if we have two sets of variables, \(X = \{x_1, x_2, \ldots, x_p\}\) and \(Y = \{y_1, y_2, \ldots, y_q\}\), CCA seeks to find linear combinations of \(X\) and \(Y\), denoted as \(U = a_1x_1 + a_2x_2 + \ldots + a_px_p\) and \(V = b_1y_1 + b_2y_2 + \ldots + b_qy_q\). The vectors \(a = [a_1, a_2, \ldots, a_p]\) and \(b = [b_1, b_2, \ldots, b_q]\) are chosen to maximize the correlation between \(U\) and \(V\), under the constraint of orthogonality. This means that each pair of canonical variates (\(U\) and \(V\)) is uncorrelated with all other pairs. This property is crucial for the method’s ability to extract multiple, distinct dimensions of correlation between the two sets of variables.
Additionally, the minimum canonical correlation value that can be identified through CCA is equivalent to the largest bivariate correlation that exists between any single pair of variables across the \(X\) and \(Y\) sets. This characteristic emphasizes CCA’s capacity to uncover the most significant linear relationship possible between the two variable sets, beyond what is achievable through simple bivariate correlation analysis.
For this tutorial, we will use the mtcars
dataset
included in R, which contains various measurements in miles per gallon
(mpg), cylinder count, horsepower, and other aspects of automobile
design and performance for 32 automobiles. We will focus on exploring
the relationship between a set of variables related to car performance
(mpg
, hp
, wt
) and a set related
to design specifications (disp
, drat
,
qsec
).
First, install and load the necessary packages:
## Uncomment the line below to install the CCA package if you haven't already
# install.packages("CCA")
library(CCA) # To compute canonical correlations
Next, prepare the data:
data(mtcars)
performance <- mtcars[, c("mpg", "hp", "wt")]
design <- mtcars[, c("disp", "drat", "qsec")]
# Check for NA values to ensure data integrity
sum(is.na(performance))
## [1] 0
sum(is.na(design))
## [1] 0
Now, perform the canonical correlation analysis:
cca_result <- cancor(performance, design)
cca_result
## $cor
## [1] 0.9295183 0.7912781 0.3069647
##
## $xcoef
## [,1] [,2] [,3]
## mpg 0.0008956961 -0.011633574 -0.070645646
## hp -0.0009510815 0.002925112 -0.002799602
## wt -0.1274417707 -0.248515669 -0.242116355
##
## $ycoef
## [,1] [,2] [,3]
## disp -0.0014145342 -0.0006212359 -0.00186371
## drat 0.0123441384 0.0151949679 -0.50735976
## qsec -0.0006053208 -0.1102697491 -0.04391710
##
## $xcenter
## mpg hp wt
## 20.09062 146.68750 3.21725
##
## $ycenter
## disp drat qsec
## 230.721875 3.596563 17.848750
The values under $cor
in the output represent the
correlations between pairs of canonical variates from two sets of
variables:
0.9295183: Correlation between the first canonical variates (\(CV1_X\) and \(CV1_Y\)), indicating a very strong linear relationship.
0.7912781: Correlation between the second canonical variates (\(CV2_X\) and \(CV2_Y\)), showing a strong but lesser relationship than the first pair.
0.3069647: Correlation between the third canonical variates (\(CV3_X\) and \(CV3_Y\)), indicating a relatively weak relationship.
Each correlation measures the strength of the linear relationship captured by each corresponding pair of canonical variates.
The canonical variates are generated through linear combinations of
the original variables within each set, aiming to maximize the
correlation between the pairs of canonical variables from the two sets.
In our analysis of the mtcars
dataset, we have two sets:
one related to car performance (mpg
, hp
,
wt
) and the other to design specifications
(disp
, drat
, qsec
). Their
coefficients for each canonical variates are given by
$xcoef
and $ycoef
in the output above. The
equations for canonical variates are crucial for interpreting the
results of CCA, showing how each original variable contributes to the
relationships between the two sets of variables.
\[CV1_X = (0.0008956961 \times mpg) - (0.0009510815 \times hp) - (0.1274417707 \times wt)\] \[CV2_X = (-0.011633574 \times mpg) + (0.002925112 \times hp) - (0.248515669 \times wt)\] \[CV3_X = (-0.070645646 \times mpg) - (0.002799602 \times hp) - (0.242116355 \times wt)\]
\[CV1_Y = (-0.0014145342 \times disp) + (0.0123441384 \times drat) - (0.0006053208 \times qsec)\] \[CV2_Y = (-0.0006212359 \times disp) + (0.0151949679 \times drat) - (0.1102697491 \times qsec)\] \[CV3_Y = (-0.00186371 \times disp) - (0.50735976 \times drat) - (0.04391710 \times qsec)\]
The $xcenter
and $ycenter
values in output
represent the means of variables for the X-set (Car Performance) and
Y-set (Design Specifications), respectively. These means are used to
center the variables by subtracting them from the original data, a
standard preprocessing step in CCA to ensure variables are on a
comparable scale and to improve interpretability.
$xcenter:
mpg: 20.09062 (average miles per gallon)
hp: 146.68750 (average horsepower)
wt: 3.21725 (average weight in thousands of pounds)
$ycenter:
disp: 230.721875 (average engine displacement in cubic inches)
drat: 3.596563 (average rear axle ratio)
qsec: 17.848750 (average quarter-mile time in seconds)
We will visualize the canonical correlations, indicating the strength of the relationships between canonical variates, and the canonical coefficients (loadings) to show the contribution of individual variables to these variates.
library(ggplot2) # For plotting
library(reshape2) # For melting data frames
library(gridExtra) # For arranging multiple plots
# Canonical correlations plot data
can_cor_data <- data.frame(
Canonical_Variate = paste("Pair", 1:length(cca_result$cor)),
Correlation = cca_result$cor
)
# Prepare melted data frames for xcoef and ycoef for loadings plots
xcoef_melted <- melt(cca_result$xcoef)
colnames(xcoef_melted) <- c("Variable", "Canonical_Variate", "Loading")
xcoef_melted$Set <- "X-set"
ycoef_melted <- melt(cca_result$ycoef)
colnames(ycoef_melted) <- c("Variable", "Canonical_Variate", "Loading")
ycoef_melted$Set <- "Y-set"
# Combine xcoef and ycoef data for a unified plot of loadings
coef_melted <- rbind(xcoef_melted, ycoef_melted)
# Define distinct color sets for the canonical correlations and the sets (X-set and Y-set)
cor_colors <- c("Pair 1" = "#FF9999",
"Pair 2" = "#9999FF",
"Pair 3" = "#99FF99")
set_colors <- c("X-set" = "steelblue", "Y-set" = "darkorange")
# Plot canonical correlations with distinct colors
p1 <- ggplot(can_cor_data, aes(x = Canonical_Variate, y = Correlation, fill = Canonical_Variate)) +
geom_bar(stat = "identity") +
scale_fill_manual(values = cor_colors) +
theme_minimal() +
labs(title = "Canonical Correlations", x = "", y = "Correlation") +
coord_flip()
# Update the melted data to reflect the Pair labels
coef_melted$Canonical_Variate <- factor(coef_melted$Canonical_Variate,
levels = c(1, 2, 3),
labels = c("Pair 1", "Pair 2", "Pair 3"))
# Plot loadings with updated labels and set-specific colors
p2 <- ggplot(coef_melted, aes(x = Variable, y = Loading, fill = Set)) +
geom_bar(stat = "identity", position = "dodge") +
scale_fill_manual(values = set_colors) +
facet_wrap(~Canonical_Variate, scales = "free") +
theme_minimal() +
labs(title = "Canonical Coefficients (Loadings)", x = "Variable", y = "Loading") +
coord_flip()
# Combine plots
grid.arrange(p1, p2, ncol = 1)
The top plot displays the correlation coefficients for each pair of canonical variates from the two sets of variables. The color-coded bars represent the strength of the association, with Pair 1 showing the highest correlation and Pair 3 the lowest, suggesting that the first pair captures the most significant relationship between the sets.
The bottom plot is divided into three subplots corresponding to each canonical pair. These show the contribution of individual variables from both the X-set (car performance) and Y-set (design specifications) to the canonical variates. In the first canonical variate, the negative loading for ‘wt’ (from the X-set) and the positive loading for ‘drat’ (from the Y-set) suggest that there is an inverse relationship between these variables when considering the first canonical variate: as ‘wt’ increases, ‘drat’ tends to decrease, or vice versa. Each pair of canonical variates represents a unique dimension of correlation between the sets, with the first pair generally capturing the strongest relationship, and subsequent pairs capturing progressively less variance.
Let’s draw the path diagram for the first canonical variate.
library(DiagrammeR) # For path diagram
## Warning: package 'DiagrammeR' was built under R version 4.2.3
library(DiagrammeRsvg) # For SVG rendering
library(magick) # For image processing
## Linking to ImageMagick 6.9.12.93
## Enabled features: cairo, fontconfig, freetype, heic, lcms, pango, raw, rsvg, webp
## Disabled features: fftw, ghostscript, x11
graph <- grViz("
digraph CCA {
# Graph layout settings
rankdir=BT
# Define styles for nodes
node [fontname = Helvetica]
# Performance variables
node [shape = box, color = blue]
mpg [label='mpg\\n0.0009'];
hp [label='hp\\n-0.00095'];
wt [label='wt\\n-0.1274'];
# Design variables
node [shape = box, color = green]
disp [label='disp\\n-0.0014'];
drat [label='drat\\n0.0123'];
qsec [label='qsec\\n-0.0006'];
# Canonical Variates
node [shape = ellipse, style = filled, color = lightgrey]
CV1x [label='CV1x'];
CV1y [label='CV1y'];
# Connecting performance variables to CV1x
edge [color = blue]
mpg -> CV1x;
hp -> CV1x;
wt -> CV1x;
# Connecting design variables to CV1y
edge [color = green]
disp -> CV1y;
drat -> CV1y;
qsec -> CV1y;
# Canonical correlations between Canonical Variates
edge [color = red, constraint=false, dir=both]
CV1x -> CV1y [label = '0.93'];
}
", engine = 'neato')
## This part is additional to render the image as HTML
# Export the graph to SVG
svg <- export_svg(graph)
# Save the SVG to a file
writeLines(svg, "graph.svg")
# Read the SVG content
svg_image <- image_read_svg("graph.svg", width = 2000)
# Convert to PNG and save
image_write(svg_image, "graph.png")
knitr::include_graphics("graph.png")
In Canonical Correlation Analysis (CCA), understanding the statistical significance of the derived canonical correlations is crucial for interpreting the results meaningfully. The process involves evaluating whether the relationships uncovered by CCA are statistically significant—i.e., not likely to have occurred by chance. This section explains the statistical calculations behind testing the significance of canonical correlations.
The significance testing of canonical correlations involves several key steps and concepts, which are briefly outlined below:
Wilks’ Lambda:
Wilks’ Lambda (\(\Lambda\)) is a
statistic used to assess the significance of the overall model in
multivariate tests, including CCA. It represents the ratio of the
determinant of the within-groups sum of squares and cross-product matrix
to the determinant of the total sum of squares and cross-product matrix.
For CCA, it measures the proportion of variance not explained by the
canonical correlations. Lower values of \(\Lambda\) indicate higher
significance.
Chi-Squared Transformation:
To assess the significance of \(\Lambda\), it is transformed into a
chi-squared (\(\chi^2\)) statistic.
This transformation allows us to use the chi-squared distribution to
determine the probability that the observed relationships could occur by
chance. The transformation formula involves the number of observations
(\(N\)) and the number of variables in
each set (\(k_x\) and \(k_y\)), adjusting for the degrees of
freedom.
P-Value Calculation:
The p-value is calculated from the chi-squared statistic and its degrees
of freedom, which are determined by the number of variables in the X-set
and Y-set. The p-value tells us the probability of observing a
chi-squared statistic as extreme as, or more extreme than, what was
actually observed, under the assumption that there is no relationship
between the variable sets (null hypothesis).
# Perform Canonical Correlation Analysis
cca_result <- cancor(performance, design)
# The number of samples
N <- nrow(performance)
# The number of variables in the X-set and Y-set
kx <- ncol(performance)
ky <- ncol(design)
# Calculate the squared canonical correlations (eigenvalues)
eigenvalues <- cca_result$cor^2
# Initialize a vector to store Wilks' lambda values
wilks_lambda <- rep(NA, length(eigenvalues))
# Calculate Wilks' lambda for each canonical correlation
for (i in seq_along(eigenvalues)) {
wilks_lambda[i] <- prod(1 - eigenvalues[i:length(eigenvalues)])
}
# Convert Wilks' lambda into a chi-square statistic
chisq_stats <- -(N - 1 - (max(kx, ky) + 1) / 2) * log(wilks_lambda)
# Compute the p-values from the chi-square distribution
p_values <- pchisq(chisq_stats, df = (kx - (0:(length(wilks_lambda)-1))) * (ky - (0:(length(wilks_lambda)-1))), lower.tail = FALSE)
# Combine the results into a data frame for easy viewing
significance_tests <- data.frame(
Canonical_Correlation = cca_result$cor,
Wilks_Lambda = wilks_lambda,
Chi_Squared = chisq_stats,
P_Value = p_values
)
# Output the results
significance_tests
## Canonical_Correlation Wilks_Lambda Chi_Squared P_Value
## 1 0.9295183 0.04605485 89.259742 2.291676e-15
## 2 0.7912781 0.33864938 31.400909 2.535756e-06
## 3 0.3069647 0.90577265 2.870041 9.024256e-02
Given the CCA results for the mtcars
dataset, where we
analyzed relationships between car performance and design
specifications, we conducted significance tests for each canonical
correlation. Here’s a summary of the findings:
The significance tests highlight that the first two canonical correlations uncover significant relationships between our sets of variables, with the first demonstrating an especially strong connection. This is reflected in the variance explained, where approximately 86.4% of the variance for the first and 62.6% for the second canonical correlation points to a robust, statistically significant relationship. These strong correlations are underscored by their very low p-values. Conversely, the third canonical correlation, explaining only 9.4% of the variance, does not significantly clarify the relationship between the variable sets, as its higher p-value suggests the correlation might merely be coincidental.
In this tutorial, we talked about the Canonical Correlation Analysis (CCA), a statistical method for exploring the relationships between two sets of variables, using the mtcars dataset as an example. We demonstrated how CCA identifies linear combinations of these variables that are maximally correlated, thereby uncovering significant relationships not apparent through simple correlation analyses. The tutorial covered data preparation, execution of CCA with R and relevant packages such as CCA and ggplot2, and the interpretation of canonical correlations and coefficients to understand the contributions of individual variables. We highlighted the importance of assessing the statistical significance of these correlations, employing Wilks’ Lambda and chi-squared tests. Additionally, we illustrated the relationships through visualizations and path diagrams.