Canonical Correlation

Introduction to Canonical Correlation

Canonical Correlation Analysis (CCA) is a multivariate statistical technique used to explore the relationships between two sets of variables. It seeks to identify and measure the associations between the variables by finding linear combinations of the variables in each set that are maximally correlated with each other.

Mathematically, if we have two sets of variables, \(X = \{x_1, x_2, \ldots, x_p\}\) and \(Y = \{y_1, y_2, \ldots, y_q\}\), CCA seeks to find linear combinations of \(X\) and \(Y\), denoted as \(U = a_1x_1 + a_2x_2 + \ldots + a_px_p\) and \(V = b_1y_1 + b_2y_2 + \ldots + b_qy_q\). The vectors \(a = [a_1, a_2, \ldots, a_p]\) and \(b = [b_1, b_2, \ldots, b_q]\) are chosen to maximize the correlation between \(U\) and \(V\), under the constraint of orthogonality. This means that each pair of canonical variates (\(U\) and \(V\)) is uncorrelated with all other pairs. This property is crucial for the method’s ability to extract multiple, distinct dimensions of correlation between the two sets of variables.

Additionally, the minimum canonical correlation value that can be identified through CCA is equivalent to the largest bivariate correlation that exists between any single pair of variables across the \(X\) and \(Y\) sets. This characteristic emphasizes CCA’s capacity to uncover the most significant linear relationship possible between the two variable sets, beyond what is achievable through simple bivariate correlation analysis.

Canonical Correlation Analysis in R

For this tutorial, we will use the mtcars dataset included in R, which contains various measurements in miles per gallon (mpg), cylinder count, horsepower, and other aspects of automobile design and performance for 32 automobiles. We will focus on exploring the relationship between a set of variables related to car performance (mpg, hp, wt) and a set related to design specifications (disp, drat, qsec).

First, install and load the necessary packages:

## Uncomment the line below to install the CCA package if you haven't already
# install.packages("CCA")
library(CCA) # To compute canonical correlations

Next, prepare the data:

data(mtcars)
performance <- mtcars[, c("mpg", "hp", "wt")]
design <- mtcars[, c("disp", "drat", "qsec")]

# Check for NA values to ensure data integrity
sum(is.na(performance))
## [1] 0
sum(is.na(design))
## [1] 0

Now, perform the canonical correlation analysis:

cca_result <- cancor(performance, design)
cca_result
## $cor
## [1] 0.9295183 0.7912781 0.3069647
## 
## $xcoef
##              [,1]         [,2]         [,3]
## mpg  0.0008956961 -0.011633574 -0.070645646
## hp  -0.0009510815  0.002925112 -0.002799602
## wt  -0.1274417707 -0.248515669 -0.242116355
## 
## $ycoef
##               [,1]          [,2]        [,3]
## disp -0.0014145342 -0.0006212359 -0.00186371
## drat  0.0123441384  0.0151949679 -0.50735976
## qsec -0.0006053208 -0.1102697491 -0.04391710
## 
## $xcenter
##       mpg        hp        wt 
##  20.09062 146.68750   3.21725 
## 
## $ycenter
##       disp       drat       qsec 
## 230.721875   3.596563  17.848750

$cor

The values under $cor in the output represent the correlations between pairs of canonical variates from two sets of variables:

  • 0.9295183: Correlation between the first canonical variates (\(CV1_X\) and \(CV1_Y\)), indicating a very strong linear relationship.

  • 0.7912781: Correlation between the second canonical variates (\(CV2_X\) and \(CV2_Y\)), showing a strong but lesser relationship than the first pair.

  • 0.3069647: Correlation between the third canonical variates (\(CV3_X\) and \(CV3_Y\)), indicating a relatively weak relationship.

Each correlation measures the strength of the linear relationship captured by each corresponding pair of canonical variates.

$xcoef and $ycoef

The canonical variates are generated through linear combinations of the original variables within each set, aiming to maximize the correlation between the pairs of canonical variables from the two sets. In our analysis of the mtcars dataset, we have two sets: one related to car performance (mpg, hp, wt) and the other to design specifications (disp, drat, qsec). Their coefficients for each canonical variates are given by $xcoef and $ycoef in the output above. The equations for canonical variates are crucial for interpreting the results of CCA, showing how each original variable contributes to the relationships between the two sets of variables.

Canonical Variates for the X-set (Car Performance)

\[CV1_X = (0.0008956961 \times mpg) - (0.0009510815 \times hp) - (0.1274417707 \times wt)\] \[CV2_X = (-0.011633574 \times mpg) + (0.002925112 \times hp) - (0.248515669 \times wt)\] \[CV3_X = (-0.070645646 \times mpg) - (0.002799602 \times hp) - (0.242116355 \times wt)\]

Canonical Variate for the Y-set (Design )

\[CV1_Y = (-0.0014145342 \times disp) + (0.0123441384 \times drat) - (0.0006053208 \times qsec)\] \[CV2_Y = (-0.0006212359 \times disp) + (0.0151949679 \times drat) - (0.1102697491 \times qsec)\] \[CV3_Y = (-0.00186371 \times disp) - (0.50735976 \times drat) - (0.04391710 \times qsec)\]

$xcenter and $ycenter

The $xcenter and $ycenter values in output represent the means of variables for the X-set (Car Performance) and Y-set (Design Specifications), respectively. These means are used to center the variables by subtracting them from the original data, a standard preprocessing step in CCA to ensure variables are on a comparable scale and to improve interpretability.

$xcenter:
    mpg: 20.09062 (average miles per gallon)
    hp: 146.68750 (average horsepower)
    wt: 3.21725 (average weight in thousands of pounds)

$ycenter:
    disp: 230.721875 (average engine displacement in cubic inches)
    drat: 3.596563 (average rear axle ratio)
    qsec: 17.848750 (average quarter-mile time in seconds)

Visualizing the Relationship

We will visualize the canonical correlations, indicating the strength of the relationships between canonical variates, and the canonical coefficients (loadings) to show the contribution of individual variables to these variates.

library(ggplot2) # For plotting 
library(reshape2)  # For melting data frames
library(gridExtra) # For arranging multiple plots

# Canonical correlations plot data
can_cor_data <- data.frame(
  Canonical_Variate = paste("Pair", 1:length(cca_result$cor)),
  Correlation = cca_result$cor
)

# Prepare melted data frames for xcoef and ycoef for loadings plots
xcoef_melted <- melt(cca_result$xcoef)
colnames(xcoef_melted) <- c("Variable", "Canonical_Variate", "Loading")
xcoef_melted$Set <- "X-set"

ycoef_melted <- melt(cca_result$ycoef)
colnames(ycoef_melted) <- c("Variable", "Canonical_Variate", "Loading")
ycoef_melted$Set <- "Y-set"

# Combine xcoef and ycoef data for a unified plot of loadings
coef_melted <- rbind(xcoef_melted, ycoef_melted)

# Define distinct color sets for the canonical correlations and the sets (X-set and Y-set)
cor_colors <- c("Pair 1" = "#FF9999", 
                "Pair 2" = "#9999FF",  
                "Pair 3" = "#99FF99")  
set_colors <- c("X-set" = "steelblue", "Y-set" = "darkorange")

# Plot canonical correlations with distinct colors
p1 <- ggplot(can_cor_data, aes(x = Canonical_Variate, y = Correlation, fill = Canonical_Variate)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = cor_colors) + 
  theme_minimal() +
  labs(title = "Canonical Correlations", x = "", y = "Correlation") +
  coord_flip()

# Update the melted data to reflect the Pair labels
coef_melted$Canonical_Variate <- factor(coef_melted$Canonical_Variate,
                                        levels = c(1, 2, 3),
                                        labels = c("Pair 1", "Pair 2", "Pair 3"))

# Plot loadings with updated labels and set-specific colors
p2 <- ggplot(coef_melted, aes(x = Variable, y = Loading, fill = Set)) +
  geom_bar(stat = "identity", position = "dodge") +
  scale_fill_manual(values = set_colors) + 
  facet_wrap(~Canonical_Variate, scales = "free") + 
  theme_minimal() +
  labs(title = "Canonical Coefficients (Loadings)", x = "Variable", y = "Loading") +
  coord_flip()

# Combine plots
grid.arrange(p1, p2, ncol = 1)