canonicalcorr – Orhan Soyuhos

Introduction to Canonical Correlation

Canonical Correlation Analysis (CCA) is a multivariate statistical technique used to explore the relationships between two sets of variables. It seeks to identify and measure the associations between the variables by finding linear combinations of the variables in each set that are maximally correlated with each other.

Mathematically, if we have two sets of variables, $X = \{x_1, x_2, \ldots, x_p\}$ and $Y = \{y_1, y_2, \ldots, y_q\}$, CCA seeks to find linear combinations of $X$ and $Y$, denoted as $U = a_1x_1 + a_2x_2 + \ldots + a_px_p$ and $V = b_1y_1 + b_2y_2 + \ldots + b_qy_q$. The vectors $a = [a_1, a_2, \ldots, a_p]$ and $b = [b_1, b_2, \ldots, b_q]$ are chosen to maximize the correlation between $U$ and $V$, under the constraint of orthogonality. This means that each pair of canonical variates ($U$ and $V$) is uncorrelated with all other pairs. This property is crucial for the method’s ability to extract multiple, distinct dimensions of correlation between the two sets of variables.

Additionally, the minimum canonical correlation value that can be identified through CCA is equivalent to the largest bivariate correlation that exists between any single pair of variables across the $X$ and $Y$ sets. This characteristic emphasizes CCA’s capacity to uncover the most significant linear relationship possible between the two variable sets, beyond what is achievable through simple bivariate correlation analysis.

Canonical Correlation Analysis in R

For this tutorial, we will use the mtcars dataset included in R, which contains various measurements in miles per gallon (mpg), cylinder count, horsepower, and other aspects of automobile design and performance for 32 automobiles. We will focus on exploring the relationship between a set of variables related to car performance (mpg, hp, wt) and a set related to design specifications (disp, drat, qsec).

First, install and load the necessary packages:

## Uncomment the line below to install the CCA package if you haven't already
# install.packages("CCA")
library(CCA) # To compute canonical correlations

Next, prepare the data:

data(mtcars)
performance <- mtcars[, c("mpg", "hp", "wt")]
design <- mtcars[, c("disp", "drat", "qsec")]

# Check for NA values to ensure data integrity
sum(is.na(performance))

## [1] 0

sum(is.na(design))

## [1] 0

Now, perform the canonical correlation analysis:

cca_result <- cancor(performance, design)
cca_result

## $cor
## [1] 0.9295183 0.7912781 0.3069647
## 
## $xcoef
##              [,1]         [,2]         [,3]
## mpg  0.0008956961 -0.011633574 -0.070645646
## hp  -0.0009510815  0.002925112 -0.002799602
## wt  -0.1274417707 -0.248515669 -0.242116355
## 
## $ycoef
##               [,1]          [,2]        [,3]
## disp -0.0014145342 -0.0006212359 -0.00186371
## drat  0.0123441384  0.0151949679 -0.50735976
## qsec -0.0006053208 -0.1102697491 -0.04391710
## 
## $xcenter
##       mpg        hp        wt 
##  20.09062 146.68750   3.21725 
## 
## $ycenter
##       disp       drat       qsec 
## 230.721875   3.596563  17.848750

$cor

The values under $cor in the output represent the correlations between pairs of canonical variates from two sets of variables:

0.9295183: Correlation between the first canonical variates ($CV1_X$ and $CV1_Y$), indicating a very strong linear relationship.
0.7912781: Correlation between the second canonical variates ($CV2_X$ and $CV2_Y$), showing a strong but lesser relationship than the first pair.
0.3069647: Correlation between the third canonical variates ($CV3_X$ and $CV3_Y$), indicating a relatively weak relationship.

Each correlation measures the strength of the linear relationship captured by each corresponding pair of canonical variates.

$xcoef and $ycoef

The canonical variates are generated through linear combinations of the original variables within each set, aiming to maximize the correlation between the pairs of canonical variables from the two sets. In our analysis of the mtcars dataset, we have two sets: one related to car performance (mpg, hp, wt) and the other to design specifications (disp, drat, qsec). Their coefficients for each canonical variates are given by $xcoef and $ycoef in the output above. The equations for canonical variates are crucial for interpreting the results of CCA, showing how each original variable contributes to the relationships between the two sets of variables.

Canonical Variates for the X-set (Car Performance)

\[CV1_X = (0.0008956961 \times mpg) - (0.0009510815 \times hp) - (0.1274417707 \times wt)\] \[CV2_X = (-0.011633574 \times mpg) + (0.002925112 \times hp) - (0.248515669 \times wt)\] \[CV3_X = (-0.070645646 \times mpg) - (0.002799602 \times hp) - (0.242116355 \times wt)\]

Canonical Variate for the Y-set (Design )

\[CV1_Y = (-0.0014145342 \times disp) + (0.0123441384 \times drat) - (0.0006053208 \times qsec)\] \[CV2_Y = (-0.0006212359 \times disp) + (0.0151949679 \times drat) - (0.1102697491 \times qsec)\] \[CV3_Y = (-0.00186371 \times disp) - (0.50735976 \times drat) - (0.04391710 \times qsec)\]

$xcenter and $ycenter

The $xcenter and $ycenter values in output represent the means of variables for the X-set (Car Performance) and Y-set (Design Specifications), respectively. These means are used to center the variables by subtracting them from the original data, a standard preprocessing step in CCA to ensure variables are on a comparable scale and to improve interpretability.

$xcenter:
    mpg: 20.09062 (average miles per gallon)
    hp: 146.68750 (average horsepower)
    wt: 3.21725 (average weight in thousands of pounds)

$ycenter:
    disp: 230.721875 (average engine displacement in cubic inches)
    drat: 3.596563 (average rear axle ratio)
    qsec: 17.848750 (average quarter-mile time in seconds)

Visualizing the Relationship

We will visualize the canonical correlations, indicating the strength of the relationships between canonical variates, and the canonical coefficients (loadings) to show the contribution of individual variables to these variates.

library(ggplot2) # For plotting 
library(reshape2)  # For melting data frames
library(gridExtra) # For arranging multiple plots

# Canonical correlations plot data
can_cor_data <- data.frame(
  Canonical_Variate = paste("Pair", 1:length(cca_result$cor)),
  Correlation = cca_result$cor
)

# Prepare melted data frames for xcoef and ycoef for loadings plots
xcoef_melted <- melt(cca_result$xcoef)
colnames(xcoef_melted) <- c("Variable", "Canonical_Variate", "Loading")
xcoef_melted$Set <- "X-set"

ycoef_melted <- melt(cca_result$ycoef)
colnames(ycoef_melted) <- c("Variable", "Canonical_Variate", "Loading")
ycoef_melted$Set <- "Y-set"

# Combine xcoef and ycoef data for a unified plot of loadings
coef_melted <- rbind(xcoef_melted, ycoef_melted)

# Define distinct color sets for the canonical correlations and the sets (X-set and Y-set)
cor_colors <- c("Pair 1" = "#FF9999", 
                "Pair 2" = "#9999FF",  
                "Pair 3" = "#99FF99")  
set_colors <- c("X-set" = "steelblue", "Y-set" = "darkorange")

# Plot canonical correlations with distinct colors
p1 <- ggplot(can_cor_data, aes(x = Canonical_Variate, y = Correlation, fill = Canonical_Variate)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = cor_colors) + 
  theme_minimal() +
  labs(title = "Canonical Correlations", x = "", y = "Correlation") +
  coord_flip()

# Update the melted data to reflect the Pair labels
coef_melted$Canonical_Variate <- factor(coef_melted$Canonical_Variate,
                                        levels = c(1, 2, 3),
                                        labels = c("Pair 1", "Pair 2", "Pair 3"))

# Plot loadings with updated labels and set-specific colors
p2 <- ggplot(coef_melted, aes(x = Variable, y = Loading, fill = Set)) +
  geom_bar(stat = "identity", position = "dodge") +
  scale_fill_manual(values = set_colors) + 
  facet_wrap(~Canonical_Variate, scales = "free") + 
  theme_minimal() +
  labs(title = "Canonical Coefficients (Loadings)", x = "Variable", y = "Loading") +
  coord_flip()

# Combine plots
grid.arrange(p1, p2, ncol = 1)

The top plot displays the correlation coefficients for each pair of canonical variates from the two sets of variables. The color-coded bars represent the strength of the association, with Pair 1 showing the highest correlation and Pair 3 the lowest, suggesting that the first pair captures the most significant relationship between the sets.

The bottom plot is divided into three subplots corresponding to each canonical pair. These show the contribution of individual variables from both the X-set (car performance) and Y-set (design specifications) to the canonical variates. In the first canonical variate, the negative loading for ‘wt’ (from the X-set) and the positive loading for ‘drat’ (from the Y-set) suggest that there is an inverse relationship between these variables when considering the first canonical variate: as ‘wt’ increases, ‘drat’ tends to decrease, or vice versa. Each pair of canonical variates represents a unique dimension of correlation between the sets, with the first pair generally capturing the strongest relationship, and subsequent pairs capturing progressively less variance.

Path Diagram

Let’s draw the path diagram for the first canonical variate.

library(DiagrammeR) # For path diagram

## Warning: package 'DiagrammeR' was built under R version 4.2.3

library(DiagrammeRsvg) # For SVG rendering
library(magick) # For image processing

## Linking to ImageMagick 6.9.12.93
## Enabled features: cairo, fontconfig, freetype, heic, lcms, pango, raw, rsvg, webp
## Disabled features: fftw, ghostscript, x11

graph <- grViz("
digraph CCA {

  # Graph layout settings
  rankdir=BT

  # Define styles for nodes
  node [fontname = Helvetica]

  # Performance variables
  node [shape = box, color = blue]
  mpg [label='mpg\\n0.0009'];
  hp [label='hp\\n-0.00095'];
  wt [label='wt\\n-0.1274'];

  # Design variables
  node [shape = box, color = green]
  disp [label='disp\\n-0.0014'];
  drat [label='drat\\n0.0123'];
  qsec [label='qsec\\n-0.0006'];

  # Canonical Variates
  node [shape = ellipse, style = filled, color = lightgrey]
  CV1x [label='CV1x'];
  CV1y [label='CV1y'];

  # Connecting performance variables to CV1x
  edge [color = blue]
  mpg -> CV1x;
  hp -> CV1x;
  wt -> CV1x;

  # Connecting design variables to CV1y
  edge [color = green]
  disp -> CV1y;
  drat -> CV1y;
  qsec -> CV1y;

  # Canonical correlations between Canonical Variates
  edge [color = red, constraint=false, dir=both]
  CV1x -> CV1y [label = '0.93'];

}
", engine = 'neato')

## This part is additional to render the image as HTML
# Export the graph to SVG
svg <- export_svg(graph)
# Save the SVG to a file
writeLines(svg, "graph.svg")
# Read the SVG content
svg_image <- image_read_svg("graph.svg", width = 2000) 
# Convert to PNG and save
image_write(svg_image, "graph.png")

knitr::include_graphics("graph.png")

Statistical Significance

In Canonical Correlation Analysis (CCA), understanding the statistical significance of the derived canonical correlations is crucial for interpreting the results meaningfully. The process involves evaluating whether the relationships uncovered by CCA are statistically significant—i.e., not likely to have occurred by chance. This section explains the statistical calculations behind testing the significance of canonical correlations.

Calculation Overview

The significance testing of canonical correlations involves several key steps and concepts, which are briefly outlined below:

Wilks’ Lambda:
Wilks’ Lambda ($\Lambda$) is a statistic used to assess the significance of the overall model in multivariate tests, including CCA. It represents the ratio of the determinant of the within-groups sum of squares and cross-product matrix to the determinant of the total sum of squares and cross-product matrix. For CCA, it measures the proportion of variance not explained by the canonical correlations. Lower values of $\Lambda$ indicate higher significance.
Chi-Squared Transformation:
To assess the significance of $\Lambda$, it is transformed into a chi-squared ($\chi^2$) statistic. This transformation allows us to use the chi-squared distribution to determine the probability that the observed relationships could occur by chance. The transformation formula involves the number of observations ($N$) and the number of variables in each set ($k_x$ and $k_y$), adjusting for the degrees of freedom.
P-Value Calculation:
The p-value is calculated from the chi-squared statistic and its degrees of freedom, which are determined by the number of variables in the X-set and Y-set. The p-value tells us the probability of observing a chi-squared statistic as extreme as, or more extreme than, what was actually observed, under the assumption that there is no relationship between the variable sets (null hypothesis).

# Perform Canonical Correlation Analysis
cca_result <- cancor(performance, design)

# The number of samples
N <- nrow(performance)  

# The number of variables in the X-set and Y-set
kx <- ncol(performance)
ky <- ncol(design)

# Calculate the squared canonical correlations (eigenvalues)
eigenvalues <- cca_result$cor^2

# Initialize a vector to store Wilks' lambda values
wilks_lambda <- rep(NA, length(eigenvalues))

# Calculate Wilks' lambda for each canonical correlation
for (i in seq_along(eigenvalues)) {
  wilks_lambda[i] <- prod(1 - eigenvalues[i:length(eigenvalues)])
}

# Convert Wilks' lambda into a chi-square statistic
chisq_stats <- -(N - 1 - (max(kx, ky) + 1) / 2) * log(wilks_lambda)

# Compute the p-values from the chi-square distribution
p_values <- pchisq(chisq_stats, df = (kx - (0:(length(wilks_lambda)-1))) * (ky - (0:(length(wilks_lambda)-1))), lower.tail = FALSE)

# Combine the results into a data frame for easy viewing
significance_tests <- data.frame(
  Canonical_Correlation = cca_result$cor,
  Wilks_Lambda = wilks_lambda,
  Chi_Squared = chisq_stats,
  P_Value = p_values
)

# Output the results
significance_tests

##   Canonical_Correlation Wilks_Lambda Chi_Squared      P_Value
## 1             0.9295183   0.04605485   89.259742 2.291676e-15
## 2             0.7912781   0.33864938   31.400909 2.535756e-06
## 3             0.3069647   0.90577265    2.870041 9.024256e-02

Given the CCA results for the mtcars dataset, where we analyzed relationships between car performance and design specifications, we conducted significance tests for each canonical correlation. Here’s a summary of the findings:

First Canonical Correlation: Highly significant ($p < 0.0001$), indicating a very strong and statistically significant relationship between the first pair of canonical variates.
Second Canonical Correlation: Also significant ($p < 0.0001$), showing a strong relationship for the second pair, though less pronounced than the first.
Third Canonical Correlation: Not significant ($p = 0.090$), suggesting that the relationship captured by the third pair of canonical variates could be due to chance.

Interpretation

The significance tests highlight that the first two canonical correlations uncover significant relationships between our sets of variables, with the first demonstrating an especially strong connection. This is reflected in the variance explained, where approximately 86.4% of the variance for the first and 62.6% for the second canonical correlation points to a robust, statistically significant relationship. These strong correlations are underscored by their very low p-values. Conversely, the third canonical correlation, explaining only 9.4% of the variance, does not significantly clarify the relationship between the variable sets, as its higher p-value suggests the correlation might merely be coincidental.

Summary

In this tutorial, we talked about the Canonical Correlation Analysis (CCA), a statistical method for exploring the relationships between two sets of variables, using the mtcars dataset as an example. We demonstrated how CCA identifies linear combinations of these variables that are maximally correlated, thereby uncovering significant relationships not apparent through simple correlation analyses. The tutorial covered data preparation, execution of CCA with R and relevant packages such as CCA and ggplot2, and the interpretation of canonical correlations and coefficients to understand the contributions of individual variables. We highlighted the importance of assessing the statistical significance of these correlations, employing Wilks’ Lambda and chi-squared tests. Additionally, we illustrated the relationships through visualizations and path diagrams.