Covariance and Correlation

Introduction to Covariance and Correlation

Covariance and correlation are two statistical measures used to describe the relationship between two variables. While both metrics measure the direction of the relationship, they differ in terms of their scale and interpretability.

Covariance indicates the direction of the linear relationship between two variables. Positive covariance means that two variables tend to move in the same direction, whereas negative covariance indicates that they move in opposite directions. However, the magnitude of covariance is not standardized, making it difficult to interpret the strength of the relationship.

Correlation, specifically the Pearson correlation coefficient, measures both the strength and direction of the linear relationship between two variables. Its value ranges from -1 to 1, where 1 indicates a perfect positive linear relationship, -1 indicates a perfect negative linear relationship, and 0 indicates no linear relationship.

Mathematical Formulas

Covariance Formula

The covariance between two variables, \(X\) and \(Y\), with observations indexed by \(i\), can be calculated using the formula:

\[ \text{Cov}(X, Y) = \frac{\sum_{i=1}^{n} (X_i - \bar{X})(Y_i - \bar{Y})}{n-1} \]

where:

  • \(X_i\) and \(Y_i\) are the values of the \(i^{th}\) observation in the datasets \(X\) and \(Y\), respectively,
  • \(\bar{X}\) and \(\bar{Y}\) are the means of \(X\) and \(Y\),
  • \(n\) is the number of observations.

This formula calculates the average product of the deviations of each pair of observations from their respective means. A positive result indicates a tendency to move together, while a negative result indicates a tendency to move in opposite directions.

Correlation Formula

The Pearson correlation coefficient between the same variables can be calculated as:

\[ r = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y} \]

where:

  • \(r\) is the Pearson correlation coefficient,
  • \(\text{Cov}(X, Y)\) is the covariance between variables \(X\) and \(Y\),
  • \(\sigma_X\) and \(\sigma_Y\) are the standard deviations of \(X\) and \(Y\), respectively.

The Pearson correlation coefficient standardizes the covariance by the product of the standard deviations of the two variables, resulting in a value between -1 and 1.

Calculating Covariance and Correlation in R

Data preparation with the mtcars dataset

The mtcars dataset in R is a compilation of data extracted from the 1974 Motor Trend US magazine. It comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models). Key variables in the mtcars dataset include:

mpg: Miles per (US) gallon
cyl: Number of cylinders
disp: Displacement (cu.in.)
hp: Gross horsepower
drat: Rear axle ratio
wt: Weight (1,000 lbs)
qsec: 1/4 mile time
vs: Engine (0 = V-shaped, 1 = straight)
am: Transmission (0 = automatic, 1 = manual)
gear: Number of forward gears
carb: Number of carburetors

For the purpose of our tutorial, we will focus on two variables: mpg (miles per gallon) and wt (weight of the car). These variables are selected to explore the hypothesis that vehicle weight may inversely affect fuel efficiency.

# Load the mtcars dataset
data(mtcars)

# Overview of the dataset
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
# We'll focus on the 'mpg' (miles per gallon) and 'wt' (weight) variables

Calculating Covariance

To calculate covariance, we use the cov() function:

# Calculating covariance between mpg and wt
covariance_mtcars <- cov(mtcars$mpg, mtcars$wt)
print(paste("Covariance between mpg and wt:", covariance_mtcars))
## [1] "Covariance between mpg and wt: -5.11668467741936"

A negative covariance would indicate that as the weight of a car increases, its fuel efficiency tends to decrease, and vice versa.

Calculating Correlation

To assess the strength in addition the direction of the relationship, we calculate the Pearson correlation coefficient using the cor() function in R:

# Calculating correlation between mpg and wt
correlation_mtcars <- cor(mtcars$mpg, mtcars$wt)
print(paste("Correlation between mpg and wt:", correlation_mtcars))
## [1] "Correlation between mpg and wt: -0.867659376517228"

A negative correlation near -0.85 would suggest a strong inverse relationship between car weight and fuel efficiency.

Visualizing the Relationship

A scatter plot is a common way to visualize the relationship between two continuous variables.

Creating a Scatter Plot

We can create a scatter plot in R using the plot() function:

# Scatter plot of mpg vs wt
plot(mtcars$wt, mtcars$mpg, main = "Scatter Plot of mpg vs wt", xlab = "Weight (1000 lbs)", ylab = "Miles per Gallon", pch = 19, col = "blue")

Adding a Regression Line

To visualize the linear relationship more clearly, we can add a regression line.

# Scatter plot of mpg vs wt
plot(mtcars$wt, mtcars$mpg, main = "Scatter Plot of mpg vs wt", xlab = "Weight (1000 lbs)", ylab = "Miles per Gallon", pch = 19, col = "blue")
# Adding a regression line to the scatter plot
abline(lm(mpg ~ wt, data = mtcars), col = "red")

Interactive Visualization

You should also check out this cool interactive visualization by Kristoffer Magnusson at https://rpsychologist.com/correlation/.

Summary

Covariance and correlation are fundamental statistics for understanding the relationship between two variables. While covariance indicates the direction of the relationship, correlation provides both direction and magnitude, making it easier to interpret.

In the example above, a negative covariance and correlation indicate that heavier cars generally have lower miles per gallon ratings, reflecting an inverse relationship between weight and fuel efficiency. The scatter plot with the regression line visually confirms this relationship, showing a clear trend where fuel efficiency decreases as the weight of the car increases.