Covariance and correlation are two statistical measures used to describe the relationship between two variables. While both metrics measure the direction of the relationship, they differ in terms of their scale and interpretability.
Covariance indicates the direction of the linear relationship between two variables. Positive covariance means that two variables tend to move in the same direction, whereas negative covariance indicates that they move in opposite directions. However, the magnitude of covariance is not standardized, making it difficult to interpret the strength of the relationship.
Correlation, specifically the Pearson correlation coefficient, measures both the strength and direction of the linear relationship between two variables. Its value ranges from -1 to 1, where 1 indicates a perfect positive linear relationship, -1 indicates a perfect negative linear relationship, and 0 indicates no linear relationship.
The covariance between two variables, \(X\) and \(Y\), with observations indexed by \(i\), can be calculated using the formula:
\[ \text{Cov}(X, Y) = \frac{\sum_{i=1}^{n} (X_i - \bar{X})(Y_i - \bar{Y})}{n-1} \]
where:
This formula calculates the average product of the deviations of each pair of observations from their respective means. A positive result indicates a tendency to move together, while a negative result indicates a tendency to move in opposite directions.
The Pearson correlation coefficient between the same variables can be calculated as:
\[ r = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y} \]
where:
The Pearson correlation coefficient standardizes the covariance by the product of the standard deviations of the two variables, resulting in a value between -1 and 1.
The mtcars dataset in R is a compilation of data extracted from the 1974 Motor Trend US magazine. It comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models). Key variables in the mtcars dataset include:
mpg: Miles per (US) gallon
cyl: Number of cylinders
disp: Displacement (cu.in.)
hp: Gross horsepower
drat: Rear axle ratio
wt: Weight (1,000 lbs)
qsec: 1/4 mile time
vs: Engine (0 = V-shaped, 1 = straight)
am: Transmission (0 = automatic, 1 = manual)
gear: Number of forward gears
carb: Number of carburetors
For the purpose of our tutorial, we will focus on two variables: mpg (miles per gallon) and wt (weight of the car). These variables are selected to explore the hypothesis that vehicle weight may inversely affect fuel efficiency.
# Load the mtcars dataset
data(mtcars)
# Overview of the dataset
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
# We'll focus on the 'mpg' (miles per gallon) and 'wt' (weight) variables
To calculate covariance, we use the cov() function:
# Calculating covariance between mpg and wt
covariance_mtcars <- cov(mtcars$mpg, mtcars$wt)
print(paste("Covariance between mpg and wt:", covariance_mtcars))
## [1] "Covariance between mpg and wt: -5.11668467741936"
A negative covariance would indicate that as the weight of a car increases, its fuel efficiency tends to decrease, and vice versa.
To assess the strength in addition the direction of the relationship, we calculate the Pearson correlation coefficient using the cor() function in R:
# Calculating correlation between mpg and wt
correlation_mtcars <- cor(mtcars$mpg, mtcars$wt)
print(paste("Correlation between mpg and wt:", correlation_mtcars))
## [1] "Correlation between mpg and wt: -0.867659376517228"
A negative correlation near -0.85 would suggest a strong inverse relationship between car weight and fuel efficiency.
A scatter plot is a common way to visualize the relationship between two continuous variables.
We can create a scatter plot in R using the plot() function:
# Scatter plot of mpg vs wt
plot(mtcars$wt, mtcars$mpg, main = "Scatter Plot of mpg vs wt", xlab = "Weight (1000 lbs)", ylab = "Miles per Gallon", pch = 19, col = "blue")
To visualize the linear relationship more clearly, we can add a regression line.
# Scatter plot of mpg vs wt
plot(mtcars$wt, mtcars$mpg, main = "Scatter Plot of mpg vs wt", xlab = "Weight (1000 lbs)", ylab = "Miles per Gallon", pch = 19, col = "blue")
# Adding a regression line to the scatter plot
abline(lm(mpg ~ wt, data = mtcars), col = "red")
You should also check out this cool interactive visualization by Kristoffer Magnusson at https://rpsychologist.com/correlation/.
Covariance and correlation are fundamental statistics for understanding the relationship between two variables. While covariance indicates the direction of the relationship, correlation provides both direction and magnitude, making it easier to interpret.
In the example above, a negative covariance and correlation indicate that heavier cars generally have lower miles per gallon ratings, reflecting an inverse relationship between weight and fuel efficiency. The scatter plot with the regression line visually confirms this relationship, showing a clear trend where fuel efficiency decreases as the weight of the car increases.