Variance, Covariance, and Correlation in Python

The difference between variance, covariance, and correlation is:

  • Variance is a measure of variability from the mean
  • Covariance is a measure of relationship between the variability of 2 variables – covariance is scale dependent because it is not standardized
  • Correlation is a of relationship between the variability of of 2 variables – correlation is standardized making it not scale dependent

A more in-depth look into each of these will be discussed below. First to import the required packages and create some fake data.

import pandas as pd
import numpy as np


# Setting a seed so the example is reproducible
np.random.seed(4272018)

df = pd.DataFrame(np.random.randint(low= 0, high= 20, size= (5, 2)),
                  columns= ['Commercials Watched', 'Product Purchases'])

df
Commercials WatchedProduct Purchases
01013
1150
277
324
41611
df.agg(["mean", "std"])
Commercials WatchedProduct Purchases
mean10.0000007.000000
std5.7879185.244044

WHAT IS VARIANCE?

Variance is a measure of how much the data for a variable varies from it’s mean. This can be represented with the following equation:Variance (s2)=∑(xi−x¯)2N−1Where,

  • xi is the ith observation,
  • x¯ is the mean, and
  • N is the number of observations

Calculating this manually for commercials watched would produce the following results:Variable: Commercials Watched x¯ = (10 + 15 + 7 + 2 + 16)/ 5 = 10.00 Variance (s2) = ((10 – 10)2 + (15 – 10)2 + (7 – 10)2 + (2 – 10)2 + (16 – 10)2) / (5 – 1) Variance (s2) = 33.5

This can be calculated easily within Python – particulatly when using Pandas. Although Pandas is not the only available package which will calculate the variance. Using Pandas, one simply needs to enter the following:

df.var()

Commercials Watched 33.5 Product Purchases 27.5 dtype: float64


WHAT IS COVARIANCE?

Covariance is a measure of relationship between 2 variables that is scale dependent, i.e. how much will a variable change when another variable changes. This can be represented with the following equation:Covariance (x,y)=∑(xi−x¯)(yi−y¯)N−1Where,

  • xi is the ith observation in variable x,
  • x¯ is the mean for variable x,
  • yi is the ith observation in variable y,
  • y¯ is the mean for variable y, and
  • N is the number of observations

The formula is very similar to the formula used to calculate variance. The difference being that instead of squaring the differences between the data point and the mean for that variable, instead one multiples that difference to the difference of the other variable.

The covariance between commercials watched and product purchases can be calclated manually and would produce the following results:Variables: Commercials Watched and Product Purchases Covariance (x, y) = ((10 – 10)(13 – 7) + (15 – 10)(0 – 7) + (7 – 10)(7 – 7) + (2 – 10)(4 – 7) + (16 – 10)(11 – 7)) / (5 – 1) = 3.25

Again, this can be calculated easily within Python – particulatly when using Pandas. Although Pandas is not the only available package which will calculate the covariance. Using Pandas, one simply needs to enter the following:

df.cov()
Commercials WatchedProduct Purchases
Commercials Watched33.503.25
Product Purchases3.2527.50

Interpreting covariance is hard to gain any meaning from since the values are not scale dependent and does not have any upper bound. This is where correlation comes in.


WHAT IS CORRELATION?

Correlation overcomes the lack of scale dependency that is present in covariance by standardizing the values. This standardization converts the values to the same scale, the example below will the using the Pearson Correlation Coeffiecient. The equation for converting data to Z-scores is:Z-score =xi−x¯sxWhere,

  • xi is the ith value for the variable,
  • x¯ is the mean for the variable, and
  • sx is the standard deviation for the variable

There is no need to convert the values before using the Pearson Correlation equation since the standardization is apart of the formula:r=∑(xi−x¯)(yi−y¯)(N−1)(sx)(sy)Where,

  • xi is the ith observation in variable x,
  • x¯ is the mean for variable x,
  • yi is the ith observation in variable y,
  • y¯ is the mean for variable y, and
  • N is the number of observations
  • sx is the standard deviation for variable x
  • sy is the standard deviation for variable y

Conducting the equation manually would produce the following result:Variables: Commercials Watched and Product Purchases r = ((10 – 10)(13 – 7) + (15 – 10)(0 – 7) + (7 – 10)(7 – 7) + (2 – 10)(4 – 7) + (16 – 10)(11 – 7)) / (5 – 1)(5.787918)(5.244044) = 0.11

Again, this can be calculated easily within Python – particulatly when using Pandas. Although Pandas is not the only available package which will calculate the correlation. Using Pandas, one simply needs to enter the following:

df.corr()
Commercials WatchedProduct Purchases
Commercials Watched1.0000000.107077
Product Purchases0.1070771.000000

The Pearson Correlation Coeffiecient will always range between -1 to 1. The closer the correlation coeffiecient is to -1 or 1, the stronger the relationship; whereas, the close the correlation coefficient is to 0, the weaker the relationship is.

If the correlation coeffiecient is positive, this indicates that as one variable increase so does the other. However, if the correlation coeffiecient is negative, it indicates that as one variable increase the other decreases. An easy way to see this relationship is to plot is using a scatter plot. Currently there is no agreed on threshold for how to interpret the coefficients. Akoglu, (2018) provides the following table with the three most commonly used suggestions for how to interpret the correlation cofficients – the fields vary a bit.

Correlation CoefficientDancey & Reidy (Psychology)Quinnipiac University (Politics)Chan YH (Medicine)
+1−1PerfectPerfectPerfect
+0.9−0.9StrongVery StrongVery Strong
+0.8−0.8StrongVery StrongVery Strong
+0.7−0.7StrongVery StrongModerate
+0.6−0.6ModerateStrongModerate
+0.5−0.5ModerateStrongFair
+0.4−0.4ModerateStrongFair
+0.3−0.3WeakModerateFair
+0.2−0.2WeakWeakPoor
+0.1−0.1WeakNegligiblePoor
00ZeroNoneNone

There are other measures of correlation, such as: Spearman’s rank correlation, Kendall’s tau, biserial, and point-biseral correlations. Each correlation measure has different assumptions about that data and are testing different null hypotheses. The in-depth look at these measures is out of scope for this page.

Follow Us On