Stats with Python: Rank Correlation
February 06, 2021 | 9 min read | 93 views
The correlation coefficient is a familiar statistic that we see everywhere from news articles to scientific papers, but there are several variations whose differences should be noted. This post aims to recap the definitions of those common correlation coefficients, with the derivation of the equation and experiment regarding Spearman rank correlation coefficient.
For measuring the linear correlation between two sets of data, it is common to use Pearson product-moment correlation coefficient. Pearson correlation coefficient is the most well-known measure for correlation. When the term “correlation coefficient” is used without further information, it usually refers to this type of definition. Given paired data , Pearson’s r is defined as:
where and are the sample means. The numerator is the covariance between and and the denominator is the product of their standard deviations.
The correlation coefficient ranges from to . is observed if and only if all the data points lie on a line (perfect correlation).
When the data is ordinal variable, you should consider rank correlation. One of the common measures for rank correlation is Spearman rank correlation coefficient, which is simply the Pearson correlation coefficient between the two rank variables. For the paired ranks for the raw scores , the Spearman’s ρ is defined as:
where and are the sample means. This definition can be simplfied to:
It takes the value 1 if the order of the raw scores all match, and the value -1 if the order is completely reversed.
Since and are ranks of scores, following equations hold.
Using the above equations, we have:
Here, let’s consider the sum of the squared difference between and :
Here, I conducted a quick experiment to confirm that the Pearson’s r of the ranks is equivalent to Spearman’s ρ. I generated 100 pairs of random samples (
y) and calculated several types of correlation coefficients.
import numpy as np import matplotlib.pyplot as plt import seaborn as sns sns.set_style("darkgrid") n = 100 x = np.random.rand(n) y = x + 0.5*np.random.rand(n)
from scipy import stats print(stats.pearsonr(x, y)) # >> (0.8863388430290433, 1.5374433582768292e-34) print(stats.spearmanr(x, y)) # >> SpearmanrResult(correlation=0.8835643564356436, pvalue=4.677602781530847e-34) print(stats.kendalltau(x, y)) # >> KendalltauResult(correlation=0.6981818181818182, pvalue=7.62741751146521e-25)
y to ranks, it is confirmed that
stats.spearmanr(x, y) is equal to
a = len(x) - stats.rankdata(x) + 1 b = len(y) - stats.rankdata(y) + 1 print(stats.pearsonr(a, b)) # >> (0.8835643564356437, 4.677602781530673e-34)
Kendall rank correlation coefficient is another common type of rank correlation efficients. Among pairs of indices , it considers the number of concordant pairs , the number of discordant pairs , and ties and .
Given these quantities, Kendall’s τ (a) is defined as:
As well as Spearman’s ρ, Kendall’s τ (a) takes the value 1 if the order of the raw scores all match, and the value -1 if the order is completely reversed.
Kendall’s τ (b) cares about the case where ties are around.
Simlarly, Goodman and Kruskal’s γ is defined as:
When there are no ties (i.e. ), Kendall’s and are equal to Goodman and Kruskal’s :
 東京大学教養学部統計学教室 編. ”統計学入門“（第3章）. 東京大学出版会. 1991.
 統計WEB － 統計学、調べる、学べる、BellCurve（ベルカーブ）
* This “statistics dictionary” covers a range of concepts with LaTeX codes.
 Correlation coefficient - Wikipedia
Written by Shion Honda. If you like this, please share!