Hippocampus's Garden

Under the sea, in the hippocampus's garden...

Stats with Python: Rank Correlation | Hippocampus's Garden

Stats with Python: Rank Correlation

February 06, 2021  |  8 min read  |  903 views

  • このエントリーをはてなブックマークに追加

The correlation coefficient is a familiar statistic that we see everywhere from news articles to scientific papers, but there are several variations whose differences should be noted. This post aims to recap the definitions of those common correlation coefficients, with the derivation of the equation and experiment regarding Spearman rank correlation coefficient.

Pearson Correlation Coefficient

For measuring the linear correlation between two sets of data, it is common to use Pearson product-moment correlation coefficient. Pearson correlation coefficient is the most well-known measure for correlation. When the term “correlation coefficient” is used without further information, it usually refers to this type of definition. Given paired data {(xi,yi)}i=1n\{(x_i,y_i)\}_{i=1}^n, Pearson’s r is defined as:

r=i=1n(xix)(yiy)i=1n(xix)2i=1n(yiy)2,r = \frac{{\displaystyle \sum_{i = 1}^n (x_i - \overline{x}) (y_i - \overline{y})}}{\sqrt{{\displaystyle \sum_{i = 1}^n (x_i - \overline{x})^2}} \sqrt{{\displaystyle \sum_{i = 1}^n (y_i - \overline{y})^2}}} ,

where xˉ\bar{x} and yˉ\bar{y} are the sample means. The numerator is the covariance between xix_i and yiy_i and the denominator is the product of their standard deviations.

The correlation coefficient ranges from 1−1 to 11. r=±1r=\plusmn1 is observed if and only if all the data points lie on a line (perfect correlation).

Spearman Rank Correlation Coefficient

When the data is ordinal variable, you should consider rank correlation. One of the common measures for rank correlation is Spearman rank correlation coefficient, which is simply the Pearson correlation coefficient between the two rank variables. For the nn paired ranks {(ai,bi)}i=1n\{(a_i,b_i)\}_{i=1}^n for the raw scores {(xi,yi)}i=1n\{(x_i,y_i)\}_{i=1}^n, the Spearman’s ρ is defined as:

ρ=i=1n(aia)(bib)i=1n(aia)2i=1n(bib)2,\rho = \frac{{\displaystyle \sum_{i = 1}^n (a_i - \overline{a}) (b_i - \overline{b})}}{\sqrt{{\displaystyle \sum_{i = 1}^n (a_i - \overline{a})^2}} \sqrt{{\displaystyle \sum_{i = 1}^n (b_i - \overline{b})^2}}} ,

where aˉ\bar{a} and bˉ\bar{b} are the sample means. This definition can be simplfied to:

ρ=16n(n21)i=1n(aibi)2.\rho = 1 - \frac{6}{n(n^2 - 1)}\sum_{i = 1}^n (a_i-b_i)^2.

It takes the value 1 if the order of the raw scores all match, and the value -1 if the order is completely reversed.

Proof for the Simplified Form

Since aia_i and bib_i are ranks of nn scores, following equations hold.

i=1nai=i=1nbi=i=1ni=n(n+1)2i=1nai2=i=1nbi2=i=1ni2=n(n+1)(2n+1)6a=b=1ni=1ni=n+12\begin{gathered} \sum_{i = 1}^n a_i = \sum_{i = 1}^n b_i=\sum_{i = 1}^n i = \frac{n(n+1)}{2}\\ \sum_{i = 1}^n a_i^2 = \sum_{i = 1}^n b_i^2 = \sum_{i = 1}^n i^2= \frac{n(n+1)(2n+1)}{6}\\ \overline{a} = \overline{b} = \frac{1}{n} \sum_{i = 1}^n i= \frac{n+1}{2} \end{gathered}

Using the above equations, we have:

ρ=i=1n(aia)(bib)i=1n(aia)2i=1n(bib)2=i=1n(ain+12)(bin+12)i=1n(ain+12)2i=1n(bin+12)2=i=1n(aibin+12(ai+bi)+(n+1)24)i=1n(ai2(n+1)ai+(n+1)24)i=1n(bi2(n+1)bi+(n+1)24)=i=1naibin(n+1)22+n(n+1)24n(n+1)(2n+1)6n(n+1)22+n(n+1)24=12i=1naibi3n(n+1)2(n1)n(n+1).\begin{aligned} \rho &= \frac{{\displaystyle \sum_{i = 1}^n (a_i - \overline{a}) (b_i - \overline{b})}}{\sqrt{{\displaystyle \sum_{i = 1}^n (a_i - \overline{a})^2}} \sqrt{{\displaystyle \sum_{i = 1}^n (b_i - \overline{b})^2}}} \\ &= \frac{{\displaystyle \sum_{i = 1}^n \Bigl(a_i - \frac{n+1}{2}\Bigr) \Bigl(b_i - \frac{n+1}{2}\Bigr)}}{\sqrt{{\displaystyle \sum_{i = 1}^n \Bigl(a_i - \frac{n+1}{2}\Bigr)^2}} \sqrt{{\displaystyle \sum_{i = 1}^n \Bigl(b_i - \frac{n+1}{2}\Bigr)^2}}} \\ &= \frac{ {\displaystyle \sum_{i = 1}^n \Bigl(a_ib_i - \frac{n+1}{2}(a_i+b_i) + \frac{(n+1)^2}{4}\Bigr)}} { \sqrt{{\displaystyle \sum_{i = 1}^n \Bigl(a_i^2 - (n+1)a_i + \frac{(n+1)^2}{4}\Bigr)}} \sqrt{{\displaystyle \sum_{i = 1}^n \Bigl(b_i^2 - (n+1)b_i + \frac{(n+1)^2}{4}\Bigr)}} } \\ &= \frac{ {\displaystyle \sum_{i = 1}^n a_ib_i - \frac{n(n+1)^2}{2} + \frac{n(n+1)^2}{4}} }{ {\displaystyle \frac{n(n+1)(2n+1)}{6} - \frac{n(n+1)^2}{2} + \frac{n(n+1)^2}{4}} }\\ &= \frac{ 12{\displaystyle \sum_{i = 1}^n a_ib_i - 3n(n+1)^2} }{ (n-1)n(n+1) }. \end{aligned}

Here, let’s consider the sum of the squared difference between aia_i and bib_i:

i=1n(aibi)2=2i=1naibi+n(n+1)(2n+1)3.\sum_{i = 1}^n (a_i-b_i)^2 = -2\sum_{i = 1}^na_ib_i +\frac{n(n+1)(2n+1)}{3}.

Therefore,

ρ=6i=1n(aibi)2+2n(n+1)(2n+1)3n(n+1)2(n1)n(n+1)=16n(n21)i=1n(aibi)2   \begin{aligned} \rho &= \frac{ -6{\displaystyle \sum_{i = 1}^n (a_i-b_i)^2 +2n(n+1)(2n+1)- 3n(n+1)^2} }{ (n-1)n(n+1) }\\ &= 1 - \frac{6}{n(n^2 - 1)}\sum_{i = 1}^n (a_i-b_i)^2 ~~~\blacksquare \end{aligned}

Experiment

Here, I conducted a quick experiment to confirm that the Pearson’s r of the ranks is equivalent to Spearman’s ρ. I generated 100 pairs of random samples (x and y) and calculated several types of correlation coefficients.

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("darkgrid")

n = 100
x = np.random.rand(n)
y = x + 0.5*np.random.rand(n)

2021 02 04 23 04 40

from scipy import stats
print(stats.pearsonr(x, y))
# >> (0.8863388430290433, 1.5374433582768292e-34)

print(stats.spearmanr(x, y))
# >> SpearmanrResult(correlation=0.8835643564356436, pvalue=4.677602781530847e-34)

print(stats.kendalltau(x, y))
# >> KendalltauResult(correlation=0.6981818181818182, pvalue=7.62741751146521e-25)

By converting x and y to ranks, it is confirmed that stats.spearmanr(x, y) is equal to stats.pearsonr(a, b).

a = len(x) - stats.rankdata(x) + 1
b = len(y) - stats.rankdata(y) + 1
print(stats.pearsonr(a, b))
# >> (0.8835643564356437, 4.677602781530673e-34)

Kendall Rank Correlation Coefficient

Kendall rank correlation coefficient is another common type of rank correlation efficients. Among NN pairs of indices {(i,j)}i<j\{(i,j)\}_{i<j}, it considers the number of concordant pairs PP, the number of discordant pairs QQ, and ties TxT_x and TyT_y.

P={(i,j)  0i<j<n, (xjxi)(yjyi)>0)}Q={(i,j)  0i<j<n, (xjxi)(yjyi)<0)}Tx={(i,j)  0i<j<n, xi=xj)}Ty={(i,j)  0i<j<n, yi=yj)}N=n(n1)2\begin{gathered} P = |\{ (i,j)~|~ 0\leq i <j<n, ~ (x_j-x_i)(y_j-y_i)>0 ) \}|\\ Q = |\{ (i,j)~|~ 0\leq i <j<n, ~ (x_j-x_i)(y_j-y_i)<0 ) \}|\\ T_x= |\{ (i,j)~|~ 0\leq i <j<n, ~ x_i=x_j ) \}|\\ T_y= |\{ (i,j)~|~ 0\leq i <j<n, ~ y_i=y_j ) \}|\\ N = \frac{n(n-1)}{2} \end{gathered}

Given these quantities, Kendall’s τ (a) is defined as:

τa=PQN.\tau_a = \frac{P - Q}{N} .

As well as Spearman’s ρ, Kendall’s τ (a) takes the value 1 if the order of the raw scores all match, and the value -1 if the order is completely reversed.

Kendall’s τ (b) cares about the case where ties are around.

τb=PQNTxNTy\tau_b = \frac{P - Q}{\sqrt{N - T_x}\sqrt{N - T_y}}

Goodman and Kruskal’s Gamma

Simlarly, Goodman and Kruskal’s γ is defined as:

γ=PQP+Q.\gamma = \frac{P - Q}{P + Q}.

When there are no ties (i.e. Tx=Ty=0T_x=T_y=0), Kendall’s τa\tau_a and τb\tau_b are equal to Goodman and Kruskal’s γ\gamma:

τa=τb=γ.\tau_a=\tau_b=\gamma.

References

[1] 東京大学教養学部統計学教室 編. ”統計学入門“(第3章). 東京大学出版会. 1991.
[2] 統計WEB - 統計学、調べる、学べる、BellCurve(ベルカーブ)
* This “statistics dictionary” covers a range of concepts with LaTeX codes.
[3] 相関(と回帰)
[4] Correlation coefficient - Wikipedia


  • このエントリーをはてなブックマークに追加
[object Object]

Written by Shion Honda. If you like this, please share!

Shion Honda

Hippocampus's Garden © 2024, Shion Honda. Built with Gatsby