Hippocampus's Garden

Under the sea, in the hippocampus's garden...

Stats with Python: Sample Correlation Coefficient is Biased | Hippocampus's Garden

Stats with Python: Sample Correlation Coefficient is Biased

February 24, 2021  |  6 min read  |  746 views

  • このエントリーをはてなブックマークに追加

Have you ever wondered how much bias the sample correlation coefficient rr has with respect to the population correlation coefficient ρ\rho? In fact, even if the sample size is about 20, there will be a bias of up to 5%, depending on the value of ρ\rho. This post visualizes how large the bias is and shows how to fix it.

Definition: Pearson’s r

As already discussed in the previous post, Pearson product-moment correlation coefficient, or simply Pearson’s r of the paired sequences {(xi,yi)}i=1n\{(x_i,y_i)\}_{i=1}^n is given as:

r=i=1n(xix)(yiy)i=1n(xix)2i=1n(yiy)2,r = \frac{{\displaystyle \sum_{i = 1}^n (x_i - \overline{x}) (y_i - \overline{y})}}{\sqrt{{\displaystyle \sum_{i = 1}^n (x_i - \overline{x})^2}} \sqrt{{\displaystyle \sum_{i = 1}^n (y_i - \overline{y})^2}}} ,

where xˉ\bar{x} and yˉ\bar{y} are the sample means. This is the sample correlation coefficient. For a population, the population correlation coefficient of the paired random variables XX and YY is defined as:

ρ=Cov[X,Y]V[X]V[Y].\rho = \frac{Cov[X,Y]}{\sqrt{V[X]}\sqrt{V[Y]}}.

Pearson’s r For a Sample is Biased!

So, is the sample correlation coefficient rr an unbiased estimator of the population correlation coefficient ρ\rho? Unfortunately, the answer is no. rr is only asymptotically unbiased, so when the sample size is small, you need to care about its bias.

In this post, I assume the data XX and YY follow a bivariate normal distribution and experiment with unbiased estimators of the correlation coefficient. Then, the exact density function is given as:

f(r)=(n2)Γ(n1)(1ρ2)n12(1r2)n422πΓ(n12)(1ρr)n32 2F1(12,12;2n12;ρr+12),f(r) = \frac{(n-2)\Gamma(n-1)(1-\rho^2)^{\frac{n-1}{2}}(1-r^2)^{\frac{n-4}{2}}}{\sqrt{2\pi}\Gamma(n-\frac{1}{2})(1-\rho r)^{n-\frac{3}{2}}}~ \mathbf{_2F_1}\Bigl( \frac{1}{2},\frac{1}{2};\frac{2n-1}{2};\frac{\rho r+1}{2} \Bigr),

where Γ()\Gamma(\cdot) is the gamma function and 2F1(,;;)\mathbf{_2F_1}(\cdot,\cdot;\cdot;\cdot) is the Gaussian hypergeometric function.

From the above scary-looking equation, Olkin et al. [1] derived the unique minimum variance unbiased estimator (MVUE):

rmvu=r 2F1(12,12;n12;1r2).r_{mvu} = r~ \mathbf{_2F_1}\Bigl( \frac{1}{2},\frac{1}{2};\frac{n-1}{2};1-r^2 \Bigr).

This formula is too complicated, so there is an approximated version:

radj=r(1+1r22(n3)).r_{adj} = r\Bigl( 1+\frac{1-r^2}{2(n-3)} \Bigr).

How large is the bias of the sample correlation coefficient? How good is the approximation? To answer these questions, I conducted some experiments in the next section.

Experiment

The code I used for the experiments below is available at Colaboratory.

I obtained correlated sequences and their correlation coefficients in the following way:

  1. Draw nn samples x1,,xnx_1,\ldots,x_n from N(0,1)\mathcal{N}(0,1)
  2. Draw nn samples e1,,ene_1,\ldots,e_n from N(0,1)\mathcal{N}(0,1)
  3. Given pp (0p10\leq p\leq1), obtain nn samples by yi=pxi+(1p)eiy_i=px_i+(1-p)e_i
  4. Compute the biased, the unbiased, and the adjusted correlation coefficients

Here, pp determines the correlation between XX and YY. That is, XX and YY have a perfect correlation when p=1p=1, and they have a zero correlation when p=0p=0. Formally, we have

V[X]=1V[Y]=V[pX+(1p)E]=p2+(1p)2=2p22p+1Cov[X,Y]=Cov[X,pX+(1p)E]=p.\begin{gathered} V[X]=1\\ V[Y] = V[pX+(1-p)E] = p^2 + (1-p)^2 = 2p^2-2p+1\\ Cov[X,Y] = Cov[X,pX+(1-p)E] = p. \end{gathered}

Therefore, the population correlation coefficient ρ\rho is expressed as a function of pp:

ρ=p2p22p+1.\rho = \frac{p}{\sqrt{2p^2-2p+1}}.

First, for different values of ρ\rho and nn (sample size), I plotted the biased, the unbiased, and the adjusted correlation coefficients to see their bias from the population statistic and their asymptotic behavior. To alleviate the effects of stochastic noise, I took an average over 10,000 trials. For calculating 2F1(,;;)\mathbf{_2F_1}(\cdot,\cdot;\cdot;\cdot), scipy provides hyp2f1 function.

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.special as sc
from tqdm.notebook import tqdm

sns.set_style("darkgrid")

def gen_data(p=0.5, n=100, k=10000):
    x = np.random.randn(n, k)
    e = np.random.randn(n, k)
    y = p * x + (1-p) * e

    biased = []
    unbiased = []
    adjusted = []
    for i in tqdm(range(3, n+1)):
        r_biased = []
        r_unbiased = []
        r_adjusted = []
        for j in range(k):
            r = np.corrcoef(x[:i, j], y[:i, j])[0, 1]
            r_biased.append(r)
            r_unbiased.append(r * sc.hyp2f1(0.5, 0.5, (i-1)/2, 1-r**2))
            r = r * (1 + (1 - r**2) / (2 * (i-3)))
            r_adjusted.append(r)
        biased.append(np.mean(r_biased))
        unbiased.append(np.mean(r_unbiased))
        adjusted.append(np.mean(r_adjusted))

    return biased, unbiased, adjusted

def population_r(p):
    return p / (2*p**2 - 2*p + 1)**0.5

ps = [0, 1/3, 2/3, 1]
fig = plt.figure(figsize=(16,6))
for i,p in enumerate(ps):
    ax = fig.add_subplot(1, 4, i+1)
    biased, unbiased, adjusted = gen_data(p, n=n, k=10000)
    rho = population_r(p)
    x = np.arange(3, n+1)
    ax.plot(x, biased, label="Biased")
    ax.plot(x, unbiased, label="Unbiased")
    ax.plot(x, adjusted, label="Adjusted")
    ax.hlines(rho, x[0], x[-1], label="Ground truth")
    if i==1:
        ax.set_xlabel("Sample size", fontsize=12)
    if i==0:
        ax.set_ylabel("Correlation coefficient", fontsize=12)
    if i==3:
        ax.legend()
    ax.set_title(f"rho={rho:.3f}", fontsize=15)
fig.suptitle("Sample correlation coefficient (averaged over 10,000 trials)", fontsize=18)
plt.show()

2021 02 23 21 33 51

One can observe the following:

  • The “biased estimator” is indeed asymptotically unbiased, but the bias remains non-subtle even when n50n\sim 50.
  • It looks like neither the “unbiased” estimator nor the “adjusted” estimator is completely unbiased. Still, they are significantly more accurate than the biased estimator.
  • The bias depends on ρ\rho as well as nn. It seems that the smaller ρ|\rho| is, the larger the relative bias is.
  • When ρ=1\rho=1 (perfect correlation), r=rmvu=radj=ρ=1r=r_{mvu}=r_{adj}=\rho=1.
  • When ρ=0\rho=0 (zero correlation), rrmvuradjr\simeq r_{mvu} \simeq r_{adj}.

To better picture how the bias changes according to ρ\rho, let’s think about absolute bias ρr\rho - r for the case where n=10n=10. This is untractable due to the hypergeometric term, so I consider the approximated version:

radjr=12(n3)r(1r2)r_{adj} - r = \frac{1}{2(n-3)}r(1-r^2)

The absolute bias reaches its maximum when r=1/3r=1/\sqrt{3} (i.e. ρ0.6\rho \simeq 0.6) because:

r(radjr)=013r22(n3)=0r=±13.\begin{aligned} \frac{\partial }{\partial r}(r_{adj} - r) &= 0 \\ \frac{1-3r^2}{2(n-3)} &= 0\\ r &= \pm \frac{1}{\sqrt{3}}. \end{aligned}

The numerical experiment produced the expected result.

2021 02 23 21 18 03

In a similar way, the approximated relative bias (radjr)/radj(r_{adj} - r)/r_{adj} takes its maximum value when r=0r=0.

2021 02 23 21 03 49

Conclusion

  • The sample correlation coefficient rr is not an unbiased estimator of the population correlation coefficient ρ\rho. The bias remains untrivial when n50n\sim 50
  • For the data that follow a bivariate normal distribution, the exact form of minimum variance unbiased estimator rmvur_{mvu} is known
  • The approximated version radjr_{adj} is accurate enough and far handier
  • The smaller ρ|\rho| is, the larger the relative bias of rr is

References

[1] Ingram Olkin, John W. Pratt. ”Unbiased Estimation of Certain Correlation Coefficients“. Ann. Math. Statist. 1958.
[2] Pearson correlation coefficient - Wikipedia


  • このエントリーをはてなブックマークに追加
[object Object]

Written by Shion Honda. If you like this, please share!

Shion Honda

Hippocampus's Garden © 2024, Shion Honda. Built with Gatsby