Hippocampus's Garden

Under the sea, in the hippocampus's garden...

Stats with Python: Finite Population Correction

January 29, 2021  |  6 min read  |  30 views

  • このエントリーをはてなブックマークに追加

It is a common mistake to assume independency between samples from a finite population without replacement. This can lead to mis-estimation of, for example, the variance of the sample mean.

Consider you have nn samples X1,,XnX_1,\ldots,X_n, which are sampled without replacement from a finite population {Yii=1,N}\{Y_i|i=1,\ldots N \} with mean μ\mu and variance σ2\sigma^2, and you want to estimate the mean and variance of the sample mean Xˉ\bar{X}. As in the previous post, the sample mean is defined as:

Xˉ=1ni=1nXi.\bar{X} = \frac{1}{n}\sum_{i=1}^nX_i .

It must be noted that the samples are identically distributed but not independent of each other. For example, if X1=Y1X_1=Y_1, X2X_2 is sampled from {Yii=2,N}\{Y_i|i=2,\ldots N \}. In this case, the mean of the sample mean is the same as for the sampling with replacement.

E[Xˉ]=μ.E[\bar{X}] =\mu.

However, the variance of sample mean is not σ2/n\sigma^2/n:

V[Xˉ]=NnN1σ2n.V[\bar{X}] = \frac{N-n}{N-1}\frac{\sigma^2}{n}.

The factor (Nn)/(N1)(N-n)/(N-1) is called finite population correction. When NnN\gg n, this factor approaches 11 and can be ignored. But when nn is sufficiently large compared to NN. In this post, I’ll visualize the effect of finite population correction and then give a brief proof for the above formulae.

Visualizing Finite Population Correction

Consider a finite population {YY=1,,100}\{ Y|Y=1,\ldots,100 \} and nn samples without replacement from this population. In the following two figures, I plot the corrected variance of sample mean and uncorrected version against different sample sizes n={2,,N}n=\{2,\ldots,N \}, with different random seeds.

2021 01 28 23 59 52

2021 01 28 23 59 59

In both figures, we see that as the sample size grows, the variance of sample mean approches 00, as argued in the law of large numbers. Specifically, when n=100n=100, the corrected variance is exactly equal to 00. This is natural considering that the mean of n=100n=100 samples is always (1++100)/100=50.5(1+\ldots+100)/100=50.5. Uncorrected variance does not satisfy this condition, so now it’s clear that you should use finite population correction when the population is finite and samples are without replacement.

Using the following code, I repeated this experiment 1,000 times and plotted the average values in the following figure. The difference of corrected and uncorrected variance is clearer here.

N = 100
corrected_vars = []
uncorrected_vars = []
for _ in range(1000):
  rands = np.random.choice(N, N, replace=False)
  corrected = []
  uncorrected = []
  for n in range(2, N+1):
    var_of_mean = np.var(rands[:n])/n
    corrected.append(var_of_mean*(N-n)/(N-1))
    uncorrected.append(var_of_mean)
  corrected_vars.append(corrected)
  uncorrected_vars.append(uncorrected)

x = np.arange(2, N+1)
plt.plot(x, np.mean(corrected_vars, axis=0), label="corrected")
plt.plot(x, np.mean(uncorrected_vars, axis=0), label="Uncorrected")
plt.xlabel("Sample size")
plt.ylabel("Variance of sample mean")
plt.legend()
plt.title("Finite population of size 100 (averaged over 1,00 trials)");

2021 01 28 23 58 45

Proof

The mean of the sample mean is the same as in the case of sampling with replacement.

E[Xˉ]=E[1ni=1nXi]=1ni=1nE[Xi]=1ni=1nμ=μ.\begin{aligned} E[\bar{X}] &= E\Biggl[\frac{1}{n}\sum_{i=1}^nX_i\Biggr]\\ &= \frac{1}{n}\sum_{i=1}^nE[X_i]\\ &= \frac{1}{n}\sum_{i=1}^n\mu\\ &=\mu. \end{aligned}

The variance of sample has an additional term n1nCov[Xi,Xj]\frac{n-1}{n}Cov[X_i,X_j].

V[Xˉ]=V[1ni=1nXi]=1n2V[i=1nXi]=1n2i=1nV[Xi]+2n2i=1nj=1i1Cov[Xi,Xj]=nσ2n2+2n2n(n1)2Cov[Xi,Xj]=σ2n+n1nCov[Xi,Xj].\begin{aligned} V[\bar{X}] &= V\Biggl[\frac{1}{n}\sum_{i=1}^nX_i\Biggr]\\ &= \frac{1}{n^2}V\Biggl[\sum_{i=1}^nX_i\Biggr]\\ &= \frac{1}{n^2}\sum_{i=1}^nV[X_i] + \frac{2}{n^2}\sum_{i=1}^n\sum_{j=1}^{i-1} Cov[X_i,X_j]\\ &= \frac{n\sigma^2}{n^2} + \frac{2}{n^2}\frac{n(n-1)}{2}Cov[X_i,X_j]\\ &= \frac{\sigma^2}{n} + \frac{n-1}{n}Cov[X_i,X_j]. \end{aligned}

Here,

Cov[Xi,Xj]=E[XiXj]E[Xi]E[Xj]=2N(N1)i=1Nj=1i1YiYjμ2=1N(N1)((i=1NYi)2i=1NYi2)μ2=N2μ2N(N1)μ21N(N1)i=1NYi2=Nμ2N(N1)1N(N1)i=1NYi2=1N(N1)(i=1NYi22i=1NμYi+i=1Nμ2)=1N(N1)i=1N(Yiμ)2=1N1σ2.\begin{aligned} Cov[X_i,X_j] &= E[X_iX_j]-E[X_i]E[X_j]\\ &=\frac{2}{N(N-1)}\sum_{i=1}^N\sum_{j=1}^{i-1}Y_iY_j - \mu^2\\ &=\frac{1}{N(N-1)}\Biggl( \biggl(\sum_{i=1}^NY_i\biggr)^2 - \sum_{i=1}^NY_i^2 \Biggr)- \mu^2\\ &= \frac{N^2\mu^2}{N(N-1)} - \mu^2 - \frac{1}{N(N-1)}\sum_{i=1}^NY_i^2\\ &= \frac{N\mu^2}{N(N-1)} - \frac{1}{N(N-1)}\sum_{i=1}^NY_i^2\\ &= - \frac{1}{N(N-1)}\Biggl(\sum_{i=1}^NY_i^2 -2\sum_{i=1}^N\mu Y_i + \sum_{i=1}^N\mu^2 \Biggr)\\ &= - \frac{1}{N(N-1)}\sum_{i=1}^N(Y_i-\mu)^2\\ &= -\frac{1}{N-1}\sigma^2. \end{aligned}

Thus,

V[Xˉ]=σ2nn1n1N1σ2=NnN1σ2n.\begin{aligned} V[\bar{X}] &= \frac{\sigma^2}{n} - \frac{n-1}{n}\frac{1}{N-1}\sigma^2\\ &= \frac{N-n}{N-1}\frac{\sigma^2}{n}. \end{aligned}

Intuition

The uncorrected version does not take the covariance term Cov[Xi,Xj]Cov[X_i,X_j], which is negative, into account. This leads to the overestimation of the variance.

References

[1] 東京大学教養学部統計学教室 編. ”統計学入門“(第9章). 東京大学出版会. 1991.
[2] Fernando Tusell. ”Finite Population Sampling”. 2012.


  • このエントリーをはてなブックマークに追加

Written by Shion Honda. If you like this, please share!