Hippocampus's Garden

Under the sea, in the hippocampus's garden...

Stats with Python: Unbiased Variance | Hippocampus's Garden

Stats with Python: Unbiased Variance

January 17, 2021  |  6 min read  |  2,772 views

  • このエントリーをはてなブックマークに追加

Are you wondering what unbiased sample variance is? Or, why it is divided by n-1? If so, this post answers them for you with a simple simulation, proof, and an intuitive explanation.

Consider you have nn i.i.d. samples: X1,...,XnX_1,...,X_n, and you want to estimate the population mean μ\mu and the population variance σ2\sigma ^2 from these samples. The sample mean is defined as:

Xˉ=1ni=1nXi.\bar{X} = \frac{1}{n}\sum_{i=1}^nX_i .

This looks quite natural. But what about the sample variance? This is defined as:

s2=1n1i=1n(XiXˉ)2.s^2 = \frac{1}{n-1}\sum_{i=1}^n (X_i-\bar{X})^2 .

When I first saw this, it looked weird. Where does n1n-1 come from? The professor said “this term makes the estimation unbiased”, which I didn’t quite understand. But now, thanks to Python, it’s much clearer than it was then. So, in this post, I’ll make a concise and clear explanation of unbiased variance.

Visualizing How Unbiased Variance is Great

Consider a “biased” version of variance estimator:

S2=1ni=1n(XiXˉ)2.S^2 = \frac{1}{n}\sum_{i=1}^n (X_i-\bar{X})^2.

In fact, as well as unbiased variance, this estimator converges to the population variance as the sample size approaches infinity. However, the “biased variance” estimates the variance slightly smaller.

Let’s see how these esitmators are different. Suppose you are drawing samples, one by one up to 100, from a continuous uniform distribution U(0,1)\mathcal{U}(0,1). The population mean is 0.50.5 and the population variance is

01(x0.5)2 dx=112=0.0833.\int_0^1 (x-0.5)^2~\mathrm{d}x = \frac{1}{12} = 0.0833\ldots .

I simulated estimating the population variance using the above two estimators in the following code.

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("darkgrid")

n = 100
rands = np.random.rand(n)
biased = []
unbiased = []
for i in range(2, n+1):
  biased.append(np.var(rands[:i], ddof=0))
  unbiased.append(np.var(rands[:i], ddof=1))

x = np.arange(2, n+1)
plt.plot(x, biased, label="Biased")
plt.plot(x, unbiased, label="Unbiased")
plt.hlines(1/12, x[0], x[-1], label="Ground truth")
plt.xlabel("Sample size")
plt.ylabel("Variance")
plt.legend()
plt.title("Sample variance #1");

2021 01 17 09 53 52

This code gives different results every time you execute it.

2021 01 17 11 22 49

Indeed, both of these estimators seem to converge to the population variance 1/121/12 and the biased variance is slightly smaller than the unbiased estimator. However, from these results, it’s hard to see which is more “unbiased” to the ground truth. So, I repeated this experiment 10,000 times and plotted the average performance in the figure below.

n = 100
k = 10000
rands = np.random.rand(n, k)
biased = []
unbiased = []
for i in range(2, n+1):
  biased.append(np.var(rands[:i], axis=0, ddof=0).mean())
  unbiased.append(np.var(rands[:i],  axis=0, ddof=1).mean())

x = np.arange(2, n+1)
plt.plot(x, biased, label="Biased")
plt.plot(x, unbiased, label="Unbiased")
plt.hlines(1/12, x[0], x[-1], label="Ground truth")
plt.xlabel("Sample size")
plt.ylabel("Variance")
plt.legend()
plt.title("Sample variance (averaged over 10,000 trials)");

2021 01 17 09 56 03

Now it’s clear how the biased variance is biased. Even when there are 100 samples, its estimate is expected to be 1% smaller than the ground truth. In contrast, the unbiased variance is actually “unbiased” to the ground truth.

Proof

Though it is a little complicated, here is a formal explanation of the above experiment. Recall that the variance of the sample mean follows this equation:

V[Xˉ]=V[1ni=1nXi]=1n2V[i=1nXi]=1n2i=1nV[Xi]=1n2nV[X]=1nV[X].\begin{aligned} V[\bar{X}] &= V\Biggl[\frac{1}{n}\sum_{i=1}^nX_i\Biggr]\\ &= \frac{1}{n^2}V\Biggl[\sum_{i=1}^nX_i\Biggr]\\ &= \frac{1}{n^2}\sum_{i=1}^nV[X_i]\\ &= \frac{1}{n^2}nV[X]\\ &= \frac{1}{n}V[X]. \end{aligned}

Thus,

E[S2]=E[1ni=1n(XiXˉ)2]=E[1ni=1n((Xiμ)(Xˉμ))2]=E[1ni=1n((Xiμ)22(Xiμ)(Xˉμ)+(Xˉμ)2)]=E[1ni=1n(Xiμ)2]E[2ni=1n(Xiμ)(Xˉμ)]+E[1ni=1n(Xˉμ)2]=1ni=1nE[(Xiμ)2]E[2n(Xˉμ)i=1n(Xiμ)]+E[1nn(Xˉμ)2]=1nnV[X]E[2n(Xˉμ)n(Xˉμ)]+E[(Xˉμ)2]=V[X]E[(Xˉμ)2]=V[X]V[Xˉ]=V[X]1nV[X]=n1nσ2,\begin{aligned} E[S^2] &= E\Biggl[\frac{1}{n}\sum_{i=1}^n (X_i-\bar{X})^2 \Biggr]\\ &= E\Biggl[ \frac{1}{n}\sum_{i=1}^n \Bigl((X_i-\mu) -(\bar{X}-\mu) \Bigr)^2 \Biggr] \\ &= E\Biggl[ \frac{1}{n}\sum_{i=1}^n \Bigl((X_i-\mu)^2 -2(X_i-\mu)(\bar{X}-\mu) +(\bar{X}-\mu)^2 \Bigr) \Biggr] \\ &= E\Biggl[ \frac{1}{n}\sum_{i=1}^n (X_i-\mu)^2\Biggr] \\ & \quad- E\Biggl[\frac{2}{n}\sum_{i=1}^n (X_i-\mu)(\bar{X}-\mu) \Biggr] \\ & \quad+ E\Biggl[\frac{1}{n}\sum_{i=1}^n (\bar{X}-\mu)^2 \Biggr] \\ &= \frac{1}{n}\sum_{i=1}^n E[(X_i-\mu)^2] \\ & \quad- E\Biggl[\frac{2}{n}(\bar{X}-\mu)\sum_{i=1}^n (X_i-\mu) \Biggr] \\ & \quad + E\Biggl[\frac{1}{n}n (\bar{X}-\mu)^2 \Biggr]\\ &= \frac{1}{n}nV[X] - E\Biggl[\frac{2}{n}(\bar{X}-\mu)n (\bar{X}-\mu) \Biggr] + E\Biggl[ (\bar{X}-\mu)^2 \Biggr]\\ &= V[X] - E\Biggl[(\bar{X}-\mu)^2 \Biggr]\\ &= V[X] - V[\bar{X}]\\ &= V[X] - \frac{1}{n}V[X]\\ &= \frac{n-1}{n}\sigma^2, \end{aligned}

which means that the biased variance estimates the true variance (n1)/n(n-1)/n times smaller. To correct this bias, you need to estimate it by the unbiased variance:

s2=1n1i=1n(XiXˉ)2,s^2 = \frac{1}{n-1}\sum_{i=1}^n (X_i-\bar{X})^2 ,

then,

E[s2]=σ2.E[s^2] = \sigma^2.

Here, n1n-1 is a quantity called degree of freedom. In the above example, the samples are subject to the equation:

X1+X2++Xn=nXˉ.X_1+X_2+\ldots+X_n = n\bar{X}.

So, given the sample mean Xˉ\bar{X}, the nn samples have only n1n-1 degrees of freedom. When I called the function np.var() in the experiment, I specified ddof=0 or ddof=1. This argument is short for delta degree of freedom, meaning how many degrees of freedom are reduced.

Intuition

The bias of the biased variance can be explained in a more intuitive way. By definition, the sample mean is always closer to the samples than the population mean, which leads to the smaller variance estimation if divided by the sample size nn. For more explanations, I’d recommend this video:


References

[1] 東京大学教養学部統計学教室 編. ”統計学入門“(第9章). 東京大学出版会. 1991.


  • このエントリーをはてなブックマークに追加
[object Object]

Written by Shion Honda. If you like this, please share!

Shion Honda

Hippocampus's Garden © 2024, Shion Honda. Built with Gatsby