Hippocampus's Garden

Under the sea, in the hippocampus's garden...

Stats with Python: Simple Linear Regression | Hippocampus's Garden

Stats with Python: Simple Linear Regression

March 22, 2021  |  5 min read  |  110 views

  • このエントリーをはてなブックマークに追加

We’ve seen several aspects of the correlation coefficient in the previous posts. The correlation coefficient treats two variables equally; they are symmetrical. When two variables are not symmetrical, that is, when you want to explain yy by xx, correlation analysis alone is not sufficient. Instead, you might want to conduct a regression analysis.

The simplest approach, simple linear regression, considers a single explanatory variable (independent variable) xx for explaining the objective variable (dependent variable) yy.

y=β0+β1xy = \beta_0+\beta_1x

Least Square Estimates

How to determine the parameters β0\beta_0 and β1\beta_1 in the above equation? Given the paired data {(xi,yi)}i=1n\{(x_i,y_i)\}_{i=1}^n, they are determined by the method of least squares. That is, they are chosen to minimize the sum of the squared error between the predicted y^i=b0+b1xi\hat{y}_i = b_0+b_1x_i and the actual yiy_i:

L(b0,b1)i=1n{yi(b0+b1xi)}2.\mathcal{L}(b_0,b_1) \coloneqq \sum_{i=1}^n \{ y_i - (b_0+b_1x_i) \}^2.

Therefore, the least squares estimates are:

β^1=SxySx2,β^0=yˉβ1xˉ.\begin{gathered} \hat{\beta}_1 = \frac{S_{xy}}{S_{x}^2},\\ \hat{\beta}_0 = \bar{y}-\beta_1\bar{x}. \end{gathered}

, where

Sxy1ni=1n(xix)(yiy),Sx21ni=1n(xix)2,Sy21ni=1n(yiy)2.\begin{gathered} S_{xy} \coloneqq \frac{1}{n}\sum_{i = 1}^n (x_i - \overline{x}) (y_i - \overline{y}),\\ S_{x}^2 \coloneqq \frac{1}{n}\sum_{i = 1}^n (x_i - \overline{x})^2,\\ S_{y}^2 \coloneqq \frac{1}{n}\sum_{i = 1}^n (y_i - \overline{y})^2. \end{gathered}

Proof

L(b0,b1)=i=1n{(yiyˉ)+(yˉb0b1xˉ)b1(xixˉ)}2\mathcal{L}(b_0,b_1) = \sum_{i=1}^n \{ (y_i-\bar{y}) + (\bar{y}-b_0-b_1\bar{x}) -b_1(x_i-\bar{x})\}^2

Considering i=1n(xixˉ)=0\sum_{i=1}^n(x_i-\bar{x})=0 and i=1n(yiyˉ)=0\sum_{i=1}^n(y_i-\bar{y})=0,

L(b0,b1)=i=1n(yiyˉ)2+n(yˉb0b1xˉ)2+b12i=1n(xixˉ)22b1i=1n(xixˉ)(yiyˉ)=nSy2+n(yˉb0b1xˉ)2+nb12Sx22nb1Sxy=nSx2(b1SxySx2)2+n(yˉb0b1xˉ)2+n(Sy2Sxy2Sx2)\begin{aligned} \mathcal{L}(b_0,b_1) &= \sum_{i=1}^n(y_i-\bar{y})^2 + n(\bar{y}-b_0-b_1\bar{x})^2 \\ & \quad + b_1^2\sum_{i=1}^n(x_i-\bar{x})^2 - 2b_1\sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})\\ &= nS_{y}^2 + n(\bar{y}-b_0-b_1\bar{x})^2 + nb_1^2S_{x}^2 - 2nb_1S_{xy}\\ &= nS_x^2\Bigl(b_1 - \frac{S_{xy}}{S_x^2}\Bigr)^2 + n(\bar{y}-b_0-b_1\bar{x})^2 + n\Bigl(S_y^2 - \frac{S_{xy}^2}{S_x^2}\Bigr) \end{aligned}

From the above calculation, L(b0,b1)\mathcal{L}(b_0,b_1) takes its minimum value when b1=Sxy/Sx2b_1=S_{xy}/S_x^2 and b0=yˉb1xˉb_0 = \bar{y}-b_1\bar{x}.

Coefficient of Determination

Now we have the predicted values {y^i}i=1n\{\hat{y}_i\}_{i=1}^n. How good are these predictions? To evaluate the goodness, coefficient of determination R2R^2 is frequently used.

R2ESSTSS=i=1n(yi^yˉ)2i=1n(yiyˉ)2.R^2 \coloneqq \frac{ESS}{TSS} = \frac{ \sum_{i=1}^n (\hat{y_i}-\bar{y})^2}{\sum_{i=1}^n (y_i-\bar{y})^2}.

The coefficient of determination is the ratio of ESS (explained sum of squares) to TSS (total sum of squares). As you may imagine from its notation, the coefficient of determination R2R^2 is the square of the Pearson correlation coefficient rr.

Proof

Using the equation: y^iyˉ=β1(xixˉ)\hat{y}_i - \bar{y} = \beta_1(x_i - \bar{x}),

R2=i=1n(β1(xixˉ))2i=1n(yiyˉ)2=nβ12Sx2nSy2=Sxy2Sx4Sx2Sy2=(SxySxSy)2=r2.\begin{aligned} R^2 &= \frac{ \sum_{i=1}^n (\beta_1(x_i - \bar{x}))^2}{\sum_{i=1}^n (y_i-\bar{y})^2}\\ &= \frac{n\beta_1^2 S_x^2}{nS_y^2}\\ &= \frac{S_{xy}^2}{S_{x}^4}\frac{S_x^2}{S_y^2}\\ &= \Bigl( \frac{S_{xy}}{S_xS_y} \Bigr)^2\\ &= r^2. \end{aligned}

Experiment

Lastly, let’s confirm that R2=r2R^2=r^2, introducing how to use linear regression with Python. As done in the previous post, I generated 100 pairs of correlated random samples (x and y).

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("darkgrid")

n = 100
x = np.random.rand(n)
y = x + 0.5*np.random.rand(n)

2021 03 22 22 19 16

Scikit-learn implements linear regression as LinearRegression and coefficient of determination as r2_score. After fitting the model, we can plot the regression line like below:

from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

model = LinearRegression()
model.fit(x.reshape(-1, 1), y.reshape(-1, 1))

sns.scatterplot(x, y)
plt.plot(x, model.predict(x.reshape(-1, 1)), color="k")

ogp

As expected, R2=r2R^2=r^2 is confirmed.

r2_score(y, model.predict(x.reshape(-1, 1)))
# >> 0.7922606713476185

np.corrcoef(x, y)[0,1]**2
# >> 0.7922606713476184

References

[1] 倉田 博史, 星野 崇宏. ”入門統計解析“(第3章). 新世社. 2009.


  • このエントリーをはてなブックマークに追加
[object Object]

Written by Shion Honda. If you like this, please share!

Shion Honda

Hippocampus's Garden © 2024, Shion Honda. Built with Gatsby