Hippocampus's Garden

Under the sea, in the hippocampus's garden...

    Search by
    Stats with Python: Simple Linear Regression | Hippocampus's Garden

    Stats with Python: Simple Linear Regression

    March 22, 2021  |  5 min read  |  94 views

    We’ve seen several aspects of the correlation coefficient in the previous posts. The correlation coefficient treats two variables equally; they are symmetrical. When two variables are not symmetrical, that is, when you want to explain yy by xx, correlation analysis alone is not sufficient. Instead, you might want to conduct a regression analysis.

    The simplest approach, simple linear regression, considers a single explanatory variable (independent variable) xx for explaining the objective variable (dependent variable) yy.

    y=β0+β1xy = \beta_0+\beta_1x

    Least Square Estimates

    How to determine the parameters β0\beta_0 and β1\beta_1 in the above equation? Given the paired data {(xi,yi)}i=1n\{(x_i,y_i)\}_{i=1}^n, they are determined by the method of least squares. That is, they are chosen to minimize the sum of the squared error between the predicted y^i=b0+b1xi\hat{y}_i = b_0+b_1x_i and the actual yiy_i:

    L(b0,b1)i=1n{yi(b0+b1xi)}2.\mathcal{L}(b_0,b_1) \coloneqq \sum_{i=1}^n \{ y_i - (b_0+b_1x_i) \}^2.

    Therefore, the least squares estimates are:

    β^1=SxySx2,β^0=yˉβ1xˉ.\begin{gathered} \hat{\beta}_1 = \frac{S_{xy}}{S_{x}^2},\\ \hat{\beta}_0 = \bar{y}-\beta_1\bar{x}. \end{gathered}

    , where

    Sxy1ni=1n(xix)(yiy),Sx21ni=1n(xix)2,Sy21ni=1n(yiy)2.\begin{gathered} S_{xy} \coloneqq \frac{1}{n}\sum_{i = 1}^n (x_i - \overline{x}) (y_i - \overline{y}),\\ S_{x}^2 \coloneqq \frac{1}{n}\sum_{i = 1}^n (x_i - \overline{x})^2,\\ S_{y}^2 \coloneqq \frac{1}{n}\sum_{i = 1}^n (y_i - \overline{y})^2. \end{gathered}


    L(b0,b1)=i=1n{(yiyˉ)+(yˉb0b1xˉ)b1(xixˉ)}2\mathcal{L}(b_0,b_1) = \sum_{i=1}^n \{ (y_i-\bar{y}) + (\bar{y}-b_0-b_1\bar{x}) -b_1(x_i-\bar{x})\}^2

    Considering i=1n(xixˉ)=0\sum_{i=1}^n(x_i-\bar{x})=0 and i=1n(yiyˉ)=0\sum_{i=1}^n(y_i-\bar{y})=0,

    L(b0,b1)=i=1n(yiyˉ)2+n(yˉb0b1xˉ)2+b12i=1n(xixˉ)22b1i=1n(xixˉ)(yiyˉ)=nSy2+n(yˉb0b1xˉ)2+nb12Sx22nb1Sxy=nSx2(b1SxySx2)2+n(yˉb0b1xˉ)2+n(Sy2Sxy2Sx2)\begin{aligned} \mathcal{L}(b_0,b_1) &= \sum_{i=1}^n(y_i-\bar{y})^2 + n(\bar{y}-b_0-b_1\bar{x})^2 \\ & \quad + b_1^2\sum_{i=1}^n(x_i-\bar{x})^2 - 2b_1\sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})\\ &= nS_{y}^2 + n(\bar{y}-b_0-b_1\bar{x})^2 + nb_1^2S_{x}^2 - 2nb_1S_{xy}\\ &= nS_x^2\Bigl(b_1 - \frac{S_{xy}}{S_x^2}\Bigr)^2 + n(\bar{y}-b_0-b_1\bar{x})^2 + n\Bigl(S_y^2 - \frac{S_{xy}^2}{S_x^2}\Bigr) \end{aligned}

    From the above calculation, L(b0,b1)\mathcal{L}(b_0,b_1) takes its minimum value when b1=Sxy/Sx2b_1=S_{xy}/S_x^2 and b0=yˉb1xˉb_0 = \bar{y}-b_1\bar{x}.

    Coefficient of Determination

    Now we have the predicted values {y^i}i=1n\{\hat{y}_i\}_{i=1}^n. How good are these predictions? To evaluate the goodness, coefficient of determination R2R^2 is frequently used.

    R2ESSTSS=i=1n(yi^yˉ)2i=1n(yiyˉ)2.R^2 \coloneqq \frac{ESS}{TSS} = \frac{ \sum_{i=1}^n (\hat{y_i}-\bar{y})^2}{\sum_{i=1}^n (y_i-\bar{y})^2}.

    The coefficient of determination is the ratio of ESS (explained sum of squares) to TSS (total sum of squares). As you may imagine from its notation, the coefficient of determination R2R^2 is the square of the Pearson correlation coefficient rr.


    Using the equation: y^iyˉ=β1(xixˉ)\hat{y}_i - \bar{y} = \beta_1(x_i - \bar{x}),

    R2=i=1n(β1(xixˉ))2i=1n(yiyˉ)2=nβ12Sx2nSy2=Sxy2Sx4Sx2Sy2=(SxySxSy)2=r2.\begin{aligned} R^2 &= \frac{ \sum_{i=1}^n (\beta_1(x_i - \bar{x}))^2}{\sum_{i=1}^n (y_i-\bar{y})^2}\\ &= \frac{n\beta_1^2 S_x^2}{nS_y^2}\\ &= \frac{S_{xy}^2}{S_{x}^4}\frac{S_x^2}{S_y^2}\\ &= \Bigl( \frac{S_{xy}}{S_xS_y} \Bigr)^2\\ &= r^2. \end{aligned}


    Lastly, let’s confirm that R2=r2R^2=r^2, introducing how to use linear regression with Python. As done in the previous post, I generated 100 pairs of correlated random samples (x and y).

    import numpy as np
    import matplotlib.pyplot as plt
    import seaborn as sns
    n = 100
    x = np.random.rand(n)
    y = x + 0.5*np.random.rand(n)

    2021 03 22 22 19 16

    Scikit-learn implements linear regression as LinearRegression and coefficient of determination as r2_score. After fitting the model, we can plot the regression line like below:

    from sklearn.linear_model import LinearRegression
    from sklearn.metrics import r2_score
    model = LinearRegression()
    model.fit(x.reshape(-1, 1), y.reshape(-1, 1))
    sns.scatterplot(x, y)
    plt.plot(x, model.predict(x.reshape(-1, 1)), color="k")


    As expected, R2=r2R^2=r^2 is confirmed.

    r2_score(y, model.predict(x.reshape(-1, 1)))
    # >> 0.7922606713476185
    np.corrcoef(x, y)[0,1]**2
    # >> 0.7922606713476184


    [1] 倉田 博史, 星野 崇宏. ”入門統計解析“(第3章). 新世社. 2009.

    [object Object]

    Written by Shion Honda. If you like this, please share!

    Shion Honda

    Hippocampus's Garden © 2023, Shion Honda. Built with Gatsby