Hippocampus's Garden

Under the sea, in the hippocampus's garden...

    Search by

    Stats with Python: Multiple Linear Regression

    March 31, 2021  |  4 min read  |  28 views

    • このエントリーをはてなブックマークに追加

    The last post discussed simple linear regression, which explains the objective variable yy by a single explanatory variable xx. This post introduces regression with multiple explanatory variables XX: multiple linear regression.

    y=β0+β1x1++βmxm=βTxy = \beta_0+\beta_1x_1+\ldots+\beta_mx_m = \boldsymbol{\beta}^\mathrm{T}\boldsymbol{x}

    Least Square Estimates, Revisited!

    Just like simple linear regression, multiple linear regression also finds their parameters β\boldsymbol{\beta} by the method of least squares. Given the paired data {(xi,yi)}i=1n\{(x_i,y_i)\}_{i=1}^n, the sum of the squared error between the predicted y^i=βTx\hat{y}_i = \boldsymbol{\beta}^\mathrm{T}\boldsymbol{x} and the actual yiy_i is written in a matrix form:

    L(β)(yXβ)2.\mathcal{L}(\boldsymbol{\beta}) \coloneqq ( \boldsymbol{y} - X\boldsymbol{\beta} )^2.

    Therefore, the least squares estimates are:

    β^=(XTX)1XTy\hat{\boldsymbol{\beta}} = (X^{\mathrm{T}}X)^{-1}X^{\mathrm{T}}\boldsymbol{y}


    L(b)=(yXb)2\mathcal{L}(\boldsymbol{b}) = ( \boldsymbol{y} - X\boldsymbol{b} )^2 L(b)b=02XT(yXb)=0XTXb=XTyb^=(XTX)1XTy\begin{aligned} \frac{\partial\mathcal{L}(\boldsymbol{b})}{\partial\boldsymbol{b}} &= 0\\ -2X^{\mathrm{T}}(\boldsymbol{y} - X\boldsymbol{b}) &=0\\ X^{\mathrm{T}}X\boldsymbol{b} &= X^{\mathrm{T}}\boldsymbol{y}\\ \hat{\boldsymbol{b}} = (X^{\mathrm{T}}X)^{-1}X^{\mathrm{T}}\boldsymbol{y} \end{aligned}

    Coefficient of Determination

    Coefficient of determination R2R^2 is defined in exactly the same way as for simple linear regression.

    R2ESSTSS=i=1n(yi^yˉ)2i=1n(yiyˉ)2.R^2 \coloneqq \frac{ESS}{TSS} = \frac{ \sum_{i=1}^n (\hat{y_i}-\bar{y})^2}{\sum_{i=1}^n (y_i-\bar{y})^2}.

    In multiple linear regression, the correlation coefficient is defined as the square root of the coefficient of determination, that is, rRr \coloneqq R.

    As extra explanatory variables are added, R2R^2 spuriously increases. For feature selection, you might want to use adjusted coefficient of determination Rˉ2\bar{R}^2 to complement this effect.

    Rˉ21n1nm1(1R2)\bar{R}^2 \coloneqq 1- \frac{n-1}{n-m-1}(1-R^2)

    Here, n1n-1 is the degree of freedom of the totals yiyˉy_i-\bar{y} and nm1n-m-1 is the degree of freedom of the residuals yyi^y-\hat{y_i}.


    It’s easy to use multple linear regression because scikit-learn’s LinearRegression supports it by default. You can use it exactly the same way as the last post.

    First, let’s prepare the dataset.

    import numpy as np
    import matplotlib.pyplot as plt
    import seaborn as sns
    n = 100
    m = 2
    X = np.random.rand(n,2)
    y = X[:,0] - 2*X[:,1] + 0.5*np.random.rand(n)
    fig = plt.figure()
    ax = fig.add_subplot(111, projection='3d')
    ax.scatter(X[:,0], X[:,1], y, color='k')

    2021 03 31 23 06 39

    Next, call LinearRegression, fit the model, and plot the regression plane in 3D space.

    from sklearn.linear_model import LinearRegression
    from sklearn.metrics import r2_score
    model = LinearRegression()
    model.fit(X, y.reshape(-1, 1))
    xx, yy = np.meshgrid(np.linspace(0, 1, 20), np.linspace(0, 1, 20))
    model_viz = np.array([xx.flatten(), yy.flatten()]).T
    predicted = model.predict(model_viz)
    fig = plt.figure()
    ax = fig.add_subplot(111, projection='3d')
    ax.scatter(xx.flatten(), yy.flatten(), predicted, facecolor=(0,0,0,0), s=20, edgecolor='#70b3f0')
    ax.scatter(X[:,0], X[:,1], y, color='k')


    R2R^2 is also calculated in the same way.

    r2_score(y, model.predict(X))
    # >> 0.9640837457111405


    [1] 倉田 博史, 星野 崇宏. ”入門統計解析“(第9章). 新世社. 2009.
    [2] Multiple Linear Regression and Visualization in Python | Pythonic Excursions

    • このエントリーをはてなブックマークに追加
    [object Object]

    Written by Shion Honda. If you like this, please share!

    Shion Honda

    Hippocampus's Garden © 2021, Shion Honda. Built with Gatsby