Hippocampus's Garden

Under the sea, in the hippocampus's garden...

Stats with Python: Multiple Linear Regression | Hippocampus's Garden

Stats with Python: Multiple Linear Regression

March 31, 2021  |  3 min read  |  89 views

  • このエントリーをはてなブックマークに追加

The last post discussed simple linear regression, which explains the objective variable yy by a single explanatory variable xx. This post introduces regression with multiple explanatory variables XX: multiple linear regression.

y=β0+β1x1++βmxm=βTxy = \beta_0+\beta_1x_1+\ldots+\beta_mx_m = \boldsymbol{\beta}^\mathrm{T}\boldsymbol{x}

Least Square Estimates, Revisited!

Just like simple linear regression, multiple linear regression also finds their parameters β\boldsymbol{\beta} by the method of least squares. Given the paired data {(xi,yi)}i=1n\{(x_i,y_i)\}_{i=1}^n, the sum of the squared error between the predicted y^i=βTx\hat{y}_i = \boldsymbol{\beta}^\mathrm{T}\boldsymbol{x} and the actual yiy_i is written in a matrix form:

L(β)(yXβ)2.\mathcal{L}(\boldsymbol{\beta}) \coloneqq ( \boldsymbol{y} - X\boldsymbol{\beta} )^2.

Therefore, the least squares estimates are:

β^=(XTX)1XTy\hat{\boldsymbol{\beta}} = (X^{\mathrm{T}}X)^{-1}X^{\mathrm{T}}\boldsymbol{y}

Proof

L(b)=(yXb)2\mathcal{L}(\boldsymbol{b}) = ( \boldsymbol{y} - X\boldsymbol{b} )^2
L(b)b=02XT(yXb)=0XTXb=XTyb^=(XTX)1XTy\begin{aligned} \frac{\partial\mathcal{L}(\boldsymbol{b})}{\partial\boldsymbol{b}} &= 0\\ -2X^{\mathrm{T}}(\boldsymbol{y} - X\boldsymbol{b}) &=0\\ X^{\mathrm{T}}X\boldsymbol{b} &= X^{\mathrm{T}}\boldsymbol{y}\\ \hat{\boldsymbol{b}} = (X^{\mathrm{T}}X)^{-1}X^{\mathrm{T}}\boldsymbol{y} \end{aligned}

Coefficient of Determination

Coefficient of determination R2R^2 is defined in exactly the same way as for simple linear regression.

R2ESSTSS=i=1n(yi^yˉ)2i=1n(yiyˉ)2.R^2 \coloneqq \frac{ESS}{TSS} = \frac{ \sum_{i=1}^n (\hat{y_i}-\bar{y})^2}{\sum_{i=1}^n (y_i-\bar{y})^2}.

In multiple linear regression, the correlation coefficient is defined as the square root of the coefficient of determination, that is, rRr \coloneqq R.

As extra explanatory variables are added, R2R^2 spuriously increases. For feature selection, you might want to use adjusted coefficient of determination Rˉ2\bar{R}^2 to complement this effect.

Rˉ21n1nm1(1R2)\bar{R}^2 \coloneqq 1- \frac{n-1}{n-m-1}(1-R^2)

Here, n1n-1 is the degree of freedom of the totals yiyˉy_i-\bar{y} and nm1n-m-1 is the degree of freedom of the residuals yyi^y-\hat{y_i}.

Experiment

It’s easy to use multple linear regression because scikit-learn’s LinearRegression supports it by default. You can use it exactly the same way as the last post.

First, let’s prepare the dataset.

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("darkgrid")

n = 100
m = 2
X = np.random.rand(n,2)
y = X[:,0] - 2*X[:,1] + 0.5*np.random.rand(n)

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X[:,0], X[:,1], y, color='k')
plt.show()

2021 03 31 23 06 39

Next, call LinearRegression, fit the model, and plot the regression plane in 3D space.

from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

model = LinearRegression()
model.fit(X, y.reshape(-1, 1))

xx, yy = np.meshgrid(np.linspace(0, 1, 20), np.linspace(0, 1, 20))
model_viz = np.array([xx.flatten(), yy.flatten()]).T
predicted = model.predict(model_viz)

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(xx.flatten(), yy.flatten(), predicted, facecolor=(0,0,0,0), s=20, edgecolor='#70b3f0')
ax.scatter(X[:,0], X[:,1], y, color='k')
plt.show()

ogp

R2R^2 is also calculated in the same way.

r2_score(y, model.predict(X))
# >> 0.9640837457111405

References

[1] 倉田 博史, 星野 崇宏. ”入門統計解析“(第9章). 新世社. 2009.
[2] Multiple Linear Regression and Visualization in Python | Pythonic Excursions


  • このエントリーをはてなブックマークに追加
[object Object]

Written by Shion Honda. If you like this, please share!

Shion Honda

Hippocampus's Garden © 2024, Shion Honda. Built with Gatsby