Stats with Python: Simple Linear Regression
March 22, 2021 | 5 min read | 116 views
We’ve seen several aspects of the correlation coefficient in the previous posts. The correlation coefficient treats two variables equally; they are symmetrical. When two variables are not symmetrical, that is, when you want to explain y by x, correlation analysis alone is not sufficient. Instead, you might want to conduct a regression analysis.
The simplest approach, simple linear regression, considers a single explanatory variable (independent variable) x for explaining the objective variable (dependent variable) y.
Least Square Estimates
How to determine the parameters β0 and β1 in the above equation? Given the paired data {(xi,yi)}ni=1, they are determined by the method of least squares. That is, they are chosen to minimize the sum of the squared error between the predicted ˆyi=b0+b1xi and the actual yi:
Therefore, the least squares estimates are:
, where
Proof
Considering and ,
From the above calculation, takes its minimum value when and .
Coefficient of Determination
Now we have the predicted values . How good are these predictions? To evaluate the goodness, coefficient of determination is frequently used.
The coefficient of determination is the ratio of ESS (explained sum of squares) to TSS (total sum of squares). As you may imagine from its notation, the coefficient of determination is the square of the Pearson correlation coefficient .
Proof
Using the equation: ,
Experiment
Lastly, let’s confirm that , introducing how to use linear regression with Python. As done in the previous post, I generated 100 pairs of correlated random samples (x and y).
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("darkgrid")
n = 100
x = np.random.rand(n)
y = x + 0.5*np.random.rand(n)
Scikit-learn implements linear regression as LinearRegression
and coefficient of determination as r2_score
. After fitting the model, we can plot the regression line like below:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
model = LinearRegression()
model.fit(x.reshape(-1, 1), y.reshape(-1, 1))
sns.scatterplot(x, y)
plt.plot(x, model.predict(x.reshape(-1, 1)), color="k")
As expected, is confirmed.
r2_score(y, model.predict(x.reshape(-1, 1)))
# >> 0.7922606713476185
np.corrcoef(x, y)[0,1]**2
# >> 0.7922606713476184
References
[1] 倉田 博史, 星野 崇宏. ”入門統計解析“(第3章). 新世社. 2009.
![[object Object]](/static/2d0f4e01d6e61412b3e92139e5695299/e9fba/profile-pic.png)
Written by Shion Honda. If you like this, please share!