Stats with Python: Simple Linear Regression
March 22, 2021 | 5 min read | 106 views
We’ve seen several aspects of the correlation coefficient in the previous posts. The correlation coefficient treats two variables equally; they are symmetrical. When two variables are not symmetrical, that is, when you want to explain by , correlation analysis alone is not sufficient. Instead, you might want to conduct a regression analysis.
The simplest approach, simple linear regression, considers a single explanatory variable (independent variable) for explaining the objective variable (dependent variable) .
How to determine the parameters and in the above equation? Given the paired data , they are determined by the method of least squares. That is, they are chosen to minimize the sum of the squared error between the predicted and the actual :
Therefore, the least squares estimates are:
Considering and ,
From the above calculation, takes its minimum value when and .
Now we have the predicted values . How good are these predictions? To evaluate the goodness, coefficient of determination is frequently used.
The coefficient of determination is the ratio of ESS (explained sum of squares) to TSS (total sum of squares). As you may imagine from its notation, the coefficient of determination is the square of the Pearson correlation coefficient .
Using the equation: ,
Lastly, let’s confirm that , introducing how to use linear regression with Python. As done in the previous post, I generated 100 pairs of correlated random samples (x and y).
import numpy as np import matplotlib.pyplot as plt import seaborn as sns sns.set_style("darkgrid") n = 100 x = np.random.rand(n) y = x + 0.5*np.random.rand(n)
Scikit-learn implements linear regression as
LinearRegression and coefficient of determination as
r2_score. After fitting the model, we can plot the regression line like below:
from sklearn.linear_model import LinearRegression from sklearn.metrics import r2_score model = LinearRegression() model.fit(x.reshape(-1, 1), y.reshape(-1, 1)) sns.scatterplot(x, y) plt.plot(x, model.predict(x.reshape(-1, 1)), color="k")
As expected, is confirmed.
r2_score(y, model.predict(x.reshape(-1, 1))) # >> 0.7922606713476185 np.corrcoef(x, y)[0,1]**2 # >> 0.7922606713476184
 倉田 博史, 星野 崇宏. ”入門統計解析“（第3章）. 新世社. 2009.
Written by Shion Honda. If you like this, please share!