We’ve seen several aspects of the correlation coefficient in the previous posts. The correlation coefficient treats two variables equally; they are symmetrical. When two variables are not symmetrical, that is, when you want to explain y by x, correlation analysis alone is not sufficient. Instead, you might want to conduct a regression analysis.
The simplest approach, simple linear regression, considers a single explanatory variable (independent variable) x for explaining the objective variable (dependent variable) y.
y=β0+β1x
Least Square Estimates
How to determine the parameters β0 and β1 in the above equation? Given the paired data {(xi,yi)}i=1n, they are determined by the method of least squares. That is, they are chosen to minimize the sum of the squared error between the predicted y^i=b0+b1xi and the actual yi:
From the above calculation, L(b0,b1) takes its minimum value when b1=Sxy/Sx2 and b0=yˉ−b1xˉ.
Coefficient of Determination
Now we have the predicted values {y^i}i=1n. How good are these predictions? To evaluate the goodness, coefficient of determinationR2 is frequently used.
R2:=TSSESS=∑i=1n(yi−yˉ)2∑i=1n(yi^−yˉ)2.
The coefficient of determination is the ratio of ESS (explained sum of squares) to TSS (total sum of squares). As you may imagine from its notation, the coefficient of determination R2 is the square of the Pearson correlation coefficient r.
Lastly, let’s confirm that R2=r2, introducing how to use linear regression with Python. As done in the previous post, I generated 100 pairs of correlated random samples (x and y).
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("darkgrid")
n =100
x = np.random.rand(n)
y = x +0.5*np.random.rand(n)
Scikit-learn implements linear regression as LinearRegression and coefficient of determination as r2_score. After fitting the model, we can plot the regression line like below:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
model = LinearRegression()
model.fit(x.reshape(-1,1), y.reshape(-1,1))
sns.scatterplot(x, y)
plt.plot(x, model.predict(x.reshape(-1,1)), color="k")