Linear Regression Using Normal Equations and Polynomial Regression.
Linear Regression is a machine learning algorithm based on supervised learning. It performs a regression task. It is mostly used for finding out the relationship between variables and forecasting. This blog gives a brief idea of the two different regression algorithms and how they are derived mathematically using normal equations.
Data on two variables recorded simultaneously for a group of individuals is called bivariate data. Examples of bi-variate data are heights and weights of the students in a class, the rainfall, and the yield of paddy in a state for several consecutive years, etc.
When we have bi-variate data, we can, no doubt, consider the values of each variable separately to know the different measures like the mean and standard deviation of the variable; but here we are mainly concerned with two other problems.
Firstly, we want to study the nature and extent of association, if any, between the variables.
Secondly, if the variables are found to be associated we express one of them (regarded as the dependent variable) as a mathematical function of the other (considered as an independent variable), so that we can predict the value of the dependent variable when the value of the independent variable is known.
The first problem is called correlation analysis and the second, regression analysis.
To find the relationship between continuous correlated variables we use linear regression. Linear regression looks for a statistical relationship between a set of correlated values. The representation is a linear equation that combines a specific set of input values (x) the solution to which is the predicted output for that set of input values (y). As such, both the input values (x) and the output value are numeric. When there is a single input variable (x), the method is referred to as simple linear regression. When there are multiple input variables,the method is referred to as multiple linear regression.
Derivation of linear regression equation:
Let the linear regression equation of y on x be
y=a +bx
Since we would like to use this equation for prediction purposes, the constants a and b have to be estimated on the basis of observed values of x and y. Suppose we are given n pairs of values, (xi, yi), i = 1(1)n, of x and y. From among different methods that are available for the determination of a and b, we use the method of least squares which has many desirable properties.
When x=xi, the observed value of y is yi and the predicted value of y is a+bxi. So,
ei = yi — (a + bxi)
This ei is the error in taking a + bxi for yi. This is called the error of estimation. The method of least squares requires that a and b be so determined that
∑ei ² = ∑( yi — a — bxi ) ²
Whence we get,
The quantity r (sy / sx), usually denoted by byx , is called the regression coefficient of y on x. It gives the increment in y for a unit increase in x.
Modelling Simple Linear Regression
The very first step is to import the libraries.
import pandas as pd import numpy as np from sklearn import linear_model from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.metrics import r2_score from sklearn import metrics import statsmodels.api as sm import matplotlib.pyplot as plt %matplotlib inline
We import a data set having (x, y) pairs of values.
df = pd.read_csv(‘test.csv’, index_col=False)
We use matplotlib , a popular Python plotting library to make a scatter plot.
plt.figure(figsize=(16, 8)) plt.scatter( df[‘x’], df[‘y’], c=’black’ ) plt.xlabel(“x”) plt.ylabel(“y”) plt.show()
As you can see, there is a clear relationship between the variables ‘x’ and ‘y’.
Now we will focus on getting a linear approximation of the data.
X = df[‘x’].values.reshape(-1,1) y = df[‘y’].values.reshape(-1,1) reg = LinearRegression() reg.fit(X, y) print(“The linear model is: Y = {:.5} + {:.5}X”.format(reg.intercept_[0], reg.coef_[0][0])) The linear model is: Y = -0.46181 + 1.0143X
Following which we visualize how the line fits the data.
predictions = reg.predict(X) plt.figure(figsize=(16, 8)) plt.scatter( df[‘x’], df[‘y’], c=’black’ ) plt.plot( df[‘x’], predictions, c=’blue’, linewidth=2 ) plt.xlabel(“x”) plt.ylabel(“y”) plt.show()
How relevant is my model?
The relevancy of the model is judged by the R² value. The R² metric, it measures the proportion of variability in the target that can be explained using a feature X. Therefore, assuming a linear relationship, if feature X can explain (predict) the target, then the proportion is high and the R² value will be close to 1. If the opposite is true, the R² value is then closer to 0.
Here is how the process is done.
X = df[‘x’] y = df[‘y’] X2 = sm.add_constant(X) est = sm.OLS(y, X2) est2 = est.fit() print(est2.summary())
These lines of code give us the following output:
R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination, or the coefficient of multiple determination for multiple regression. 0% indicates that the model explains none of the variability of the response data around its mean.
In this case, an R² value of 0.989 indicates that about 98% of the variability of ‘x’ is explained by ‘y’.
Linear Regression works on data where the dependent and independent variables have a linear relationship.
But in cases where the data do not have a linear relationship and instead possess a rather complex relationship, then Polynomial Regression is used.
What is Polynomial Regression?
Polynomial Regression is a form of linear regression in which the relationship between the independent variable x and dependent variable y is modeled as an nth degree polynomial. Polynomial regression fits a nonlinear relationship between the value of x and the corresponding conditional mean of y, denoted E(y |x).
Polynomial Regression is used to overcome the problems of underfitting of data found in simple linear regression.
The linear equation used earlier :
y=a +bx
is now converted to:
y=a +bx+cx²
This is still considered to be linear model as the coefficients/weights associated with the features are still linear. x² is only a feature. However the curve that we are fitting is quadratic in nature.
The equation of Polynomial Regression can be generalized as (up to nth degree):
y = a + b1x + b2x² +….+ bnx^n
Modeling Polynomial Regression
To understand polynomial regression, we first generate a data set. The following code is used to generate a random set of values.
np.random.seed(0) x = 2 – 3 * np.random.normal(0, 1, 20) y = x – 2 * (x ** 2) + 0.5 * (x ** 3) + np.random.normal(-5, 5, 20) plt.scatter(x,y, s=10) plt.show()
First we apply the linear regression model, this step gives us an idea of the drawback of using linear regression model in this case.
x = x[:, np.newaxis] y = y[:, np.newaxis] model = LinearRegression() model.fit(x, y) y_pred = model.predict(x) plt.scatter(x, y, s=10) plt.plot(x, y_pred, color=’r’) plt.show()
We observe a case of under-fitting of data here. The R² value is also calculated and found out to be 0.605. To overcome this drawback of under-fitting, we increase the complexity and thereby aim at establishing a higher order equation.
To convert the original features into their higher order terms we will use the PolynomialFeatures class provided by scikit-learn. Next, we train the model using Linear Regression.
# transforming the data to include another axis x = x[:, np.newaxis] y = y[:, np.newaxis] polynomial_features= PolynomialFeatures(degree=2) x_poly = polynomial_features.fit_transform(x) model = LinearRegression() model.fit(x_poly, y) y_poly_pred = model.predict(x_poly) rmse = np.sqrt(mean_squared_error(y,y_poly_pred)) r2 = r2_score(y,y_poly_pred) print(rmse) print(r2) plt.scatter(x, y, s=10) # sort the values of x before line plot sort_axis = operator.itemgetter(0) sorted_zip = sorted(zip(x,y_poly_pred), key=sort_axis) x, y_poly_pred = zip(*sorted_zip) plt.plot(x, y_poly_pred, color=’m’) plt.show()
Similarly, degree 3 and another arbitrary degree 20 graphs are also plotted.
While observing these graphs, the key question that comes to our mind is which is the best fit line.
Degree 2 does solve the problem of the underfitting of data better than a simple linear regression model. However, the R² value can be improved even more.
Degree 3 covers more number of data points than degree 2. The curve is the best-fit example in this case with low variance and low bias.
Degree 20 covers most of the data points. However, this is a case of overfitting of data. Thereby, it will fail to generalize on unseen data.
To prevent over-fitting, we can add more training samples so that the algorithm doesn’t learn the noise in the system and can become more generalized.
To understand the best fit line, Bias vs Variance Trade-off must be understood.
Bias are the simplifying assumptions made by a model to make the target function easier to learn and variance is the amount that the estimate of the target function will change if different training data was used.
The goal of any supervised machine learning algorithm is to achieve low bias and low variance. In turn, the algorithm should achieve good prediction performance.
For detailed codes, head over to my Github Repository on Regression!