This blog post is designed to help you:
- Understand the concepts of simple and multiple linear regression
- Understand equations and calculations behind the linear regression
- Understand evaluation metrics
- Get exposure in brief about different forms of regression and learn about some topics which will be covered in the subsequent posts
The objectives of simple linear regression are to assess the significance of the predictor variable in explaining the variability or behavior of the response variable and to predict the values of the response variable given the values of the predictor variable.
The response variable is the variable of primary interest for which we want to build the model.
The predictor variable is used to explain the variability in the response variable.
The relationship between the response variable and the predictor variable can be characterized by the equation Y = β0 + β1X + ε where
- Y response variable
- X predictor variable
- β0 intercept parameter, which corresponds to the value of the response variable when the predictor is 0
- β1 slope parameter, which corresponds to the magnitude of change in the response variable given a one unit change in the predictor variable
- ε error term representing deviations of Y about β0 + β1X
- Because our goal in simple linear regression is usually to characterize the relationship between the response and predictor variables in our population for which we begin with a sample of data.
- From this sample, we estimate the unknown population parameters (β0, β1) that define the assumed relationship between our response and predictor variables.
- Estimates of the unknown population parameters β0 and β1 are obtained by the method of least squares. This method provides the estimates by determining the line that minimizes the sum of the squared vertical distances between the observations and the fitted line. In other words, the fitted or regression line is as close as possible to all the data points.
- The method of least squares produces parameter estimates with certain optimum properties.
- If the assumptions of simple linear regression are valid, the least squares estimates are unbiased estimates of the population parameters and have minimum variance.
- The least squares estimators are often called BLUE (Best Linear Unbiased Estimators). The term best is used because of the minimum variance property. Because of these optimum properties, the method of least squares is used by many data analysts to investigate the relationship between continuous predictor and response variables.
- With a large and representative sample, the fitted regression line should be a good approximation of the relationship between the response and predictor variables in the population. The estimated parameters obtained using the method of least squares should be good approximations of the true population parameters
- To determine whether the predictor variable explains a significant amount of variability in the response variable, the simple linear regression model is compared to the baseline model.
- The fitted regression line in a baseline model is a horizontal line across all values of the predictor variable. The slope of the regression line is 0 and the intercept is the sample mean of the response variable, (Y ). In a baseline model, there is no association between the response variable and the predictor variable. Knowing the mean of the response variable is as good in predicting values in the response variable as knowing the values of the predictor variable.
Before we move into further mathematics behind linear regression let us quickly check the assumptions behind a linear model
Assumptions of linear regression
There are four principal assumptions which justify the use of linear regression models for purposes of inference or prediction:
- linearity and additivity of the relationship between
dependent and independent variables:
- The expected value of dependent variable is a straight-line function of each independent variable, holding the others fixed.
- The slope of that line does not depend on the values of the other variables.
- The effects of different independent variables on the expected value of the dependent variable are additive.
- statistical independence of the errors (in particular, no correlation between consecutive errors in the case of time series data)
- homoscedasticity (constant variance) of the errors
- versus time (in the case of time series data)
- versus the predictions
- versus any independent variable
- normality of the error distribution.
Violation of linearity
Violations of linearity or additivity are extremely serious: if a linear model is fit to data which are nonlinearly or non-additively related, predictions are likely to be seriously in error, especially when extrapolations are beyond the range of the sample data.
How to diagnose: nonlinearity is usually most evident in a plot of observed versus predicted values or a plot of residuals versus predicted values, which are a part of standard regression output. The points should be symmetrically distributed around a diagonal line in the former plot or around horizontal line in the latter plot, with a roughly constant variance. (The residual-versus-predicted-plot is better than the observed-versus-predicted plot for this purpose, because it eliminates the visual distraction of a sloping pattern.) Look carefully for evidence of a “bowed” pattern, indicating that the model makes systematic errors whenever it is making unusually large or small predictions
How to fix: Consider applying a nonlinear transformation to the dependent and/or independent variables. For example, if the data are strictly positive, the log transformation is an option. If a log transformation is applied to the dependent variable only, this is equivalent to assuming that it grows (or decays) exponentially as a function of the independent variables. If a log transformation is applied to both the dependent variable and the independent variables, this is equivalent to assuming that the effects of the independent variables are multiplicative rather than additive in their original units.
Violation of independence of errors
Violations of independence are potentially very serious in time series regression models. Serial correlation (also known as autocorrelation”) is sometimes a byproduct of a violation of the linearity assumption, as in the case of a simple (i.e., straight) trend line fitted to data which are growing exponentially over time.
Independence can also be violated in non-time-series models if errors tend to always have the same sign under specific conditions, i.e., if the model systematically underpredicts or overpredicts what will happen when the independent variables have a specific configuration.
How to diagnose: The best test for serial correlation is to look at a residual time series plot (residuals vs. row number) and a table or plot of residual autocorrelations. The Durbin-Watson statistic provides a test for significant residual autocorrelation at lag 1: the DW stat is approximately equal to 2(1-a) where a is the lag-1 residual autocorrelation, so ideally it should be close to 2.0–say, between 1.4 and 2.6 for a sample size of 50.
How to fix: Minor cases of positive serial correlation (say, lag-1 residual autocorrelation in the range 0.2 to 0.4, or a Durbin-Watson statistic between 1.2 and 1.6) indicate that there is some room for fine-tuning in the model. Consider adding lags of the dependent variable and/or lags of some of the independent variables.
Violation of homoscedasticity
Violations of homoscedasticity (which are called “heteroscedasticity”) make it difficult to gauge the true standard deviation of the forecast errors, usually resulting in confidence intervals that are too wide or too narrow. If the variance of the errors is increasing over time, confidence intervals for out-of-sample predictions will tend to be unrealistically narrow. Heteroscedasticity may also have the effect of giving too much weight to a small subset of the data (namely the subset where the error variance was largest) when estimating coefficients.
How to diagnose: Look at a plot of residuals versus predicted values and, in the case of time series data, a plot of residuals versus time. Be alert for evidence of residuals that grow larger either as a function of time or as a function of the predicted value. To be really thorough, you should also generate plots of residuals versus independent variables to look for consistency there as well. What you hope not to see are errors that systematically get larger in one direction by a significant amount.
How to fix: If the dependent variable is strictly positive and if the residual-versus-predicted plot shows that the size of the errors is proportional to the size of the predictions (i.e., if the errors seem consistent in percentage rather than absolute terms), a log transformation applied to the dependent variable may be appropriate. In time series models, heteroscedasticity often arises due to the effects of inflation and/or real compound growth. Some combination of logging and/or deflating will often stabilize the variance in this case.
Violation of normality
Violations of normality Sometimes the error distribution is “skewed” by the presence of a few large outliers. Since parameter estimation is based on the minimization of squared error, a few extreme observations can exert a disproportionate influence on parameter estimates. Calculation of confidence intervals and various significance tests for coefficients are all based on the assumptions of normally distributed errors. If the error distribution is significantly non-normal, confidence intervals may be too wide or too narrow.
How to diagnose: the best test for normally distributed errors is a normal probability plot or normal quantile plotof the residuals. These are plots of the fractiles of error distribution versus the fractiles of a normal distribution having the same mean and variance. If the distribution is normal, the points on such a plot should fall close to the diagonal reference line. A bow-shaped pattern of deviations from the diagonal indicates that the residuals have excessive skewness (i.e., they are not symmetrically distributed, with too many large errors in one direction). An S-shaped pattern of deviations indicates that the residuals have excessive kurtosis–i.e., there are either too many or two few large errors in both directions.
There are also a variety of statistical tests for normality, including the Kolmogorov-Smirnov test, the Shapiro-Wilk test, the Jarque-Bera test, and the Anderson-Darling test.
How to fix: violations of normality often arise either because (a) the distributions of the dependent and/or independent variables are themselves significantly non-normal, and/or (b) the linearity assumption is violated. In such cases, a nonlinear transformation of variables might cure both problems. In the case of the two normal quantile plots above, the second model was obtained applying a natural log transformation to the variables in the first one.
Mathematics behind regression
We estimate the unknown population parameters (β0, β1) that define the assumed relationship between our response and predictor variables.
Estimates of the unknown population parameters β0 and β1 are obtained by the method of least squares.
This method provides the estimates by determining the line that minimizes the sum of the squared vertical distances between the observations and the fitted line.
The fitted or regression line is as close as possible to all the data points.
Method of Ordinary Least Squares (OLS)
b0 and b1 are obtained by finding the values of b0 and b1 that minimize the sum of the squared residuals
Are there any advantages of minimizing the squared errors?
- Why don’t we take the sum?
- Why don’t we take absolute values instead?
I have three lines (y=2.3x+4, y=1.8x+3.5 and y=2x+8) to find the relationship between y and x.
Table shown below calculates the error value of each data point and the total error value (E) using the three methods of Sum of error, sum of absolute error and sum of square of errors.
Sum of all errors (∑error) leads to cancellation of positive and negative errors
In ∑error^2, we penalize the error value much more compared to ∑|error|. Two equations have almost similar value for ∑|error| whereas in case of ∑error^2 there is significant difference
How is a linear line represented mathematically ?
Y(x) represents the line mathematically as for now we have only one input feature the equation will be linear equation and it also resembles the line equation “Y = mx + c” . Now we will see what effect does choosing the value of theta will have on line.
To best fit our data we have to choose the value of theta’s such that the difference between Y(x) and y is minimum. To calculate this we will define a error function. As you can see in the above image. –
- Error function is defined by the difference between Y(x) — y. where Y(x) is the fitted line and y is the actual point
- We are taking absolute error as square of the error because some points are above and below the line.
- To take the error of all points we used Summation.
- Averaged and then divided by 2 to make the calculation easier. It will have no effect on overall error.
We will assume the Beta0 will be zero. It means the line will always pass through origin.
We assumed some points (1,1),(2,2),(3,3) and assuming Beta0 = 0 and Beta1 = 1. We calculated the error and obviously it will be zero.
Then we repeated the same process with value 0 and 0.5 and error we got is 0.58 and you can also see in the image. The line is not a good fit to the given points.
Now if take more values of Beta we will get like hand drawn diagram(bottom-right) as we can minimum is at Beta1 =1
But unfortunately, we cannot always have Beta0 = 0 because if we can some data points are like shown in the below image we can to take some intercept or we can’t ever reach the best fit while Beta0 having some value we will plot a 3D graph shown in right image. It will always be bowled shaped graph.
Hence our cost or error function is as below:
Our final objective is to minimize this function. This can be done using two methods mentioned below:
- Gradient Descent
- Normal Equations
We are standing at top of the hill and we look 360 degree around us. We want to take small steps in the direction which will take us downhill. Best direction would be the direction of steepest descent. We then follow the same steps until we reach the ground as in case of below image.
This method of gradient descent is used to minimize the cost function.
Let’s understand some terms and notations as follows :
- alpha is learning rate which describes how big the step you take
- Derivative gives you the slope of the line tangent to the ‘Beta’ which can be either positive or negative and derivative tells us that we will increase or decrease the ‘Beta’
The derivation of the equation is given below:
Solving the derivative of Equation gives us the values of Beta0 and Beta1
As Gradient Descent is an iterative process, Normal equations help to find the optimum solution in one go. They use matrix multiplication. The formula’s and notations are explained in the images. Below right image will explain what will be our X and y from our example. The first column of our X will always be 1 because it will be multiplied by Beta0 which we know is our intercept to the axis’s.
The derivation of the Normal Equation are explained in below image. They use matrix notation and properties.
This image explains the
- ‘Theta’ matrix
- ‘x’ matrix
- hypothesis is expanded
- Using those matrix we can rewrite the hypothesis as given is last step
Comparison Between Gradient Descent and Normal Equations
- We need to choose alpha, initial value of beta in case of gradient Descent but Normal equations we don’t have to choose alpha or beta.
- Gradient Descent is an iterative process while Normal Equation gives you solution right away.
- Normal equation computation gets slow as number of features increases but Gradient Descent performs well with features being with very large
How do we know whether the values of b0 and b1 that we have found are meaningful?
To determine whether the predictor variable explains a significant amount of variability in the response variable, the simple linear regression model is compared to the baseline model.
The fitted regression line in a baseline model is a horizontal line across all values of the predictor variable. The slope of the regression line is 0 and the intercept is the sample mean of the response variable, (Y ).
Explained variability is related to the difference between the regression line and the mean of the response variable. The model sum of squares (SSM) is the amount of variability explained by your model. .
Unexplained variability is related to the difference between the observed values and the regression line. The error sum of squares (SSE) is the amount of variability unexplained by your model.
Total variability is related to the difference between the observed values and the mean of the response variable. The corrected total sum of squares is the sum of the explained and unexplained variability.
Multiple linear regression
The values of the slope coefficients doesn’t tell anything about their significance in explaining the dependent variable.
Even an unrelated variable when regressed would give some value of slope coefficients.
To exclude the cases where the independent variables doesn’t significantly explain the dependent variable, we need the hypothesis testing of the coefficients for checking whether they contribute in explaining the dependent variable significantly or not.
The t-statistic is used to check the significance of the coefficients.
The t-statistic used for the hypothesis testing is same as used in the hypothesis testing of coefficient of simple linear regression.
Following are the hypothesis and alternative hypothesis to check the statistical significance of bk:
- Hypothesis H0: bk =0
- Alternative Hypothesis (Ha): bk ≠ 0
- The t-statistic of (n-k-1) degrees of freedom for the hypothesis testing of the coefficient bk
In simple linear regression, the dependent variable was assumed to be dependent on only one variable (independent variable)
In General Multiple Linear Regression model, the dependent variable derive sits value from two or more than two variable.
General Multiple Linear Regression model take the following form:
- Yi = ith observation of dependent variable Y ,Xki = ith observation of kth independent variable X
- b0 = intercept term, bk = slope coefficient of kth independent variable
- εi = error term of ith observation
- n = number of observations, k = total number of independent variables
If there is no relationship among Y and X1 and X2, the model is a horizontal plane passing through the point (Y = β0, X1 = 0, X2 = 0).
If there is a relationship among Y and X1 and X2, the model is a sloping plane passing through three points:
• (Y = β0, X1 = 0, X2 = 0)
• (Y = β0 + β1, X1 = 1, X2 = 0)
• (Y = β0 + β2, X1 = 0, X2 = 1)
Coefficient of determination(R2) can also be used to test the significance of the coefficients collectively apart from using F-test.
The drawback of using Coefficient of determination is that the value of the coefficient of determination always increases as the number of independent variables are increased even if the marginal contribution of the incoming variable is statistically insignificant.
To take care of the above drawback, coefficient of determination is adjusted for the number of independent variables taken. This adjusted measure of coefficient of determination is called adjusted R2
Adjusted R2 is given by the following formula:
n = Number of Observations
k = Number of Independent Variables
= Adjusted R2
Collinearity is redundant information among the independent variables. Collinearity is not a violation of assumptions of multiple linear regression
X1 and X2 almost follow a straight line X1 = X2 in the (X1, X2) plane. Consequently, one variable provides nearly as much information as the other does. They are redundant.
Why is this a problem? Two reasons exist.
- Neither can appear to be significant when both are in the model; however, both can be significant when only one is in the model. Thus, collinearity can hide significant variables.
- Collinearity also increases the variance of the parameter estimates and consequently increases prediction error. When collinearity is a problem, the estimates of the coefficients are unstable. This means that they have a large variance.
The coefficient ry1 represents the correlation between Y and X1.
Consider that the simple linear regression of Y on X1. X1 accounts for 25% of the variance in Y, as shown by the dark blue area of overlap.
X1 and X2 are correlated with one another. The coefficient ry(1.2) reflects the correlation of Y with X1,
controlling for the variable X2.
R2 increases when X2 is added to the model, but the individual effects of X1 and X2 appear smaller because the effects tests are based on partial correlation.
The coefficient ry(1.23) reflects the correlation between Y and X1 controlling for the variables X2 and X3.
Independent effect of X1 is no longer statistically significant,.
The R2 for this model has increased with each new term in the model
- Indicators that multicollinearity may be present in a model include the following:
- Large changes in the estimated regression coefficients when a predictor variable is added or deleted
- Insignificant regression coefficients for the affected variables in the multiple regression, but a rejection of the joint hypothesis that those coefficients are all zero (using an F-test)
- Some authors have suggested a formal detection-tolerance or the variance inflation factor (VIF) for multicollinearity is the coefficient of determination of a regression of explanator j on all the other explanators. A tolerance of less than 0.20 or 0.10 and/or a VIF of 5 or 10 and above indicates a multicollinearity problem.
- Construction of a correlation matrix among the explanatory variables will yield indications as to the likelihood that any given couplet of right-hand-side variables are creating multicollinearity problems. Correlation values (off-diagonal elements) of at least 0.4 are sometimes interpreted as indicating a multicollinearity problem. This procedure is, however, highly problematic and cannot be recommended. Intuitively, correlation describes a bivariate relationship, whereas collinearity is a multivariate phenomenon.
Steps to be followed to build a linear model
Following steps are followed during a simple multiple linear regression analysis::
- Variable identification
- Identifying the dependent (response) variable and independent (explanatory) variables.
- Variable categorization (e.g. Numeric, Categorical, Discrete, Continuous etc.)
- Distribution analysis
- Frequency distribution
- Outlier treatment
- Identify the outliers/threshold limit
- Cap/floor the values at the thresholds
- Independent variables analyses
- Identify the prospective independent variables (that can explain response variable)
- Bivariate analysis of response variable against independent variables
- Variable treatment /transformation
- Grouping of distinct values/levels
- Mathematical transformation e.g. log, splines etc.
- Check in a univariate manner by individual variables
- Easy for univariate linear regression. Can be done manually.
- Too cumbersome to do manually for multivariate case
- The tools (R, SAS etc.) have in-built features to tackle it.
- Fitting the regression
- Check for correlation between independent variables
- This is to take care of Multicollinearity
- Fix Heteroskedasticty
- By suitable transformation of response variable a bit tricky).
- Using inbuilt features of statistical packages like R
- Variable selection
- Check for the most suitable transformed variable
- Select the transformation giving the best fit
- Reject the statistically insignificant variables
- Fitting the regression
- Analysis of results
- Model comparison
- Model performance check
- Lift/Gains chart and Gini coefficient
- Actual vs Predicted comparison
Evaluation metrics for regression
- Mean Squared Error (MSE)
- Root Mean Squared Error (RMSE)
- Mean Absolute Error (MAE)
- R Squared (R²)
- Adjusted R Squared (R²)
- Mean Square Percentage Error (MSPE)
- Mean Absolute Percentage Error (MAPE)
- Root Mean Squared Logarithmic Error (RMSLE)
Mean Squared Error (MSE)
It is perhaps the most simple and common metric for regression evaluation, but also probably the least useful. It is defined by the equation
where yᵢ is the actual expected output and ŷᵢ is the model’s prediction.
MSE basically measures average squared error of our predictions. For each point, it calculates square difference between the predictions and the target and then average those values.
The higher this value, the worse the model is. It is never negative, since we’re squaring the individual prediction-wise errors before summing them, but would be zero for a perfect model .
Advantage: Useful if we have unexpected values that we should care about. Vey high or low value that we should pay attention.
Disadvantage: If we make a single very bad prediction, the squaring will make the error even worse and it may skew the metric towards overestimating the model’s badness. That is a particularly problematic behaviour if we have noisy data (that is, data that for whatever reason is not entirely reliable) — even a “perfect” model may have a high MSE in that situation, so it becomes hard to judge how well the model is performing. On the other hand, if all the errors are small, or rather, smaller than 1, than the opposite effect is felt: we may underestimate the model’s badness.
Note that if we want to have a constant prediction the best one will be the mean value of the target values. It can be found by setting the derivative of our total error with respect to that constant to zero, and find it from this equation.
Root Mean Squared Error (RMSE)
RMSE is just the square root of MSE. The square root is introduced to make scale of the errors to be the same as the scale of targets.
Now, it is very important to understand in what sense RMSE is similar to MSE,and what is the difference.
First, they are similar in terms of their minimizers, every minimizer of MSE is also a minimizer for RMSE and vice versa since the square root is an non-decreasing function. For example, if we have two sets of predictions, A and B, and say MSE of A is greater than MSE of B, then we can be sure that RMSE of A is greater RMSE of B.And it also works in the opposite direction.
What does it mean for us?
It means that, if the target metric is RMSE, we still can compare our models using MSE,since MSE will order the models in the same way as RMSE. Thus we can optimize MSE instead of RMSE.
In fact, MSE is a little bit easier to work with, so everybody uses MSE instead of RMSE. Also a little bit of difference between the two for gradient-based models.
It means that travelling along MSE gradient is equivalent to traveling along RMSE gradient but with a different flowing rate and the flowing rate depends on MSE score itself.
So even though RMSE and MSE are really similar in terms of models scoring, they can be not immediately interchangeable for gradient based methods. We will probably need to adjust some parameters like the learning rate.
Mean Absolute Error (MAE)
In MAE the error is calculated as an average of absolute differences between the target values and the predictions. The MAE is a linear score which means that all the individual differences are weighted equally in the average. For example, the difference between 10 and 0 will be twice the difference between 5 and 0. However, same is not true for RMSE. Mathematically, it is calculated using this formula:
What is important about this metric is that it penalizes huge errors that not as that badly as MSE does. Thus, it’s not that sensitive to outliers as mean square error.
MAE is widely used in finance, where $10 error is usually exactly two times worse than $5 error. On the other hand, MSE metric thinks that $10 error is four times worse than $5 error. MAE is easier to justify than RMSE.
Another important thing about MAE is its gradients with respect to the predictions.The gradiend is a step function and it takes -1 when Y_hat is smaller than the target and +1 when it is larger.
Now, the gradient is not defined when the prediction is perfect,because when Y_hat is equal to Y, we can not evaluate gradient. It is not defined.
o formally, MAE is not differentiable, but in fact, how often your predictions perfectly measure the target. Even if they do, we can write a simple IF condition and returnzero when it is the case and through gradient otherwise. Also know that second derivative is zero everywhere and not defined in the point zero.
Note that if we want to have a constant prediction the best one will be the median value of the target values. It can be found by setting the derivative of our total error with respect to that constant to zero, and find it from this equation.
R Squared (R²)
Now, what if I told you that MSE for my models predictions is 32? Should I improve my model or is it good enough?Or what if my MSE was 0.4?Actually, it’s hard to realize if our model is good or not by looking at the absolute values of MSE or RMSE.We would probably want to measure how much our model is better than the constant baseline.
The coefficient of determination, or R² (sometimes read as R-two), is another metric we may use to evaluate a model and it is closely related to MSE, but has the advantage of being scale-free — it doesn’t matter if the output values are very large or very small, the R² is always going to be between -∞ and 1.
When R² is negative it means that the model is worse than predicting the mean.
The MSE of the model is computed as above, while the MSE of the baseline is defined as:
where the y with a bar is the mean of the observed yᵢ.
To make it more clear, this baseline MSE can be thought of as the MSE that the simplest possible model would get. The simplest possible model would be to always predict the average of all samples. A value close to 1 indicates a model with close to zero error, and a value close to zero indicates a model very close to the baseline.
In conclusion, R² is the ratio between how good our model is vs how good is the naive mean model.
Common Misconception: A lot of articles in the web states that the range of R² lies between 0 and 1 which is not actually true. The maximum value of R² is 1 but minimum can be minus infinity.
For example, consider a really crappy model predicting highly negative value for all the observations even though y_actual is positive. In this case, R² will be less than 0. This is a highly unlikely scenario but the possibility still exists.