What Is Linear Regression?
Linear regression is one of the most fundamental and widely-used statistical techniques. It models the relationship between two variables by fitting a straight line (the line of best fit) through your data points. The goal is to find the line that best predicts the dependent variable (Y) from the independent variable (X).
The equation of a simple linear regression line is:
Where ŷ is the predicted value, b₀ is the y-intercept (the predicted value when x = 0), and b₁ is the slope (the change in y for each one-unit change in x). The slope tells you the direction and strength of the relationship: a positive slope means y increases as x increases, a negative slope means y decreases, and a slope near zero means there's little linear relationship.
Linear regression does more than draw a line through data. It provides a mathematical model you can use for prediction, quantifies the strength of relationships, and forms the foundation for more advanced techniques like logistic regression, polynomial regression, and neural networks.
The Least Squares Method
The ordinary least squares (OLS) method is the standard way to find the line of best fit. It works by minimizing the sum of the squared vertical distances (residuals) between each data point and the regression line. Squaring the distances ensures that positive and negative deviations don't cancel each other out, and it penalizes larger errors more heavily.
Calculating the Slope (b₁)
The numerator is the sum of products of deviations — it measures how x and y vary together (covariance). The denominator is the sum of squared deviations of x — it measures how much x varies on its own. The ratio gives the slope of the best-fit line.
Calculating the Intercept (b₀)
The intercept is determined by the requirement that the regression line passes through the point (x̄, ȳ) — the mean of both variables. This is a mathematical property of OLS regression.
Step 1: x̄ = 3, ȳ = 4
Step 2: Calculate slope: b₁ = [(1-3)(2-4) + (2-3)(3-4) + (3-3)(5-4) + (4-3)(4-4) + (5-3)(6-4)] / [(1-3)² + (2-3)² + (3-3)² + (4-3)² + (5-3)²] = [4+1+0+0+4] / [4+1+0+1+4] = 9/10 = 0.9
Step 3: Calculate intercept: b₀ = 4 - 0.9(3) = 4 - 2.7 = 1.3
Result: ŷ = 1.3 + 0.9x
Understanding Residuals
Residuals (eᵢ = yᵢ - ŷᵢ) are the differences between observed and predicted values. Analyzing residuals is crucial for validating your regression model. Ideally, residuals should be randomly scattered around zero with no discernible pattern. Patterns in residuals indicate that your model is missing something — perhaps a nonlinear relationship or an important variable.
Understanding R² (R-Squared)
R², also called the coefficient of determination, measures how much of the variance in the dependent variable is explained by the independent variable. It ranges from 0 to 1 (or 0% to 100%).
Where: SS_res = Σ(yᵢ - ŷᵢ)² (sum of squared residuals)
SS_tot = Σ(yᵢ - ȳ)² (total sum of squares)
| R² Value | Interpretation |
|---|---|
| 0.00 | The model explains none of the variance |
| 0.25 | Weak explanatory power |
| 0.50 | Moderate explanatory power |
| 0.75 | Strong explanatory power |
| 1.00 | Perfect fit (all points on the line) |
- R² always increases when you add more variables, even random ones. Use adjusted R² for multiple regression.
- High R² doesn't mean the model is correct. You can get high R² with a spurious (nonsense) correlation.
- R² doesn't indicate whether the regression coefficients are statistically significant.
- The interpretation of R² depends heavily on the context and field of study.
Correlation vs. Regression
While related, correlation (Pearson's r) and regression serve different purposes. Correlation measures the strength and direction of a linear relationship between two variables — it's symmetric (r of X and Y equals r of Y and X) and unitless. Regression quantifies the relationship as an equation you can use for prediction — it's asymmetric (predicting Y from X differs from predicting X from Y) and the coefficients have units.
Regression Assumptions
For OLS regression results to be valid, several assumptions must be met. Violating these assumptions can lead to unreliable estimates and incorrect conclusions.
1. Linearity
The relationship between X and Y should be linear. If the true relationship is curved (e.g., exponential or logarithmic), a straight line will be a poor fit. Always plot your data first to check for obvious nonlinearities. If you see a curve, consider transformations (log, square root) or polynomial regression.
2. Independence of Errors
Residuals should be independent of each other. This is often violated in time series data, where today's value depends on yesterday's. The Durbin-Watson test checks for autocorrelation in residuals. If present, consider time series models like ARIMA instead of simple regression.
3. Homoscedasticity
The variance of residuals should be constant across all levels of X. If the spread of residuals increases or decreases with X (heteroscedasticity), your standard errors will be unreliable. The Breusch-Pagan test detects heteroscedasticity. Remedies include weighted least squares or robust standard errors.
4. Normality of Errors
Residuals should be approximately normally distributed. This is most important for small samples and when constructing confidence intervals. Use a Q-Q plot or the Shapiro-Wilk test to check. For large samples (n > 30), the Central Limit Theorem provides some protection against non-normality.
5. No Multicollinearity (Multiple Regression)
In multiple regression, independent variables should not be highly correlated with each other. Multicollinearity inflates standard errors, making it hard to determine which variables are truly important. Check the Variance Inflation Factor (VIF) — values above 5-10 indicate problematic multicollinearity.
Multiple Linear Regression
Simple linear regression uses one independent variable. Multiple linear regression extends this to two or more independent variables:
Each coefficient bⱼ represents the change in Y for a one-unit change in xⱼ, holding all other variables constant. This "holding constant" interpretation is what makes regression so powerful — it allows you to isolate the effect of each variable.
For example, in a model predicting house prices, the coefficient for square footage represents the effect of size on price, controlling for the number of bedrooms, age of the house, location, and any other variables in the model. This is far more informative than a simple correlation between size and price.
Real-World Applications
Business & Economics
Businesses use regression to forecast sales, optimize pricing, and understand customer behavior. Economists use it to model the relationship between education and income, interest rates and investment, or supply and demand. The concept of elasticity in economics is derived from regression coefficients — it measures the percentage change in one variable resulting from a 1% change in another.
Science & Engineering
In chemistry, the Beer-Lambert law relates absorbance to concentration through a linear relationship validated by regression. In civil engineering, regression models predict concrete strength based on mix proportions. In environmental science, regression quantifies the relationship between pollution levels and health outcomes.
Machine Learning
Linear regression is often the first algorithm taught in machine learning courses, and it remains a competitive baseline even for complex problems. Regularized versions (Ridge, Lasso, Elastic Net) handle high-dimensional data and prevent overfitting. Linear regression is also the foundation of more sophisticated models: logistic regression for classification, generalized linear models for non-normal outcomes, and even the linear layers in neural networks.
Medicine & Public Health
Epidemiologists use regression to identify risk factors for diseases, controlling for confounding variables. Clinical trials use regression to adjust for baseline differences between treatment groups. Dose-response relationships in pharmacology are modeled using regression to find optimal dosages.
Common Pitfalls
- Extrapolation: Never predict values far outside your data range. A regression model calibrated for temperatures between 50-90°F may produce absurd predictions at 150°F.
- Confusing correlation with causation: Ice cream sales and drowning deaths are positively correlated, but eating ice cream doesn't cause drowning. A confounding variable (hot weather) explains both.
- Ignoring outliers: A single extreme data point can dramatically pull the regression line toward it. Always examine your data visually before running regression.
- Overfitting: Adding too many variables makes your model fit the noise in your sample rather than the true underlying pattern. Use adjusted R², cross-validation, or information criteria (AIC, BIC) to find the right balance.
- Omitted variable bias: Leaving out an important variable that correlates with both X and Y can bias your coefficient estimates. For example, estimating the effect of education on income without controlling for ability may overstate education's effect.
Conclusion
Linear regression is deceptively simple to compute but rich in interpretation. Understanding the least squares method, R², the assumptions that underpin valid inference, and the common pitfalls that lead to misuse is essential for anyone working with data. Whether you're a student learning statistics for the first time or a data scientist building predictive models, linear regression remains an indispensable tool in your analytical toolkit.
📈 Calculate Linear Regression Instantly
Our free Linear Regression Calculator finds the line of best fit, computes R², and shows step-by-step work with visualizations.
Try the Calculator →