Simple Linear Regression
Regression & CorrelationModels the linear relationship between a single continuous predictor and a continuous outcome, estimating the slope and intercept of the best-fit line.
When to Use
Use this test when you want to predict a continuous outcome from a single continuous predictor and describe their linear relationship. For example, predicting body weight from height, or estimating how enzyme activity changes with substrate concentration.
Assumptions
- The relationship between predictor and outcome is linear.
- Observations are independent.
- Residuals (errors) are normally distributed.
- Residuals have constant variance across all levels of the predictor (homoscedasticity).
- No influential outliers that distort the fitted line.
Required Inputs
| Input | Type | Notes |
|---|---|---|
| Predictor (X) | Numeric | Continuous independent variable |
| Outcome (Y) | Numeric | Continuous dependent variable |
Output Metrics
| Metric | What it means |
|---|---|
| R-Square | Proportion of variance in Y explained by X. Ranges from 0 to 1. |
| Adj R-Square | R-squared adjusted for the number of predictors (identical to R-squared for simple regression with one predictor). |
| Root MSE | Root mean squared error of the residuals. Measures typical prediction error in the units of Y. |
| F Value | Overall F-statistic testing whether the model explains significant variance. |
| Pr > F | P-value for the overall model F-test. |
| N Observations | Number of data points used in the model. |
| Intercept — Estimate | Predicted value of Y when X = 0. |
| Intercept — Std Error | Standard error of the intercept estimate. |
| Intercept — t Value | Test statistic for the intercept. |
| Intercept — Pr > |t| | P-value testing whether the intercept differs from zero. |
| Slope — Estimate | Change in Y for a one-unit increase in X. |
| Slope — Std Error | Standard error of the slope estimate. |
| Slope — t Value | Test statistic for the slope. |
| Slope — Pr > |t| | P-value testing whether the slope differs from zero (i.e., whether X predicts Y). |
| 95% CL Lower | Lower bound of the 95% confidence interval for each parameter. |
| 95% CL Upper | Upper bound of the 95% confidence interval for each parameter. |
| Fitted Mean | Mean of fitted values from the model. |
| Fitted Min | Minimum fitted value from the model. |
| Fitted Max | Maximum fitted value from the model. |
| Residual Mean | Mean of model residuals (should be near zero). |
| Residual SD | Standard deviation of model residuals. |
| Residual Min | Minimum residual value. |
| Residual Max | Maximum residual value. |
Interpretation
- R-squared tells you what fraction of the variability in Y is accounted for by X. An R-squared of 0.60 means 60% of the variation is explained.
- The slope is the key result: it quantifies how much Y changes per unit change in X. If the slope is statistically significant (p < alpha), X is a significant predictor of Y.
- The intercept is the predicted Y when X = 0. It may or may not be meaningful depending on whether X = 0 is within the range of your data.
- Always examine a scatter plot with the fitted line and residual plots. The numbers alone cannot tell you if the linear model is appropriate.
- A high R-squared does not imply causation. Regression describes association, not causal mechanisms.
Common Pitfalls
- Fitting a straight line to a curved relationship produces misleading results. Always check for non-linearity in the residual plot.
- Extrapolating predictions beyond the observed range of X is unreliable. The linear relationship may not hold outside your data.
- A single outlier with high leverage (extreme X value) can dramatically change the slope. Check for influential points.
- Correlation does not imply causation. A significant regression does not prove that X causes Y.
How It Works
- Find the line Y = a + bX that minimises the sum of squared vertical distances (residuals) between the observed and predicted Y values (ordinary least squares).
- The slope b = sum((Xi - mean(X)) * (Yi - mean(Y))) / sum((Xi - mean(X))^2), and the intercept a = mean(Y) - b * mean(X).
- Test whether the slope is significantly different from zero using a t-test with N - 2 degrees of freedom.
- Compute R-squared as 1 - (residual sum of squares / total sum of squares).
Citations
References
- Legendre, A. M. (1805). Nouvelles methodes pour la determination des orbites des cometes. Firmin Didot.
- Gauss, C. F. (1809). Theoria motus corporum coelestium. Perthes et Besser.