Pearson Correlation
Regression & CorrelationMeasures the strength and direction of the linear relationship between two continuous variables, producing a correlation coefficient (r) that ranges from -1 to +1.
When to Use
Use this test when you want to quantify how strongly two continuous variables are linearly related. For example, measuring the association between study hours and exam scores, or between temperature and ice cream sales.
Assumptions
- Both variables are continuous (interval or ratio scale).
- The relationship between the two variables is linear.
- The data follow an approximate bivariate normal distribution.
- No extreme outliers that could distort the correlation.
- Observations are independent.
Required Inputs
| Input | Type | Notes |
|---|---|---|
| Variable 1 | Numeric | First continuous variable |
| Variable 2 | Numeric | Second continuous variable |
Output Metrics
| Metric | What it means |
|---|---|
| Pearson r | Correlation coefficient. Ranges from -1 (perfect negative) to +1 (perfect positive). 0 indicates no linear relationship. |
| r-squared | Coefficient of determination: proportion of variance shared between the two variables. |
| t-statistic | Test statistic for the null hypothesis that r = 0. |
| DF | Degrees of freedom (N - 2). |
| p-value | P-value for the two-tailed test of r = 0. |
| 95% CI Lower | Lower bound of the 95% confidence interval for r (Fisher z-transform). |
| 95% CI Upper | Upper bound of the 95% confidence interval for r (Fisher z-transform). |
Interpretation
- The sign of r indicates the direction: positive means both variables increase together; negative means one increases as the other decreases.
- Effect size thresholds for |r|: weak (0.1-0.3), moderate (0.3-0.5), strong (> 0.5). These are guidelines, not strict cutoffs.
- r-squared tells you the proportion of variance shared. An r of 0.5 means r-squared = 0.25, so only 25% of the variance is shared.
- The confidence interval for r is computed using the Fisher z-transformation, which is necessary because r has a bounded and skewed sampling distribution.
- A significant correlation does not imply causation. Two variables can be correlated because they share a common cause.
Common Pitfalls
- Pearson r only measures linear association. Two variables can have a strong non-linear relationship with r near zero.
- A single outlier can dramatically inflate or deflate the correlation, especially with small samples. Always plot your data.
- Restriction of range (when one variable has limited variability) attenuates the observed correlation below its true value.
- Correlating aggregated data (group means rather than individual observations) produces inflated correlations (ecological fallacy).
How It Works
- Standardise each variable by subtracting its mean and dividing by its standard deviation.
- Compute r as the average product of the standardised scores: r = sum(z_x * z_y) / (N - 1).
- Test significance with t = r * sqrt((N-2) / (1-r^2)), which follows a t-distribution with N-2 degrees of freedom.
- Construct the confidence interval by transforming r to Fisher z, computing the CI on the z scale, then back-transforming to the r scale.
Citations
References
- Pearson, K. (1896). Mathematical contributions to the theory of evolution. III. Regression, heredity, and panmixia. Philosophical Transactions of the Royal Society A, 187, 253-318.
- Fisher, R. A. (1921). On the "probable error" of a coefficient of correlation deduced from a small sample. Metron, 1, 3-32.