Skip to content

Pearson Correlation

Regression & Correlation

Measures the strength and direction of the linear relationship between two continuous variables, producing a correlation coefficient (r) that ranges from -1 to +1.

When to Use

Use this test when you want to quantify how strongly two continuous variables are linearly related. For example, measuring the association between study hours and exam scores, or between temperature and ice cream sales.

Assumptions

  • Both variables are continuous (interval or ratio scale).
  • The relationship between the two variables is linear.
  • The data follow an approximate bivariate normal distribution.
  • No extreme outliers that could distort the correlation.
  • Observations are independent.

Required Inputs

InputTypeNotes
Variable 1NumericFirst continuous variable
Variable 2NumericSecond continuous variable

Output Metrics

MetricWhat it means
Pearson rCorrelation coefficient. Ranges from -1 (perfect negative) to +1 (perfect positive). 0 indicates no linear relationship.
r-squaredCoefficient of determination: proportion of variance shared between the two variables.
t-statisticTest statistic for the null hypothesis that r = 0.
DFDegrees of freedom (N - 2).
p-valueP-value for the two-tailed test of r = 0.
95% CI LowerLower bound of the 95% confidence interval for r (Fisher z-transform).
95% CI UpperUpper bound of the 95% confidence interval for r (Fisher z-transform).

Interpretation

  • The sign of r indicates the direction: positive means both variables increase together; negative means one increases as the other decreases.
  • Effect size thresholds for |r|: weak (0.1-0.3), moderate (0.3-0.5), strong (> 0.5). These are guidelines, not strict cutoffs.
  • r-squared tells you the proportion of variance shared. An r of 0.5 means r-squared = 0.25, so only 25% of the variance is shared.
  • The confidence interval for r is computed using the Fisher z-transformation, which is necessary because r has a bounded and skewed sampling distribution.
  • A significant correlation does not imply causation. Two variables can be correlated because they share a common cause.

Common Pitfalls

  • Pearson r only measures linear association. Two variables can have a strong non-linear relationship with r near zero.
  • A single outlier can dramatically inflate or deflate the correlation, especially with small samples. Always plot your data.
  • Restriction of range (when one variable has limited variability) attenuates the observed correlation below its true value.
  • Correlating aggregated data (group means rather than individual observations) produces inflated correlations (ecological fallacy).

How It Works

  1. Standardise each variable by subtracting its mean and dividing by its standard deviation.
  2. Compute r as the average product of the standardised scores: r = sum(z_x * z_y) / (N - 1).
  3. Test significance with t = r * sqrt((N-2) / (1-r^2)), which follows a t-distribution with N-2 degrees of freedom.
  4. Construct the confidence interval by transforming r to Fisher z, computing the CI on the z scale, then back-transforming to the r scale.

Citations

References

  • Pearson, K. (1896). Mathematical contributions to the theory of evolution. III. Regression, heredity, and panmixia. Philosophical Transactions of the Royal Society A, 187, 253-318.
  • Fisher, R. A. (1921). On the "probable error" of a coefficient of correlation deduced from a small sample. Metron, 1, 3-32.