Skip to content

Chi-Square Goodness-of-Fit Test

Categorical Analysis

Tests whether the observed frequency distribution of a single categorical variable matches a hypothesised (expected) distribution.

When to Use

Use this test when you have a single categorical variable and want to test whether the observed proportions match a theoretical distribution. For example, testing whether a die is fair (each face equally likely), or whether the distribution of blood types in your sample matches the expected population proportions.

Assumptions

  • The variable is categorical with two or more categories.
  • Observations are independent.
  • Expected frequency in each category is at least 5.
  • The hypothesised proportions are specified before examining the data.

Required Inputs

InputTypeNotes
Categorical VariableCategoricalColumn with two or more categories to test

Output Metrics

MetricWhat it means
Chi-SquareGoodness-of-fit test statistic: sum of (observed - expected)^2 / expected.
DFDegrees of freedom: number of categories - 1.
p-valueP-value for the null hypothesis that the observed distribution matches the expected distribution.
Observed FrequenciesActual counts in each category.
Expected FrequenciesCounts expected under the hypothesised distribution.

Interpretation

  • If the p-value is less than alpha, the observed distribution differs significantly from the expected distribution.
  • Examine which categories have the largest discrepancies between observed and expected counts to understand where the deviation occurs.
  • A non-significant result means you cannot reject the hypothesised distribution, but it does not prove the distribution is correct.

Common Pitfalls

  • The test requires specifying expected proportions a priori. Fitting proportions from the data and then testing them invalidates the test.
  • Categories with very small expected counts (< 5) can inflate the chi-square statistic. Combine categories if necessary.
  • The test is sensitive to sample size. Very large samples can detect trivial departures from the expected distribution.

How It Works

  1. Specify the expected proportion for each category under the null hypothesis.
  2. Multiply each expected proportion by the total sample size to get expected frequencies.
  3. Compute chi-square = sum of (observed - expected)^2 / expected across all categories.
  4. Compare to the chi-square distribution with (k-1) degrees of freedom, where k is the number of categories.

Citations

References

  • Pearson, K. (1900). On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philosophical Magazine, 50(302), 157-175.