Skip to content

Binary Logistic Regression

Regression & Correlation

Models the probability of a binary outcome (yes/no, 0/1) as a function of one or more predictors using the logistic function.

When to Use

Use this test when your outcome has exactly two categories and you want to model which predictors influence the probability of one outcome versus the other. For example, predicting whether a patient develops a disease (yes/no) based on age, BMI, and smoking status.

Assumptions

  • The dependent variable is binary (two categories).
  • Observations are independent.
  • There is a linear relationship between the predictors and the log-odds of the outcome (linearity of the logit).
  • No multicollinearity among predictors.
  • Sufficiently large sample size (often cited as at least 10 events per predictor).

Required Inputs

InputTypeNotes
Outcome (Y)Binary (0/1)Dependent variable with two categories
PredictorsNumeric / CategoricalOne or more independent variables

Output Metrics

MetricWhat it means
-2 Log L-2 times the log-likelihood of the fitted model. Used for model comparison.
AICAkaike Information Criterion (lower is better).
BICBayesian Information Criterion (lower is better).
AccuracyProportion of correctly classified observations.
Likelihood Ratio Chi-SquareGlobal test of whether the model is better than an intercept-only model.
Wald Chi-Square (Global)Alternative global test based on Wald statistics.
Score Chi-SquareScore test (Lagrange multiplier) for the overall model.
McFadden R-SquarePseudo R-squared based on log-likelihood ratio.
Cox & Snell R-SquarePseudo R-squared that cannot reach 1.0.
Nagelkerke R-SquareAdjusted pseudo R-squared that can reach 1.0.
Estimate (log-odds)Estimated coefficient in log-odds units for each predictor.
Std ErrorStandard error of each coefficient.
Wald Chi-SqWald statistic for testing whether each coefficient is zero.
Pr > ChiSqP-value for each predictor.
Odds RatioExponentiated coefficient: exp(estimate). Represents the multiplicative change in odds per unit change in the predictor.
OR 95% CL LowerLower confidence limit for the odds ratio.
OR 95% CL UpperUpper confidence limit for the odds ratio.
Hosmer-Lemeshow Chi-SqGoodness-of-fit test comparing observed and expected frequencies across decile groups.
Hosmer-Lemeshow Pr > ChiSqP-value for the Hosmer-Lemeshow test. A non-significant result suggests adequate model fit.
SensitivityProportion of true positives correctly identified.
SpecificityProportion of true negatives correctly identified.
PrecisionProportion of predicted positives that are truly positive.
F1 ScoreHarmonic mean of precision and sensitivity.

Interpretation

  • An odds ratio > 1 means the predictor increases the odds of the outcome; < 1 means it decreases the odds. An OR of 2.5 means the odds are 2.5 times higher per unit increase in the predictor.
  • The odds ratio confidence interval is key: if it includes 1.0, the predictor is not statistically significant.
  • Pseudo R-squared values are not directly comparable to linear regression R-squared. Nagelkerke R-squared of 0.3 can indicate a good model in practice.
  • The Hosmer-Lemeshow test evaluates calibration. A significant result (p < 0.05) indicates poor fit, but the test has low power with small samples.
  • Classification metrics (accuracy, sensitivity, specificity) depend on the classification threshold (default 0.5). Adjust the threshold based on the relative cost of false positives versus false negatives.

Common Pitfalls

  • Complete or quasi-complete separation occurs when a predictor perfectly predicts the outcome. This causes coefficient estimates to diverge to infinity.
  • Odds ratios are often misinterpreted as risk ratios (relative risk). They are only similar when the outcome is rare (< 10%).
  • Multicollinearity inflates standard errors and makes individual predictors appear non-significant even when the overall model is significant.
  • Accuracy alone is misleading for imbalanced classes. If 95% of cases are negative, predicting "negative" for everything gives 95% accuracy.

How It Works

  1. Model the log-odds of the outcome as a linear combination of predictors: log(p / (1-p)) = b0 + b1*X1 + b2*X2 + ...
  2. Estimate coefficients using maximum likelihood estimation, which finds the parameter values that make the observed data most probable.
  3. Test individual coefficients using Wald statistics and the overall model using the likelihood ratio test.
  4. Exponentiate each coefficient to obtain odds ratios for interpretation.

Citations

References

  • Cox, D. R. (1958). The regression analysis of binary sequences. Journal of the Royal Statistical Society B, 20(2), 215-242.
  • Hosmer, D. W., & Lemeshow, S. (1980). Goodness of fit tests for the multiple logistic regression model. Communications in Statistics, 9(10), 1043-1069.