Binary Logistic Regression

Regression & Correlation

Models the probability of a binary outcome (yes/no, 0/1) as a function of one or more predictors using the logistic function.

When to Use

Use this test when your outcome has exactly two categories and you want to model which predictors influence the probability of one outcome versus the other. For example, predicting whether a patient develops a disease (yes/no) based on age, BMI, and smoking status.

Assumptions

The dependent variable is binary (two categories).
Observations are independent.
There is a linear relationship between the predictors and the log-odds of the outcome (linearity of the logit).
No multicollinearity among predictors.
Sufficiently large sample size (often cited as at least 10 events per predictor).

Required Inputs

Input	Type	Notes
Outcome (Y)	Binary (0/1)	Dependent variable with two categories
Predictors	Numeric / Categorical	One or more independent variables

Output Metrics

Metric	What it means
-2 Log L	-2 times the log-likelihood of the fitted model. Used for model comparison.
AIC	Akaike Information Criterion (lower is better).
BIC	Bayesian Information Criterion (lower is better).
Accuracy	Proportion of correctly classified observations.
Likelihood Ratio Chi-Square	Global test of whether the model is better than an intercept-only model.
Wald Chi-Square (Global)	Alternative global test based on Wald statistics.
Score Chi-Square	Score test (Lagrange multiplier) for the overall model.
McFadden R-Square	Pseudo R-squared based on log-likelihood ratio.
Cox & Snell R-Square	Pseudo R-squared that cannot reach 1.0.
Nagelkerke R-Square	Adjusted pseudo R-squared that can reach 1.0.
Estimate (log-odds)	Estimated coefficient in log-odds units for each predictor.
Std Error	Standard error of each coefficient.
Wald Chi-Sq	Wald statistic for testing whether each coefficient is zero.
Pr > ChiSq	P-value for each predictor.
Odds Ratio	Exponentiated coefficient: exp(estimate). Represents the multiplicative change in odds per unit change in the predictor.
OR 95% CL Lower	Lower confidence limit for the odds ratio.
OR 95% CL Upper	Upper confidence limit for the odds ratio.
Hosmer-Lemeshow Chi-Sq	Goodness-of-fit test comparing observed and expected frequencies across decile groups.
Hosmer-Lemeshow Pr > ChiSq	P-value for the Hosmer-Lemeshow test. A non-significant result suggests adequate model fit.
Sensitivity	Proportion of true positives correctly identified.
Specificity	Proportion of true negatives correctly identified.
Precision	Proportion of predicted positives that are truly positive.
F1 Score	Harmonic mean of precision and sensitivity.

Interpretation

An odds ratio > 1 means the predictor increases the odds of the outcome; < 1 means it decreases the odds. An OR of 2.5 means the odds are 2.5 times higher per unit increase in the predictor.
The odds ratio confidence interval is key: if it includes 1.0, the predictor is not statistically significant.
Pseudo R-squared values are not directly comparable to linear regression R-squared. Nagelkerke R-squared of 0.3 can indicate a good model in practice.
The Hosmer-Lemeshow test evaluates calibration. A significant result (p < 0.05) indicates poor fit, but the test has low power with small samples.
Classification metrics (accuracy, sensitivity, specificity) depend on the classification threshold (default 0.5). Adjust the threshold based on the relative cost of false positives versus false negatives.

Common Pitfalls

Complete or quasi-complete separation occurs when a predictor perfectly predicts the outcome. This causes coefficient estimates to diverge to infinity.
Odds ratios are often misinterpreted as risk ratios (relative risk). They are only similar when the outcome is rare (< 10%).
Multicollinearity inflates standard errors and makes individual predictors appear non-significant even when the overall model is significant.
Accuracy alone is misleading for imbalanced classes. If 95% of cases are negative, predicting "negative" for everything gives 95% accuracy.

How It Works

Model the log-odds of the outcome as a linear combination of predictors: log(p / (1-p)) = b0 + b1*X1 + b2*X2 + ...
Estimate coefficients using maximum likelihood estimation, which finds the parameter values that make the observed data most probable.
Test individual coefficients using Wald statistics and the overall model using the likelihood ratio test.
Exponentiate each coefficient to obtain odds ratios for interpretation.

Citations

References

Cox, D. R. (1958). The regression analysis of binary sequences. Journal of the Royal Statistical Society B, 20(2), 215-242.
Hosmer, D. W., & Lemeshow, S. (1980). Goodness of fit tests for the multiple logistic regression model. Communications in Statistics, 9(10), 1043-1069.

Simple Linear Regression

Multinomial Logistic Regression