Generalized linear models
Why we need GLMs
Ordinary linear regression assumes the response is continuous, roughly Gaussian around its mean, and lives on the entire real line. Many of the outcomes that applied workers actually care about violate every one of those assumptions. A loan either defaults or it does not. A patient experiences zero, one, two, or three adverse events in a month. A factory produces a count of defective units that depends on production volume. In all of these cases, fitting a linear model directly produces predictions that can fall outside the support of the response (negative probabilities, negative counts), residuals that are heteroskedastic by construction, and standard errors that are wrong in ways no robust correction can fully repair.
The generalized linear model (GLM), introduced by Nelder and Wedderburn (1972), keeps the appealing structure of linear regression (a linear combination of predictors) but expands it to handle outcomes whose distributions belong to a wider family. The mean of the response is no longer the linear predictor itself. It is a known monotone transformation of the linear predictor, chosen so that predictions respect the support of the distribution. The variance is allowed to depend on the mean in a structured way, which removes the need to assume constant variance.
A GLM is not a single model. It is a recipe. Specify the three ingredients and the rest follows.
The three pieces of a GLM
A generalized linear model consists of three components.
Random component. The response Y_i comes from a distribution in the exponential dispersion family. The density or mass function can be written as f(y_i; \theta_i, \phi) = \exp\left\{ \frac{y_i \theta_i - b(\theta_i)}{a(\phi)} + c(y_i, \phi) \right\} where \theta_i is the natural (canonical) parameter, \phi is the dispersion parameter, and a, b, c are known functions. The mean is \mu_i = b'(\theta_i) and the variance is \text{Var}(Y_i) = b''(\theta_i) a(\phi). Familiar members include the Bernoulli, binomial, Poisson, negative binomial, gamma, and normal distributions.
Systematic component. A vector of predictors x_i enters the model only through the linear predictor \eta_i = x_i^\top \beta This is the part the GLM shares with ordinary regression.
Link function. A monotone, differentiable function g connects the mean of the response to the linear predictor: g(\mu_i) = \eta_i = x_i^\top \beta The inverse link g^{-1} maps the real line back to the support of the response. Choosing g to be the canonical link (the one for which \theta_i = \eta_i) gives some convenient algebraic and inferential properties, but other choices are sometimes preferred for interpretation.
The familiar linear model is the special case where Y is normal, g is the identity, and \phi = \sigma^2.
Logistic regression
When Y_i \in \{0, 1\} is binary, the natural random component is Bernoulli with mean \pi_i = P(Y_i = 1 \mid x_i). The canonical link is the logit: g(\pi_i) = \log\frac{\pi_i}{1 - \pi_i} = x_i^\top \beta Solving for \pi_i gives the logistic (sigmoid) mean function: \pi_i = \frac{\exp(x_i^\top \beta)}{1 + \exp(x_i^\top \beta)}
Interpretation
A coefficient \beta_j is the change in the log-odds of Y = 1 for a one-unit increase in x_j, holding the other predictors fixed. Exponentiating, \exp(\beta_j) is the odds ratio. If \exp(\beta_j) = 1.5, a one-unit increase in x_j multiplies the odds of the event by 1.5.
A common mistake (more on this in the traps section) is to read \beta_j as a percentage-point change in probability. It is not. The marginal effect on probability depends on where you evaluate it: \frac{\partial \pi_i}{\partial x_{ij}} = \pi_i (1 - \pi_i) \beta_j At \pi_i = 0.5 this is 0.25 \beta_j. Near \pi_i = 0 or \pi_i = 1 it is close to zero. The convention in most applied work is to report the average marginal effect, \frac{1}{n} \sum_i \pi_i (1 - \pi_i) \beta_j, with standard errors computed by the delta method or by bootstrap.
A second interpretation, useful when the predictor is itself binary, is the odds-ratio difference. Suppose x_j is a dummy for treatment. The odds of Y = 1 under treatment are \exp(\beta_j) times the odds without treatment. If the baseline probability is 0.10, the corresponding odds are 0.10 / 0.90 \approx 0.11. Multiplying by \exp(\beta_j) = 1.5 gives odds of about 0.167, which translates back to a probability of 0.167 / 1.167 \approx 0.143. A 4.3 percentage-point increase, not a 50 percent increase. The two summaries differ in important ways and the audience determines which to lead with.
Likelihood
For independent Bernoulli observations, the log-likelihood is \ell(\beta) = \sum_{i=1}^n \big[ y_i \log \pi_i + (1 - y_i) \log(1 - \pi_i) \big] which after substituting \pi_i = \exp(x_i^\top \beta) / (1 + \exp(x_i^\top \beta)) becomes \ell(\beta) = \sum_{i=1}^n \big[ y_i x_i^\top \beta - \log(1 + \exp(x_i^\top \beta)) \big] This is concave in \beta, so any local maximum is the global maximum. The score equations are \nabla \ell(\beta) = X^\top (y - \pi) = 0 which says that the residuals y - \pi are orthogonal to every column of X at the optimum. This is the analog of the OLS normal equations for the binary outcome.
Probit regression
The probit model uses the standard normal CDF \Phi as the inverse link: \pi_i = \Phi(x_i^\top \beta) Equivalently, g(\pi_i) = \Phi^{-1}(\pi_i). The motivation is a latent-variable story. There is an unobserved continuous Y_i^* with Y_i^* = x_i^\top \beta + \varepsilon_i, \quad \varepsilon_i \sim N(0, 1) and we observe Y_i = \mathbf{1}\{Y_i^* > 0\}. Then P(Y_i = 1 \mid x_i) = \Phi(x_i^\top \beta).
Logit and probit produce nearly identical fitted probabilities in the central region (say 0.1 < \pi < 0.9). They diverge in the tails. The logistic distribution has heavier tails than the normal, so probit pulls extreme probabilities toward zero or one more aggressively than logit. Coefficients are not on the same scale: \beta^{\text{logit}} \approx 1.6 \beta^{\text{probit}} is the rough conversion (\pi / \sqrt{3} is the exact factor for the standard deviations).
When should you prefer one over the other? Logit if you want odds ratios that you can report cleanly. Probit if the latent-variable interpretation is substantively meaningful (a common choice in classical econometrics for choice models). For prediction alone the difference is rarely material.
Poisson regression for counts
When Y_i \in \{0, 1, 2, \ldots\} is a count, the natural starting point is Y_i \mid x_i \sim \text{Poisson}(\mu_i), \quad \log \mu_i = x_i^\top \beta The log link guarantees \mu_i > 0. A coefficient \beta_j is the change in \log \mu for a one-unit increase in x_j, so \exp(\beta_j) - 1 is the percentage change in the expected count. If \exp(\beta_j) = 1.20, increasing x_j by one is associated with a 20 percent rise in the mean count.
The defining feature of the Poisson is that \text{Var}(Y_i) = E(Y_i) = \mu_i. This is a strong assumption. In practice counts are usually overdispersed: the variance exceeds the mean. Failing to check this is one of the most common mistakes in applied count modeling.
An informal diagnostic is the Pearson dispersion statistic, \hat\phi = \frac{1}{n - p} \sum_i \frac{(y_i - \hat\mu_i)^2}{\hat\mu_i} which should be close to 1 under a correctly specified Poisson. Values above 1.5 or so are evidence of overdispersion.
Negative binomial for overdispersed counts
If the Poisson assumption fails, the negative binomial provides a flexible alternative. One useful parameterization writes Y_i \mid x_i \sim \text{NB}(\mu_i, \alpha), \quad \log \mu_i = x_i^\top \beta with variance \text{Var}(Y_i) = \mu_i + \alpha \mu_i^2. The extra parameter \alpha \geq 0 inflates the variance above the mean, and Poisson is recovered at \alpha = 0. A likelihood ratio test of \alpha = 0 against \alpha > 0 is a clean way to decide between Poisson and negative binomial.
The interpretation of \beta is the same as in Poisson regression: an exponent is a multiplicative effect on the mean count. The standard errors, by contrast, are typically larger under NB, reflecting the additional variance.
Estimation by maximum likelihood and IRLS
There is no closed form for \hat\beta in a GLM. We maximize the log-likelihood numerically. The standard algorithm is iteratively reweighted least squares (IRLS), which dates to Nelder and Wedderburn’s original paper. The key idea is that the score equations of a GLM can be rearranged to look like the normal equations of a weighted least squares problem, with weights that themselves depend on \beta.
At iteration t, given the current estimate \beta^{(t)}:
- Compute the linear predictor \eta_i^{(t)} = x_i^\top \beta^{(t)} and the fitted mean \mu_i^{(t)} = g^{-1}(\eta_i^{(t)}).
- Build the working response z_i^{(t)} = \eta_i^{(t)} + (y_i - \mu_i^{(t)}) \frac{d\eta_i}{d\mu_i}\bigg|_{\mu_i^{(t)}}
- Build the working weights w_i^{(t)} = \left[ \left(\frac{d\eta_i}{d\mu_i}\right)^2 \text{Var}(Y_i) \right]^{-1}\bigg|_{\mu_i^{(t)}}
- Solve the weighted least squares problem \beta^{(t+1)} = (X^\top W^{(t)} X)^{-1} X^\top W^{(t)} z^{(t)}
- Iterate until \beta stops moving.
IRLS is Fisher scoring in disguise. It converges quickly for well-behaved GLMs, typically in five to ten iterations. The final (X^\top W X)^{-1} is the asymptotic covariance matrix of \hat\beta.
For logistic regression specifically, the working weights take the simple form w_i = \pi_i (1 - \pi_i) and the working response is z_i = x_i^\top \beta + (y_i - \pi_i) / [\pi_i (1 - \pi_i)]. The Newton-Raphson update reduces to a weighted regression of z on X at each iteration. For Poisson regression with the log link, w_i = \mu_i and z_i = x_i^\top \beta + (y_i - \mu_i) / \mu_i. Different link functions produce different weight and working-response formulas, but the algorithmic skeleton is the same.
Starting values matter. The standard choice for logistic regression sets \pi_i^{(0)} = (y_i + 0.5) / 2 (a shrinkage toward 0.5 that avoids zeros and ones), then computes the initial \beta^{(0)} from a single weighted least squares step. For Poisson, \mu_i^{(0)} = y_i + 0.1 is common. Bad starting values can stall IRLS or push it toward a flat region of the log-likelihood.
Inference
Three asymptotically equivalent tests appear in every GLM output.
Wald test. The familiar z or \chi^2 statistic built from \hat\beta and its standard error. Easy to compute. Can misbehave with separation or in small samples. W = \hat\beta_j / \text{SE}(\hat\beta_j) \sim N(0, 1)
Likelihood ratio test. Compare twice the difference of log-likelihoods between a nested pair of models: \text{LR} = 2[\ell(\hat\beta_{\text{full}}) - \ell(\hat\beta_{\text{null}})] \sim \chi^2_q where q is the number of restrictions. Generally the best behaved of the three.
Score (Lagrange multiplier) test. Built from the gradient of the log-likelihood evaluated at the restricted estimate. Useful when fitting the full model is expensive.
For overall model fit, several pseudo-R^2 measures exist. McFadden’s R^2 is the most commonly reported: R^2_{\text{McF}} = 1 - \frac{\ell(\hat\beta)}{\ell(\hat\beta_0)} where \ell(\hat\beta_0) is the log-likelihood of the null (intercept-only) model. Cox and Snell’s variant is R^2_{\text{CS}} = 1 - [L_0 / L_{\text{full}}]^{2/n}. Neither is bounded above by 1 in the same way that ordinary R^2 is, so the magnitudes do not translate directly. Values of 0.2 to 0.4 in a McFadden R^2 are generally considered excellent fit for a binary outcome.
For binary classifiers the ROC curve plots the true positive rate against the false positive rate as the classification threshold varies. The area under that curve (AUC) is a threshold-free summary of discrimination: an AUC of 0.5 is no better than a coin flip, an AUC of 1.0 is perfect discrimination.
Worked example: loan default
We simulate a dataset of n = 500 borrowers. The outcome is whether the borrower defaulted within a year (y \in \{0, 1\}). The predictors are credit score (x1, standardized) and annual income in tens of thousands of dollars (x2). The true data-generating process is P(Y_i = 1 \mid x_i) = \frac{1}{1 + \exp[-(-1.2 - 0.8 x_{1i} - 0.3 x_{2i})]} so higher credit scores and higher incomes both reduce the probability of default.
Excel
Excel does not ship with a built-in logistic regression command, so we fit it by hand using Solver to maximize the log-likelihood. This is a useful exercise. It strips away the black-box feel of glm() and shows what the algorithm is doing.
A1:A500 =y (0/1 outcome)
B1:B500 =x1 (credit score)
C1:C500 =x2 (income)
E1 =intercept guess (start 0)
E2 =beta1 guess (start 0)
E3 =beta2 guess (start 0)
D1 =1/(1+EXP(-($E$1+$E$2*B1+$E$3*C1))) 'fitted pi_i
F1 =A1*LN(D1)+(1-A1)*LN(1-D1) 'log-likelihood contribution
D1:F1 filled down to row 500
G1 =SUM(F1:F500) 'total log-likelihood
Tools > Solver:
Set Objective: G1
To: Max
By changing variable cells: E1:E3
Solving Method: GRG Nonlinear
Solver converges to estimates close to the true coefficients. To get standard errors by hand, compute the Hessian of the log-likelihood at the solution numerically (small perturbations of E1:E3) and take the square roots of the diagonal of its negative inverse. This is feasible but tedious. For real applied work, do logistic regression in a statistics package.
Stata
* fit
logit y x1 x2, robust
* convert log-odds to odds ratios
logit, or
* average marginal effects on the probability scale
margins, dydx(*)
* classification table at the default 0.5 cutoff
estat classification
* ROC curve and AUC
lrocThe robust option gives Huber-White standard errors, which protect against mild misspecification of the variance structure. margins, dydx(*) is the workhorse for translating logit coefficients into something interpretable to a non-technical audience.
R
library(broom)
library(marginaleffects)
library(pROC)
fit <- glm(y ~ x1 + x2, data = df, family = binomial(link = "logit"))
# coefficient table with confidence intervals
tidy(fit, conf.int = TRUE, exponentiate = TRUE)
# average marginal effects on probability scale
avg_slopes(fit)
# pseudo R^2 (McFadden)
1 - fit$deviance / fit$null.deviance
# ROC curve and AUC
roc_obj <- roc(df$y, predict(fit, type = "response"))
auc(roc_obj)
plot(roc_obj)For probit, switch to family = binomial(link = "probit"). For Poisson with the log link, family = poisson(). For negative binomial, use MASS::glm.nb() because the dispersion parameter is not in the exponential family and needs its own iteration.
Julia
using GLM, DataFrames, StatsBase
fit_logit = glm(@formula(y ~ x1 + x2), df, Binomial(), LogitLink())
coef(fit_logit)
stderror(fit_logit)
deviance(fit_logit)
loglikelihood(fit_logit)
# McFadden pseudo-R^2
fit_null = glm(@formula(y ~ 1), df, Binomial(), LogitLink())
1 - loglikelihood(fit_logit) / loglikelihood(fit_null)
# predicted probabilities
predict(fit_logit)For probit, swap LogitLink() for ProbitLink(). Poisson is Poisson() with LogLink(). Negative binomial is in GLM.jl as NegativeBinomial(theta), but the dispersion parameter must be supplied or estimated separately.
Common traps
Misreading logistic coefficients. A coefficient of 0.8 in a logistic regression does not mean “an 80 percent increase in the probability.” It means an 80 percent increase in the log-odds, which translates to multiplying the odds by \exp(0.8) \approx 2.23. To get a probability change, compute marginal effects. Always report marginal effects alongside coefficients when the audience is not statistically trained.
Ignoring overdispersion in Poisson. The textbook Poisson model collapses if the variance exceeds the mean. Standard errors will be too small and p-values will be too aggressive. Always compute the Pearson dispersion statistic. If it exceeds 1.5, fit a negative binomial or a quasi-Poisson with an inflation factor on the standard errors.
Rare events and separation. If the event is very rare (fewer than ~20 events) or if one predictor perfectly predicts the outcome, maximum likelihood breaks down. Coefficients drift toward \pm \infty and standard errors explode. Solutions include Firth’s penalized likelihood (logistf in R), exact logistic regression for small samples, or a Bayesian model with a weakly informative prior.
Treating logit and probit as the same model. They are not. Their coefficients are not on the same scale, and the implied tail probabilities differ. If you switch from logit to probit in revisions of a paper without rescaling, your readers will be confused.
Comparing pseudo-R^2 across models with different samples. McFadden’s R^2 is sample-dependent. A model that explains 30 percent of the deviance on one dataset does not necessarily explain 30 percent on another. Pseudo-R^2 is for within-sample model comparison only.
Predicting outside the support of the training data. GLMs extrapolate linearly on the link scale, which can produce wild predictions when a new observation lies far from the support of X used to fit the model. A logistic model trained on credit scores between 550 and 750 will happily produce a probability for a score of 400, but the prediction has no empirical basis. Always check the range of predictors at prediction time and flag observations that lie outside the calibration range.
Confusing within-sample and out-of-sample performance. AUC on the same data used to fit the model overstates the model’s discrimination. Split the sample, cross-validate, or hold out a test set. The size of the gap between in-sample and out-of-sample AUC is itself a diagnostic for overfitting.
Reporting checklist
- State the random component, link function, and the predictors in the model. Do not write “we ran a logistic regression” and stop there.
- Report coefficients with standard errors and either confidence intervals or p-values. For logit, also report odds ratios.
- For logit and probit, report average marginal effects. Audiences understand “a one-unit increase in income reduces default probability by 4 percentage points” far better than “a coefficient of -0.3 on the log-odds scale.”
- Report a measure of fit. For binary outcomes, AUC and a confusion matrix at a meaningful threshold. For counts, the Pearson dispersion statistic and, if relevant, the result of an overdispersion test.
- Identify the standard error type (model-based, robust, clustered).
- If you used a penalized or Bayesian fit because of separation or sparse data, say so and explain the prior or penalty.
- Provide the sample size, the event rate, and any exclusions.
References
- McCullagh, P., & Nelder, J. A. (1989). Generalized Linear Models (2nd ed.). Chapman and Hall. The original definitive treatment.
- Agresti, A. (2013). Categorical Data Analysis (3rd ed.). Wiley. The standard reference for binary, multinomial, and count models.
- Wooldridge, J. M. (2010). Econometric Analysis of Cross Section and Panel Data (2nd ed.). MIT Press. The applied-econometrics view, including quasi-likelihood and robust inference.
- Nelder, J. A., & Wedderburn, R. W. M. (1972). Generalized linear models. Journal of the Royal Statistical Society A, 135(3), 370 to 384.
- Firth, D. (1993). Bias reduction of maximum likelihood estimates. Biometrika, 80(1), 27 to 38. The original paper on penalized likelihood for separation.