Regression diagnostics

The job of diagnostics

Fitting a regression model is merely the starting point of a rigorous statistical analysis. Once a model is estimated, we must determine whether the underlying assumptions of the linear regression framework are satisfied. Regression diagnostics are a set of procedures and tests used to evaluate the validity of these assumptions. If the assumptions are violated, our results may be misleading. For instance, violations of exogeneity lead to biased coefficients, while violations of homoskedasticity result in incorrect standard errors and invalid p-values. A thorough diagnostic phase is what distinguishes a professional analysis from a cursory one. The goal is not to find a perfect model, but to understand the limitations of the chosen model and ensure that the inferences drawn from it are robust.

Residual analysis

The residuals e_i = Y_i - \hat{Y}_i are the primary tools for diagnosing model performance. They represent the part of the dependent variable that the model failed to explain.

Residuals vs. Fitted plot

A scatter plot of residuals against the fitted values (\hat{Y}) is the first line of defense against model misspecification. In an ideal model, the residuals should be randomly and evenly distributed around the horizontal line at zero, showing no discernible pattern. - Non-linearity: A curved or “U-shaped” pattern in the residuals suggests that the relationship between Y and X is not linear and that the model might require polynomial terms or logarithmic transformations. This often indicates that the marginal effect of X is not constant. - Heteroskedasticity: A funnel-shaped pattern, where the spread of residuals increases or decreases with the fitted values, indicates that the variance of the error term is not constant. This violates the Gauss-Markov assumption of homoskedasticity.

QQ plot

The Quantile-Quantile (QQ) plot is used to assess the normality of the error terms. It plots the quantiles of the residuals against the quantiles of a standard normal distribution. If the errors are normal, the points will fall along a straight 45-degree line. Deviations at the ends of the line suggest “heavy tails” (excess kurtosis) or skewness. While OLS is robust to non-normality in large samples due to the Central Limit Theorem, normality is required for the exact validity of t-tests and F-tests in small samples.

Heteroskedasticity

Heteroskedasticity occurs when Var(\epsilon_i | X) is not a constant \sigma^2 for all observations. This is extremely common in cross-sectional data where the scale of the dependent variable varies widely.

Formal Tests

Breusch-Pagan Test: This test checks for a linear relationship between the error variance and the regressors. It involves regressing the squared residuals on the original X variables. The test statistic follows a chi-square distribution under the null hypothesis of homoskedasticity.
White Test: This is a more general test that accounts for non-linear forms of heteroskedasticity by regressing the squared residuals on the regressors, their squares, and their cross-products. A significant result suggests that the standard OLS errors are unreliable.

Solutions: Robust Standard Errors

If heteroskedasticity is present, we must abandon the standard OLS formula for the variance-covariance matrix. Instead, we use “sandwich” estimators of the form (X^\top X)^{-1} (X^\top \Omega X) (X^\top X)^{-1}, where \Omega is a diagonal matrix of squared residuals. - HC1: A simple degrees-of-freedom correction (n / (n-k)). - HC2: Weights each squared residual by 1/(1-h_{ii}), where h_{ii} is the leverage. This provides better performance when the errors are truly homoskedastic. - HC3: Weights by 1/(1-h_{ii})^2. This is even more conservative and is generally recommended for smaller samples (n < 250) as it more effectively guards against over-optimistic standard errors and better approximates the jackknife estimator.

Multicollinearity

Multicollinearity exists when independent variables are highly correlated. While it does not bias the coefficients, it inflates their variances. The variance of a coefficient \hat{\beta}_j can be expressed as: Var(\hat{\beta}_j) = \frac{\sigma^2}{\sum (X_{ij} - \bar{X}_j)^2 (1 - R^2_j)} As R^2_j (the correlation of X_j with other regressors) approaches 1, the variance approaches infinity.

Metrics

Variance Inflation Factor (VIF): For each regressor j, the VIF is 1 / (1 - R^2_j). A VIF of 10 indicates that the variance of the coefficient is ten times larger than it would be if the regressor were uncorrelated with the others.
Condition Number: The square root of the ratio of the largest to the smallest eigenvalue of X^\top X. Values above 30 suggest severe instability.

Influence and leverage

Not all observations are equally important. Some may have an outsized impact on the regression results.

Leverage (h_{ii}): Measures how far an observation’s regressor values are from the mean. High-leverage points act as “anchor points” that can pivot the regression line.
Influence: An observation is influential if its removal causes a substantial change in the estimated parameters.

Metrics

Cook’s Distance (D_i): Measures the aggregate change in all fitted values when observation i is removed. D_i > 4/n is a common threshold for further investigation.
DFBETAs: Measures the change in a specific coefficient \beta_j when observation i is removed, normalized by the standard error.

Outliers and Robust Regression

Outliers with large residuals but low leverage may not change the slope much but will inflate standard errors. Outliers with high leverage (influential points) can drastically alter the coefficients.

If outliers are problematic but legitimate, “Robust Regression” methods are superior to simply deleting data. - M-Estimation: Replaces the squared residual loss function with a function that grows more slowly, such as the Huber loss, which is quadratic for small residuals and linear for large ones. This downweights the influence of outliers. - Trimmed Mean / Winsorizing: Reduces the impact of extreme values by removing them or capping them at a certain percentile before running OLS.

Functional form and specification

A model is misspecified if it omits important variables or uses an incorrect functional form.

Ramsey RESET Test

The Regression Equation Specification Error Test (RESET) adds powers of the fitted values (e.g., \hat{Y}^2, \hat{Y}^3) to the model and tests their joint significance. If significant, it suggests that the model is missing non-linearities or interaction terms.

Link Test

Often used in GLMs but applicable to OLS, the link test regresses Y on \hat{Y} and \hat{Y}^2. If the squared term is significant, the “link” between the linear predictor and the dependent variable is likely misspecified.

Worked example: Diagnostics in Action

We fit a model and find evidence of heteroskedasticity and one highly influential outlier. We re-estimate with HC3 errors and check the stability of results after downweighting the outlier.

Excel

Excel requires using LINEST for basic stats and manual array formulas for more advanced diagnostics.

A1:A100: [X1]
B1:B100: [X2]
C1:C100: [Y]
D1: =LINEST(C1:C100, A1:B100, TRUE, TRUE)
E1: =C1 - (Index(D$1, 1, 1)*B1 + Index(D$1, 1, 2)*A1 + Index(D$1, 1, 3)) (Residuals)
F1: =E1^2 (Squared residuals)
G1: =1/(1 - Index(LINEST(A1:A100, B1:B100), 3)) (VIF for X1)

Stata

Stata’s diagnostic suite is highly integrated.

regress y x1 x2
predict res, resid
predict cookd, cooksd
estat hettest      // BP test
estat vif          // VIFs
estat ovtest       // Ramsey RESET

* Robust SEs
regress y x1 x2, vce(robust)

* Huber Robust Regression
rreg y x1 x2

R

R offers the most flexibility for robust methods.

library(car)
library(lmtest)
library(sandwich)
library(MASS)

model <- lm(y ~ x1 + x2, data = df)

# Diagnostics
vif(model)
bptest(model)
resettest(model)

# Robust SEs (HC3)
coeftest(model, vcov = vcovHC(model, type = "HC3"))

# Huber M-estimator
robust_model <- rlm(y ~ x1 + x2, data = df)
summary(robust_model)

Julia

In Julia, we use GLM.jl and CovarianceMatrices.jl.

using GLM, DataFrames, CovarianceMatrices, LinearAlgebra

model = lm(@formula(y ~ x1 + x2), df)

# HC3 standard errors
vcov_matrix = vcov(model, HC3())
se_robust = sqrt.(diag(vcov_matrix))

# Cook's Distance calculation
X = modelmatrix(model)
H = X * inv(X'X) * X'
leverage = diag(H)
e = residuals(model)
s2 = sum(e.^2) / (size(X,1) - size(X,2))
cooks_d = (e.^2 ./ (size(X,2) * s2)) .* (leverage ./ (1 .- leverage).^2)

Common traps

A frequent mistake is the binary “pass/fail” approach to diagnostics. Statistical tests are sensitive to sample size. In a very large sample, a Breusch-Pagan test might reject homoskedasticity even if the actual violation is trivial and has no impact on the results. Conversely, in small samples, tests may fail to detect serious problems due to low power.

Another trap is using robust standard errors as a panacea. Robust SEs correct the inference for heteroskedasticity, but they do nothing for omitted variable bias or incorrect functional form. If the residuals show a clear non-linear pattern, you should fix the model specification (e.g., add X^2) rather than just switching to robust standard errors.

Finally, researchers often delete outliers to “clean” the data. This can lead to overfitting and biased results that do not generalize. Every exclusion must be theoretically justified (e.g., the observation is a clear coding error).

Reporting checklist

Discuss the findings from residual plots, specifically noting any patterns that suggest non-linearity or heteroskedasticity.
Report VIF values to confirm the absence of severe multicollinearity.
State the results of formal heteroskedasticity tests.
Clearly identify the type of robust standard errors used (e.g., “HC3 errors were used to ensure valid inference”).
Report any observations identified as highly influential and explain how they were handled (e.g., “Results were stable even after removing the observation with high Cook’s distance”).
Report the results of specification tests like the Ramsey RESET.
Provide a clear justification for any functional form choices (e.g., “Log transformation was applied to address skewness in the residuals”).

References

Belsley, D. A., Kuh, E., & Welsch, R. E. (2004). Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. Wiley.
Cook, R. D., & Weisberg, S. (1982). Residuals and Influence in Regression. Chapman and Hall.
Long, J. S., & Ervin, L. H. (2000). Using Heteroscedasticity Consistent Standard Errors in the Linear Regression Model. The American Statistician, 54(3), 217-224.
Huber, P. J. (1981). Robust Statistics. Wiley.
Wooldridge, J. M. (2015). Introductory Econometrics: A Modern Approach (6th ed.). Cengage Learning.