Linear regression

The linear model

The linear regression model is the foundational tool of applied statistics and econometrics. It provides a formal framework for describing how a dependent variable Y changes in response to one or more independent variables X. We write the model as: Y = X\beta + \epsilon In this matrix notation, Y is an n \times 1 vector of observations, X is an n \times k matrix of regressors (typically including a column of ones to represent the intercept), \beta is a k \times 1 vector of unknown coefficients we wish to estimate, and \epsilon is an n \times 1 vector of unobserved error terms.

Classical assumptions

For the Ordinary Least Squares (OLS) estimator to be the optimal tool for this model, we invoke the Gauss-Markov assumptions. These assumptions define the “Classical Linear Regression Model”:

A1: Linearity. The model is linear in the parameters \beta. This means Y is a linear combination of the columns of X plus an error term. Note that X itself can contain non-linear transformations of variables (like X^2 or \ln X).
A2: Full Rank. The matrix X has full column rank, meaning rank(X) = k. This ensures that there is no perfect multicollinearity; no regressor can be written as a exact linear combination of the others.
A3: Strict Exogeneity. The errors have conditional mean zero given the entire regressor matrix: E[\epsilon | X] = 0. This is the most critical assumption for unbiasedness, implying that the regressors contain no information about the expected value of the error.
A4: Homoskedasticity. The errors have constant variance: Var(\epsilon_i | X) = \sigma^2 for all i.
A5: No Autocorrelation. The errors are uncorrelated across different observations: Cov(\epsilon_i, \epsilon_j | X) = 0 for i \neq j.

If assumptions A1 through A5 hold, the Gauss-Markov theorem states that the OLS estimator \hat{\beta} is the Best Linear Unbiased Estimator (BLUE).

OLS as projection

Beyond simple calculus, OLS has a profound geometric interpretation. Consider Y as a point in n-dimensional space. The columns of X span a subspace of dimension k within that space. Our goal is to find a vector in that subspace, \hat{Y} = X\hat{\beta}, that is as close as possible to Y in terms of Euclidean distance.

The point in the subspace closest to Y is the orthogonal projection of Y onto the column space of X. This implies that the residual vector e = Y - \hat{Y} must be orthogonal to the entire subspace spanned by X. Mathematically, this means: X^\top e = X^\top (Y - X\hat{\beta}) = 0 These are the orthogonality conditions that define the OLS estimator. The projection matrix P = X(X^\top X)^{-1}X^\top, often called the “hat matrix,” transforms Y into its fitted values \hat{Y}. The matrix M = I - P is the residual-maker matrix, which transforms Y into the residuals e. Both P and M are symmetric and idempotent.

The normal equations and properties

Solving the orthogonality condition X^\top (Y - X\hat{\beta}) = 0 leads directly to the normal equations: X^\top X \hat{\beta} = X^\top Y Assuming X has full rank, the matrix X^\top X is non-singular and invertible, giving the closed-form solution: \hat{\beta} = (X^\top X)^{-1} X^\top Y

Unbiasedness

Using the definition of Y = X\beta + \epsilon, we can write: \hat{\beta} = (X^\top X)^{-1} X^\top (X\beta + \epsilon) = \beta + (X^\top X)^{-1} X^\top \epsilon Taking the expectation conditional on X: E[\hat{\beta} | X] = \beta + (X^\top X)^{-1} X^\top E[\epsilon | X] Under the exogeneity assumption (E[\epsilon | X] = 0), E[\hat{\beta} | X] = \beta, proving that OLS is unbiased.

Variance and Gauss-Markov

The variance-covariance matrix of the OLS estimator is: Var(\hat{\beta} | X) = \sigma^2 (X^\top X)^{-1} The Gauss-Markov theorem proves that for any other linear unbiased estimator \tilde{\beta} = AY, the variance Var(\tilde{\beta}|X) is greater than or equal to Var(\hat{\beta}|X) in the sense that the difference matrix Var(\tilde{\beta}|X) - Var(\hat{\beta}|X) is positive semi-definite. This confirms the efficiency of OLS in the class of linear unbiased estimators.

Frisch-Waugh-Lovell Theorem

The Frisch-Waugh-Lovell (FWL) theorem is essential for understanding “control variables” in regression. Suppose we partition our regressors into two groups, X_1 and X_2. The model is Y = X_1\beta_1 + X_2\beta_2 + \epsilon. The FWL theorem states that we can obtain the exact same \hat{\beta}_2 as in the full model by following these steps: 1. Regress Y on X_1 and save the residuals, e_{Y|X1}. 2. Regress each column of X_2 on X_1 and save those residuals, e_{X2|X1}. 3. Regress e_{Y|X1} on e_{X2|X1}.

This result shows that \hat{\beta}_2 represents the relationship between Y and X_2 after purging both variables of any linear relationship they have with X_1. It formalizes the idea of “holding other factors constant” and provides a bridge between simple and multiple regression.

Interpretation of coefficients

Proper interpretation of coefficients depends on the functional form and the units of the variables. In a multiple regression, every coefficient is interpreted as the effect of a one-unit change in that regressor, holding all other included regressors fixed (the ceteris paribus condition).

Level-Level: Y = \beta_0 + \beta_1 X + \epsilon. A one-unit increase in X is associated with a \beta_1 unit change in Y.
Log-Level: \ln Y = \beta_0 + \beta_1 X + \epsilon. A one-unit increase in X is associated with a 100 \cdot \beta_1 percent change in Y (approximately). This is common for variables like wages or prices.
Log-Log: \ln Y = \beta_0 + \beta_1 \ln X + \epsilon. A one-percent increase in X is associated with a \beta_1 percent change in Y. In economics, \beta_1 is an elasticity, such as the price elasticity of demand.

Dummy variables and interactions

Categorical variables are incorporated using dummy (indicator) variables. To avoid the “dummy variable trap” (perfect multicollinearity), one category must be omitted. The coefficients on the remaining dummies represent the expected difference in Y relative to the omitted base category.

Interaction terms allow the effect of one variable to vary depending on the level of another. For a model Y = \beta_0 + \beta_1 X + \beta_2 D + \beta_3 (X \cdot D) + \epsilon, where D is a dummy: - If D=0, the effect of X is \beta_1. - If D=1, the effect of X is \beta_1 + \beta_3. The coefficient \beta_3 is the “difference-in-slopes.” Interactions can also be formed between two continuous variables, in which case the effect of X_1 is \beta_1 + \beta_3 X_2, meaning the marginal effect of X_1 depends linearly on X_2.

Joint hypothesis testing

To test if a group of q coefficients are all zero, we use an F-test. We compare the sum of squared residuals from the unrestricted model (SSR_u) to the restricted model where those coefficients are set to zero (SSR_r): F = \frac{(SSR_r - SSR_u)/q}{SSR_u / (n-k)} Under the null hypothesis, this follows an F(q, n-k) distribution. This is used to test the overall significance of the model (where q = k-1 and the restricted model only has an intercept) or the significance of a set of related variables, such as a group of regional dummies.

Model selection criteria

While R^2 measures the proportion of variance explained, it is a poor tool for model selection because it never decreases as more variables are added.

Adjusted R^2: 1 - (1 - R^2) \frac{n-1}{n-k}. It penalizes the addition of unnecessary regressors.
Akaike Information Criterion (AIC): 2k - 2\ln(L). AIC focuses on predictive accuracy and has a smaller penalty for complexity. It is based on minimizing the Kullback-Leibler divergence.
Bayesian Information Criterion (BIC): k \ln(n) - 2\ln(L). BIC is derived from a Bayesian perspective and has a much larger penalty for additional parameters as n grows. It is consistent for finding the “true” model if the true model is among those being compared.

Worked example: Simulation and Estimation

We simulate a model with an interaction: Y = 5 + 2X + 3D + 1.5(X \cdot D) + \epsilon.

Excel

Excel’s Data Analysis Toolpak provides a user-friendly interface, while LINEST allows for dynamic calculations.

A1:A200: [X]
B1:B200: [D] (0 or 1)
C1: =A1*B1 (Interaction term)
D1:D200: [Y]
E1: =LINEST(D1:D200, A1:C200, TRUE, TRUE)

The resulting array from LINEST contains the coefficients in reverse order (\beta_3, \beta_2, \beta_1, \beta_0).

Stata

Stata’s factor-variable notation (i. and c.) makes interactions effortless and reduces the risk of manual coding errors.

clear
set obs 200
gen x = rnormal()
gen d = (runiform() > 0.5)
gen y = 5 + 2*x + 3*d + 1.5*x*d + rnormal()

* Interaction using ## to include main effects and interaction
regress y c.x##i.d

* Linear combinations after regression
lincom x + 1.x#c.x

R

In R, the * operator in a formula automatically includes both main effects and the interaction, which is the standard practice.

set.seed(123)
n <- 200
x <- rnorm(n)
d <- rbinom(n, 1, 0.5)
y <- 5 + 2*x + 3*d + 1.5*x*d + rnorm(n)

# factor(d) ensures d is treated as categorical
model <- lm(y ~ x * factor(d))
summary(model)

# Extract tidy results
library(broom)
results <- tidy(model)
print(results)

Julia

Julia’s GLM.jl supports the same formula syntax for interactions and provides high-performance estimation.

using GLM, DataFrames, Random, Distributions

Random.seed!(123)
n = 200
df = DataFrame(x = randn(n), d = rand(0:1, n))
df.y = 5.0 .+ 2.0.*df.x .+ 3.0.*df.d .+ 1.5.*df.x.*df.d .+ randn(n)

# Use * for interactions in the formula
model = lm(@formula(y ~ x * d), df)
println(model)

# Access coefficients and standard errors
b = coef(model)
se = stderror(model)

Common traps

Omitted Variable Bias (OVB) is a pervasive issue in non-experimental data. If a relevant variable that influences Y is correlated with an included regressor X, the OLS estimate of X’s coefficient will be biased. For example, in a regression of earnings on education, “ability” is a likely omitted variable that correlates with both, leading to an overestimation of the returns to education.

Perfect Multicollinearity is often caused by the “dummy variable trap.” This happens when you include a dummy for every possible category plus a constant. Because the sum of all dummies equals 1 (the constant), one variable is perfectly predictable from the others.

Including “bad controls” or post-treatment variables can introduce selection bias. For instance, if you are studying the effect of a training program on employment, controlling for “job satisfaction” is a bad idea because satisfaction is only observed for those who found a job as a result of the program.

Finally, confusing correlation with causation remains the most significant conceptual error. Regression identifies associations. Claiming that X causes Y requires a design that addresses endogeneity, such as an experiment, natural experiment, or a robust instrumental variables strategy.

Reporting checklist

Include a clear table of coefficients with standard errors.
Report the R-squared and Adjusted R-squared to show model fit.
State the total number of observations (n).
Clearly describe the omitted base category for all sets of dummy variables.
Report the F-statistic and its p-value for the overall model significance.
Specify the type of standard errors used (e.g., standard OLS, robust, or clustered).
Explicitly list all control variables included in the specification.

References

Wooldridge, J. M. (2010). Econometric Analysis of Cross Section and Panel Data (2nd ed.). MIT Press.
Hansen, B. E. (2022). Econometrics. Princeton University Press.
Angrist, J. D., & Pischke, J.-S. (2009). Mostly Harmless Econometrics: An Empiricist’s Companion. Princeton University Press.
Greene, W. H. (2018). Econometric Analysis (8th ed.). Pearson.