Chapter 3. OLS, panel fixed effects, and clustering

Why this is the foundation

If you map out every regression you will run over the next decade of blended-finance work, roughly 90% will be one of two things. An OLS regression with a few controls. Or that same regression with fixed effects and clustered standard errors. The fancier methods (DiD, RDD, IV, synthetic control) all sit on top of this. None of them rescue you if the underlying regression is wrong, and most of them collapse to OLS or OLS-with-FE in their identification step anyway.

So you need to be able to do two things without thinking about them: write down the model cleanly, and compute the right standard errors for the data structure you have. The model is the easy part. Most published mistakes are about the second part. Most desk rejections at applied journals (AEJ:Applied, JDE, World Development) trace back to clustering at the wrong level, clustering at too few clusters without acknowledging it, or absorbing the variation the researcher claims to identify.

This chapter walks through OLS, then heteroskedasticity-robust standard errors, then cluster-robust standard errors, then fixed effects (within transformation, FWL, and high-dimensional absorption), then the trap list, then a worked Portuguese municipality-year example you can copy.

OLS review in two minutes

Write the linear model in matrix form:

y = X\beta + \varepsilon

where y is n \times 1, X is n \times k (including a constant), \beta is k \times 1, and \varepsilon is n \times 1. The four classical assumptions:

Linearity in parameters. The model is linear in \beta, not necessarily in X (you can include \log X, X^2, interactions).
Exogeneity. E[\varepsilon \mid X] = 0. The regressors carry no information about the error in expectation. This is the assumption that fails when you have omitted variables, simultaneity, or measurement error.
No perfect multicollinearity. X has full column rank, so X'X is invertible.
Homoskedasticity (classical only). \mathrm{Var}(\varepsilon \mid X) = \sigma^2 I_n. We do not assume this in modern applied work. We compute robust SEs instead.

The OLS estimator solves \min_{\beta} (y - X\beta)'(y - X\beta) and yields

\hat\beta = (X'X)^{-1} X'y.

Under assumptions 1–3 plus E[\varepsilon \mid X] = 0, \hat\beta is unbiased: E[\hat\beta \mid X] = \beta. Under all four classical assumptions plus normality of \varepsilon, the Gauss-Markov theorem says \hat\beta is the best linear unbiased estimator (BLUE), where “best” means lowest variance in the class of linear unbiased estimators.

When do the assumptions fail?

Exogeneity fails when there are omitted confounders, reverse causality, or selection. This is what most of causal inference is built to solve. Fixed effects help partially (they absorb time-invariant confounders). DiD, IV, and RDD are the other tools.
Homoskedasticity fails almost always in real data. Variance scales with covariates, with cluster size, with treatment status. The fix is robust standard errors, not a transformation.
No multicollinearity fails when you include dummies that are linear combinations of each other (the dummy variable trap) or when you absorb fixed effects that span all your variation.

The bias-variance tradeoff is the reason we cluster, robustify, and trim our covariates rather than ignore problems. Each of these decisions is a tradeoff, not a free upgrade.

Heteroskedasticity-robust standard errors

The classical OLS variance estimator is

\widehat{\mathrm{Var}}(\hat\beta) = \hat\sigma^2 (X'X)^{-1}, \qquad \hat\sigma^2 = \frac{\hat\varepsilon'\hat\varepsilon}{n - k}.

This assumes homoskedasticity. When variance is heteroskedastic, the right object is the sandwich estimator of White (1980):

\widehat{\mathrm{Var}}_{\mathrm{HC}}(\hat\beta) = (X'X)^{-1} \left( \sum_{i=1}^n \hat\varepsilon_i^2 x_i x_i' \right) (X'X)^{-1}.

The bread is (X'X)^{-1}, the meat is the middle term. The middle is just OLS with each squared residual weighting its own row. That is why it is called sandwich.

White’s original form is HC0. It is biased downward in finite samples because it does not adjust for degrees of freedom or for the influence of high-leverage observations. Several corrections exist:

HC1. Multiply HC0 by n / (n - k). This is Stata’s robust default.
HC2. Replace each \hat\varepsilon_i^2 with \hat\varepsilon_i^2 / (1 - h_{ii}), where h_{ii} is the leverage of observation i.
HC3. Replace each \hat\varepsilon_i^2 with \hat\varepsilon_i^2 / (1 - h_{ii})^2. MacKinnon and White (1985) recommend this for small samples, and it is the safest default when n is modest (say, n < 250).

In practice:

Stata. regress y x, robust gives HC1. regress y x, vce(hc3) gives HC3.
R fixest. feols(y ~ x, data = df, vcov = "hetero") gives HC1 by default. Pass vcov = list("hetero", "HC3") for HC3.
R base. coeftest(lm(y ~ x), vcov = vcovHC(model, type = "HC3")) via the sandwich and lmtest packages.

A practical rule. If you have clustering, the HC choice barely matters because the cluster correction dominates. If you have no clustering and modest n, use HC3.

Cluster-robust standard errors

In most blended-finance data, observations are not independent. Households in the same village share unobserved shocks: a local drought, a corrupt mayor, the same loan officer. Students in the same school share a teacher. Firms in the same industry share a regulatory environment. Repeated observations of the same unit over time share that unit’s unobserved type.

When residuals are correlated within groups, OLS standard errors are wrong. They are almost always too small, sometimes by a factor of three or five. Bertrand, Duflo, and Mullainathan (2004) show that ignoring within-state-over-time correlation in DiD designs led to false rejection rates above 40% at a nominal 5% level. That one paper changed how the field thinks about inference.

The fix is the cluster-robust variance estimator of Liang and Zeger (1986):

\widehat{\mathrm{Var}}_{\mathrm{CR}}(\hat\beta) = (X'X)^{-1} \left( \sum_{g=1}^G X_g' \hat\varepsilon_g \hat\varepsilon_g' X_g \right) (X'X)^{-1}

where g indexes clusters and X_g, \hat\varepsilon_g are the rows for cluster g. The middle term sums outer products of cluster-level residual vectors instead of squared scalar residuals. Same sandwich, bigger middle.

Where do you cluster?

This is where most applied researchers go wrong. The clean modern answer is in Abadie, Athey, Imbens, and Wooldridge (originally circulated 2017, published in QJE 2023): “When Should You Adjust Standard Errors for Clustering?”

Their two-part rule:

Cluster at the level of treatment assignment. If the treatment is randomly assigned at the village level, cluster at the village. If it varies by state-year, cluster at the state-year (or two-way by state and year).
Cluster at the level of sampling. If you sampled villages and then surveyed everyone within them, the sampling design induces within-village dependence. Cluster at the village.

When the two levels differ, cluster at the higher of the two, meaning the one that contains the other. If you randomly assigned treatment at the village level but sampled at the household level, cluster at the village.

Software:

Stata. regress y x, vce(cluster id) for one-way. reghdfe y x, absorb(fe) vce(cluster id year) for two-way.
R fixest. feols(y ~ x, data = df, cluster = "id") for one-way. cluster = c("id", "year") for two-way.

Few clusters and the wild cluster bootstrap

Cluster-robust SEs rely on G \to \infty asymptotics. When G is small (under 40 is the conventional warning line, under 20 is the panic line), the asymptotic approximation fails. Your t-stats look fine but your true rejection rate is much higher than 5%.

The fix is the wild cluster bootstrap of Cameron, Gelbach, and Miller (2008). It resamples not the observations but the residuals within clusters, multiplied by a Rademacher draw (+1 or -1) at the cluster level. The result is a bootstrap distribution that respects the cluster structure even when G is tiny.

Stata. boottest treated, cluster(municipality) after a regression.
R. fwildclusterboot::boottest(model, param = "treated", clustid = "municipality", B = 9999).

Report both the asymptotic cluster-robust p-value and the wild bootstrap p-value when G < 40. Referees increasingly require it.

Fixed effects intuition

A unit fixed effect is a dummy variable for each unit. Mechanically, it absorbs every time-invariant characteristic of that unit, observed or not. If the unit is a municipality, the unit FE soaks up climate, geography, founding institutions, ethnic composition (if stable), and long-run trade routes. You no longer need to measure these things because they cannot vary across the observations of one unit.

A time fixed effect is a dummy for each time period. It absorbs every shock common to all units in that period: a national policy change, an exchange-rate shift, a global commodity-price spike. You no longer need a regressor for “the year of the global financial crisis” because the year 2008 dummy soaks it up.

A two-way fixed effects specification has both:

y_{it} = \alpha_i + \gamma_t + \beta x_{it} + \varepsilon_{it}.

Identification of \beta now comes from variation in x_{it} that is not explained by the unit’s mean (\alpha_i) or by the time period’s mean (\gamma_t). Put another way, you are exploiting deviations of x_{it} from its unit average and its time average, simultaneously.

This is powerful and limited. Powerful because the bias from any omitted variable that lives at the unit level or the time level is gone. Limited because if your treatment x_{it} varies only at the unit level (some units treated, some not, throughout) then \alpha_i absorbs the treatment entirely and you cannot identify \beta. The lesson: include FEs at every level below the level at which your treatment actually varies, and not at the level itself.

The within transformation and Frisch-Waugh-Lovell

Mechanically, the unit FE estimator is equivalent to demeaning y and x within each unit and then running OLS on the demeaned variables:

\tilde y_{it} = y_{it} - \bar y_i, \qquad \tilde x_{it} = x_{it} - \bar x_i.

Then \hat\beta_{\mathrm{FE}} = (\tilde X' \tilde X)^{-1} \tilde X' \tilde y.

This is a special case of the Frisch-Waugh-Lovell theorem: in a multiple regression y = X_1 \beta_1 + X_2 \beta_2 + \varepsilon, the coefficient \hat\beta_1 equals the coefficient from regressing M_{X_2} y on M_{X_2} X_1, where M_{X_2} = I - X_2(X_2'X_2)^{-1}X_2' is the residual-maker. Demeaning is exactly residualizing on unit dummies. FWL says doing this in one step or two gives the same answer.

The practical implication: you never literally have to invert a matrix with thousands of dummy columns. You demean. That is what every fixed-effects estimator does internally.

High-dimensional FE absorption

For one-way demeaning the within transformation is closed-form. For two-way (unit and time), you can still do it analytically when the panel is balanced, but it gets ugly. For three or more high-dimensional FEs (say firm \times year, worker \times firm, sector \times region \times year), there is no closed form. You need iterative algorithms.

The two main contributions:

Guimarães and Portugal (2010, Stata Journal). An iterative procedure for absorbing two high-dimensional FEs, originally written for the linked employer-employee data common in matched-firm-worker datasets.
Correia (2015). Extended this to arbitrarily many FEs with much better numerical performance. This is the engine behind Stata’s reghdfe and R’s fixest.

The reason this matters: do not, ever, write xi: regress y x i.firm i.year in Stata for hundreds of firms. Stata will try to create a dummy for every firm, store the resulting design matrix in memory, and either crash or take an hour. Use reghdfe y x, absorb(firm year). In R, never write lm(y ~ x + factor(firm) + factor(year)) for the same reason. Use feols(y ~ x | firm + year, data = df).

fixest is the fastest open-source implementation on the planet right now. On a panel of ten million observations with two high-dimensional FEs, it runs in under a second. Use it.

Common traps

Clustering at too low a level. A village has thirty households; you cluster at the household. You now have thousands of clusters but each cluster is too small to learn anything. Worse, you have failed to account for the village-level shocks. Always cluster at or above the level of treatment assignment.
Clustering at the wrong level. Treatment is assigned at the district level but you cluster at the household. SEs are too small, rejection rates inflated. Move up to the district.
Including FEs that absorb the variation you want. You include a firm FE and your treatment is whether the firm received a loan in 2015 (firm-level, time-invariant). The firm FE absorbs the treatment. You will see a dropped coefficient or a nonsensical estimate. The fix is to drop the firm FE, or to change the treatment definition so it varies within firm over time.
Using xtreg, fe for more than one FE. Stata’s xtreg, fe does only the unit demeaning. If you write xtreg y x i.year, fe it will technically include year dummies, but the standard errors will be wrong because the degrees-of-freedom adjustment does not account for the year FEs the way reghdfe does. For anything beyond one FE, use reghdfe.
Two-way clustering when treatment varies in one dimension. Two-way clustering is appropriate when shocks correlate across both rows and columns. If your treatment varies only at the unit level, one-way clustering at the unit is enough. Two-way clustering throws away information.
SEs that are tiny because n is huge but G is small. You have 50,000 observations and 8 clusters. Your t-stats are 12, your p-values are zero. They are wrong. The effective sample size for inference is closer to G than to n. Wild cluster bootstrap, or move up.
Forgetting to check for multicollinearity after absorbing FEs. When you absorb many FEs, the remaining variation can be tiny. Run reghdfe with verbose(2) or check feols summary output for “the variable was dropped”. If your treatment has been absorbed and you did not notice, your coefficient is meaningless.

Worked example: a Portuguese municipality-year panel

Setup. 308 municipalities, ten years (2014–2023), so n = 3{,}080. Outcome is log_consumption, the log of average household consumption from a microsimulation merged with municipal accounts. Treatment is treated, a dummy for whether the municipality received a rural credit program (rolled out in a quasi-random way across municipalities and years). Controls are log_assets and log_hh_size.

The right specification absorbs municipality and year fixed effects, and clusters at the municipality level, which is also the level of treatment assignment.

Stata

* Install once
ssc install reghdfe
ssc install ftools
ssc install boottest

* Load
use portugal_panel.dta, clear

* Two-way FE, cluster at municipality
reghdfe log_consumption treated log_hh_size log_assets, ///
    absorb(municipality year) ///
    vce(cluster municipality)

* Wild cluster bootstrap on the treatment coefficient
boottest treated, cluster(municipality) nograph

* For comparison, with two-way clustering
reghdfe log_consumption treated log_hh_size log_assets, ///
    absorb(municipality year) ///
    vce(cluster municipality year)

R

library(fixest)
library(fwildclusterboot)
library(modelsummary)

df <- readRDS("portugal_panel.rds")

# Two-way FE, cluster at municipality
m1 <- feols(
  log_consumption ~ treated + log_hh_size + log_assets |
    municipality + year,
  cluster = "municipality",
  data = df
)
summary(m1)

# Wild cluster bootstrap
set.seed(42)
boot_treated <- boottest(
  m1,
  param = "treated",
  clustid = "municipality",
  B = 9999
)
boot_treated

# Two-way clustering for comparison
m2 <- feols(
  log_consumption ~ treated + log_hh_size + log_assets |
    municipality + year,
  cluster = c("municipality", "year"),
  data = df
)

# Nicely formatted table
modelsummary(
  list("One-way" = m1, "Two-way" = m2),
  stars = TRUE,
  gof_omit = "AIC|BIC|RMSE"
)

The point estimates are identical across Stata and R, down to numerical noise. The cluster-robust SEs are identical to four decimal places. If they differ by more than that, something is wrong, usually a different degrees-of-freedom correction. Stata uses (G / (G-1)) \cdot ((n-1)/(n-k)) by default. fixest matches this when you set ssc = ssc(adj = TRUE, cluster.adj = TRUE), which is the default.

Reading the output

You will see something like:

                  Estimate  Std. Error  t value  Pr(>|t|)
treated           0.0421      0.0173    2.43      0.016 *
log_hh_size       0.5210      0.0381    13.67     <2e-16 ***
log_assets        0.2147      0.0224    9.58      <2e-16 ***

Number of observations: 3080
Fixed-effects: municipality (308), year (10)
Standard errors: Clustered (municipality)
Within R²: 0.187

Several things to read off:

Coefficient on treated is 0.042, so consumption rises about 4.2% (since the outcome is in logs) in treated municipality-years relative to control. The t-stat is 2.43; the cluster-robust p-value is 0.016. The wild bootstrap p-value will likely be similar because we have 308 clusters, well above the few-cluster warning line.
Within R^2 of 0.187 says the regressors, after partialing out the FEs, explain about 19% of the residual variation in y. The overall R^2 would include the variation explained by the FEs themselves, and is usually much higher (often above 0.9 in panel models). Within R^2 is the number to report, because it tells the reader how much the regressors of interest do beyond the FE skeleton.
F-stat, when reported, is the joint test that all the coefficients (excluding the FEs) are zero. In a single-treatment design, this is similar to the t-stat squared on the treatment.
Standard error of 0.0173 on treated is the cluster-robust SE. If you re-ran with vcov = "hetero" instead of cluster = "municipality", you would get a smaller SE (perhaps 0.011) and a larger t-stat. The smaller SE would be wrong, because it ignores within-municipality correlation over time.

When to use Driscoll-Kraay (1998)

The Liang-Zeger cluster-robust estimator handles arbitrary correlation within a cluster. It assumes independence between clusters. For most panel data this is fine: you cluster by municipality, and you assume municipalities are independent of each other.

Suppose instead that you have wide panels (large T) and you worry about cross-sectional dependence. A shock that hits municipality A in 2018 also hits municipality B in 2018 because they are both in the same region with the same governor. Then within-municipality clustering does not capture this cross-municipality, same-year correlation.

Driscoll and Kraay (1998) propose a heteroskedasticity-and-autocorrelation-consistent (HAC) estimator that is also robust to cross-sectional dependence. It is appropriate when T is large (at least 20, often more), and when you have reason to believe in cross-sectional dependence.

Stata. xtscc y x, fe lag(.) runs a Driscoll-Kraay SE after a fixed-effects regression. The lag option sets the bandwidth.
R. plm::vcovSCC after a plm model.

For panels with T < 20, do not use Driscoll-Kraay. The lag-truncation bandwidth is unstable. Stick with cluster-robust or two-way clustering. For the Portugal panel with T = 10, cluster-robust at the municipality level is the right call.

Two-way clustering

When shocks correlate across both rows and columns of your panel (within unit over time, AND within time across units), Cameron, Gelbach, and Miller (2011) extended cluster-robust SEs to a two-way version. The variance estimator is

\widehat{\mathrm{Var}}_{\mathrm{2W}} = V_{\mathrm{cluster\ by\ }g} + V_{\mathrm{cluster\ by\ }h} - V_{\mathrm{cluster\ by\ }g \cap h}

In words: the two one-way clustered variances added, minus the variance from the intersection (which would otherwise be double-counted).

When to use it:

Treatment varies at one level (say, state-year), and there are shocks correlated within state over time and within year across states.
Common in studies using state-year variation, country-industry-year variation, school-cohort variation.

In Stata, reghdfe ..., vce(cluster id year). In R fixest, cluster = c("id", "year"). The default behavior in both is the CGM (2011) estimator with proper degrees-of-freedom corrections.

A caution. Two-way clustering needs both dimensions to have enough clusters. If you have 50 states and 5 years, two-way clustering with only 5 year clusters will misbehave. Use Bell-McCaffrey or wild bootstrap in that case, or accept one-way at the state level and acknowledge the residual within-year correlation as a limitation in your write-up.

Reporting checklist

When you write up a regression, include:

Sample size and unit of observation (n, what each row represents).
Number of clusters (G), separately for each dimension if two-way.
Which SE type (HC1, cluster-robust, two-way cluster, Driscoll-Kraay, wild bootstrap).
Why you clustered where you did (treatment assignment level, sampling level, both).
Wild bootstrap p-values if G < 40 for any clustering dimension.
Within-R^2 for FE models (not overall R^2, which is misleading).
Which fixed effects you absorb, including the number of FE categories for each.
A note on multicollinearity if any observations or variables were dropped during absorption.

A clean footnote example: “Standard errors are clustered at the municipality level (308 clusters), which corresponds to the level of treatment assignment in the credit program rollout. Wild cluster bootstrap p-values (Cameron-Gelbach-Miller 2008, 9,999 replications) are reported in brackets for the main coefficient and yield the same inference.”

Why this matters for an applied researcher

Every paper in applied development and rural finance lives or dies on whether the standard errors are right. The most common desk-reject sentence at AEJ:Applied, JDE, and World Development is some variation of: the standard errors are not clustered at the appropriate level for the design described in the data section. A desk rejection takes two weeks of your life and resets the publication clock. It is the most avoidable mistake in applied econometrics, and it is still the most common one.

The fix is not glamorous. Read your panel description carefully. Decide what level treatment is assigned at, what level sampling is at, and choose the higher one. Compute cluster-robust or two-way clustered SEs. Check with a wild bootstrap if the cluster count is below forty. Write a footnote that explains the choice. Five sentences in your draft. Two lines of code. A career’s worth of papers that do not get rejected for inference errors.

This chapter is the foundation. The DiD, RDD, IV, and synthetic control chapters all rely on you having internalized it. When you read someone else’s paper, the first thing to check is the SE specification. When you write your own, get this right before you do anything else.

References

Foundational SE theory

White, H. (1980). “A Heteroskedasticity-Consistent Covariance Matrix Estimator and a Direct Test for Heteroskedasticity.” Econometrica 48(4): 817–838.
MacKinnon, J. G., and H. White (1985). “Some Heteroskedasticity-Consistent Covariance Matrix Estimators with Improved Finite Sample Properties.” Journal of Econometrics 29(3): 305–325.
Liang, K.-Y., and S. L. Zeger (1986). “Longitudinal Data Analysis Using Generalized Linear Models.” Biometrika 73(1): 13–22.

Clustering and inference

Bertrand, M., E. Duflo, and S. Mullainathan (2004). “How Much Should We Trust Differences-in-Differences Estimates?” Quarterly Journal of Economics 119(1): 249–275.
Cameron, A. C., J. B. Gelbach, and D. L. Miller (2008). “Bootstrap-Based Improvements for Inference with Clustered Errors.” Review of Economics and Statistics 90(3): 414–427.
Cameron, A. C., J. B. Gelbach, and D. L. Miller (2011). “Robust Inference with Multiway Clustering.” Journal of Business and Economic Statistics 29(2): 238–249.
Abadie, A., S. Athey, G. W. Imbens, and J. M. Wooldridge (2023). “When Should You Adjust Standard Errors for Clustering?” Quarterly Journal of Economics 138(1): 1–35. (Earlier NBER version, 2017.)

Panel and HDFE

Driscoll, J. C., and A. C. Kraay (1998). “Consistent Covariance Matrix Estimation with Spatially Dependent Panel Data.” Review of Economics and Statistics 80(4): 549–560.
Guimarães, P., and P. Portugal (2010). “A Simple Feasible Procedure to Fit Models with High-Dimensional Fixed Effects.” Stata Journal 10(4): 628–649.
Correia, S. (2015). “Singletons, Cluster-Robust Standard Errors and Fixed Effects: A Bad Mix.” Technical Note, Duke University. (The technical foundation of reghdfe.)

Applied work to read

Banerjee, A. V., E. Duflo, R. Glennerster, and C. Kinnan (2015). “The Miracle of Microfinance? Evidence from a Randomized Evaluation.” AEJ:Applied 7(1): 22–53. (Reference for SE reporting in rural credit RCTs.)
Burgess, R., M. Hansen, B. A. Olken, P. Potapov, and S. Sieber (2012). “The Political Economy of Deforestation in the Tropics.” Quarterly Journal of Economics 127(4): 1707–1754. (Two-way clustered SEs across district and year in a developing-country panel.)