Chapter 4. Difference-in-Differences

Inference Lab · Applied causal inference for the spatial social sciences Dr. Ian Helfrich

Interactive: parallel-trends visualizer

The identifying assumption of DiD is that, absent treatment, the treated and control groups would have evolved on parallel paths. The visualizer below simulates two groups under a common linear trend, with an optional pre-treatment slope differential and a true ATT. Move the sliders and watch what TWFE would recover against what is actually going on underneath.

viewof base_slope = Inputs.range([-0.05, 0.05], {value: 0.02, step: 0.005, label: "Common trend slope (per period)"})
viewof treat_extra = Inputs.range([-0.03, 0.03], {value: 0.0, step: 0.001, label: "Extra slope on treated (pre-period violation)"})
viewof tx_effect = Inputs.range([-0.5, 1.0], {value: 0.4, step: 0.05, label: "True post-treatment ATT"})
viewof n_periods = Inputs.range([6, 20], {value: 10, step: 1, label: "Number of periods"})
viewof noise_sd = Inputs.range([0, 0.2], {value: 0.05, step: 0.01, label: "Noise SD"})

sim_did = {
  const data = []
  const treat_at = Math.floor(n_periods / 2)
  for (let t = 0; t < n_periods; t++) {
    const noise_c = (Math.random() - 0.5) * 2 * noise_sd
    const noise_t = (Math.random() - 0.5) * 2 * noise_sd
    const treated_now = t >= treat_at
    const c_val = base_slope * t + noise_c
    const t_val = base_slope * t + treat_extra * t + (treated_now ? tx_effect : 0) + noise_t
    data.push({t, group: "Control", y: c_val})
    data.push({t, group: "Treated", y: t_val})
  }
  return data
}

Plot.plot({
  width: 720, height: 360, marginLeft: 60,
  x: {label: "Period →", grid: true},
  y: {label: "↑ Outcome", grid: true},
  color: {legend: true, domain: ["Control", "Treated"], range: ["#0d3b66", "#b85c38"]},
  marks: [
    Plot.ruleX([Math.floor(n_periods / 2) - 0.5], {stroke: "#888", strokeDasharray: "4 4"}),
    Plot.line(sim_did, {x: "t", y: "y", stroke: "group", strokeWidth: 2}),
    Plot.dot(sim_did, {x: "t", y: "y", stroke: "group", fill: "white", r: 3})
  ]
})

{
  const violation = Math.abs(treat_extra) > 0.005
  return html`<div style="background: ${violation ? '#fff4ec' : '#eef6ee'}; border-left: 4px solid ${violation ? '#b85c38' : '#4f7942'}; padding: 0.7rem 1rem; margin: 1rem 0; border-radius: 4px;">
    <strong>${violation ? 'Parallel trends violated.' : 'Parallel trends look reasonable.'}</strong>
    Pre-period slope differential: ${treat_extra.toFixed(3)} per period.
    ${violation
      ? 'A naive TWFE estimate will conflate the pre-trend slope difference with the true ATT. Reach for Honest pre-trends (Rambachan-Roth 2023) or argue substantively that the pre-trend would have stopped at treatment.'
      : 'TWFE / Callaway-SantAnna will recover the ATT cleanly. Include this kind of plot in your robustness section.'}
  </div>`
}

4.1 Why DiD is the workhorse of program evaluation in blended finance

You will rarely get to randomize. A development finance institution rolls out a credit-guarantee facility across districts in three waves. A government partner activates a rural electrification program in selected municipalities first, then expands. A philanthropic anchor seeds a women-owned-business loan window in two states and adds two more eighteen months later. The eligibility rule was not a coin flip. It was politics, capacity, and timing.

What you do have is variation in when units get treated, and outcomes measured before and after. Difference-in-Differences (DiD) is the tool that turns that variation into a causal estimate. The logic is almost embarrassingly simple. Compare the change in outcomes (not the levels) for units that got the program against the change for units that did not. The first difference (post minus pre) sweeps out anything time-invariant about each unit: institutional quality, baseline poverty, soil type, ethnic composition, distance to ports. The second difference (treated minus control) sweeps out anything that affected everyone in that period: a commodity-price shock, a national policy, a drought, a pandemic.

The classic citations are worth knowing by name. Card and Krueger (1994) on the New Jersey minimum-wage increase, comparing fast-food employment changes in NJ to eastern Pennsylvania. Bertrand, Duflo, and Mullainathan (2004) on the standard-error problem when serial correlation is ignored. Duflo (2001) on the Indonesian school construction program (INPRES), where staggered district-level rollout was used to estimate the returns to schooling. That last one is the template for almost every paper you will read in rural development. Master the logic and you can read 80% of the empirical blended-finance literature with real comprehension.

The promise of DiD is also its danger. Because the recipe sounds so clean (subtract, subtract, done), people apply it mechanically to settings where the underlying assumptions are violated. The canonical regression specification (two-way fixed effects, or TWFE) has been shown over the last five years to be biased in ways that can flip the sign of the answer. That is the centerpiece of this chapter. Get the intuition for why. Learn one of the modern estimators well enough to run it without supervision.

4.2 The 2x2 case in one paragraph, and the parallel-trends picture

Start with the cleanest version. Two periods (pre and post), two groups (treated T and control C), one outcome Y. The DiD estimator is

\hat\delta_{DD} \;=\; \big(\bar Y^T_{\text{post}} - \bar Y^T_{\text{pre}}\big) \;-\; \big(\bar Y^C_{\text{post}} - \bar Y^C_{\text{pre}}\big).

That is it. The point estimate is a difference of differences. Geometrically, picture two lines on a chart with time on the x-axis and the outcome on the y-axis. Draw the treated group’s line. Draw the control group’s line. They start at different levels, which is fine; that is what unit fixed effects are for. The control line tilts upward between pre and post by some amount, call it \Delta^C. The treated line tilts upward by \Delta^T. The DiD estimate is \Delta^T - \Delta^C.

The identifying assumption (parallel trends) is the claim that, in the absence of treatment, the treated group’s outcome would have moved by exactly \Delta^C. The control’s observed trend is the counterfactual for what the treated group would have done. We never observe that counterfactual, so we cannot test the assumption directly. We can only check whether the two groups moved in parallel before treatment started, and reason about whether the same dynamics would have continued absent the program.

Two consequences worth internalizing now. First, parallel trends is an assumption about trends, not levels. The treated and control groups can sit at wildly different baseline levels. Second, the assumption is scale-dependent. If parallel trends holds for Y, it generally does not hold for \log Y, and the other way around. Pick the scale you want to defend before you run anything.

4.3 The regression representation

In a panel of units i observed in periods t, with a binary treatment indicator D_{it} equal to 1 once unit i has been treated, the textbook DiD regression is

Y_{it} \;=\; \alpha_i \;+\; \lambda_t \;+\; \delta \, D_{it} \;+\; \varepsilon_{it},

where \alpha_i are unit fixed effects (one per municipality, district, firm) and \lambda_t are time fixed effects (one per period). The coefficient \delta is the estimand of interest. With two periods and two groups, OLS on this specification reproduces the 2x2 DiD exactly.

This is the two-way fixed effects (TWFE) specification. It is the workhorse because it generalizes the 2x2 case in obvious ways. You can have many units, many periods, and (the thing that matters) units that get treated at different points in time (staggered adoption). For decades, applied researchers treated \hat\delta_{\text{TWFE}} as if it were “the” average treatment effect on the treated (ATT), and reported it without much further thought.

That is where the wheels came off.

4.4 The TWFE crisis

Between 2018 and 2021 a wave of papers (Goodman-Bacon 2021; de Chaisemartin and d’Haultfœuille 2020; Sun and Abraham 2021; Callaway and Sant’Anna 2021; Borusyak, Jaravel, and Spiess 2024) showed that when treatment timing is staggered and treatment effects are heterogeneous (across cohorts, across time-since-treatment, or both), the TWFE estimator does not deliver a sensibly weighted average of unit-level treatment effects. It can deliver something with the wrong sign.

The reason is a forbidden comparison. Inside the TWFE machine, \hat\delta_{\text{TWFE}} is a weighted average of many 2x2 DiD comparisons. Some of those comparisons are exactly what you want: a newly treated cohort against a not-yet-treated cohort, or against a never-treated control. But some are perverse: an already-treated cohort is used as the “control” for a newly-treated cohort. The change in the already-treated cohort over the relevant window includes the dynamic part of its treatment effect (effects that grow or fade over time). When you subtract that off, you are subtracting treatment from treatment, and the resulting comparison contributes with a weight that can be negative.

Goodman-Bacon (2021) decomposes the TWFE estimator into all of these underlying 2x2 building blocks. The decomposition is the most useful diagnostic tool of the last decade. You can run it on your data and see, in concrete terms, how much weight is being put on forbidden comparisons.

The intuition for why this matters in development is direct. In any realistic rural-finance rollout, treatment effects grow over time. Credit access builds slowly, investments mature, network effects compound. They also differ across cohorts. The first municipalities chosen are often the most prepared. Later cohorts include the harder cases. Under those conditions, TWFE is biased. The sign can flip when the dynamics are sharp enough, which is exactly when you most want a reliable answer. Goodman-Bacon’s running example, state-level no-fault divorce laws, flips sign once the right estimator is used.

The fix is to refuse the forbidden comparisons. The modern estimators do that explicitly, each in its own way.

4.5 The modern toolkit, pick one

Four estimators dominate. They differ in what comparison groups they use and how they aggregate, but they agree on the core principle: never let an already-treated unit play the role of a control.

Callaway and Sant’Anna (2021). Estimate a separate ATT(g,t) for each combination of treatment cohort g (the period in which a unit was first treated) and calendar time t \ge g. Each ATT(g,t) uses either never-treated or not-yet-treated units as the comparison group. You then aggregate ATT(g,t) into whatever summary you want: an overall ATT, an event-study path by length of exposure, or heterogeneity by cohort. R: the did package. Stata: csdid. Learn this one first.

Sun and Abraham (2021). A correction to the standard event-study regression. Replace the usual lead-lag dummies with cohort-specific interactions, then weight by cohort share to recover a clean event-study path. R: fixest::sunab(). Stata: eventstudyinteract. Easiest drop-in replacement if you are already running event studies in TWFE-land.

de Chaisemartin and d’Haultfœuille (2020) and 2024 extensions. A family of DiD-multiplier estimators that handles non-absorbing treatment, treatment intensity, and dynamic effects. R and Stata: DIDmultiplegt and did_multiplegt_dyn. Use this when treatment turns on and off, or when intensity varies.

Borusyak, Jaravel, and Spiess (2024). An imputation estimator. Estimate unit and time fixed effects on untreated observations only, predict the counterfactual for treated observations, average the gaps. Often the most efficient when parallel trends actually holds. R: didimputation. Stata: did_imputation.

Estimator	Use when	R	Stata
Callaway-Sant’Anna	Default for staggered absorbing treatment	`did`	`csdid`
Sun-Abraham	Want a quick event-study correction	`fixest::sunab()`	`eventstudyinteract`
de Chaisemartin-d’Haultfœuille	Non-absorbing or variable-intensity treatment	`DIDmultiplegt`	`did_multiplegt_dyn`
Borusyak-Jaravel-Spiess	High efficiency under parallel trends	`didimputation`	`did_imputation`

Pick one. Learn it well. Report it alongside TWFE so reviewers see that you know about the bias, and so the magnitude of the correction in your specific case is visible.

4.6 The event study plot

If you take away one habit from this chapter, take this one. Always produce an event-study plot before you produce a single number.

The event study regresses the outcome on a set of dummies for time relative to treatment (event time e = t - g_i, where g_i is the period unit i was first treated):

Y_{it} \;=\; \alpha_i \;+\; \lambda_t \;+\; \sum_{e \ne -1} \beta_e \, \mathbf{1}\{t - g_i = e\} \;+\; \varepsilon_{it}.

The coefficients \beta_e for e < 0 are leads, placebo periods before treatment. They should sit near zero. That is the visual evidence for parallel trends. The coefficients for e \ge 0 are lags, the dynamic ATT path. Normalize on e = -1, the period immediately before treatment, so every \beta_e is interpreted as a difference from that anchor.

When estimated with TWFE, this event study is contaminated by the same forbidden comparisons that contaminate the scalar TWFE coefficient. Use Sun-Abraham (fixest::sunab()) or Callaway-Sant’Anna’s event-time aggregation. The plot you show in your paper should come from one of those, not from raw TWFE.

Visually: pre-period coefficients clustered around zero with confidence intervals overlapping zero, then a step up (or down) at e = 0, then a dynamic path. If the pre-period coefficients trend, you have a parallel-trends problem and you must address it head-on. See 4.7.

4.7 Validation

Three layers of validation, in order of importance.

Visual pre-trends. Look at the event-study leads. Are they flat? Do they trend up or down before treatment? A formal Wald test for joint zero pre-period coefficients exists, but it is severely underpowered in typical sample sizes. Roth (2022) shows that conditioning on passing a pre-trends test makes the bias of your DiD estimate worse, not better, because you are selecting on a noisy diagnostic. Use the visual inspection as your primary check, and supplement with formal tools.

Honest pre-trends. Rambachan and Roth (2023) give you a way to report bounds on the DiD estimate that explicitly allow for some pre-trend continuation into the post period. You assume the post-treatment violation of parallel trends is bounded by some constant times the worst pre-treatment violation, and you back out the implied bounds on the ATT. R package: HonestDiD. It is rapidly becoming standard in development economics and impact evaluation. If your pre-trends are not perfectly flat (they almost never are), this is how you defend your estimates in a referee report.

Placebo and alternative controls. Pick a placebo outcome, something the program should not affect, and run the same DiD. Pick a placebo treatment date, a fake rollout a few years before the real one, and rerun. Re-estimate the ATT using never-treated only as your control, then again using not-yet-treated only. If the answer moves a lot across these robustness cuts, your identification is fragile.

Synthetic control (the topic of Chapter 5) is a natural complement when you have a small number of treated units and a long pre-period. DiD and synthetic control answer slightly different questions, but they should agree in spirit. If they do not, find out why before you publish.

4.8 Common traps

A short list of things that get DiD papers desk-rejected.

Running TWFE on staggered rollouts without checking for negative weights. Always run the Goodman-Bacon decomposition (bacondecomp in R and Stata) when you have staggered timing. If the share of weight on forbidden comparisons is non-trivial, switch to a modern estimator. Report both. Show the reader you know.
Wrong clustering. When treatment is assigned at the district level and you have multiple observations per district, you must cluster standard errors at the district level. Bertrand, Duflo, and Mullainathan (2004) showed how serial correlation within units leads to dramatically understated standard errors under naive clustering. Abadie, Athey, Imbens, and Wooldridge (2017, published 2023) refined the theory: cluster at the level of treatment assignment, not at the level of the outcome. If districts were assigned to treatment, cluster at the district. Period. Two-way clustering (district and year) is rarely necessary if you have unit fixed effects; one-way at the assignment level usually suffices.
Treating D_{it} as a single dummy when effects are dynamic. If the program’s effect grows over four years and you compress it to a single post-treatment indicator, you are reporting an average that hides the entire story. Always run the event study. Always show the dynamic path.
Forgetting to anchor the event study. If you do not drop one event-time dummy (typically e = -1), the coefficients are not identified, and your software will either drop one for you silently at a different anchor or return collinear nonsense. Always anchor at e = -1.
Population weights ignored. When your unit is a municipality and your outcome is a per-capita variable, an unweighted regression gives equal weight to a town of 800 and a city of 800,000. If your estimand is a population-weighted ATT, which it usually should be in policy work, weight accordingly. Solon, Haider, and Wooldridge (2015) is the right reference for the weighting question.
Confusing absorbing with reversible treatment. Standard DiD assumes treatment, once received, stays on. If your program can be revoked (or municipalities can drop out of eligibility), you are in non-absorbing territory and you need de Chaisemartin-d’Haultfœuille’s estimators.
Compositional changes over time. If the panel is unbalanced because units enter or exit, the fixed effects do less work than you think. Check whether the composition of i’s observed in each period is stable. If it is not, balance the panel or use estimators that handle unbalanced data explicitly.

4.9 Worked example: a hypothetical rural microcredit rollout in Brazil

Setup. A panel of 500 municipalities in Brazil observed annually from 2014 through 2023. A blended-finance facility, co-funded by a DFI and the national development bank, opens a rural microcredit window in waves. 100 municipalities are treated starting in 2018, another 150 in 2020, another 120 in 2022, and the remaining 130 are never treated during the sample window. The outcome is \log per-capita household consumption, measured annually by a household survey aggregated to the municipal level. Cohort assignment was based on a combination of capacity scores and political negotiation, so it is not random, but conditional on municipality fixed effects we are willing to assume parallel trends in the absence of treatment. Population weights are available.

The pipeline below walks through five steps: a TWFE regression as a baseline; the Goodman-Bacon decomposition as a diagnostic; a Sun-Abraham event study; Callaway-Sant’Anna ATT(g,t) and aggregated ATT; and an honest pre-trends robustness check.

4.9.1 R pipeline

# Required packages
library(fixest)        # TWFE, Sun-Abraham via sunab()
library(did)           # Callaway-Sant'Anna
library(bacondecomp)   # Goodman-Bacon decomposition
library(HonestDiD)     # Rambachan-Roth bounds
library(ggplot2)
library(data.table)

# --- 1. Load and inspect ---
dt <- fread("br_microcredit_panel.csv")
# Columns: mun_id, year, log_cons_pc, pop, treated, cohort (first treatment year, 0 if never)
setorder(dt, mun_id, year)

# Construct D_it (1 if year >= cohort and cohort > 0)
dt[, D := as.integer(cohort > 0 & year >= cohort)]
# Event time (NA for never-treated)
dt[, event := ifelse(cohort > 0, year - cohort, NA_integer_)]

# --- 2. TWFE baseline ---
m_twfe <- feols(log_cons_pc ~ D | mun_id + year,
                data = dt, weights = ~pop, cluster = ~mun_id)
summary(m_twfe)

# --- 3. Goodman-Bacon decomposition ---
# Requires a balanced panel and a numeric D
bgd <- bacon(log_cons_pc ~ D, data = dt,
             id_var = "mun_id", time_var = "year")
# Inspect weights on "Earlier vs Later Treated" (forbidden comparisons)
print(bgd)
# If the weight on forbidden comparisons is > ~5%, TWFE is suspect.

# --- 4. Sun-Abraham event study ---
# Anchor at e = -1 (default for sunab when ref = -1)
m_sa <- feols(log_cons_pc ~ sunab(cohort, year, ref.p = -1) | mun_id + year,
              data = dt[cohort != 0 | is.na(cohort)],  # exclude never-treated rows if needed
              weights = ~pop, cluster = ~mun_id)
# Plot
iplot(m_sa, main = "Event study (Sun-Abraham)",
      xlab = "Years since treatment", ref.line = -0.5)
abline(h = 0, lty = 2)

# --- 5. Callaway-Sant'Anna ---
# Make never-treated have cohort = 0 (this is the convention for the did package)
dt[, gname := ifelse(cohort == 0, 0, cohort)]
cs <- att_gt(yname = "log_cons_pc",
             tname = "year",
             idname = "mun_id",
             gname = "gname",
             control_group = "notyettreated",   # or "nevertreated"
             weightsname = "pop",
             clustervars = "mun_id",
             data = dt)
summary(cs)

# Aggregate to overall ATT
agg_overall <- aggte(cs, type = "simple")
summary(agg_overall)

# Aggregate to event-study path (dynamic ATT)
agg_dyn <- aggte(cs, type = "dynamic", min_e = -5, max_e = 5)
summary(agg_dyn)
ggdid(agg_dyn) +
  labs(title = "Dynamic ATT (Callaway-Sant'Anna)",
       x = "Years since treatment",
       y = "Effect on log per-capita consumption") +
  theme_minimal()

# Aggregate to cohort heterogeneity
agg_group <- aggte(cs, type = "group")
summary(agg_group)

# --- 6. Honest pre-trends (Rambachan-Roth) ---
# Extract the dynamic ATT and its variance from the CS aggregation
betahat <- agg_dyn$att.egt
sigma   <- agg_dyn$V_analytical  # or use the bootstrap variance
# Run sensitivity analysis: bound the post-treatment violation by Mbar times
# the worst pre-treatment violation
sens <- createSensitivityResults_relativeMagnitudes(
  betahat = betahat, sigma = sigma,
  numPrePeriods = 5, numPostPeriods = 6,
  Mbarvec = seq(0, 2, by = 0.5))
createSensitivityPlot_relativeMagnitudes(sens, originalResults = ...)

A note on reading the output. The att_gt summary prints one row per (g, t) pair: the ATT for cohort g at calendar year t, with t ranging over the observed sample. The aggte(..., type = "dynamic") collapse averages these along event time, weighting by cohort size. The type = "group" collapse gives you one ATT per cohort, which is useful for heterogeneity narratives in a development paper. For example, the 2018 cohort (recipients with higher baseline capacity) might show a 6.2% lift, while the 2022 cohort (the harder cases) shows 2.1%. The type = "simple" collapse is the overall ATT.

Interpretation. Suppose the overall aggregated ATT is \hat\delta = 0.041 (SE 0.012), interpreted as a 4.1 log-point increase in per-capita consumption attributable to the microcredit window, averaged over treated municipality-years. Compare it to a TWFE estimate of \hat\delta_{\text{TWFE}} = 0.018 (SE 0.009). The gap is the bias. The Goodman-Bacon decomposition will tell you how much of TWFE’s underestimate came from forbidden comparisons, that is, already-treated cohorts whose growing effects acted as the wrong control for newly-treated cohorts.

4.9.2 Stata pipeline

* Load and prep
import delimited using "br_microcredit_panel.csv", clear
xtset mun_id year

gen byte D = (cohort > 0 & year >= cohort)
gen int event = year - cohort if cohort > 0

*** 1. TWFE baseline (reghdfe with population weights, clustered SEs) ***
ssc install reghdfe
ssc install ftools
reghdfe log_cons_pc D [aw=pop], absorb(mun_id year) vce(cluster mun_id)

*** 2. Goodman-Bacon decomposition ***
ssc install bacondecomp
bacondecomp log_cons_pc D, ddetail

*** 3. Callaway-Sant'Anna ***
ssc install csdid
ssc install drdid

* csdid requires gvar = first treatment period, 0 for never treated
gen gvar = cond(cohort == 0, 0, cohort)
csdid log_cons_pc [iw=pop], ivar(mun_id) time(year) gvar(gvar) ///
      method(dripw) agg(event) wboot rseed(42)
* Overall ATT
csdid_estat simple
* Event-study aggregation
csdid_estat event, window(-5 5)
* Cohort heterogeneity
csdid_estat group

*** 4. Sun-Abraham event study (eventstudyinteract) ***
ssc install eventstudyinteract
ssc install avar
gen never_treated = (cohort == 0)
* Generate event-time dummies, omit -1
forvalues k = 5(-1)2 { gen lead`k' = (event == -`k') }
gen lag0 = (event == 0)
forvalues k = 1/6 { gen lag`k' = (event == `k') }
eventstudyinteract log_cons_pc lead* lag* [aw=pop], ///
      cohort(cohort) control_cohort(never_treated) ///
      absorb(mun_id year) vce(cluster mun_id)
* Plot
ssc install coefplot
coefplot, keep(lead* lag*) vertical yline(0) xline(7, lpattern(dash)) ///
      xlabel(, angle(45)) title("Event study (Sun-Abraham)")

*** 5. Event-study plot via eventdd ***
ssc install eventdd
eventdd log_cons_pc [aw=pop], timevar(event) method(hdfe, absorb(mun_id year)) ///
      cluster(mun_id) graph_op(title("Event study (TWFE, for comparison only)") ///
      ytitle("Effect on log per-capita consumption"))

The Stata pipeline mirrors the R one: TWFE baseline, Goodman-Bacon diagnostic, Callaway-Sant’Anna as the headline estimator, Sun-Abraham as a robustness check, and a TWFE event-study plot via eventdd only to show the contrast. Your published plot should come from csdid_estat event or eventstudyinteract, not from eventdd on raw TWFE.

4.9.3 Reading the ATT(g,t) heatmap

The Callaway-Sant’Anna output is a matrix of ATT(g,t) for each combination of cohort g (rows) and calendar time t (columns), restricted to t \ge g. Visualize it as a heatmap with cohorts on the y-axis (2018, 2020, 2022) and calendar time on the x-axis. The diagonal entries are the impact effects, the effect of treatment in the first year of treatment. Moving right along a row traces the dynamic path for that cohort. Moving down a column compares cohorts at the same calendar time.

What you are looking for is a coherent story. Does the impact effect (the diagonal) line up across cohorts, suggesting a stable program effect across waves? Or does it shrink as the program expands to harder municipalities? Do dynamics (movement to the right within a row) grow, peak, or fade? The heatmap is the richest single visual in modern DiD. Put it in your paper.

4.10 Reporting checklist

For every DiD paper you write or referee, the minimum acceptable contents:

The event-study plot from a modern estimator (Sun-Abraham or Callaway-Sant’Anna), normalized at e = -1, with confidence intervals.
TWFE and modern-estimator point estimates reported side by side, so the reader sees the magnitude of the correction.
Goodman-Bacon decomposition reported in an appendix table, with the weight on forbidden comparisons disclosed.
Cohort heterogeneity (CS type = "group" aggregation) if rollout is staggered.
Robustness to alternative control groups: never-treated only versus not-yet-treated only. Report both.
Standard errors clustered at the treatment-assignment level. State this explicitly.
Honest pre-trends bounds (Rambachan-Roth) if visual pre-trends are not flat, or if a referee can plausibly object that they are not flat.
Population weights when the estimand is a population-weighted ATT.
A clear statement of what the estimand is (overall ATT? ATT for the 2018 cohort? Dynamic ATT at e = 3?).

4.11 Why this matters for your career

You are walking into a field where the foundational empirical method is DiD. Every World Bank impact evaluation, every J-PAL pre-analysis plan, every IFC results framework, every DFI annual report on additionality is, somewhere in its plumbing, a DiD. The papers you will be asked to read at orientation, the papers you will be asked to summarize for board memos, the papers you will eventually write yourself, are mostly DiD all the way down.

The TWFE crisis is the most consequential econometrics finding of the past decade for development and blended finance. It changed how the World Bank’s Development Impact Evaluation group writes its reports, how J-PAL teaches its evaluation training courses, and how AEA-affiliated journals desk-screen submissions. A paper submitted in 2026 that runs TWFE on staggered rollout data without disclosing the bias gets desk-rejected. A paper that quietly reports TWFE without addressing the bias and gets published anyway will be retracted-by-citation. Future authors footnote it as subject to the staggered-treatment bias documented in Goodman-Bacon (2021), and its policy weight collapses.

Your competitive advantage as a junior researcher entering this space is not that you can run a regression. Everyone can. It is that you can read a paper, immediately identify whether the authors used TWFE on staggered data, look at their event-study plot, and form a defensible view on whether their estimate is biased. That is a five-minute skill that takes most senior practitioners weeks of catching up to acquire, because they learned DiD before 2018. You are starting after. Use that.

One more thing. In blended finance, where the “treated” unit is often a fund-of-funds, a guarantee program, or a specific deal, you will frequently have very small N, a handful of treated units and dozens of years. DiD with small N is hard. Cluster bootstraps, wild cluster bootstraps (Cameron, Gelbach, and Miller 2008), and synthetic control are your friends. The next chapter is synthetic control. Read it knowing that it is the answer to “what do I do when I only have three treated countries?”

4.12 References

Foundational and methodological

Abadie, A., Athey, S., Imbens, G. W., & Wooldridge, J. M. (2023). When should you adjust standard errors for clustering? Quarterly Journal of Economics, 138(1), 1-35. (Originally NBER WP, 2017.)
Bertrand, M., Duflo, E., & Mullainathan, S. (2004). How much should we trust differences-in-differences estimates? Quarterly Journal of Economics, 119(1), 249-275.
Borusyak, K., Jaravel, X., & Spiess, J. (2024). Revisiting event-study designs: Robust and efficient estimation. Review of Economic Studies, 91(6), 3253-3285.
Callaway, B., & Sant’Anna, P. H. C. (2021). Difference-in-differences with multiple time periods. Journal of Econometrics, 225(2), 200-230.
Cameron, A. C., Gelbach, J. B., & Miller, D. L. (2008). Bootstrap-based improvements for inference with clustered errors. Review of Economics and Statistics, 90(3), 414-427.
Card, D., & Krueger, A. B. (1994). Minimum wages and employment: A case study of the fast-food industry in New Jersey and Pennsylvania. American Economic Review, 84(4), 772-793.
de Chaisemartin, C., & d’Haultfœuille, X. (2020). Two-way fixed effects estimators with heterogeneous treatment effects. American Economic Review, 110(9), 2964-2996.
Goodman-Bacon, A. (2021). Difference-in-differences with variation in treatment timing. Journal of Econometrics, 225(2), 254-277.
Rambachan, A., & Roth, J. (2023). A more credible approach to parallel trends. Review of Economic Studies, 90(5), 2555-2591.
Roth, J. (2022). Pretest with caution: Event-study estimates after testing for parallel trends. American Economic Review: Insights, 4(3), 305-322.
Solon, G., Haider, S. J., & Wooldridge, J. M. (2015). What are we weighting for? Journal of Human Resources, 50(2), 301-316.
Sun, L., & Abraham, S. (2021). Estimating dynamic treatment effects in event studies with heterogeneous treatment effects. Journal of Econometrics, 225(2), 175-199.

Applied development and blended finance

Banerjee, A., Karlan, D., & Zinman, J. (2015). Six randomized evaluations of microcredit: Introduction and further steps. American Economic Journal: Applied Economics, 7(1), 1-21.
Burgess, R., & Pande, R. (2005). Do rural banks matter? Evidence from the Indian social banking experiment. American Economic Review, 95(3), 780-795.
Duflo, E. (2001). Schooling and labor market consequences of school construction in Indonesia: Evidence from an unusual policy experiment. American Economic Review, 91(4), 795-813.
Kaboski, J. P., & Townsend, R. M. (2012). The impact of credit on village economies. American Economic Journal: Applied Economics, 4(2), 98-133.

Software

did (R), Callaway & Sant’Anna: https://bcallaway11.github.io/did/
fixest (R), Bergé: https://lrberge.github.io/fixest/
HonestDiD (R), Rambachan & Roth: https://github.com/asheshrambachan/HonestDiD
csdid (Stata), Rios-Avila, Callaway, & Sant’Anna: SSC.
eventstudyinteract (Stata), Sun: SSC.
did_multiplegt_dyn (Stata), de Chaisemartin et al.: SSC.

End of Chapter 4. Chapter 5 takes up Synthetic Control and its modern generalizations (synthetic DiD, matrix completion). Read this chapter and the next together; they are the two halves of the same problem.