Chapter 12. Practice problems
The format mirrors the macroprep question bank. Each problem has a prompt, a solution, and a one-line “key insight.” Tiers escalate Easy \to Medium \to Hard \to Twist. The Twist tier is where junior researchers tend to lose points in real-world contexts: at a J-PAL methods retreat, in a referee report, or during a World Bank seminar.
Work each one on paper first. Then check yourself.
Block 1. OLS, FE, clustering
Q1.1 (Easy)
You run reg log_consumption treated, robust on a panel of 500 households across 10 years (n=5000). What’s wrong?
Solution. The same household appears in 10 rows. Errors are correlated within household. robust SEs assume independence across rows, so the SE is too small. Cluster at the household level: reg log_consumption treated, vce(cluster household_id).
Key insight. Robust SE handles heteroskedasticity. Cluster SE handles within-group correlation. They are different problems.
Q1.2 (Medium)
You have 12 clusters in a cluster-randomized trial. You run reghdfe ... vce(cluster cluster_id). Why might this give the wrong inference?
Solution. Cluster-robust asymptotics require many clusters (typically \geq 40 as a rule of thumb). With 12, the cluster-bootstrap distribution of the t-statistic is not well approximated by a standard normal, so you over-reject. Use wild cluster bootstrap (Cameron-Gelbach-Miller 2008). In Stata: boottest treated, cluster(cluster_id) reps(9999). In R: fwildclusterboot::boottest(model, clustid = "cluster_id", param = "treated", B = 9999).
Key insight. Cluster count matters more than total sample size for cluster-robust inference.
Q1.3 (Hard)
A reviewer says: “your treatment is assigned at the district level but you cluster at the village level. Re-do.” You have 50 districts, 500 villages, 50000 households. What changes?
Solution. Cluster at district. Liang-Zeger SE only handles correlation within clusters; assignment-level correlation isn’t fully captured by village clustering. Abadie-Athey-Imbens-Wooldridge (2023) is clear: cluster at the level of treatment assignment, not below. Your SEs will grow (probably by 1.5–3x) because n_clusters falls from 500 to 50.
Key insight. Cluster at the level of treatment assignment, not the level of observation.
Q1.4 (Twist)
You cluster at the district level (50 districts) and your treatment effect is 0.12 (SE 0.04, p<0.01). A discussant points out that your DGP also has time-correlated shocks at the year level (severe drought in 2019 affected all districts). What is the fix?
Solution. Two-way clustering at district AND year. In Stata: reghdfe ... vce(cluster district year). In R fixest: cluster = c("district", "year"). Cameron-Gelbach-Miller (2011) showed that if shocks correlate in two non-nested dimensions, you need to cluster on both.
Key insight. Two-way clustering when treatment and shocks vary on different dimensions. Single-clustering will under-state the SE.
Block 2. Difference-in-Differences
Q2.1 (Easy)
2x2 DiD on a simulated 2-period 2-group panel. Treated group goes from mean log-cons 5.0 to 5.4; control from 5.0 to 5.2. What’s the DiD estimate?
Solution. \hat\delta = (5.4 - 5.0) - (5.2 - 5.0) = 0.4 - 0.2 = 0.2.
Key insight. DiD is just two-difference subtraction. Don’t overthink.
Q2.2 (Medium)
On the same data, you run reg y treated post (treated × post). The coefficient on the interaction is 0.2. Why is this the same as Q2.1?
Solution. In the 2x2 case, the coefficient on the interaction term is algebraically identical to the DiD estimate from group means. The regression representation is equivalent.
Key insight. Regression DiD = mean-differences DiD when there are no covariates and two periods.
Q2.3 (Hard)
You run TWFE with staggered rollout: 5 cohorts, treatment at years 2018, 2019, 2020, 2021, 2022. Goodman-Bacon (2021) decomposition. Why might the TWFE coefficient be misleading?
Solution. Already-treated units serve as controls for newly-treated cohorts. If treatment effects are heterogeneous across cohorts or evolve over time, TWFE puts negative weights on some 2x2 comparisons. The estimate can have the wrong sign relative to the true average ATT. Diagnose with bacondecomp package (R) or bacondecomp ado (Stata).
Key insight. Staggered rollout + heterogeneous TEs → TWFE is biased. Use Callaway-Sant’Anna, Sun-Abraham, or Borusyak-Jaravel-Spiess.
Q2.4 (Twist)
A reviewer says: “your pre-treatment event-study coefficients fail the joint test of zero (p=0.04). Your parallel-trends assumption is violated. Reject the paper.” Are they right?
Solution. Not necessarily. A statistical pre-trend test is under-powered: with many leads and small SEs, you’ll always reject “zero” eventually. The right move is Honest pre-trends (Roth 2022; Rambachan-Roth 2023): bound the post-treatment effect under partial-identification assumptions about how much the pre-trend could have continued. Use the HonestDiD R package. Show that under “the pre-trend continues linearly”, your estimated ATT is still positive and significant. Or: argue substantively that the small pre-trend is due to a specific known confounder you can control for.
Key insight. “Pre-trends fail” \not\to paper dead. Use Honest pre-trends to bound what the post-trend could be.
Block 3. Regression Discontinuity
Q3.1 (Easy)
Sharp RDD: census tracts with poverty rate \geq 20\% are eligible for a tax credit. Outcome: log investment per capita. You fit rdrobust log_inv_pc poverty_rate, c(20) and get \hat\tau = 0.18 (SE 0.06), CCT bandwidth 4.2, n_eff = 320. What does \hat\tau mean?
Solution. Local-to-the-cutoff causal effect of being eligible on log investment per capita. Tracts at 20% poverty have investment ~18% higher than tracts at 20-\varepsilon. Generalizes only to tracts near the cutoff, not to all tracts.
Key insight. RDD identifies a LATE at the cutoff, not the average treatment effect for everyone.
Q3.2 (Medium)
McCrary density test rejects (p<0.01) with a visible spike just above 20%. What do you do?
Solution. Manipulation suspected. Either: (a) drop a “donut” of observations exactly at the cutoff and re-run, (b) instrument the eligibility threshold with something exogenous (rare), or (c) acknowledge the manipulation as a threat and bound the result. If the heaping is due to rounding in the running variable (poverty rates rounded to the nearest 5%), this is mechanical and the McCrary test will lie. Look at the histogram first; if you see heaping at 15%, 20%, 25%, the issue is rounding, not strategic manipulation.
Key insight. McCrary tells you density is discontinuous. Inspect the histogram before deciding it’s manipulation.
Q3.3 (Hard)
A reviewer says: “your CCT bandwidth is 4.2 but your point estimate flips sign at bandwidth 2.0 and bandwidth 7.0. Robustness fails.” What’s the principled response?
Solution. Show the full bandwidth-sensitivity curve. If the CCT-optimal point estimate is the only one positive and significant while the “obvious” alternatives flip, your result is fragile. Diagnose: is the running-variable distribution sparse near the cutoff at h=2? Is the underlying regression nonlinear at h=7? Use rdrobust’s bandwidth-sensitivity plotting tool to show the entire curve, then defend the CCT choice with the methodological literature (CCT bandwidth is MSE-optimal). If your result really does depend on h, you have a weaker RDD than you thought.
Key insight. RDD point estimates SHOULD be roughly stable across nearby bandwidths. If they flip sign, you have a problem.
Q3.4 (Twist)
You apply RDD to EU Cohesion Fund NUTS-3 regions: regions with GDP per capita \leq 75\% of EU average are eligible. You find a positive effect on regional growth. A discussant asks: “what about regions that crossed the threshold via re-classification at EU enlargement in 2004?” What’s the threat?
Solution. Re-classification at enlargement is a quasi-manipulation: many regions crossed because the denominator (EU average) shifted, not because the regions changed. This means the post-enlargement treated group is selected on a different mechanism than the pre-enlargement group. Either: (a) restrict your sample to pre-2004 (loses statistical power but cleaner), (b) include enlargement-year fixed effects interacted with the running variable (captures the shifted threshold), (c) do a fuzzy RDD treating enlargement as a partial compliance shock.
Key insight. RDD assumes the cutoff is stable. If the cutoff itself moves due to a policy event, you have a different problem.
Block 4. Quantile regression
Q4.1 (Easy)
You run qreg log_cons treated, q(0.5) on rural household data and get \hat\beta = 0.04. You run reg log_cons treated, robust and get \hat\beta = 0.06. Why are they different?
Solution. OLS estimates the conditional mean effect; median regression estimates the conditional median effect. If the outcome distribution is right-skewed (long upper tail), the mean is pulled by the top while the median is not. The 0.04 vs 0.06 difference reflects that the program raises the upper part of the distribution more than the middle.
Key insight. The OLS coefficient and the median-regression coefficient are not the same parameter and won’t generally coincide.
Q4.2 (Medium)
You run sqreg log_cons treated, q(0.1 0.5 0.9) reps(500) and find \hat\beta(0.1) = 0.12, \hat\beta(0.5) = 0.04, \hat\beta(0.9) = -0.01. What’s the policy interpretation?
Solution. The program raises log consumption by 12% at the 10th percentile of the conditional distribution, 4% at the median, and slightly compresses the top. Net effect: progressive distribution. This is the kind of headline a Gates Foundation officer wants. “Program lifts the bottom.”
Key insight. Quantile regression coefficients across \tau describe the distributional shape of the treatment effect.
Q4.3 (Hard)
Quantile regression with fixed effects is not as clean as OLS with fixed effects. Why, and what do you do?
Solution. The within-transformation that works for OLS doesn’t have a clean quantile analogue: \text{median}(y_i - \bar y_i) is not the median deviation in the usual sense. Two options: (a) penalized quantile regression (Koenker 2004), (b) the Machado-Santos-Silva (2019) method-of-moments quantile regression, which uses an auxiliary location/scale model. Both have caveats; neither is as clean as feols(y ~ x | id). R: quantreg::rq with penalty; Stata: mmqreg user-written ado.
Key insight. QR with FE is an active research problem. Be transparent about which estimator you used and why.
Q4.4 (Twist)
A discussant says: “your \hat\beta(0.1) > \hat\beta(0.9) shows the program helped the poor most.” Why is this not quite right under standard assumptions?
Solution. Quantile regression coefficients describe the effect on the conditional \tau-th quantile, not on the people who would have been at the \tau-th quantile without treatment. Without rank invariance, the people at the 10th percentile after treatment are not the same people who would have been at the 10th percentile without treatment. The cleaner question, “what does this program do to the lower tail of the unconditional outcome distribution,” is answered by RIF regression (Firpo-Fortin-Lemieux 2009), not by classical quantile regression.
Key insight. Conditional-quantile coefficients and unconditional-distribution effects are different parameters. Pick the one that matches your policy question.
Block 5. IV
Q5.1 (Easy)
You use distance to nearest bank branch as an instrument for credit takeup. First-stage F = 4.2. What’s wrong?
Solution. Weak instrument. Rule of thumb (now superseded): F > 10. Modern: use the Olea-Pflueger (2013) effective F, or report the Anderson-Rubin CI which is robust to weak instruments. With F=4.2 your 2SLS estimate is heavily biased toward OLS and has the wrong coverage.
Key insight. Weak first stage = weak result. Don’t use 2SLS if your first-stage F is below the Olea-Pflueger threshold for your setting.
Q5.2 (Medium)
First-stage F = 35. Your 2SLS estimate is 0.45, OLS is 0.12. Why might 2SLS be 4x larger than OLS, and what does that suggest?
Solution. Two possibilities: (a) OLS is downward-biased by reverse causality or attenuation from measurement error, and 2SLS is closer to truth; (b) IV identifies a LATE for compliers, who happen to be more responsive than the population average. Or the exclusion restriction is violated and Z affects Y through a channel other than D. Defend the exclusion restriction substantively; that is the only way to discriminate.
Key insight. Large IV/OLS gap deserves explanation. Could be bias correction, could be LATE on responsive compliers, could be exclusion violation.
Q5.3 (Hard)
A referee notes: your instrument is distance to nearest extension office. But more remote villages also have less infrastructure, lower education, and worse markets. Does the exclusion restriction hold?
Solution. Probably not, in isolation. Distance to extension office likely correlates with other rural-isolation channels that affect log farm income directly. The fix: control for those other channels (distance to nearest road, paved-road density, school count, etc.) in both the structural and the first stage. The instrument now becomes “the residual variation in extension-office distance after controlling for other isolation measures”. This is a valid identification strategy if you can argue the residual is as good as random, e.g., extension offices were placed in 1985 based on a now-defunct administrative-region map.
Key insight. Exclusion restriction = there is no channel from Z to Y other than through D, conditional on controls. Defend with theory and narrative, not regressions.
Q5.4 (Twist)
You use rainfall in year t-1 as an IV for farm income in t. What if rainfall in t-1 affects nutrition in t directly (not through income)?
Solution. Exclusion fails. Rainfall affects food prices, which affect consumption, which affects nutrition directly, not through reported farm income. The right move is either to use a different instrument (a non-agricultural shock), to model the multiple channels explicitly with a multi-equation system, or to acknowledge the IV as a reduced-form effect of weather on the outcome rather than a structural ATE.
Key insight. Weather IVs are everywhere in development economics, but their exclusion restrictions are often violated through multiple direct channels. Be very careful.
Block 6. Synthetic Control and spatial
Q6.1 (Easy)
You apply SC to evaluate a national rural credit reform in Portugal, with 12 EU donor countries. Pre-period RMSPE = 0.005, post-period RMSPE = 0.04. The ratio is 8. Is this big?
Solution. Yes, large. The synthetic Portugal tracks actual Portugal well in pre-period (RMSPE 0.005 is essentially noise), and the post-period gap is 8x larger. Permutation test: run SC on each of the 12 donors as if it were treated, build the distribution of post/pre ratios. Your treated unit’s ratio of 8 should rank near the top to give a small p-value. If it’s the 1st of 13, p ≈ 1/13 ≈ 0.077.
Key insight. Post/pre RMSPE ratio is the SC analogue of a t-stat. Permutation gives the p-value.
Q6.2 (Medium)
Your synthetic Portugal puts 70% weight on Greece and 30% on Spain. A reviewer says: “your counterfactual is mostly Greece, but Greece had a debt crisis in the post-period that contaminates your control.” What do you do?
Solution. Augmented SC (Ben-Michael-Feller-Rothstein 2021) or restrict the donor pool. The simplest fix is to drop Greece from the donor pool and re-fit. If the result holds with a Greece-free donor set, the Greek debt-crisis contamination is not driving the headline. If the result vanishes, the design is in trouble: the “rural credit effect” is partly “absence of Greek debt crisis.”
Key insight. Inspect SC weights. If a donor with concurrent unrelated shocks gets large weight, robustness-check by dropping it.
Q6.3 (Hard)
A spatial-econometrics referee says: “your municipality-level regression has Moran’s I = 0.34 in residuals. Cluster SEs at municipality don’t fix this. Use Conley SEs.” Are they right?
Solution. Yes. Cluster SEs handle correlation WITHIN cluster but not BETWEEN nearby clusters. Spatial autocorrelation across municipalities (correlated outcomes across geographic neighbors) inflates the effective sample size beyond what cluster SE assumes. Use Conley (1999) SEs with a bandwidth equal to the typical spatial correlation length (e.g., 30 km). Stata: acreg ... lag(km(30)). R: conleyreg::conleyreg(model, df, lat = "lat", lon = "lon", dist_cutoff = 30).
Key insight. Cluster SEs handle within-cluster correlation. Conley SEs handle across-cluster spatial correlation. They are not substitutes.
Q6.4 (Twist)
You apply geographic RDD to the boundary of Portuguese NUTS-3 regions: the treated region has a rural-credit program, the adjacent region doesn’t. You find a 12% jump in farm income at the boundary. A discussant says: “but the soil quality changes at the boundary too, and that’s not random.” What do you do?
Solution. This is the Dell (2010) “mita” problem. The right fix is two-dimensional running-variable control (Keele-Titiunik 2015): include flexible functions of latitude and longitude on both sides, plus the signed distance to the boundary. Then test for balance on soil quality, elevation, road density, etc., just inside vs. just outside the boundary. If you cannot get balance, run a placebo test on the “pure” boundary effect using a soil-quality outcome. If you see a jump in soil quality at the same boundary, your design is contaminated.
Key insight. Geographic RDD assumes the boundary is otherwise smooth in covariates. Test it. If it isn’t, you have a confounded design.
Block 7. Heterogeneous TEs and causal ML
Q7.1 (Easy)
You fit a causal forest with grf::causal_forest(). The CATE prediction at one observation is 0.15. What does this mean?
Solution. The model predicts that for a unit with this covariate profile (baseline assets, household size, distance to market, etc.), the treatment effect on the outcome is 0.15 in the outcome’s units. It’s a personalized treatment-effect estimate.
Key insight. Causal forest = a per-unit treatment-effect prediction conditional on covariates.
Q7.2 (Medium)
A reviewer says: “your variable importance plot shows distance to market is the top driver of CATE. So programs should target distant villages.” What’s the danger in this inference?
Solution. Variable importance in grf tells you which covariates partition the CATE most cleanly, not necessarily which cause the heterogeneity. Distance might be correlated with other unobserved drivers (market access, institutional quality, ethnicity). Use the best linear projection (BLP) of the CATE on a few pre-specified covariates of policy interest (Semenova-Chernozhukov 2021) to get interpretable, defensible heterogeneity statements. Variable importance is for exploratory data analysis, not for policy.
Key insight. Variable importance is descriptive, not causal. Use BLP for policy-relevant heterogeneity statements.
Q7.3 (Hard)
You’re considering using policy learning (Athey-Wager 2021) to design a treatment-assignment rule that maximizes welfare. What’s the main implementability concern?
Solution. The learned policy rule depends on covariates that the implementing agency may not have at decision time. If your rule says “give credit to households with baseline assets < median, distance to market > 80th percentile, and household size > 5”, the rural credit officer in a Brazilian municipality needs to know those covariates to apply the rule, and they may not. Also: regret bounds in Athey-Wager assume large training samples; with n=2000 you may not have a stable policy. Practical fix: report both the optimal-policy welfare gain AND the simpler interaction-based heuristic welfare gain, and let the implementer choose.
Key insight. Policy learning gives you the rule. The rule has to be implementable in the field.
Q7.4 (Twist)
A senior researcher critiques your causal-forest analysis: “you split the sample 50/50 for fitting and inference, but the test sample has very different covariate distributions from the training sample, and the CATEs are unreliable.” What’s the diagnostic?
Solution. Check covariate overlap between training and test splits via propensity-score balance or the energy distance. If the test sample is meaningfully outside the training-sample support, your CATE predictions extrapolate. Two fixes: (a) use cross-fitting (every observation gets a prediction from a model that didn’t see it; eliminates the train/test asymmetry), (b) trim the test sample to the training-sample support before reporting CATEs. The first is what DoubleML does by default; the second is good hygiene.
Key insight. Causal-forest CATEs are reliable only on the training-sample support. Cross-fitting solves the split problem.
Block 8. Pre-analysis plans and design
Q8.1 (Easy)
You’re designing an RCT with 200 treated and 200 control individuals, 50/50 split, with a baseline standard deviation of 1.0 in log consumption. What’s the MDE (80% power, 5% size, two-sided)?
Solution. \text{MDE} \approx (1.96 + 0.84) \cdot \sqrt{\frac{1}{0.5 \cdot 0.5}} \cdot \frac{1.0}{\sqrt{400}} = 2.80 \cdot 2 \cdot 0.05 = 0.28 SD.
Key insight. With n=400 and equal split, you can detect ~0.28 SD effects. Smaller effects need more sample.
Q8.2 (Medium)
Same MDE calculation, but it’s cluster-randomized: 20 clusters of 20 each. ICC = 0.15. What’s the new MDE?
Solution. Design effect = 1 + (20 - 1) \cdot 0.15 = 1 + 2.85 = 3.85. Effective sample n_{eff} = 400 / 3.85 \approx 104. MDE \approx 0.28 \cdot \sqrt{3.85} \approx 0.55 SD.
Key insight. Clustering kills power. A design with 20 clusters of 20 is worth ~100 independent observations when ICC is moderate.
Q8.3 (Hard)
Your RCT has 25% attrition, mostly in the treated arm (30% T vs 20% C). What threat does this raise, and how do you address it?
Solution. Differential attrition is the deadliest threat. If treatment causes high-consumption people to drop out (e.g., they migrate after receiving credit), your retained sample is selected and the estimated treatment effect is biased. Lee (2009) bounds: assume the worst case (the missing treated are the highest or lowest of the population), then compute the ATT under each assumption. The reported result is the [lower, upper] bound. If the bounds straddle zero, you cannot sign the effect. Otherwise the lower bound becomes the conservative headline.
Key insight. Differential attrition \geq 5pp is a paper-killer unless you bound it.
Q8.4 (Twist)
You pre-specified one primary outcome (log consumption) and three secondary outcomes (food security, school attendance, mental health). Primary is null (p=0.40). One secondary is significant (mental health, p=0.02). A funder asks you to lead with the mental-health result. What do you do?
Solution. Don’t. Lead with the primary (null) and report the secondary as exploratory, with the BH-corrected p-value. The funder wants a headline; your job is to report what the design was powered to detect. Reframing exploratory findings as primary is exactly the spec-search problem PAPs are designed to prevent. The professional move: report the null cleanly, note the mental-health signal with appropriate hedging, and propose a follow-up trial powered for that outcome. Your future credibility depends on this.
Key insight. A null is data. Reframing it kills your future credibility with any methodologist who reads your paper.
Solutions checked
Now go run the code in code/ and redo these by hand. The point of the curriculum is fluency. Fluency comes from doing, not from reading.