Chapter 9. Heterogeneous Treatment Effects and Spatial Methods
Inference Lab · Dr. Ian Helfrich
Two big topics live in this chapter, and they are connected in a way that matters for applied work. Rural blended finance research has two unavoidable features. The treatment hits different people differently, and the units sit in space. Ignore the first and you tell donors a story about a program that “worked” or “didn’t” when the truth is that it worked for some and failed for others. Ignore the second and the standard errors are wrong and the average treatment effect mixes in spillovers no one ever modelled. Both mistakes are easy to make and easy to fix once you know the tools.
The chapter has two parts. Part A handles heterogeneity (interaction terms, causal forests, double machine learning, policy learning). Part B handles space (autocorrelation, Conley standard errors, spatial regression discontinuity, SUTVA violations). Each ends with a worked example tied to a rural credit application.
Part A. Heterogeneous treatment effects
1. Why averages are not enough
Consider a rural microcredit pilot in a Portuguese district. The headline result the donor wants is “did the program raise household income?” Run the regression, the ATT comes back at \hat\tau = 0.002 with a wide standard error, and the donor’s program officer asks for a one-page brief saying the intervention failed.
But suppose the truth looks like this:
\tau(x) = \begin{cases} +0.10 & \text{if the household had high baseline assets} \\ -0.10 & \text{if the household had low baseline assets} \end{cases}
The ATT is roughly zero because the two effects cancel. The policy implication is the opposite of “the program failed.” The program transfers welfare from the poorest to the richer rural households. That is a finding worth a paper. It is also the finding the donor most needs to know before scaling.
The conditional average treatment effect \tau(x) = \mathbb{E}[Y(1) - Y(0) \mid X = x] is the object the analyst actually cares about. The ATT is just the population average of \tau(x). For rural blended finance, the question “who responded?” is the entire policy question.
2. Classical heterogeneity: interaction terms
The oldest and still most useful approach is to interact treatment with a covariate the analyst has a prior about. In a difference-in-differences setting:
Y_{it} = \alpha_i + \gamma_t + \tau D_{it} + \delta (D_{it} \times Z_i) + \varepsilon_{it}
Here Z_i is the moderator (baseline assets, household size, distance to market). The coefficient \delta says whether the treatment effect varies with Z. Test H_0: \delta = 0, report the conditional effect at meaningful values of Z, and the heterogeneity analysis is clean and transparent.
The cost of this approach is that the analyst is stuck with whatever moderators were specified in advance. If the real heterogeneity runs along some interaction of three covariates no one thought to interact, the specification will miss it. This is where causal machine learning earns its keep.
3. Modern causal machine learning
Three tools, in roughly the order to learn them.
Causal forests (Athey, Tibshirani, and Wager 2019) extend the random forest to estimate \tau(x) directly. The idea is to grow trees that split on the treatment effect heterogeneity rather than on the outcome. At each leaf, the local treatment effect is estimated by a doubly robust score. The forest then averages many such trees, producing a smooth function \hat\tau(x) that can be evaluated at any point.
Two features make causal forests practically valuable. First, no need to specify which covariates drive heterogeneity. The forest tells you. Second, the package implements honest splitting: half the sample picks the tree structure, the other half estimates the leaf values. This gives valid pointwise confidence intervals on \hat\tau(x). The R package is grf (generalized random forests) by Tibshirani, Athey, and Wager.
Double or debiased machine learning (Chernozhukov, Chetverikov, Demirer, Duflo, Hansen, Newey, and Robins 2018) solves a different problem. The setting is a low-dimensional treatment effect of interest with a high-dimensional vector of controls. The analyst wants to use flexible ML (random forests, lasso, neural nets) for the nuisance functions (the propensity score \pi(x) = \mathbb{P}(D=1 \mid X=x) and the outcome regressions) without contaminating the inference on \tau.
The trick is two-fold. (i) Use a Neyman-orthogonal moment condition, so first-stage estimation error in the nuisance functions cancels out at first order. (ii) Cross-fit, meaning the nuisance functions are estimated on one half of the sample and used to predict on the other half, then swap. The resulting estimator is \sqrt{n}-consistent and asymptotically normal even when the nuisance ML estimators converge at n^{-1/4} rates. The R package is DoubleML; the Python equivalent is also DoubleML.
Best linear projection of the CATE (Semenova and Chernozhukov 2021) is the interpretation layer on top of a causal forest. Given \hat\tau(x_i) for each unit from grf::causal_forest(), project it onto a small set of covariates Z_i of interest:
\hat\tau(x_i) = Z_i' \beta + u_i
The fitted \hat\beta describes, in plain regression language, how the treatment effect varies with the chosen moderators. The nonparametric flexibility of the forest is preserved while results read in a way a policymaker can follow. This is the bridge between “the forest found heterogeneity” and “here is what it means.”
Policy learning (Athey and Wager 2021) closes the loop. Given \hat\tau(x), a rule \pi: x \mapsto \{0,1\} can be derived that assigns treatment to maximize welfare:
\pi^*(x) = \mathbf{1}\{\hat\tau(x) > c\}
where c is the per-unit cost of treatment. The Athey-Wager paper provides regret bounds: how much welfare is lost by using the estimated \hat\tau instead of the true \tau. The R package is policytree.
4. Worked example: rural microcredit CATE
Imagine a randomized rural credit pilot from a Portuguese mutualist cooperative. Outcome: household consumption one year post-treatment. Treatment: received the credit line. Covariates: baseline assets, household size, distance to market town, age of head, whether the household has migrants abroad. Sample size is 4,800 households.
library(grf)
library(tidyverse)
# Data: Y outcome, W treatment, X covariates
X <- df %>% select(baseline_assets, hh_size, dist_market, age_head, has_migrant) %>%
as.matrix()
Y <- df$consumption_post
W <- df$treated
# Honest causal forest. num.trees high enough to stabilize.
cf <- causal_forest(X, Y, W,
num.trees = 4000,
honesty = TRUE,
tune.parameters = "all",
seed = 42)
# Predict CATE for each unit (out-of-bag estimates by default)
tau.hat <- predict(cf)$predictions
tau.se <- sqrt(predict(cf, estimate.variance = TRUE)$variance.estimates)
# Variable importance: which covariates drive heterogeneity?
var_imp <- variable_importance(cf)
rownames(var_imp) <- colnames(X)
print(var_imp)
# Best linear projection onto a small set of moderators
blp <- best_linear_projection(cf,
A = X[, c("baseline_assets", "dist_market")])
print(blp)
# Diagnostic: omnibus calibration test
test_calibration(cf)The test_calibration output gives a “mean prediction” coefficient (should be near 1 if the forest is unbiased on average) and a “differential prediction” coefficient (should be positive and significant if the forest has detected genuine heterogeneity). If the differential prediction coefficient is insignificant, there is no detectable heterogeneity to talk about. Report that honestly.
To visualize:
# Plot CATE by baseline assets
df$tau_hat <- tau.hat
df$tau_se <- tau.se
ggplot(df, aes(x = baseline_assets, y = tau_hat)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "loess", color = "darkred") +
geom_hline(yintercept = 0, linetype = "dashed") +
labs(x = "Baseline assets (EUR)",
y = expression(hat(tau)(x)),
title = "CATE by baseline assets, rural credit pilot",
caption = "Source: cooperative pilot data, N=4,800. Honest causal forest, 4,000 trees. Shaded band: loess 95% CI.")For Stata users: there is a honestrf ado and a causalforest package, but both are noticeably less mature than grf. Several diagnostics from grf (omnibus calibration, best linear projection, doubly robust ATE) are not directly available. For the full causal-forest toolkit, use R. When Stata is the only option, fall back to interaction terms and a Romano-Wolf adjustment for testing many moderators.
5. Common traps in heterogeneity analysis
Five traps to internalize.
HARKing (Hypothesizing After Results are Known). Look at 50 candidate subgroups, find that the “single mothers in coastal districts with primary education” subgroup has a p < 0.05 effect, and write that up as the headline. With 50 tests at \alpha = 0.05, you expect 2.5 false positives by pure chance. The fix is either pre-registration (commit to subgroups before seeing data) or multiple-testing correction. Romano-Wolf (2005) is the right choice when tests are correlated, which subgroup tests usually are. Benjamini-Hochberg (1995) controls the false discovery rate and is simpler.
Confusing CATE heterogeneity with effect modification by an outcome-relevant covariate. If X predicts Y directly (it is a strong predictor) and \hat\tau(x) varies with X, this might be genuine effect modification or it might be that the noise structure differs across X levels. The omnibus calibration test in grf helps. So does looking at whether the heterogeneity replicates on a held-out sample.
Fitting and inferring on the same sample. This is the most common error in applied causal ML. Grow a forest on the full sample, then predict \hat\tau(x_i) for the same units, and the predictions are overfitted and the confidence intervals understate uncertainty. The fix is honest splitting (built into grf::causal_forest() with honesty = TRUE) or out-of-bag prediction (also default).
Cream-skimming. The subgroup with the biggest measured \hat\tau(x) may be the subgroup that would have done well without treatment. If a microcredit program produces the largest “effect” for households with high baseline assets, that could be because those households were already on an upward trajectory. The CATE is identified only if the unconfoundedness assumption holds conditional on X. If high-asset households have different unobserved trajectories, no amount of forest tuning rescues you.
Reporting too many decimal places. A causal forest with 4,000 trees on 4,800 households gives a \hat\tau(x) that is, at best, a smooth approximation. Reporting \hat\tau = 0.1247 implies a precision the method does not deliver. Round, and report the uncertainty.
Part B. Spatial methods
6. Why spatial matters for rural blended finance
Rural data is geographic by construction. The unit of observation is a household in a village, a cooperative in a parish, a farm parcel in a freguesia, a borrower in a Portuguese municipality. These units are organized in space, and that organization carries information the econometric model has to handle.
Two failures follow from ignoring space. First, spatial autocorrelation in the residuals makes the OLS standard errors too small. The naive count is n = 1{,}200 municipalities of independent variation. The effective count is closer to 200 because nearby municipalities share unobserved shocks (weather, market access, ethnic composition). Second, spillovers across units violate SUTVA. If treating Village A raises consumption in neighboring Village B through trade or imitation, the control units are contaminated and the ATT is biased.
Both problems are solved with tools that have become standard in the last decade. The hardest part is recognizing which problem you have.
7. Spatial autocorrelation: the diagnostic
Moran’s I is the workhorse statistic for spatial autocorrelation. Given residuals e_i and a row-standardized spatial weights matrix W (where w_{ij} encodes the proximity between units i and j):
I = \frac{n}{S_0} \cdot \frac{\sum_i \sum_j w_{ij} (e_i - \bar e)(e_j - \bar e)}{\sum_i (e_i - \bar e)^2}
with S_0 = \sum_i \sum_j w_{ij}. Under the null of no spatial dependence, \mathbb{E}[I] = -1/(n-1), which is near zero for large n. Significantly positive I means nearby units have similar residuals. Geary’s C is a closely related statistic; the two usually agree.
Always run Moran’s I on the residuals from the fitted regression, not on the outcome. A spatially clustered outcome is fine if the controls absorb it. Spatial clustering in residuals is the diagnostic that something is missing.
Pair the statistic with a Moran scatterplot: plot e_i against \sum_j w_{ij} e_j (the spatial lag of the residual). A positive slope is positive autocorrelation. The plot also flags spatial outliers: units whose residual differs sharply from their neighbors.
8. Spatial econometric models
Two canonical models.
Spatial autoregressive (SAR), also called spatial lag:
Y = \rho W Y + X \beta + \varepsilon
The outcome at unit i depends on the outcomes at i’s neighbors. This is the right specification when the underlying mechanism is genuine spillover (a neighbor’s microcredit uptake raises your consumption directly through trade, demonstration, or labor markets).
Spatial error model (SEM):
Y = X \beta + u, \qquad u = \lambda W u + \varepsilon
The errors are spatially correlated, but the outcomes are not directly. This is the right specification when there are unobserved spatially-correlated shocks (weather, soil, ethnic composition) that affect outcomes but no theory says one unit’s outcome causes another’s.
The Lagrange Multiplier tests (Anselin 1988; Anselin, Bera, Florax, and Yoon 1996) help with the choice. Run robust LM-lag and LM-error tests. If LM-lag is significant and LM-error is not, fit SAR. If the reverse, fit SEM. If both are significant, fit the more general SARAR model and report both \rho and \lambda.
A serious caveat. Gibbons and Overman (2012) argue that the parameters of SAR are not credibly identified without an exclusion restriction, because W Y is mechanically correlated with W X and with any spatially correlated shock. Their position is that applied researchers should be skeptical of any policy conclusion that hinges on the estimated \rho. The author agrees. Use SAR for description and pattern detection. Do not use it to make causal claims about spillovers without an instrument or experimental variation.
9. The modern preferred alternative: Conley standard errors
Most applied work in rural development has moved away from estimating \rho and \lambda explicitly and toward Conley (1999) standard errors. The idea is to allow the residuals to be spatially correlated up to some distance threshold d without specifying the functional form of that correlation. The variance estimator is:
\hat V_{\text{Conley}} = (X'X)^{-1} \left( \sum_i \sum_j K(d_{ij}/d) \, x_i x_j' e_i e_j \right) (X'X)^{-1}
where K is a kernel that downweights distant pairs and d_{ij} is the geographic distance between units i and j. The result is a heteroskedasticity- and autocorrelation-consistent variance estimator that works in space the way Newey-West works in time.
Hsiang (2010) provided the implementation that made Conley SEs accessible. In R, the package is conleyreg. In Stata, the dominant packages are acreg (Colella, Lalive, Sakalli, Thoenig) and reg2hdfespatial (Fetzer). All of them take regression output, unit coordinates, and a distance cutoff, and return Conley-corrected SEs.
Two practical points. First, the cutoff d matters. Common practice is to try several cutoffs (50 km, 100 km, 200 km) and report the largest, on the principle that under-clustering is worse than over-clustering. Second, Conley SEs typically increase standard errors by 30 to 100 percent relative to OLS. If a “significant” result depends on OLS SEs and dies under Conley, question whether the result is real.
Goldsmith-Pinkham, Sorkin, and Swift (2020) is the modern reference on what spatial clustering actually buys you and where it falls short. Read it before submitting a paper.
10. Geographic regression discontinuity
When a treatment is defined by crossing a geographic boundary (a colonial district line, a province border, an EU funding cutoff between NUTS-2 regions), the running variable is signed distance to the boundary. The geographic RDD framework follows the standard RDD logic in Chapter 5, with two modifications.
First, the running variable is two-dimensional. Distance alone is not enough; you need to control for location along the boundary as well. Keele and Titiunik (2015) developed the two-dimensional running variable approach: include latitude and longitude (or two coordinates of position along the boundary) as additional running variables, and estimate locally.
Second, the boundary itself must be “as good as random” conditional on observables. The classic application is Dell (2010) on the Peruvian mita. The Spanish colonial mita conscripted labor from districts inside a bounded region for the Potosi silver mines from 1573 to 1812. Dell shows that the boundary was drawn on administrative grounds (river basins, ethnic boundaries) that are plausibly orthogonal to long-run economic outcomes once you control for elevation and ethnic composition. Households on opposite sides of the boundary, separated by only a few kilometers, faced different colonial labor regimes. Today, the mita side is 25 percent poorer.
For rural blended finance, the geographic RDD shows up whenever a program treats one side of an administrative boundary and not the other. EU LEADER funding cutoffs at NUTS-2 borders. National rural credit eligibility tied to municipality classification. Watershed-based agricultural subsidies. If the treatment can be written as a sharp geographic discontinuity, the design is available.
11. Spatial spillovers and SUTVA violations
The Stable Unit Treatment Value Assumption requires that one unit’s potential outcomes do not depend on another unit’s treatment. In rural settings, this often fails. A microcredit program in Village A raises consumption in Village B because cooperatives trade across villages. A pesticide subsidy in Parcel A affects Parcel B through wind drift. An agricultural extension visit to one farmer raises productivity for the neighbors who watch.
Three approaches.
Saturation designs (Baird, Bohren, McIntosh, and Özler 2018) randomize at two levels. First, randomly assign clusters (villages, parishes) to high, medium, or low saturation. Second, within each cluster, randomly assign households to treatment. This produces variation in both whether a household is treated and how saturated its environment is. The direct effect (own treatment) and the spillover effect (neighbors’ treatment) can then be estimated separately.
Explicit modeling of W in the regression. Add a “share of neighbors treated” variable on the right-hand side:
Y_{it} = \alpha_i + \gamma_t + \tau D_{it} + \delta \overline{D}_{N(i),t} + \varepsilon_{it}
where \overline{D}_{N(i),t} is the fraction of unit i’s neighbors who are treated at time t. The coefficient \delta is the spillover. Identification requires exogenous variation in neighbor treatment, which is why this approach pairs naturally with cluster-randomized or saturation designs.
Cluster-randomized designs. If spillovers operate within villages but not across them, randomize at the village level. The cluster contains its own spillovers, and across-cluster comparisons recover the total program effect. This is the cleanest design when villages are well-separated; it falls apart when villages trade across boundaries.
For rural credit specifically, the spillover problem is large and underappreciated. Banerjee, Chandrasekhar, Duflo, and Jackson (2013) on the diffusion of microfinance in Indian villages showed that information about a microfinance program spreads through household networks; treatment of central nodes is much more effective than treatment of peripheral nodes. Ignore the network and you underestimate the program’s reach and miss the right targeting rule.
12. Worked example: Portuguese municipal credit data
Imagine administrative data on a rural credit program covering 308 Portuguese municipalities (concelhos), with random eligibility determined by a tier classification cutoff. Outcome: agricultural value added per capita, three years post-program.
library(sf)
library(spdep)
library(spatialreg)
library(conleyreg)
library(tidyverse)
# Load shapefile, merge with outcome data
muni <- st_read("concelhos_pt.shp") %>%
st_transform(crs = 3763) # ETRS89 / Portugal TM06, units in meters
muni <- muni %>% left_join(df, by = "DICO")
# Construct spatial weights: queen contiguity, row-standardized
nb <- poly2nb(muni, queen = TRUE)
lw <- nb2listw(nb, style = "W", zero.policy = TRUE)
# Step 1: OLS baseline
ols <- lm(agri_va_pc ~ treated + log_pop + elev_mean + dist_lisbon, data = muni)
summary(ols)
# Step 2: Moran's I on residuals
moran.test(residuals(ols), lw, zero.policy = TRUE)
# Step 3: LM tests to choose SAR vs SEM
lm.LMtests(ols, lw, test = c("LMerr", "LMlag", "RLMerr", "RLMlag"))
# Step 4a: Fit SAR
sar <- lagsarlm(agri_va_pc ~ treated + log_pop + elev_mean + dist_lisbon,
data = muni, listw = lw, zero.policy = TRUE)
summary(sar)
# Step 4b: Fit SEM
sem <- errorsarlm(agri_va_pc ~ treated + log_pop + elev_mean + dist_lisbon,
data = muni, listw = lw, zero.policy = TRUE)
summary(sem)
# Step 5: Conley standard errors at multiple cutoffs
muni_xy <- muni %>%
mutate(centroid = st_centroid(geometry),
lon = st_coordinates(centroid)[,1],
lat = st_coordinates(centroid)[,2]) %>%
st_drop_geometry()
conley_50 <- conleyreg(agri_va_pc ~ treated + log_pop + elev_mean + dist_lisbon,
data = muni_xy, dist_cutoff = 50,
lat = "lat", lon = "lon", crs = 3763)
conley_100 <- conleyreg(agri_va_pc ~ treated + log_pop + elev_mean + dist_lisbon,
data = muni_xy, dist_cutoff = 100,
lat = "lat", lon = "lon", crs = 3763)
conley_200 <- conleyreg(agri_va_pc ~ treated + log_pop + elev_mean + dist_lisbon,
data = muni_xy, dist_cutoff = 200,
lat = "lat", lon = "lon", crs = 3763)
# Report all three. The largest is the conservative choice.In Stata, the equivalent Conley step uses acreg:
* Load coordinates
use "concelhos_pt.dta", clear
* OLS with Conley SEs, 100km cutoff
acreg agri_va_pc treated log_pop elev_mean dist_lisbon, ///
latitude(lat) longitude(lon) dist(100) ///
kernel(uniform)The gettable package (Frey 2022) helps you export Conley-clustered results into LaTeX tables alongside OLS and SAR specifications, which is the standard table format for a rural development empirical paper.
13. Career relevance
Two practical claims. First, rural blended finance papers get torn apart in peer review when authors fail to cluster spatially. Referees in development economics now expect Conley SEs on any rural geographic dataset by default. Submit a panel of municipalities with OLS or even state-clustered SEs and expect a referee report demanding Conley clustering as a robustness check. Build it into the first draft.
Second, program evaluation papers under-state effects when they ignore spillovers. Rural credit, agricultural extension, and microfinance all generate neighbor effects, and most evaluation designs are built around the assumption that the control group is untreated. In a saturated rural ecosystem, the control group is partly contaminated, and the naive ATT is biased toward zero. A saturation design or explicit W modeling that recovers both direct and spillover effects makes a stronger case for scaling.
The combination of causal forest CATE estimates plus Conley-corrected SEs plus a thoughtful treatment of spillovers is the modern frontier for rural impact evaluation. The first paper that ties all three together cleanly tends to be the most cited in an early portfolio.
14. References
Anselin, L. (1988). Spatial Econometrics: Methods and Models. Kluwer Academic.
Anselin, L., Bera, A. K., Florax, R., and Yoon, M. J. (1996). Simple diagnostic tests for spatial dependence. Regional Science and Urban Economics, 26(1), 77 to 104.
Athey, S., Tibshirani, J., and Wager, S. (2019). Generalized random forests. Annals of Statistics, 47(2), 1148 to 1178.
Athey, S., and Wager, S. (2021). Policy learning with observational data. Econometrica, 89(1), 133 to 161.
Baird, S., Bohren, J. A., McIntosh, C., and Özler, B. (2018). Optimal design of experiments in the presence of interference. Review of Economics and Statistics, 100(5), 844 to 860.
Banerjee, A., Chandrasekhar, A. G., Duflo, E., and Jackson, M. O. (2013). The diffusion of microfinance. Science, 341(6144), 1236498.
Benjamini, Y., and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. JRSS B, 57(1), 289 to 300.
Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. Econometrics Journal, 21(1), C1 to C68.
Conley, T. G. (1999). GMM estimation with cross sectional dependence. Journal of Econometrics, 92(1), 1 to 45.
Dell, M. (2010). The persistent effects of Peru’s mining mita. Econometrica, 78(6), 1863 to 1903.
Gibbons, S., and Overman, H. G. (2012). Mostly pointless spatial econometrics? Journal of Regional Science, 52(2), 172 to 191.
Goldsmith-Pinkham, P., Sorkin, I., and Swift, H. (2020). Bartik instruments: what, when, why, and how. American Economic Review, 110(8), 2586 to 2624. (And the related working-paper note on spatial clustering practice.)
Hsiang, S. M. (2010). Temperatures and cyclones strongly associated with economic production in the Caribbean and Central America. Proceedings of the National Academy of Sciences, 107(35), 15367 to 15372.
Keele, L. J., and Titiunik, R. (2015). Geographic boundaries as regression discontinuities. Political Analysis, 23(1), 127 to 155.
Romano, J. P., and Wolf, M. (2005). Stepwise multiple testing as formalized data snooping. Econometrica, 73(4), 1237 to 1282.
Semenova, V., and Chernozhukov, V. (2021). Debiased machine learning of conditional average treatment effects and other causal functions. Econometrics Journal, 24(2), 264 to 289.
End of Chapter 9. Next: Chapter 10, Panel Data with Many Fixed Effects.