Chapter 10. Pre-Analysis Plans and RCT Design

Inference Lab · Dr. Ian Helfrich


Interactive: minimum detectable effect calculator

For a two-arm cluster-randomized trial. Adjust the inputs and watch the MDE and power curves move. Formulas: design effect = 1 + (m − 1)ρ; effective sample n_eff = n/DE; MDE in standard-deviation units ≈ (t_{1−κ} + t_{α/2}) · √(1/(p(1−p))) · σ/√n_eff.


10.0 Why this chapter sits where it does

A pre-analysis plan is the document that turns an RCT from a polished story into a credible one. Every major development funder now expects one at signing: the World Bank’s DIME group, J-PAL, IPA, IDB-Invest, the EIB impact-evaluation team, the AfDB. The AER and AEJ:Applied desk-reject RCT papers without a registered trial ID. The shift happened in roughly a decade, and it is permanent.

A clean PAP is also the difference between being read as a junior implementer and being read as the person leading the methodology. The senior economist on your team will ask, in the kickoff meeting, what your pre-specified primary outcome is and how you are correcting for multiple testing. If the answer is already in the deck, the conversation moves on. If the answer is “we will work that out closer to endline,” the conversation slows down and you have lost the room.

This chapter teaches you how to write a PAP and why each section matters. Boilerplate that you wrote without understanding gets torn apart by a referee who has read more PAPs than you have. The point of the chapter is to make sure you are not that author.


10.1 Why pre-register at all

A pre-analysis plan is a document filed publicly, usually on the AEA RCT Registry at socialscienceregistry.org, before you see your endline data. It pins down:

  1. The hypotheses you set out to test.
  2. The exact specifications you will run.
  3. The corrections you will apply for multiple comparisons.
  4. The handling of attrition, missingness, and deviations.

The reason is straightforward. Without pre-registration, an empirical economist with a flexible regression specification, a modest number of covariates, and a willingness to try many subgroups can almost always find a p < 0.05 result somewhere. That is specification search, and it is fatal to inference. A pre-specified analysis plan binds your future self to the specification you committed to while the data were still blind.

Max Kasy (2018) has argued that strict pre-registration can be costly when the optimal estimator depends on data features you only see at endline. He is right that mechanical pre-registration is not optimal in a decision-theoretic sense. The institutional answer is that pre-registration is what funders and journals now require, because the system does not trust researchers, or their future selves, to honestly distinguish confirmatory from exploratory work without a paper trail. You write a primary-outcome PAP and label exploratory work as exploratory.

J-PAL requires it. IPA requires it. World Bank DIME requires it. The AER and AEJ:Applied will not review an RCT paper without an AEA Registry trial ID and a filed PAP. The question is no longer whether to pre-register. The question is how to write one that will hold up.


10.2 What goes in a PAP: the checklist

A serviceable PAP has the following sections. Memorize this list; you’ll write it dozens of times.

1. Background and theory of change. Two or three pages. What is the intervention, what is the causal mechanism you expect, what does prior literature suggest about effect sizes? A theory of change is just a diagram or paragraph mapping inputs (the intervention) through mediators (behavioral or structural channels) to outcomes (what you measure).

2. Primary, secondary, and exploratory outcomes. Tiered. Primary outcomes are the ones you will correct for multiple testing and on which the study lives or dies. Keep this list to one to three outcomes. Secondary outcomes are pre-specified but acknowledged as underpowered. Exploratory outcomes are anything goes, but labeled honestly as exploratory.

3. Sample. Population, eligibility criteria, sampling frame, target sample size. If you’re sampling municipalities, state which administrative list you drew from and why.

4. Treatment assignment. Unit of randomization (individual, household, village, municipality). Stratification variables. Treatment fraction P. Randomization seed and software (R randomizr, Stata randtreat).

5. Estimation strategy. The regression you will run, written down with the FE structure and the SE clustering choice. For a cluster-randomized municipality-level trial:

Y_{ij} = \alpha + \beta T_j + \gamma X_{ij} + \delta_s + \varepsilon_{ij}

where T_j is treatment at the municipality j, X_{ij} is baseline covariates, \delta_s is strata fixed effects, and SEs are clustered at j (the unit of randomization).

6. Multiple-hypothesis correction. Specify the family, the method (Romano-Wolf, Benjamini-Hochberg, or family-wise Bonferroni), and the software call. Specify it before you see the data.

7. Heterogeneity analyses. Pre-specified subgroups, not “we’ll run causal forests and report what’s significant.” If you want to test heterogeneity by gender, region, and baseline poverty tertile, name those three, name the interaction tests, and stop.

8. Attrition handling. State the method: Lee (2009) bounds, Manski (1990) IPW, or differential-attrition adjustment. State at what attrition rate (typically >10%) you escalate to bounds.

9. Power calculation showing MDE. A formal calculation, ideally with code and assumptions stated. We do this in §10.3.

10. Deviation protocol. Under what circumstances will you deviate from the PAP, and how will deviations be documented? The honest answer is: deviations are flagged in an addendum filed before unblinding, with reasons.

That’s the skeleton. A good PAP runs 15-25 pages including appendices.


10.3 Power calculations and the minimum detectable effect

Suppose you propose a study of a microcredit intervention in 200 Brazilian municipalities. Before you collect a single survey, your funder asks: “What effect size can this study detect, with 80% power, at the 5% significance level?” The answer is the minimum detectable effect, or MDE.

For an individually randomized two-arm trial with continuous outcomes, the MDE is:

\text{MDE} = (t_{1-\kappa} + t_{\alpha/2}) \cdot \sqrt{\frac{1}{P(1-P)}} \cdot \frac{\sigma}{\sqrt{n}}

The pieces:

  • \kappa = 0.8 is the desired power, so t_{1-\kappa} = t_{0.8} \approx 0.84 (one-sided).
  • \alpha = 0.05 is the significance level, so t_{\alpha/2} = t_{0.025} \approx 1.96 (two-sided test).
  • P is the share of the sample assigned to treatment. With P = 0.5 (balanced), \sqrt{1/[P(1-P)]} = 2.
  • \sigma is the outcome standard deviation.
  • n is the total sample size.

Plug in \kappa = 0.8, \alpha = 0.05, P = 0.5: the multiplier (0.84 + 1.96) \cdot 2 = 5.60. So MDE in standard deviation units is 5.60 \cdot \sigma / \sqrt{n}. With n = 1000 and \sigma = 1, that’s 0.177 \sigma. Translation: you can detect an effect of about 0.18 SD with 80% power, or you’ll fail to reject the null for anything smaller.

What does 0.18 SD mean for log per-capita consumption? Roughly an 18-19% effect (since the SD of log consumption in rural Brazilian household surveys is typically around 0.6-0.9, so a 0.18-SD effect on log consumption translates to about 0.11-0.16 log points, or 12-17% in levels). That is a large microcredit effect; most well-identified microcredit studies (Banerjee, Karlan, Zinman 2015 and the six-country consensus) find consumption effects in the 0-5% range. So a study with n = 1000 is underpowered for realistic microcredit effects.

The honest move when you discover this ex ante: go back to the funder, ask for more money, or pick a more powerful design (more clusters, fewer per cluster, baseline measurement for ANCOVA).

Cluster designs: the loss from clustering

When you randomize at the cluster level (municipality, school, village) and measure at the individual level, the effective sample size shrinks. The reason is intra-cluster correlation (ICC): individuals in the same municipality are more similar to each other than two random individuals would be, so each additional individual in a cluster gives you less new information than an additional individual in a new cluster.

The design effect (Kish 1965) is:

\text{DE} = 1 + (\bar{m} - 1)\rho

where \bar{m} is the average cluster size and \rho is the ICC. Your MDE scales up by \sqrt{\text{DE}}:

\text{MDE}_{\text{cluster}} = \text{MDE}_{\text{individual}} \cdot \sqrt{1 + (\bar{m} - 1)\rho}

Effective sample size:

n_{\text{eff}} = \frac{n}{1 + (\bar{m} - 1)\rho}

If you have 200 municipalities of 50 households each (n = 10{,}000) with ICC \rho = 0.15:

\text{DE} = 1 + (50 - 1)(0.15) = 1 + 7.35 = 8.35

n_{\text{eff}} = 10{,}000 / 8.35 \approx 1{,}198

You spent 10,000 surveys and got the statistical power of about 1,200 individual-level observations. The MDE inflates by \sqrt{8.35} \approx 2.89. With \sigma = 1, that’s an MDE of about 0.16 \sigma.

This is the Bloom (1995) result formalized: cluster-randomized designs are expensive in power. The fix is to increase the number of clusters, not the number per cluster. Adding 50 more municipalities does more for power than adding 50 households to each existing municipality. The intuition: variation across clusters is what identifies treatment effects in cluster-randomized designs, and per-cluster sample size only buys you precision in estimating each cluster mean.

Power calc in R

# install.packages(c("pwr", "clusterPower"))
library(pwr)
library(clusterPower)

# Individual-level, two-arm balanced
pwr.t.test(n = 500, d = 0.18, sig.level = 0.05, power = NULL,
           type = "two.sample", alternative = "two.sided")
# power ~ 0.81

# Cluster-randomized: solve for MDE given clusters, n_per_cluster, ICC
crtpwr.2mean(
  alpha = 0.05,
  power = 0.80,
  m = 100,        # clusters per arm (200 total)
  n = 50,         # individuals per cluster
  cv = 0,         # no variation in cluster size
  icc = 0.15,
  varw = 1        # within-cluster variance (sigma^2)
)
# Returns the MDE on the original scale

Power calc in Stata

* Individual-level
power twomeans 0 0.18, sd(1) power(0.80) alpha(0.05)

* Cluster-randomized (Stata 17+)
power twomeans 0 0.18, sd(1) power(0.80) alpha(0.05) ///
    cluster m1(50) m2(50) k1(100) k2(100) rho(0.15)

* Alternative for older Stata: -clustersampsi- (SSC)
ssc install clustersampsi
clustersampsi, mu1(0) mu2(0.18) sd(1) rho(0.15) m(50) k(100)

The output gives you the detectable difference. Cross-check against the closed-form formula above; if they disagree by more than rounding, you’ve fed the function the wrong \sigma.


10.4 When to cluster

You cluster the randomization when you can’t randomize at the individual level without contamination. Examples relevant to blended-finance evaluations:

  • Information spillovers. Training one household in a 50-household village on financial literacy contaminates the control (they teach their neighbors).
  • General-equilibrium effects. Giving credit to 30% of farmers in a market changes the price of inputs and outputs that the other 70% face.
  • Political resistance. Municipal authorities won’t accept a program that randomizes within their municipality; the political unit demands a binary in-or-out.
  • Logistical constraints. A mobile health unit serves a village at a time, not households.

The cost of clustering is the power loss above. The benefit is internal validity: you’re not biasing your control by inadvertently treating it.

A practical rule: if your ICC is below 0.05, the cluster cost is small. If it’s above 0.15, the cost is large and you need to think carefully about whether you have enough clusters. A cluster-randomized trial with fewer than 30 clusters per arm is in wild-cluster-bootstrap territory (Cameron, Gelbach, Miller 2008; see Roodman et al. 2019 for the Stata boottest implementation). Don’t run a 12-cluster-per-arm trial and report cluster-robust SEs; they are wildly anti-conservative.


10.5 Stratification

Stratification (also called blocking) means: before randomizing, divide the sample into strata defined by baseline variables that predict the outcome, then randomize within each stratum.

Why bother? Three reasons:

  1. Variance reduction. Within-stratum residual variance is smaller than overall variance, so the same treatment effect is estimated more precisely.
  2. Balance. Stratification guarantees that treatment and control are balanced on the stratification variables, by construction.
  3. Power for heterogeneity. If you stratify by region and you’ve pre-specified heterogeneity by region, you’ve also pre-built the comparison.

The cost: in a fully stratified design with strata fixed effects in the regression, you must adjust standard errors and the degrees of freedom. Standard tools handle this if you just include stratum dummies as fixed effects.

When you have many candidate stratification variables (say, 8 baseline covariates), classical stratification breaks down (too many strata, too few units per stratum). The modern fix is re-randomization (Athey and Imbens 2017, building on Morgan and Rubin 2012). Procedure:

  1. Define a balance criterion (e.g., a Mahalanobis distance over baseline covariates, or the max t-statistic across covariate balance tests).
  2. Draw a random treatment assignment.
  3. Compute the balance criterion. If it’s worse than a pre-specified threshold (e.g., 90th percentile of the criterion’s null distribution), discard and redraw.
  4. Iterate until acceptable balance.

The cost is that inference must be conditional on the re-randomization scheme; you can’t just run OLS with cluster-robust SEs and pretend it was simple random assignment. Athey-Imbens (2017) gives the inference procedure (randomization inference or covariate-adjusted regression).


10.6 Balance tests: do them, but ignore them mostly

Tradition demands a “Table 1” showing baseline means by treatment arm with t-tests for each row. Reviewers expect it. So produce it.

But understand: this table is not really a test of randomization. If you randomized correctly, baseline imbalance is by definition due to chance. The right move per Mutz, Pemantle, and Pham (2019) and Athey-Imbens (2017) is:

  1. Report baseline means by arm in a table.
  2. Don’t run formal balance tests as if a p-value of 0.04 on one row indicates broken randomization; with 20 covariates, you’d expect about one to come up significant by chance.
  3. If you find a meaningful imbalance on a covariate that predicts the outcome, control for it in the regression. (You should pre-specify “if X is imbalanced at p < 0.05 we control for it” in the PAP.)
  4. Stratify or re-randomize ex ante on the most important predictors so balance is by construction.

The combination of pre-stratification, baseline covariates in the regression, and a transparent baseline-balance table handles 99% of referee concerns.


10.7 Attrition

Attrition is the deadliest threat to internal validity in field experiments. People move, refuse to be re-surveyed, die, drop out of the program. If attrition is differential (different rates or different types of attriters across arms), the endline sample is no longer randomized.

Three responses, in increasing severity:

Report attrition rates by arm. Always. If treatment attrition is 12% and control is 18%, that 6-point gap is informative; you’ve selectively retained people in treatment who differ from those you lost in control.

Inverse probability weighting (Manski 1990). Model the probability of attrition as a function of baseline covariates, then weight retained observations by the inverse of their predicted retention probability. Valid if attrition is “on observables” (selection-on-observables for the attrition process). State the assumption clearly; it’s strong.

Lee (2009) bounds. Distribution-free. Assumes only monotone attrition (treatment can affect retention in one direction). Procedure: trim the arm with lower attrition from the top or bottom of the outcome distribution to match the retention rate of the other arm. This gives you the worst-case range of the ATE. Lee bounds are wide but defensible; J-PAL papers use them routinely.

If your differential attrition is bad enough that the Lee bounds straddle zero, your study has not credibly identified the treatment effect, no matter what your point estimate says.

A practical PAP rule: “If endline retention is below 80% in any arm or differs by more than 5 percentage points across arms, we will report Lee bounds and treat the headline ITT as suggestive rather than definitive.”

A useful but expensive addition: a “track-down” subsample. Pick a random sample of the attriters and pay extra (door-to-door visits, phone calls, incentives) to recover their endline outcomes. This lets you estimate selection bias directly and re-weight.


10.8 Multiple hypothesis testing

If you test 20 outcomes at \alpha = 0.05, you expect 1 spurious rejection by chance. The conventional fixes:

Bonferroni. Divide \alpha by the number of tests. Simple, but conservative (controls family-wise error rate at the cost of power; ignores correlation across tests).

Benjamini-Hochberg (1995). Controls the false discovery rate (FDR): the expected fraction of rejections that are false positives. Less conservative than Bonferroni. Recommended for exploratory work where you expect multiple true positives.

Romano-Wolf (2005). Controls FWER but is far less conservative than Bonferroni because it uses the dependence structure of the test statistics (bootstrapped). Recommended for primary multiple outcomes in RCTs. Implementations:

# install.packages("wyoung")  # actually this is Stata; see below
# In R, see the multcomp package or the Westfall-Young bootstrap manually
# Or use the romanowolfp() function from the multHTest or stratafy packages
ssc install rwolf
rwolf y1 y2 y3, indepvars(treatment) controls(baseline_y region_dummies) ///
    cluster(municipality) reps(1000) seed(20260511)
* Returns Romano-Wolf adjusted p-values for each outcome

The right move, almost always: pre-specify a small primary-outcome family (1 to 3 outcomes) and apply correction only within that family. Then list secondary outcomes separately and apply BH within them. Exploratory outcomes get no correction but are labeled exploratory.

The mistake to avoid: declaring 12 primary outcomes, applying Bonferroni, and finding nothing significant. You’ve corrected away your power. Better to have 2 well-defined primary outcomes with strong prior justification.


10.9 A worked example: a Brazilian rural microcredit RCT

Suppose you are designing a study of a graduation-program-style microcredit intervention for 10,000 rural households in northeast Brazil. You have a budget for 200 municipalities. Here is what the relevant section of your PAP looks like.

Sample and randomization

Sample: 200 municipalities in the northeast region (Maranhão, Piauí, Ceará, Pernambuco, Bahia), drawn from the IBGE list of municipalities with population <50,000 and rural population share >40%. Within each municipality, 50 households are sampled from the Cadastro Único registry of eligible low-income households. Total n = 10{,}000.

Unit of randomization: municipality. Justification: information spillovers and political resistance to within-municipality randomization.

Stratification: 5 regional strata (one per state) × 3 baseline credit-takeup tertiles (computed from PNAD Contínua municipal aggregates) = 15 strata. Within each stratum, half the municipalities are assigned to treatment.

Treatment fraction: P = 0.5 (100 treatment, 100 control municipalities).

Estimation

Primary outcome: log per-capita household consumption at 18 months post-baseline. ANCOVA specification:

\log C_{ij,18} = \alpha + \beta T_j + \gamma \log C_{ij,0} + \delta_s + \varepsilon_{ij}

where \log C_{ij,0} is the baseline log per-capita consumption, \delta_s is the 15 stratum fixed effects, and SEs are clustered at the municipality j.

Secondary outcomes (pre-specified, BH-corrected within the family):

  1. Log monthly business revenue.
  2. An asset index (Filmer-Pritchett PCA on durables).
  3. Food security index (HFIAS, 9-item scale).

Power

Assumptions: - \sigma of log per-capita consumption = 0.75 (from PNAD 2019, northeast rural subsample). - ICC \rho = 0.15 at the municipality level (estimated from PNAD). - 50 households per municipality, 100 municipalities per arm. - Anticipated attrition: 15%. Adjust n accordingly: effective n = 8500.

Design effect: 1 + (50 \cdot 0.85 - 1) \cdot 0.15 = 1 + 41.5 \cdot 0.15 = 7.22.

Effective n: 8500 / 7.22 \approx 1{,}177.

MDE: 5.60 \cdot 0.75 / \sqrt{1177} \approx 0.122 \sigma, or about 9-10% in log consumption.

This is at the edge of what graduation programs have been shown to produce (Banerjee et al. 2015 six-country consensus found ~5-15% consumption effects). The study is adequately powered but not lavishly so. You note this honestly in the PAP and to the funder.

Attrition

We will report attrition rates by arm. If differential attrition exceeds 5 percentage points, we will report Lee (2009) bounds on the headline estimate. Tracking calls and home visits will be made for all non-respondents at 6 and 12 months to minimize 18-month attrition.

Multiple hypothesis correction

The primary outcome is a single outcome (log per-capita consumption), so no correction is required for primary inference. Secondary outcomes (revenue, assets, food security) will be corrected via Benjamini-Hochberg FDR at q = 0.10 within the secondary-outcome family. Heterogeneity tests (by gender of household head, by baseline poverty tertile) will be reported with both unadjusted and Romano-Wolf adjusted p-values.

Heterogeneity

Pre-specified subgroups: 1. By gender of household head (female-headed vs. male-headed). 2. By baseline poverty tertile (bottom third vs. top two-thirds).

Interaction tests reported alongside the main ATE. No additional subgroups will be reported as confirmatory; further heterogeneity is explicitly exploratory.

Deviation protocol

Any deviation from this PAP will be filed as an addendum to the AEA Registry entry prior to unblinding and will be flagged in the published manuscript.


10.10 Common traps

The mistakes below recur often enough that they are worth flagging directly. Each one can damage the credibility of an otherwise serviceable study.

Filing the PAP after data collection. The AEA Registry timestamps every submission. Reviewers check the timestamp against the baseline-survey date. If the PAP timestamp falls after the endline survey, there is no PAP; there is a post-hoc analysis dressed up as one. The reputational damage from doing this once is hard to undo.

Pre-specifying too many primary outcomes. A PAP with 14 primary outcomes corrects away its own power. Pick 1-3, with strong theoretical justification. Send everything else to secondary or exploratory tiers.

Heterogeneity fishing dressed up as causal forest. Wager-Athey causal forests are a wonderful tool, but a PAP that says “we will run a causal forest and report subgroups with the highest CATE” is fishing. Pre-specify which subgroups, or pre-specify a single honest data-driven discovery step (a single best subgroup, reported with a Romano-Wolf adjustment over the search space).

Power calculation that ignores attrition. Field attrition is 15-30% in 12-18 month studies. If your power calc assumes 100% retention, you’ve overstated your statistical power by 15-30%. Inflate n accordingly, or shorten the follow-up window.

Cluster RCT with too few clusters. Below 30 clusters per arm, cluster-robust SEs are anti-conservative. You need wild-cluster bootstrap (Cameron-Gelbach-Miller; Roodman et al. 2019 boottest). Don’t run a 10-cluster-per-arm trial without writing the bootstrap into the PAP.

Forgetting to register on AEA RCT Registry. Free, takes about an hour. Required by AER, AEJ:Applied, REStat, JDE, ReStud for any RCT submitted after 2017. Not registering means your paper goes to desk rejection.


10.11 Post-trial deliverables

When the data come in, three things are owed:

  1. An updated PAP if there are deviations. Filed as an addendum with timestamps, before unblinding wherever possible.
  2. A pre-registered “headline figure” and “headline table” so reviewers can see exactly what was committed to. This is standard now at AEJ:Applied and increasingly elsewhere.
  3. A public replication package. Data (where ethically possible) plus code plus a README that runs end-to-end. The AER, AEJ:Applied, and REStat have all required this since 2019. The replication package is checked by the journal’s data editor before acceptance.

A replication package is a deliverable, not an afterthought. Build it as the analysis is built, not at the end. Use R Markdown or Quarto so the manuscript and the code share a single source. Version it on a public GitHub repo, with a frozen Zenodo DOI for the publication version.


10.12 Why this matters for an applied researcher

Pre-registered impact evaluation is the price of admission to the field. Multilateral teams expect a PAP at the proposal stage. A grant officer reviewing a health-financing or rural-credit RCT proposal will send it back if there is no power calculation with explicit ICC assumptions, or no statement of how attrition will be handled. The reviewer is doing the same triage you would do; they are looking for whether the proposer has understood what their own study can and cannot deliver.

The pool of researchers who can write a defensible PAP from scratch is small. The pool who can also run the cluster-randomized power calculation themselves, code the Romano-Wolf correction, and explain to a non-econometrician why the attrition adjustment matters is smaller still. Every step up that ladder buys you more autonomy on the next contract, and more credibility with the people whose work you want to be published next to.

The reason to learn this carefully now is that the cost of getting it wrong at signing is enormous. A study with a sloppy PAP at registration produces a paper that gets discounted at every later stage: by referees, by replication editors, and by anyone reading the AEA Registry timestamp against the survey dates. A study with a tight PAP, by contrast, is bankable. You can put it in a JMP packet, on a CV, or in a policy memo and the reader does not have to do extra checking. That is the asset you are building when you learn this material.


10.13 References

Core methodology: - Athey, S., and G.W. Imbens (2017). “The Econometrics of Randomized Experiments.” Handbook of Field Experiments, Vol. 1. - Benjamini, Y., and Y. Hochberg (1995). “Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing.” Journal of the Royal Statistical Society B, 57(1). - Bloom, H.S. (1995). “Minimum Detectable Effects: A Simple Way to Report the Statistical Power of Experimental Designs.” Evaluation Review, 19(5). - Cameron, A.C., J. Gelbach, and D. Miller (2008). “Bootstrap-Based Improvements for Inference with Clustered Errors.” Review of Economics and Statistics, 90(3). - Casey, K., R. Glennerster, and E. Miguel (2012). “Reshaping Institutions: Evidence on Aid Impacts Using a Preanalysis Plan.” Quarterly Journal of Economics, 127(4). - Kasy, M. (2018). “Optimal Taxation and Insurance Using Machine Learning, and Related Methodological Discussion of Pre-Analysis Plans.” (Working paper; see also Kasy’s broader critique of mechanical pre-registration.) - Lee, D.S. (2009). “Training, Wages, and Sample Selection: Estimating Sharp Bounds on Treatment Effects.” Review of Economic Studies, 76(3). - Manski, C.F. (1990). “Nonparametric Bounds on Treatment Effects.” American Economic Review Papers and Proceedings, 80(2). - Morgan, K.L., and D.B. Rubin (2012). “Rerandomization to Improve Covariate Balance in Experiments.” Annals of Statistics, 40(2). - Mutz, D., L. Pemantle, and P. Pham (2019). “The Perils of Balance Testing in Experimental Design.” American Statistician, 73(1). - Olken, B. (2015). “Promises and Perils of Pre-Analysis Plans.” Journal of Economic Perspectives, 29(3). - Romano, J.P., and M. Wolf (2005). “Exact and Approximate Stepdown Methods for Multiple Hypothesis Testing.” Journal of the American Statistical Association, 100(469). - Roodman, D., M.A. MacKinnon, M.D. Webb, and J.G. Nielsen (2019). “Fast and Wild: Bootstrap Inference in Stata Using boottest.” Stata Journal, 19(1).

Applied development PAPs and replication packages worth reading: - Banerjee, A., D. Karlan, and J. Zinman (2015). “Six Randomized Evaluations of Microcredit: Introduction and Further Steps.” AEJ:Applied, 7(1). The consensus paper for microcredit RCTs. - Banerjee, A., E. Duflo, N. Goldberg, D. Karlan, R. Osei, W. Parienté, J. Shapiro, B. Thuysbaert, and C. Udry (2015). “A Multifaceted Program Causes Lasting Progress for the Very Poor: Evidence from Six Countries.” Science, 348(6236). The graduation-program RCT, cluster-randomized across six countries. - Crepon, B., F. Devoto, E. Duflo, and W. Parienté (2015). “Estimating the Impact of Microcredit on Those Who Take It Up: Evidence from a Randomized Experiment in Morocco.” AEJ:Applied, 7(1).

Templates and tools: - J-PAL Pre-Analysis Plan template: povertyactionlab.org/research-resources - AEA RCT Registry: socialscienceregistry.org - World Bank DIME wiki on impact evaluation methods - Stata: rwolf (Clarke, Romano, Wolf 2020), boottest (Roodman et al. 2019), power twomeans with cluster() option - R: pwr (Champely), clusterPower (Kleinman, Reich, Esserman), boottest-equivalent via fwildclusterboot (Fischer & Roodman)


End of Chapter 10.