Chapter 6. Instrumental Variables
Ian Helfrich, PhD (Georgia Tech, 2024). Trade, networks, development.
6.2 The two assumptions, in plain language
Let Y_i be the outcome, D_i the endogenous treatment, Z_i the candidate instrument, and \varepsilon_i the structural error (everything unobserved that drives Y). The structural equation is Y_i = \beta_0 + \beta_1 D_i + \varepsilon_i, and you cannot just run OLS, because \mathrm{Cov}(D_i, \varepsilon_i) \neq 0. The instrument Z_i has to satisfy two conditions.
The first is relevance: \mathrm{Cov}(Z_i, D_i) \neq 0. The instrument actually moves the treatment. This is testable. Run the first-stage regression D_i = \pi_0 + \pi_1 Z_i + u_i and inspect \hat{\pi}_1 and the F-statistic on the excluded instruments. If the first stage is flat, IV is hopeless. We come back to weak instruments in section 6.4.
The second is exclusion: \mathrm{Cov}(Z_i, \varepsilon_i) = 0. The instrument affects the outcome only through the treatment, not through any other channel. This is the killer assumption, and it is untestable, because \varepsilon_i is unobserved by construction. You cannot regress the residual on Z and call that a test, because the residual is computed assuming the model is correct, which assumes exclusion. The only way to defend exclusion is with theory, institutional knowledge, and narrative. You have to argue, in prose, that the only way Z could matter for Y runs through D.
In Miguel-Satyanath-Sergenti, exclusion is the claim that rainfall affects civil conflict only through its effect on agricultural income. Critics ask: what about rainfall directly affecting the ease of military mobilization (wet roads), or the timing of harvests as a coordinating signal for rebellion? Those are exclusion-restriction violations, not refuted by any statistic in the paper. They get argued about with reference to the field. In Burgess-Pande, exclusion is the claim that the 1977 branch licensing formula affected rural poverty only through the branches it generated, not through other regulatory channels packaged with the rule. Again, defended in prose.
The exclusion-restriction defense belongs as a paragraph in the paper. Treat it like a small piece of literary criticism. List the alternative channels. Address each. Say why the data, the context, or auxiliary evidence rules them out. Reviewers will read it carefully, and they will reject the paper on this paragraph more than on anything else.
6.3 2SLS in two equations, and what LATE really means
Two-stage least squares is the most common IV estimator and the one in nine out of ten applied papers. The mechanics are exactly what the name says.
First stage. Regress the endogenous treatment on the instrument and any exogenous controls W_i: D_i = \pi_0 + \pi_1 Z_i + \pi_2' W_i + u_i. Compute the fitted values \hat{D}_i = \hat{\pi}_0 + \hat{\pi}_1 Z_i + \hat{\pi}_2' W_i.
Second stage. Substitute \hat{D}_i for D_i in the structural equation and run OLS: Y_i = \beta_0 + \beta_1 \hat{D}_i + \beta_2' W_i + \nu_i.
The coefficient \hat{\beta}_1^{\text{2SLS}} is the IV estimate. Never report the second stage standard errors from a hand-rolled two-step procedure: they ignore the fact that \hat{D}_i is itself estimated. Any real software package (R’s ivreg, fixest, AER; Stata’s ivreg2, ivreghdfe) computes the correct asymptotic variance internally. Use the package.
The 2SLS estimator has a closed form. With one instrument, one endogenous regressor, and no controls, it reduces to the Wald ratio \hat{\beta}_1^{\text{2SLS}} = \frac{\widehat{\mathrm{Cov}}(Z_i, Y_i)}{\widehat{\mathrm{Cov}}(Z_i, D_i)}. The denominator is the reduced-form effect of Z on D (the first stage). The numerator is the reduced-form effect of Z on Y (the reduced form). IV is, mechanically, the reduced form scaled by the first stage.
Now the punch line that gets glossed over in too many empirical papers. Imbens and Angrist (1994) showed that under heterogeneous treatment effects (the realistic case), \hat{\beta}_1^{\text{2SLS}} does not estimate the average treatment effect over the whole population. It estimates the Local Average Treatment Effect (LATE): the average treatment effect among compliers, units whose treatment status would change in response to the instrument.
Angrist, Imbens, and Rubin (1996) formalized this with a four-way partition. Each unit has a potential-treatment value under each value of the instrument: D_i(z=1) and D_i(z=0). There are four types.
- Compliers: D_i(1) = 1, D_i(0) = 0. They take treatment if the instrument pushes them, otherwise not. Your IV estimate is their average treatment effect.
- Always-takers: D_i(1) = D_i(0) = 1. Take treatment regardless. The instrument does not move them.
- Never-takers: D_i(1) = D_i(0) = 0. Refuse treatment regardless. The instrument does not move them.
- Defiers: D_i(1) = 0, D_i(0) = 1. Do the opposite of what the instrument suggests. Assumed not to exist (the monotonicity assumption).
Under monotonicity plus the standard IV assumptions, 2SLS identifies the average treatment effect among compliers only. For an always-taker or never-taker, the instrument has no effect on D, so the data are silent about their treatment effect. You can have a perfectly clean instrument and still know nothing about the average effect on the whole population.
Why this matters for policy. If the instrument is “distance to nearest agricultural extension office” and the treatment is “took out an agricultural credit line,” the compliers are farmers whose borrowing decision flipped because the office was close. They are marginal borrowers: intermediate risk tolerance, intermediate creditworthiness, intermediate need. The always-takers (large established farmers who would have borrowed anyway) and never-takers (subsistence farmers who would never have entered formal credit) are not in the treatment effect. If a policymaker asks “Should we expand this program nationally?”, the LATE answers a narrower question: “What was the effect on the marginal borrowers our instrument generated?” Honest reporting tells the policymaker exactly that.
When reporting a 2SLS estimate, also report a description of the compliers: their average baseline covariates, the share of sample they represent. Abadie (2003) gives a clean method. This is the kind of reporting that distinguishes a careful applied paper from a careless one.
6.4 Weak instruments, and why the 10/16 rule is dead
Weak instruments are the silent killer of IV papers. When the first stage is weak (the instrument barely moves the treatment), three bad things happen at once. First, 2SLS becomes biased in finite samples, and the bias is toward OLS, precisely the bias IV was supposed to fix. Second, the conventional asymptotic distribution of the 2SLS estimator becomes a poor approximation. Third, the confidence intervals understate true uncertainty, sometimes by a lot.
Staiger and Stock (1997) gave the first formal weak-instrument asymptotic framework and proposed a rule of thumb: a first-stage F above 10 is “strong enough.” Stock and Yogo (2005) made this rigorous, deriving cutoffs such that the worst-case relative bias of 2SLS to OLS is below a chosen threshold, or the worst-case size distortion of a nominal 5 percent test is below 15 percent. The 10 threshold corresponds roughly to the latter with one instrument; the cutoff rises to around 16 with two instruments. This is where the “F above 10 or 16” folklore comes from.
For fifteen years, applied papers cited “first-stage F above 10” and moved on. We now know this is not adequate. Two problems. The Stock-Yogo cutoffs were derived under homoskedasticity and a non-robust F; almost no applied paper uses non-robust SEs anymore. And even with the right F, the 10 cutoff is too generous: weak-instrument bias and size distortion persist well above that threshold in many realistic settings.
Olea and Pflueger (2013) developed the modern fix: the effective F-statistic, which extends weak-instrument testing to settings with heteroskedasticity, clustering, or autocorrelation. Compute the effective F and compare it to the Olea-Pflueger critical values, which depend on the number of instruments and the desired worst-case bias level. For one instrument and one endogenous regressor, the Olea-Pflueger cutoff for worst-case 10 percent bias is around 23, not 10. That is a much harder bar.
In Stata, weakivtest (Pflueger and Wang 2015) gives the effective F and the critical values. In R, ivDiag implements Olea-Pflueger; fixest::fitstat reports the appropriate F. Report the effective F, not the conventional F.
A complementary approach when the instrument is suspect is to abandon Wald-type confidence intervals entirely and use Anderson-Rubin (AR) confidence intervals (Anderson and Rubin 1949; modern treatments in Chernozhukov and Hansen 2008, Andrews, Stock, and Sun 2019). AR intervals invert a test that does not require a strong first stage. They have correct coverage even when the instrument is weak. They are sometimes wide (the data honestly cannot pin down \beta) and sometimes unbounded (the data have nothing to say). That honesty is the point. A 95 percent AR interval of [-2.1, 4.7] says you cannot reject a wide range of effects; the corresponding 2SLS interval might be a misleadingly narrow [0.3, 1.1] if the instrument is weak.
The rule. If the Olea-Pflueger effective F is above the relevant cutoff for the design, report the standard 2SLS confidence interval and note the F. If it is below, report the AR interval as primary, the 2SLS interval as secondary with a warning, and discuss the weak-instrument problem in prose. Do not hide it.
6.5 Common traps
Bias toward OLS under weak instruments. When Z barely predicts D, the fitted value \hat{D} is mostly noise plus a small signal. The second-stage regression of Y on \hat{D} acts like a noisy version of the OLS regression of Y on D, the thing IV was meant to correct. The cure is a strong first stage, defended with a real F-statistic.
Order condition violations with multiple endogenous regressors. With k endogenous variables you need at least k instruments. With more instruments than endogenous variables, the system is over-identified and you get the Hansen J-test for over-identifying restrictions (Hansen 1982). Passing the J-test does not mean exclusion holds; it means your instruments are mutually consistent under the assumption that at least one is exogenous. With one instrument, you cannot test exclusion at all.
Clustering at the wrong level. Standard errors must be clustered at the level of variation in Z. If Z varies at the district level, cluster at the district level, not the household level. Failure to do this gives SEs that are too small, sometimes by an order of magnitude. Burgess-Pande clusters at the state level because the 1977 RBI rule varies across states.
Many weak instruments. Bekker (1994) showed that even when each individual instrument is weak, having many of them creates its own bias and the standard 2SLS asymptotics fail. This appears in shift-share designs and network/leave-one-out settings. If you have more than five or so instruments, use LIML or jackknife IV (JIVE; Angrist, Imbens, Krueger 1999) instead of 2SLS. LIML has better small-sample properties when instruments are weak and numerous.
Shift-share instruments without an identification argument. The Bartik (1991) instrument was applied for decades without a careful identification argument. Goldsmith-Pinkham, Sorkin, and Swift (2020) showed that validity of a Bartik IV is equivalent to validity of the industry shares as instruments (the shares-based view), so you defend exogeneity of the shares. Borusyak, Hull, and Jaravel (2022) give a complementary shocks-based view: validity requires the national shocks to be exogenous conditional on local exposure. Either argument can be made, but you must make one. Mostly Harmless Econometrics (Angrist and Pischke 2009) warned about this in a generic form years before the formal results.
Forbidden regressions. Do not use 2SLS when the second stage is nonlinear (probit, logit, nonlinear GMM) by simply plugging in first-stage fitted values. That is the “forbidden regression” (Wooldridge 2010, ch. 9) and is inconsistent. Use control-function methods, proper nonlinear IV/GMM, or a linear probability model with IV.
6.6 A worked example: rural credit and farm income
A standard blended-finance question. Does taking up a formal agricultural credit line raise farm income? OLS is biased because credit takeup is endogenous. Candidate instrument: distance from the household to the nearest agricultural extension office. The institutional story: extension offices are where farmers learn about formal credit programs and are nudged through the paperwork. Closer offices, more takeup. Exclusion: distance affects farm income only through credit takeup, not through direct technical assistance, proximity to markets, or general infrastructure.
This exclusion is debatable, and a real paper would address those alternative channels with controls (district fixed effects, distance to nearest market, distance to paved road) and auxiliary evidence (showing extension offices in this context only do credit outreach). For pedagogy the IV regression and workflow follow.
Setup. Outcome Y_i: log farm income. Treatment D_i: indicator for whether the household took out a credit line in the survey year. Instrument Z_i: kilometers to the nearest extension office. Controls W_i: household head age, education, household size, district fixed effects, distance to the nearest paved road.
R code with fixest::feols
library(fixest)
library(ivDiag)
# Data assumed in df with columns:
# log_farm_income, took_credit, dist_extension_km,
# head_age, head_educ, hh_size, dist_road_km, district_id
# IV with fixed effects, clustering at district level
mod_iv <- feols(
log_farm_income ~ head_age + head_educ + hh_size + dist_road_km |
district_id |
took_credit ~ dist_extension_km,
data = df,
cluster = ~district_id
)
summary(mod_iv, stage = 1) # first stage
summary(mod_iv) # 2SLS second stage
fitstat(mod_iv, ~ ivf1 + ivwald1) # F-stats and Wald
# OLS for comparison
mod_ols <- feols(
log_farm_income ~ took_credit + head_age + head_educ + hh_size + dist_road_km |
district_id,
data = df,
cluster = ~district_id
)
# Reduced form (Z directly on Y)
mod_rf <- feols(
log_farm_income ~ dist_extension_km + head_age + head_educ + hh_size + dist_road_km |
district_id,
data = df,
cluster = ~district_id
)
# Olea-Pflueger effective F and Anderson-Rubin CI
ivDiag::ivDiag(
data = df,
Y = "log_farm_income",
D = "took_credit",
Z = "dist_extension_km",
controls = c("head_age", "head_educ", "hh_size", "dist_road_km"),
cl = "district_id",
bootstrap = TRUE
)A typical output table from this workflow (numbers illustrative).
| Specification | \hat{\beta} (effect on log income) | SE | 95% CI |
|---|---|---|---|
| OLS | 0.083 | 0.022 | [0.040, 0.126] |
| Reduced form (Z on Y) | -0.0041 per km | 0.0013 | [-0.0066, -0.0016] |
| First stage F (effective) | 31.7 | ||
| 2SLS | 0.241 | 0.078 | [0.088, 0.394] |
| Anderson-Rubin 95% CI | [0.094, 0.421] |
What this means. OLS gives an 8 percent income effect, which a careless analyst would report. The 2SLS estimate of 24 percent is three times larger. The direction of the gap is consistent with classical selection: borrowers are negatively selected on unobservables (financial distress, weak collateral, recent shocks), so OLS underestimates the causal effect. The effective F of 31.7 sits comfortably above the Olea-Pflueger cutoff, so 2SLS is reliable, and the AR confidence interval is close to the 2SLS Wald interval, confirming the inference.
The LATE interpretation. This 24 percent is the average effect among compliers: farmers whose decision to take credit was changed by being closer to an extension office. These are marginal borrowers, not the large established farms (always-takers) or subsistence-only households (never-takers). When advising a policymaker about national expansion, report this LATE and name who the compliers are.
Stata code with ivreg2 and weakivtest
* Cluster-robust IV with fixed effects and weak-instrument test
ivreghdfe log_farm_income (took_credit = dist_extension_km) ///
head_age head_educ hh_size dist_road_km, ///
absorb(district_id) cluster(district_id) first
* Olea-Pflueger effective F
weakivtest
* Anderson-Rubin confidence interval
weakiv ivreg2 log_farm_income (took_credit = dist_extension_km) ///
head_age head_educ hh_size dist_road_km, ///
cluster(district_id) arfirst prints the first-stage regression. weakivtest (Pflueger and Wang 2015) gives the Olea-Pflueger effective F and critical values. weakiv (Finlay, Magnusson, Schaffer 2014) computes the AR confidence region. Report OLS, reduced form, first-stage F, 2SLS, and AR interval side by side.
6.7 Modern instruments: a brief tour
Three families of modern IV designs worth knowing by name.
Shift-share (Bartik) instruments. The classic Bartik instrument predicts local labor demand using a weighted average of national industry-level growth rates, with weights from local industry shares at a baseline period. The recent identification literature (Goldsmith-Pinkham, Sorkin, Swift 2020; Borusyak, Hull, Jaravel 2022; Adao, Kolesar, Morales 2019) clarifies what assumptions the design requires. For rural development, shift-share instruments built from commodity-price shocks weighted by local crop composition (e.g., Dube and Vargas 2013 on coffee prices and conflict in Colombia) are common.
Judges fixed effects (examiner IV). When treatment decisions are made by randomly assigned case workers, examiner identity is a valid instrument for the decision. Used heavily in criminal justice (Kling 2006), bankruptcy (Dobbie and Song 2015), and asylum decisions (Dahl, Kostol, Mogstad 2014). For blended finance, this design appears whenever loan applications are routed to loan officers with random assignment within branches. Identification rests on monotonicity (no defiers), recently questioned (Frandsen, Lefgren, Leslie 2023).
Network and gravity-style instruments. In networks, the canonical IV is the “leave-one-out”: predict the variable at unit i using the average among i’s network neighbors, excluding i (Bramoulle, Djebbari, Fortin 2009). In gravity-of-trade settings, third-country shocks instrument for bilateral flows. Subtle identification issues (Borusyak and Hull 2020 on “recentering” non-random exposure); use with care.
A word of caution. Shift-share and examiner IV are powerful but have enough subtle identification requirements that they are easier to learn second, after a clean one-instrument IV paper is under the belt.
6.8 Reporting checklist
A clean IV paper reports the following, in order, in the main table.
- OLS estimate, with SEs clustered at the level of variation in Z. Your benchmark.
- Reduced form (Y on Z), same controls, same clustering. If the reduced form is null, your IV is null. Full stop.
- First stage (D on Z): coefficient, SE, and the Olea-Pflueger effective F (not the conventional F).
- 2SLS estimate, SEs clustered at the level of Z.
- Anderson-Rubin 95% CI, especially if the effective F is near the cutoff. If AR and 2SLS Wald disagree, AR wins.
- Hansen J statistic if over-identified, with p-value.
- A paragraph defending exclusion. Name the alternative channels. Address each. This paragraph is the heart of the paper.
- A complier characterization (Abadie 2003 or similar) so the reader knows whose LATE you report.
Reviewers will look for each item. Build the table once and reuse the template.
6.9 Why this chapter matters for an applied researcher
Rural lending studies are full of endogeneity. People choose to borrow, to join cooperatives, to migrate, to adopt, to attend. Every blended-finance program that runs on voluntary takeup has the same identification problem. DiD timing is rarely clean (anticipation, lagged takeup, switching), and RDD eligibility is rarely sharp (rules are bent, applications filtered). When those two tools fail, IV is what is left.
Two practical points beyond the mechanics.
First, an instrument that works in one country and period is not automatically valid in another. Rainfall has been used as an instrument for income in dozens of papers; in some the exclusion restriction is plausible (subsistence agriculture, no other channel), in others it is not (mining or industrial economies where rain affects production through many channels). When adopting an instrument from the literature, the job is not to cite the original paper as authority. The job is to argue that exclusion is plausible in this setting. That is real intellectual work.
Second, IV is not a magic wand. A failed paper with OLS becomes a failed paper with IV if the instrument is weak or the exclusion is implausible. Sometimes three months on an instrument end with the conclusion that it does not work. That is fine. The point of these tools is to produce a credible answer to a real question. Sometimes the credible answer is “we cannot tell from this data,” and saying that clearly is more valuable than reporting a polished number that does not survive scrutiny.
A useful reference point from the author’s own work: in Helfrich (2026) on the NMTC, the raw rural penalty of -0.262*** on log investment per capita collapses to -0.047 (not significant, p=0.64) once Community Development Entity fixed effects are introduced. The CDE allocation pattern is doing the work that a naive specification would have attributed to rurality. The lesson generalizes: before reaching for an instrument, check whether the variation you are worried about is absorbed by a fixed effect you have not yet imposed.
For any blended-finance work, the IV-shaped question to keep in the back pocket is always: what is a source of variation in program takeup that is not driven by the unobserved characteristics that also drive the outcome? If a clean source exists, there is a paper. If not, write up the descriptive evidence honestly and look for a different empirical handle. Knowing when IV will not save you is half of using IV well.
6.10 References
Methodology and identification.
- Anderson, T. W., and Herman Rubin. 1949. “Estimation of the Parameters of a Single Equation in a Complete System of Stochastic Equations.” Annals of Mathematical Statistics 20(1): 46-63.
- Imbens, Guido W., and Joshua D. Angrist. 1994. “Identification and Estimation of Local Average Treatment Effects.” Econometrica 62(2): 467-475.
- Angrist, Joshua D., Guido W. Imbens, and Donald B. Rubin. 1996. “Identification of Causal Effects Using Instrumental Variables.” JASA 91(434): 444-455.
- Staiger, Douglas, and James H. Stock. 1997. “Instrumental Variables Regression with Weak Instruments.” Econometrica 65(3): 557-586.
- Stock, James H., and Motohiro Yogo. 2005. “Testing for Weak Instruments in Linear IV Regression.” In Identification and Inference for Econometric Models, ed. Andrews and Stock, 80-108. CUP.
- Olea, Jose Luis Montiel, and Carolin Pflueger. 2013. “A Robust Test for Weak Instruments.” JBES 31(3): 358-369.
- Andrews, Isaiah, James H. Stock, and Liyang Sun. 2019. “Weak Instruments in IV Regression: Theory and Practice.” Annual Review of Economics 11: 727-753.
- Abadie, Alberto. 2003. “Semiparametric IV Estimation of Treatment Response Models.” Journal of Econometrics 113(2): 231-263.
- Hansen, Lars Peter. 1982. “Large Sample Properties of GMM Estimators.” Econometrica 50(4): 1029-1054.
- Bekker, Paul A. 1994. “Alternative Approximations to the Distributions of IV Estimators.” Econometrica 62(3): 657-681.
- Angrist, Joshua D., Guido W. Imbens, and Alan B. Krueger. 1999. “Jackknife IV Estimation.” Journal of Applied Econometrics 14(1): 57-67.
- Chernozhukov, Victor, and Christian Hansen. 2008. “The Reduced Form: A Simple Approach to Inference with Weak Instruments.” Economics Letters 100(1): 68-71.
Shift-share and modern designs.
- Bartik, Timothy J. 1991. Who Benefits from State and Local Economic Development Policies? Upjohn Institute.
- Goldsmith-Pinkham, Paul, Isaac Sorkin, and Henry Swift. 2020. “Bartik Instruments: What, When, Why, and How.” AER 110(8): 2586-2624.
- Borusyak, Kirill, Peter Hull, and Xavier Jaravel. 2022. “Quasi-Experimental Shift-Share Research Designs.” RES 89(1): 181-213.
- Adao, Rodrigo, Michal Kolesar, and Eduardo Morales. 2019. “Shift-Share Designs: Theory and Inference.” QJE 134(4): 1949-2010.
- Bramoulle, Yann, Habiba Djebbari, and Bernard Fortin. 2009. “Identification of Peer Effects through Social Networks.” Journal of Econometrics 150(1): 41-55.
Applied papers cited.
- Miguel, Edward, Shanker Satyanath, and Ernest Sergenti. 2004. “Economic Shocks and Civil Conflict: An IV Approach.” JPE 112(4): 725-753.
- Burgess, Robin, and Rohini Pande. 2005. “Do Rural Banks Matter? Evidence from the Indian Social Banking Experiment.” AER 95(3): 780-795.
- Dube, Oeindrila, and Juan F. Vargas. 2013. “Commodity Price Shocks and Civil Conflict: Evidence from Colombia.” RES 80(4): 1384-1421.
Texts and software.
- Angrist, Joshua D., and Jorn-Steffen Pischke. 2009. Mostly Harmless Econometrics. Princeton.
- Wooldridge, Jeffrey M. 2010. Econometric Analysis of Cross Section and Panel Data, 2nd ed. MIT Press.
- Pflueger, Carolin E., and Su Wang. 2015. “A Robust Test for Weak Instruments in Stata.” Stata Journal 15(1): 216-225.
- Finlay, Magnusson, and Schaffer. 2014.
weakiv: Stata package.