Chapter 2. R vs. Stata: workflow, translation, when to use which

You need both. That is the short answer. The long answer is below.

Why both still matter

Stata is the default at the World Bank, the IMF, the IDB, the AfDB, the IFC, and most J-PAL, IPA, and DIME field-RCT pipelines. The institutional code base is in .do files. Senior researchers will hand you .do files and expect you to extend them. When a DFI evaluation contract comes in, the deliverable is .do files plus a .dta and a PDF, not an .R script.

R has been the default in academic development economics since roughly 2018, and in causal machine learning, geospatial work, and anything that touches modern reproducibility (Quarto, GitHub, RMarkdown). The frontier-methods papers (Callaway and Sant’Anna; Borusyak, Jaravel, and Spiess; Athey and Wager) ship their reference implementations in R first, and often only.

Python is the third leg, mostly for data engineering (raster processing, large administrative-data ETL) and for anything ML-heavy beyond what grf can do. You do not need to be fluent for headline econometrics, but a working level helps. It is not the focus of this chapter.

The split is roughly:

Task Stata native R native Either
Two-way FE workhorse regressions reghdfe fixest::feols() \checkmark
Callaway-Sant’Anna DiD csdid did::att_gt() \checkmark
Sun-Abraham event study eventstudyinteract fixest::sunab() \checkmark
RDD with CCT bandwidth rdrobust rdrobust::rdrobust() \checkmark
IV with weak-instrument robust CI ivreghdfe + weakivtest fixest::feols(... \| ... ~ Z) \checkmark
Synthetic control synth (Abadie ado) Synth, augsynth \checkmark
Synthetic DiD (no clean port) synthdid R only
Quantile regression qreg, sqreg, grqreg quantreg::rq() \checkmark
Causal forests (none mature) grf::causal_forest() R only
Double / debiased ML (none mature) DoubleML R only
Spatial methods (SAR/SEM/Moran) limited (spreg, splagvar) sf, spdep, spatialreg R dominates
Conley spatial SEs acreg conleyreg \checkmark
Honest pre-trends (none) HonestDiD R only
Policy-learning trees (none) policytree R only
Plotting tables to LaTeX estout modelsummary, kableExtra \checkmark
Beautiful figures hand-built / coefplot ggplot2 R wins
Reproducible reports very limited (dyndoc) Quarto / RMarkdown R wins
Geospatial / GIS none sf, terra, tmap R only
Survey weighting (complex designs) svy: prefix is excellent survey package is excellent Tie

The pattern: Stata for the standard linear-model and FE workhorse code, R for everything modern and for figures. In practice your portfolio code will be a mix, and you will hand off between them via .dta and .csv.

Project structure (do this once, save yourself a year of pain)

Whatever you do, do not keep a project as a single folder with forty .do files named final_v2_USE_THIS.do. The structure below is what the World Bank’s iefolder ado generates and what most J-PAL teams use.

portugal-blended-finance/
  data/
    raw/                 # source files, untouched, never commit if PII
    intermediate/        # mid-cleaning outputs
    processed/           # final analysis-ready, committed if non-PII
    documentation/       # codebooks, data-dictionary, fieldwork notes
  code/
    01_clean_raw.do      # raw -> intermediate
    02_build_panel.do    # intermediate -> processed
    03_main_tables.do    # main regressions
    04_robustness.do     # robustness checks
    05_figures.R         # ggplot figures (use R here even if rest is Stata)
    06_appendix.do
    run_all.do           # master script that calls 01-06 in order
  output/
    tables/              # .tex output from estout/esttab
    figures/             # .pdf, .png
    logs/                # .log files for replication
  paper/
    main.tex
    appendix.tex
    refs.bib
  pre_analysis_plan/
    pap_v1.pdf
    aea_registry_id.txt  # if you registered with AEA RCT registry
  presentations/
    neudc_2026.pdf
    ifc_brief_2026.pdf
  README.md              # how to run, what version of Stata/R/python
  .gitignore

run_all.do (or run_all.R) is the single command that rebuilds every output from scratch given the raw data. If you cannot run run_all.do from a fresh clone of the repo and reproduce your tables, your project is not reproducible. Test this monthly.

A common-commands translation table

The two languages call the same statistical concept by different names. The table below covers the operations you will do daily.

Operation Stata R (tidyverse / fixest)
Load CSV import delimited file.csv df <- readr::read_csv("file.csv")
Load Stata data use file.dta, clear df <- haven::read_dta("file.dta")
Save Stata data save file.dta, replace haven::write_dta(df, "file.dta")
Filter rows keep if year >= 2010 df <- df %>% filter(year >= 2010)
Drop rows drop if missing(income) df <- df %>% filter(!is.na(income))
Keep columns keep id year income df <- df %>% select(id, year, income)
Rename column rename oldname newname df <- df %>% rename(newname = oldname)
New column gen log_inc = log(income) df <- df %>% mutate(log_inc = log(income))
Replace values replace x = 0 if missing(x) df <- df %>% mutate(x = replace_na(x, 0))
Sort sort id year df <- df %>% arrange(id, year)
Group summary collapse (mean) income, by(id) df %>% group_by(id) %>% summarise(income = mean(income))
Merge merge 1:1 id using other.dta df <- df %>% inner_join(other, by = "id")
Append append using other.dta df <- bind_rows(df, other)
OLS reg y x1 x2, robust feols(y ~ x1 + x2, data = df, vcov = "hetero")
FE reghdfe y x1, absorb(id year) vce(cluster id) feols(y ~ x1 \| id + year, cluster = "id", data = df)
Logit logit y x1 x2 glm(y ~ x1 + x2, family = binomial, data = df)
IV (2SLS) ivreghdfe y (D = Z) x1, absorb(id) cluster(id) feols(y ~ x1 \| id \| D ~ Z, cluster = "id", data = df)
Tabulate tab x df %>% count(x) or table(df$x)
Summary stats sum y, detail df %>% summarise(across(y, list(mean=mean, sd=sd, ...)))
Regression table estout est1 est2, using out.tex, style(tex) modelsummary(list(m1, m2), output = "out.tex")
Coefficient plot coefplot est1, drop(_cons) modelplot(m1) or hand-ggplot
Event study plot eventdd iplot(feols(... ~ sunab(...)))
Loop forvalues i = 1/10 { ... } for (i in 1:10) { ... } or purrr::map(1:10, ~ ...)
Macros / variables local x = 5; di \x’|x <- 5; print(x)| | Conditional |if (cond) { … }|if (cond) { … }(identical) | | Save figure |graph export fig.pdf, replace|ggsave(“fig.pdf”, width=6, height=4)| | Comment |* one lineor/* block */|# one line`

The mental shift is this. Stata operates on the dataset, one in memory at a time, mostly. R operates on named data frames, many at once. Stata is column-major and verb-first (“regress y x”). R and the tidyverse are data-first and pipe through verbs (df %>% filter %>% mutate %>% regress).

Concrete starter scripts

Stata: clean + regress + export, end to end

* setup
clear all
set more off
cd "~/projects/portugal-blended-finance"
log using output/logs/main.log, replace

* load
use data/processed/panel.dta, clear

* basic checks
describe
sum log_consumption log_income treated, detail
tab year treated

* main TWFE regression
reghdfe log_consumption treated, absorb(municipality year) vce(cluster municipality)
est store m1

* with controls
reghdfe log_consumption treated log_hh_size log_assets, absorb(municipality year) vce(cluster municipality)
est store m2

* Callaway-Sant'Anna for the modern DiD estimator
csdid log_consumption, ivar(municipality) time(year) gvar(first_treated) method(dripw)
estat all

* export
esttab m1 m2 using output/tables/main_table.tex, replace ///
    se star(* 0.10 ** 0.05 *** 0.01) booktabs ///
    title("Effect of credit on log consumption") ///
    keep(treated log_hh_size log_assets)

log close

R: same pipeline in fixest + did + modelsummary

# setup
library(tidyverse)
library(haven)
library(fixest)
library(did)
library(modelsummary)

setwd("~/projects/portugal-blended-finance")

# load
df <- read_dta("data/processed/panel.dta")

# basic checks
glimpse(df)
df %>% summarise(across(c(log_consumption, log_income, treated),
                        list(mean = mean, sd = sd, n = ~sum(!is.na(.x)))))
df %>% count(year, treated)

# main TWFE
m1 <- feols(log_consumption ~ treated | municipality + year,
            cluster = "municipality", data = df)

# with controls
m2 <- feols(log_consumption ~ treated + log_hh_size + log_assets |
              municipality + year,
            cluster = "municipality", data = df)

# Callaway-Sant'Anna
cs <- att_gt(yname = "log_consumption",
             tname = "year",
             idname = "municipality",
             gname = "first_treated",
             data = df,
             control_group = "notyettreated",
             est_method = "dr")
overall <- aggte(cs, type = "simple")
event_study <- aggte(cs, type = "dynamic")
ggdid(event_study)

# export
modelsummary(list("Baseline" = m1, "With controls" = m2),
             output = "output/tables/main_table.tex",
             coef_omit = "Intercept",
             stars = TRUE,
             title = "Effect of credit on log consumption")

Both produce the same point estimates (up to numerical precision) and the same clustered standard errors. Both run from the same .dta file. The Stata version reads more like a recipe; the R version reads more like an essay. Pick the one that matches the job, and the team you are handing the code to.

Reproducibility checklist

This is the bar that separates a researcher from a person who happens to run regressions.

  1. Versioned dependencies. Stata: pin the version (version 17.0 at the top of .do) and document the SSC package versions in README.md. R: use renv to lock package versions; commit the lockfile.

  2. Single entry point. run_all.do or run_all.R rebuilds every output from raw data.

  3. No manual steps. Never have an output that requires “now open Excel and click here”. If something can only be done manually, write it down in README.md and explain why.

  4. Git from day one. Commit code, commit small derived datasets, do not commit raw PII. .gitignore should exclude data/raw/, *.dta if PII, .DS_Store, *.log (or commit logs separately).

  5. Seed your randomness. set seed 20260509 in Stata, set.seed(20260509) in R. Without this, bootstrap CIs and any sampling step will change run-to-run.

  6. Document the data. Every column in your processed dataset should appear in data/documentation/codebook.md with its source, units, missing-value conventions, and any transformations.

  7. Snapshot the raw. When you receive raw data, immediately save an untouched copy with a date stamp. Never edit it. All cleaning lives in code/01_clean_raw.do.

  8. Test on a fresh clone. Every month or so, clone your repo to a new directory and run run_all end-to-end. If anything is missing, fix it now, not at deadline.

When to use which (decision rules)

Use Stata when: - The contracting institution requires it (most DFI work) - You are extending a code base that is already in Stata - You need csdid, rdrobust, or ivreghdfe and do not have time to set up an R environment - The output is destined for a Stata-using audience (most government statistical offices, central banks, ministries of finance)

Use R when: - The method is causal ML, spatial, or one of the modern-DiD-only-in-R tools (Honest pre-trends, synthetic DiD, policy learning) - The deliverable is a Quarto or RMarkdown report (which most of your academic-side outputs should be) - The figure needs to be polished. ggplot beats Stata’s graph engine every time. - You are working with anyone in academic development economics post-2020

Use Python when: - The task is data engineering at scale (administrative data, satellite imagery) - The method requires modern ML beyond grf (deep learning, transformer text models, gradient boosting at scale) - You are integrating with a production data pipeline at a DFI’s tech team

Use a hand-coded estimator when: - The published implementation is buggy or out of date, and you have verified it is wrong - The estimator is brand new and not yet packaged - The dataset is so large that the standard package cannot fit it in memory

That last case is rare for blended-finance work. Do not reinvent unless you have to.

A note on speed

Modern fixest is faster than Stata’s reghdfe for the same model in most cases, often by 5x to 20x for large panels. If your regression takes more than a few minutes in Stata, it is worth porting to R for the iteration speed alone. The Stata hdfe-suffixed packages (reghdfe, ivreghdfe) are themselves Stata’s port of the fast-FE-absorption algorithm, and they are already 10x faster than the original xtreg, fe. fixest in R is the next generation.

For the largest jobs (millions of observations, dozens of FEs), neither Stata nor R is the right tool. That is where you write a parquet pipeline in Python or DuckDB and only pass the regression-ready panel to your econometrics tool. You will not hit this scale for a long time in blended-finance work, but it is worth knowing it exists.


The worked examples in subsequent chapters give both R and Stata code, side by side. Run them both. Get fluent in the translation. That fluency is what makes you the person on the team who can bridge between the World Bank’s Stata pipeline and the Athey-lab causal-forest workflow. That bridging skill is worth its weight in salary band.