Chapter 2. R vs. Stata: workflow, translation, when to use which
You need both. That is the short answer. The long answer is below.
Why both still matter
Stata is the default at the World Bank, the IMF, the IDB, the AfDB, the IFC, and most J-PAL, IPA, and DIME field-RCT pipelines. The institutional code base is in .do files. Senior researchers will hand you .do files and expect you to extend them. When a DFI evaluation contract comes in, the deliverable is .do files plus a .dta and a PDF, not an .R script.
R has been the default in academic development economics since roughly 2018, and in causal machine learning, geospatial work, and anything that touches modern reproducibility (Quarto, GitHub, RMarkdown). The frontier-methods papers (Callaway and Sant’Anna; Borusyak, Jaravel, and Spiess; Athey and Wager) ship their reference implementations in R first, and often only.
Python is the third leg, mostly for data engineering (raster processing, large administrative-data ETL) and for anything ML-heavy beyond what grf can do. You do not need to be fluent for headline econometrics, but a working level helps. It is not the focus of this chapter.
The split is roughly:
| Task | Stata native | R native | Either |
|---|---|---|---|
| Two-way FE workhorse regressions | reghdfe |
fixest::feols() |
\checkmark |
| Callaway-Sant’Anna DiD | csdid |
did::att_gt() |
\checkmark |
| Sun-Abraham event study | eventstudyinteract |
fixest::sunab() |
\checkmark |
| RDD with CCT bandwidth | rdrobust |
rdrobust::rdrobust() |
\checkmark |
| IV with weak-instrument robust CI | ivreghdfe + weakivtest |
fixest::feols(... \| ... ~ Z) |
\checkmark |
| Synthetic control | synth (Abadie ado) |
Synth, augsynth |
\checkmark |
| Synthetic DiD | (no clean port) | synthdid |
R only |
| Quantile regression | qreg, sqreg, grqreg |
quantreg::rq() |
\checkmark |
| Causal forests | (none mature) | grf::causal_forest() |
R only |
| Double / debiased ML | (none mature) | DoubleML |
R only |
| Spatial methods (SAR/SEM/Moran) | limited (spreg, splagvar) |
sf, spdep, spatialreg |
R dominates |
| Conley spatial SEs | acreg |
conleyreg |
\checkmark |
| Honest pre-trends | (none) | HonestDiD |
R only |
| Policy-learning trees | (none) | policytree |
R only |
| Plotting tables to LaTeX | estout |
modelsummary, kableExtra |
\checkmark |
| Beautiful figures | hand-built / coefplot |
ggplot2 |
R wins |
| Reproducible reports | very limited (dyndoc) |
Quarto / RMarkdown | R wins |
| Geospatial / GIS | none | sf, terra, tmap |
R only |
| Survey weighting (complex designs) | svy: prefix is excellent |
survey package is excellent |
Tie |
The pattern: Stata for the standard linear-model and FE workhorse code, R for everything modern and for figures. In practice your portfolio code will be a mix, and you will hand off between them via .dta and .csv.
Project structure (do this once, save yourself a year of pain)
Whatever you do, do not keep a project as a single folder with forty .do files named final_v2_USE_THIS.do. The structure below is what the World Bank’s iefolder ado generates and what most J-PAL teams use.
portugal-blended-finance/
data/
raw/ # source files, untouched, never commit if PII
intermediate/ # mid-cleaning outputs
processed/ # final analysis-ready, committed if non-PII
documentation/ # codebooks, data-dictionary, fieldwork notes
code/
01_clean_raw.do # raw -> intermediate
02_build_panel.do # intermediate -> processed
03_main_tables.do # main regressions
04_robustness.do # robustness checks
05_figures.R # ggplot figures (use R here even if rest is Stata)
06_appendix.do
run_all.do # master script that calls 01-06 in order
output/
tables/ # .tex output from estout/esttab
figures/ # .pdf, .png
logs/ # .log files for replication
paper/
main.tex
appendix.tex
refs.bib
pre_analysis_plan/
pap_v1.pdf
aea_registry_id.txt # if you registered with AEA RCT registry
presentations/
neudc_2026.pdf
ifc_brief_2026.pdf
README.md # how to run, what version of Stata/R/python
.gitignore
run_all.do (or run_all.R) is the single command that rebuilds every output from scratch given the raw data. If you cannot run run_all.do from a fresh clone of the repo and reproduce your tables, your project is not reproducible. Test this monthly.
A common-commands translation table
The two languages call the same statistical concept by different names. The table below covers the operations you will do daily.
| Operation | Stata | R (tidyverse / fixest) |
|---|---|---|
| Load CSV | import delimited file.csv |
df <- readr::read_csv("file.csv") |
| Load Stata data | use file.dta, clear |
df <- haven::read_dta("file.dta") |
| Save Stata data | save file.dta, replace |
haven::write_dta(df, "file.dta") |
| Filter rows | keep if year >= 2010 |
df <- df %>% filter(year >= 2010) |
| Drop rows | drop if missing(income) |
df <- df %>% filter(!is.na(income)) |
| Keep columns | keep id year income |
df <- df %>% select(id, year, income) |
| Rename column | rename oldname newname |
df <- df %>% rename(newname = oldname) |
| New column | gen log_inc = log(income) |
df <- df %>% mutate(log_inc = log(income)) |
| Replace values | replace x = 0 if missing(x) |
df <- df %>% mutate(x = replace_na(x, 0)) |
| Sort | sort id year |
df <- df %>% arrange(id, year) |
| Group summary | collapse (mean) income, by(id) |
df %>% group_by(id) %>% summarise(income = mean(income)) |
| Merge | merge 1:1 id using other.dta |
df <- df %>% inner_join(other, by = "id") |
| Append | append using other.dta |
df <- bind_rows(df, other) |
| OLS | reg y x1 x2, robust |
feols(y ~ x1 + x2, data = df, vcov = "hetero") |
| FE | reghdfe y x1, absorb(id year) vce(cluster id) |
feols(y ~ x1 \| id + year, cluster = "id", data = df) |
| Logit | logit y x1 x2 |
glm(y ~ x1 + x2, family = binomial, data = df) |
| IV (2SLS) | ivreghdfe y (D = Z) x1, absorb(id) cluster(id) |
feols(y ~ x1 \| id \| D ~ Z, cluster = "id", data = df) |
| Tabulate | tab x |
df %>% count(x) or table(df$x) |
| Summary stats | sum y, detail |
df %>% summarise(across(y, list(mean=mean, sd=sd, ...))) |
| Regression table | estout est1 est2, using out.tex, style(tex) |
modelsummary(list(m1, m2), output = "out.tex") |
| Coefficient plot | coefplot est1, drop(_cons) |
modelplot(m1) or hand-ggplot |
| Event study plot | eventdd |
iplot(feols(... ~ sunab(...))) |
| Loop | forvalues i = 1/10 { ... } |
for (i in 1:10) { ... } or purrr::map(1:10, ~ ...) |
| Macros / variables | local x = 5; di \x’|x <- 5; print(x)| | Conditional |if (cond) { … }|if (cond) { … }(identical) | | Save figure |graph export fig.pdf, replace|ggsave(“fig.pdf”, width=6, height=4)| | Comment |* one lineor/* block */|# one line` |
The mental shift is this. Stata operates on the dataset, one in memory at a time, mostly. R operates on named data frames, many at once. Stata is column-major and verb-first (“regress y x”). R and the tidyverse are data-first and pipe through verbs (df %>% filter %>% mutate %>% regress).
Concrete starter scripts
Stata: clean + regress + export, end to end
* setup
clear all
set more off
cd "~/projects/portugal-blended-finance"
log using output/logs/main.log, replace
* load
use data/processed/panel.dta, clear
* basic checks
describe
sum log_consumption log_income treated, detail
tab year treated
* main TWFE regression
reghdfe log_consumption treated, absorb(municipality year) vce(cluster municipality)
est store m1
* with controls
reghdfe log_consumption treated log_hh_size log_assets, absorb(municipality year) vce(cluster municipality)
est store m2
* Callaway-Sant'Anna for the modern DiD estimator
csdid log_consumption, ivar(municipality) time(year) gvar(first_treated) method(dripw)
estat all
* export
esttab m1 m2 using output/tables/main_table.tex, replace ///
se star(* 0.10 ** 0.05 *** 0.01) booktabs ///
title("Effect of credit on log consumption") ///
keep(treated log_hh_size log_assets)
log closeR: same pipeline in fixest + did + modelsummary
# setup
library(tidyverse)
library(haven)
library(fixest)
library(did)
library(modelsummary)
setwd("~/projects/portugal-blended-finance")
# load
df <- read_dta("data/processed/panel.dta")
# basic checks
glimpse(df)
df %>% summarise(across(c(log_consumption, log_income, treated),
list(mean = mean, sd = sd, n = ~sum(!is.na(.x)))))
df %>% count(year, treated)
# main TWFE
m1 <- feols(log_consumption ~ treated | municipality + year,
cluster = "municipality", data = df)
# with controls
m2 <- feols(log_consumption ~ treated + log_hh_size + log_assets |
municipality + year,
cluster = "municipality", data = df)
# Callaway-Sant'Anna
cs <- att_gt(yname = "log_consumption",
tname = "year",
idname = "municipality",
gname = "first_treated",
data = df,
control_group = "notyettreated",
est_method = "dr")
overall <- aggte(cs, type = "simple")
event_study <- aggte(cs, type = "dynamic")
ggdid(event_study)
# export
modelsummary(list("Baseline" = m1, "With controls" = m2),
output = "output/tables/main_table.tex",
coef_omit = "Intercept",
stars = TRUE,
title = "Effect of credit on log consumption")Both produce the same point estimates (up to numerical precision) and the same clustered standard errors. Both run from the same .dta file. The Stata version reads more like a recipe; the R version reads more like an essay. Pick the one that matches the job, and the team you are handing the code to.
Reproducibility checklist
This is the bar that separates a researcher from a person who happens to run regressions.
Versioned dependencies. Stata: pin the version (
version 17.0at the top of.do) and document the SSC package versions inREADME.md. R: userenvto lock package versions; commit the lockfile.Single entry point.
run_all.doorrun_all.Rrebuilds every output from raw data.No manual steps. Never have an output that requires “now open Excel and click here”. If something can only be done manually, write it down in
README.mdand explain why.Git from day one. Commit code, commit small derived datasets, do not commit raw PII.
.gitignoreshould excludedata/raw/,*.dtaif PII,.DS_Store,*.log(or commit logs separately).Seed your randomness.
set seed 20260509in Stata,set.seed(20260509)in R. Without this, bootstrap CIs and any sampling step will change run-to-run.Document the data. Every column in your processed dataset should appear in
data/documentation/codebook.mdwith its source, units, missing-value conventions, and any transformations.Snapshot the raw. When you receive raw data, immediately save an untouched copy with a date stamp. Never edit it. All cleaning lives in
code/01_clean_raw.do.Test on a fresh clone. Every month or so, clone your repo to a new directory and run
run_allend-to-end. If anything is missing, fix it now, not at deadline.
When to use which (decision rules)
Use Stata when: - The contracting institution requires it (most DFI work) - You are extending a code base that is already in Stata - You need csdid, rdrobust, or ivreghdfe and do not have time to set up an R environment - The output is destined for a Stata-using audience (most government statistical offices, central banks, ministries of finance)
Use R when: - The method is causal ML, spatial, or one of the modern-DiD-only-in-R tools (Honest pre-trends, synthetic DiD, policy learning) - The deliverable is a Quarto or RMarkdown report (which most of your academic-side outputs should be) - The figure needs to be polished. ggplot beats Stata’s graph engine every time. - You are working with anyone in academic development economics post-2020
Use Python when: - The task is data engineering at scale (administrative data, satellite imagery) - The method requires modern ML beyond grf (deep learning, transformer text models, gradient boosting at scale) - You are integrating with a production data pipeline at a DFI’s tech team
Use a hand-coded estimator when: - The published implementation is buggy or out of date, and you have verified it is wrong - The estimator is brand new and not yet packaged - The dataset is so large that the standard package cannot fit it in memory
That last case is rare for blended-finance work. Do not reinvent unless you have to.
A note on speed
Modern fixest is faster than Stata’s reghdfe for the same model in most cases, often by 5x to 20x for large panels. If your regression takes more than a few minutes in Stata, it is worth porting to R for the iteration speed alone. The Stata hdfe-suffixed packages (reghdfe, ivreghdfe) are themselves Stata’s port of the fast-FE-absorption algorithm, and they are already 10x faster than the original xtreg, fe. fixest in R is the next generation.
For the largest jobs (millions of observations, dozens of FEs), neither Stata nor R is the right tool. That is where you write a parquet pipeline in Python or DuckDB and only pass the regression-ready panel to your econometrics tool. You will not hit this scale for a long time in blended-finance work, but it is worth knowing it exists.
The worked examples in subsequent chapters give both R and Stata code, side by side. Run them both. Get fluent in the translation. That fluency is what makes you the person on the team who can bridge between the World Bank’s Stata pipeline and the Athey-lab causal-forest workflow. That bridging skill is worth its weight in salary band.