Optimal transport and effective distance

Why an applied researcher should care

Most spatial-econometric questions are flow problems. A drought in one province pushes labor to another. A sanctions package on one country redirects trade through a third. A financial shock at one bank cascades through correlated counterparties. Each of these is a transmission problem on a network, and each has a conservation law sitting underneath it (what leaves one node must arrive somewhere, eventually).

Two practical issues come out of this immediately.

First, geographic distance is the wrong variable in a connected economy. A Lisbon-Maputo trade flow does not travel along the great-circle path. It travels through a Lusophone network of credit lines, shipping schedules, language-of-contract, diaspora remittance pipes, and historical commercial relationships. A regression that “controls for distance” using kilometers is mis-specified relative to the actual transmission geometry.

Second, comparing two distributions without imposing a parametric form is harder than it looks. If you want to know whether a blended-finance intervention shifted the lower tail of the rural income distribution, a t-test on means will miss it, a quantile regression at the 10th percentile will pick up some of it, and a full Kolmogorov-Smirnov test will tell you the distributions differ but not by how much. The Wasserstein distance gives you a single, interpretable number with units: the average distance you have to move probability mass to convert one distribution into the other.

Both problems have the same mathematical answer, which is optimal transport (OT). The chapter develops the intuition, the algorithm that made it tractable, and three worked examples drawn from problems you are likely to face.

Optimal transport in one paragraph

Given two distributions \mu (source) and \nu (target), and a ground cost d(x,y) that measures the cost of moving a unit of mass from point x to point y, the optimal-transport problem asks for the cheapest way to rearrange \mu into \nu. Formally:

W_p(\mu, \nu) = \left(\inf_{\pi \in \Pi(\mu, \nu)} \int d(x,y)^p \, d\pi(x,y)\right)^{1/p}

where \Pi(\mu, \nu) is the set of joint distributions (couplings) with marginals \mu and \nu. The infimum is taken over all valid transport plans, and a transport plan \pi(x,y) specifies how much mass moves from x to y. For p=1 this is the Earth-mover’s distance (the 1-Wasserstein distance, W_1). For p=2 it is the 2-Wasserstein distance, which has nicer geometric properties because the underlying optimization becomes strictly convex on absolutely continuous measures.

The two ingredients are the marginal constraints (the plan must have the right row and column sums) and the cost. Everything else is bookkeeping.

The Kantorovich dual

The primal problem above is a linear program when the distributions are discrete. Its dual involves two Kantorovich potentials \phi(x) and \psi(y) subject to \phi(x) + \psi(y) \leq d(x,y):

W_1(\mu, \nu) = \sup_{\phi,\psi}\left( \int \phi \, d\mu + \int \psi \, d\nu \right).

The economic reading is direct. Imagine a third-party shipper who buys mass at the source at price \phi(x) and sells it at the destination at price -\psi(y), with the constraint that the price gap \phi(x) + \psi(y) never exceeds the cost of shipping directly. The shipper’s maximum profit equals the producer’s minimum cost: that is the strong-duality result. Most modern OT solvers operate on the dual form, since the constraint set is convenient and the variables decouple along the marginals.

A second reading, useful for econometricians: the Kantorovich potentials \phi, \psi are the (Lagrangian) shadow prices on the marginal constraints. They tell you how a small perturbation to the source or target distribution changes the total transport cost. In a structural model that produces \mu and \nu as equilibrium objects, the potentials are exactly the welfare derivatives with respect to local mass shifts. Galichon’s 2016 Optimal Transport Methods in Economics develops this connection across labor matching, hedonic models, and trade.

Entropic regularization (Sinkhorn-Cuturi 2013)

The primal LP scales as O(n^3 \log n) in the number of support points using interior-point methods. That is fine for n = 200, slow at n = 2{,}000, and hopeless at n = 200{,}000. Marco Cuturi’s 2013 paper changed this by adding an entropic regularizer:

W_\epsilon(\mu, \nu) = \inf_{\pi \in \Pi(\mu, \nu)} \int d(x,y) \, d\pi(x,y) - \epsilon H(\pi),

where H(\pi) = -\int \pi(x,y) \log \pi(x,y) \, dx \, dy is the entropy of the coupling. The regularized problem is strictly convex and admits a closed-form structure: the optimal plan is \pi^*_{ij} = u_i K_{ij} v_j where K_{ij} = e^{-d(x_i, y_j)/\epsilon} and u, v are scaling vectors satisfying the marginal constraints. The Sinkhorn-Knopp algorithm alternates between fixing u to satisfy the row marginal and fixing v to satisfy the column marginal. Each iteration is O(n^2) matrix-vector multiplication, and convergence is geometric.

This is the algorithmic shift that made OT a workable tool in applied work. In R the relevant packages are transport (exact and entropic) and T4transport (more solvers and statistical utilities). In Python the canonical library is POT (Python Optimal Transport) by Flamary and co-authors, which exposes ot.sinkhorn, ot.emd, ot.wasserstein_1d, and barycenter routines.

A practical warning that we return to at the end: the regularized cost is biased. As \epsilon \to 0 you recover the true Wasserstein distance, but at finite \epsilon the regularization shrinks distances toward zero in a non-uniform way. For statistical inference use Sinkhorn divergences (Feydy et al. 2019), which correct for this bias by subtracting self-transport terms.

Wasserstein distance as a tool in applied econometrics

Four uses of W_p show up regularly in applied papers.

Distributional comparisons across regions or time. If you have household income distributions for 2010 and 2020 from the same survey, W_1 between them is the average dollar amount of redistribution required to move the 2010 distribution to the 2020 distribution. This is a real, interpretable quantity. A mean shift would tell you the average grew by some amount but not whether the growth was concentrated in the upper tail. The Wasserstein number absorbs the entire shape change.

Two-sample tests of equality. Under the null that two samples come from the same distribution, the bootstrap distribution of W_2 (or its square) has a known asymptotic behavior. This gives a distributional analog of the two-sample t-test that is robust to the shape of the underlying distributions. Several packages now implement this directly.

Treatment-effect comparisons that go beyond mean and quantile. If a program raised the income of the middle of the distribution but did nothing for the bottom, the Wasserstein number between treatment and control captures this even if the mean shift is small. Quantile treatment effects give you the shape; the Wasserstein number gives you a single summary you can put in a table.

Goodness-of-fit testing for structural models. If a calibrated trade or labor model produces a distribution of firm sizes or wage levels, you can compute the Wasserstein distance between the model-implied distribution and the empirical one. This is a stricter test than matching moments because it accounts for the entire shape. Several recent papers in the structural-estimation literature use Wasserstein-based minimum-distance estimators in place of GMM when the moment selection feels ad hoc.

The price of all of this is computational. Exact OT in two dimensions or higher is expensive. Sinkhorn solves the cost problem at the price of bias. The next subsection covers the regularization in detail.

Effective distance (Brockmann-Helbing 2013)

Brockmann and Helbing’s 2013 Science paper introduced a network-theoretic notion of distance that has proven useful well beyond their epidemiological setting. The construction is short.

Take a network with directed edges. For each ordered pair (i, j) let P_{ij} be the probability of direct transmission from i to j, computed by normalizing the flow weights so that each source row sums to one:

P_{ij} = \frac{F_{ij}}{\sum_k F_{ik}}.

The effective distance from i to j is

d^{\text{eff}}_{ij} = 1 - \log P_{ij}.

The constant 1 is a convention that keeps distances non-negative. The information content is in the -\log P_{ij} term, which is the surprisal of the transition: a high-probability link has near-zero effective distance, a low-probability link has large effective distance, and a link with zero probability has infinite effective distance.

The pairwise definition extends to the network through shortest paths. The multi-step effective distance from i to j is

D^{\text{eff}}_{ij} = \min_{\text{paths } i \to j} \sum_{(u,v) \in \text{path}} d^{\text{eff}}_{uv}.

Brockmann and Helbing showed that for global air-travel data, D^{\text{eff}} predicted the arrival times of SARS and H1N1 with much higher fidelity than great-circle distance. The reason is that disease spreads along the actual flux of people, which the effective-distance construction approximates closely.

The translation to economic settings is direct. Replace passenger flows with trade flows, capital flows, remittance flows, or migration flows. The effective distance from Lisbon to Maputo is not the great-circle distance; it is the cheapest path through the network of language, colonial history, banking correspondence, and existing trade routes. Empirically the effective distance is often a much better predictor of bilateral economic outcomes than physical distance, especially for outcomes that depend on network intermediation.

The Helfrich-Gonchar contribution (SSRN 5202676)

The “Trade in the Spotlight” paper (Helfrich and Gonchar 2025, SSRN 5202676) extends the effective-distance logic to bilateral economic exposure. The setting is sanctions and shock transmission: when a target country gets hit, how do third countries’ exposures depend on their position in the trade network?

The headline observation is that two countries with low geographic distance but few intermediating network ties are more exposed to bilateral shocks than two countries with high geographic distance but rich intermediation. Intermediation is a buffer. A country with many indirect paths into the global system absorbs a direct shock to one of its partners by re-routing. A country with few indirect paths gets the full impact.

We operationalize this by computing effective-distance matrices over BACI bilateral trade flows year-by-year and measuring how a country’s effective-distance ranking shifts after a major sanctions event. The empirical strategy is a difference-in-differences on effective-distance positions, with country fixed effects and time effects. The reduced form is clean enough that you can see the network rearrangement in the raw data.

The mechanism is the network analog of the spatial-heterogeneity story in my thesis (Helfrich 2024, Georgia Tech): heterogeneity in network position matters for shock transmission in the same way heterogeneity in spatial position matters for trade gravity. Network-effective distance is the right covariate to use in place of great-circle distance when the outcome depends on flow rather than physical proximity. Trade gravity at the bilateral level is one such outcome. Migration flows are another. Cross-border lending is a third.

A short remark on the relationship to optimal transport proper. Effective distance is not the same object as the Wasserstein cost. It is a ground metric on the node space of a network, derived from observed flows. Once you have it, you can use it as the ground cost d(\cdot, \cdot) in an optimal-transport problem on the same node space. That nested structure is what makes the toolkit composable: effective distance gives you the right geometry, OT gives you the right way to compare distributions on that geometry.

The Penumbra / EffDist V2026 dataset

The EffDist V2026 dataset (in preparation, Zenodo deposit pending) builds effective distances at a global gridded population level. The inputs are WorldPop population grids and GHS-SMOD settlement classifications, aggregated to roughly 2° \times 2° cells. The output is a pairwise effective-distance matrix where edge weights come from gravity-style flow predictions calibrated against bilateral data where available.

For a rural-finance researcher this matters because you can ask whether a Portuguese rural region’s effective-distance to Lisbon is what predicts its financial-inclusion outcomes, rather than the great-circle distance. The two are not the same. A rural municipality in the Alto Alentejo has a low great-circle distance to Lisbon but a moderately high effective distance because the road network is sparse and most branch banking goes through Évora first. A municipality in the Algarve has a higher great-circle distance but a lower effective distance because of the seasonal-tourism road and rail capacity. In a regression of financial-inclusion outcomes on distance, swapping in effective distance often flips signs and changes magnitudes.

Worked example 1: comparing rural vs urban income distributions

Suppose we want to compare a rural household income sample to an urban one. We simulate the data so the answer is known.

# R, using `transport`
library(transport)

set.seed(2025)
rural <- rlnorm(2000, meanlog = 8.5, sdlog = 0.7)
urban <- rlnorm(3000, meanlog = 9.4, sdlog = 0.5)

# Wasserstein-1 between two 1-D empirical distributions
W1 <- wasserstein1d(rural, urban, p = 1)
W1
# [1] 5723.4  (units: same as income, here euros)

# Bootstrap a 95% CI
B <- 500
W1_boot <- replicate(B, {
  r_b <- sample(rural, replace = TRUE)
  u_b <- sample(urban, replace = TRUE)
  wasserstein1d(r_b, u_b, p = 1)
})
quantile(W1_boot, c(0.025, 0.975))

The interpretation: on average, W_1 \approx \text{€}5{,}723 of probability mass per household must be moved from the rural to the urban distribution to make them equal. This is sensitive to the right tail (urban incomes have a longer upper tail because we drew them from a distribution with higher mean and lower spread on the log scale, so on the raw scale the upper tail is heavier).

The Python equivalent:

# Python, using POT
import numpy as np
import ot

rng = np.random.default_rng(2025)
rural = rng.lognormal(mean=8.5, sigma=0.7, size=2000)
urban = rng.lognormal(mean=9.4, sigma=0.5, size=3000)

# 1-D Wasserstein
W1 = ot.wasserstein_1d(rural, urban, p=1)
print(W1)  # approximately 5700-5800

# Bootstrap CI
B = 500
boot = np.empty(B)
for b in range(B):
    rb = rng.choice(rural, size=len(rural), replace=True)
    ub = rng.choice(urban, size=len(urban), replace=True)
    boot[b] = ot.wasserstein_1d(rb, ub, p=1)
print(np.quantile(boot, [0.025, 0.975]))

For a figure, plot the two empirical CDFs on the same axes and annotate the Wasserstein distance in the corner. The visual reading is that W_1 is the integrated absolute difference between the two CDFs: W_1 = \int_0^\infty |F_{\text{rural}}(x) - F_{\text{urban}}(x)| \, dx. This is a useful identity to remember because it gives a direct geometric interpretation as the area between the two CDFs.

A mean-only summary would report E[\text{urban}] - E[\text{rural}] \approx \text{€}9{,}000. The Wasserstein number is smaller because it accounts for the fact that the two distributions overlap substantially in the middle. A median comparison would miss the upper-tail story entirely.

Worked example 2: effective-distance computation for a small trade network

A 5-country toy network suffices to make the construction concrete. The countries are labeled A, B, C, D, E. We simulate bilateral trade flows.

# R, using `igraph`
library(igraph)

set.seed(11)
flow <- matrix(c(
   0,  120,   40,    8,   20,
  90,    0,  150,   30,   12,
  35,  140,    0,  200,   60,
   5,   28,  220,    0,  180,
  18,   10,   50,  175,    0
), nrow = 5, byrow = TRUE,
dimnames = list(LETTERS[1:5], LETTERS[1:5]))

# Row-normalize to get transition probabilities
P <- flow / rowSums(flow)

# Pairwise effective distance (direct links only)
d_eff_direct <- 1 - log(P)
diag(d_eff_direct) <- 0
d_eff_direct[!is.finite(d_eff_direct)] <- Inf

# Build a directed graph with edge weights = -log P
g <- graph_from_adjacency_matrix(-log(P), mode = "directed",
                                  weighted = TRUE, diag = FALSE)

# Shortest-path effective distance (multi-step)
D_eff <- distances(g, mode = "out")
# Add the +1 convention back if you want strict Brockmann-Helbing units

The trick to remember: igraph::distances() operates on edge weights, and the right edge weight for effective-distance is -\log P_{ij}. The shortest-path sum on that graph gives the multi-step effective distance up to the additive constant. Many practitioners drop the constant; check what your specific reference uses.

A comparison to great-circle distance is informative. Suppose A and B are physically close but their bilateral trade probability P_{AB} is small (because trade re-routes through C, say). Then d^{\text{eff}}_{AB} may be much larger than d^{\text{eff}}_{AC} + d^{\text{eff}}_{CB}, meaning the shortest path from A to B goes through C. The multi-step effective distance correctly identifies C as the natural broker, even though geography would suggest a direct link.

This same code structure scales to the full 200-country bilateral matrix once you swap in BACI data. The pre-processing steps (handling missing flows, deciding what to do with zero entries, choosing a year aggregation) are the bulk of the work in practice.

Why this is useful for blended-finance research

Three places where these tools earn their keep in a blended-finance setting.

Banking access at the rural-region level. A rural municipality’s effective distance to the financial center predicts its banking access better than great-circle distance because a loan officer travels along an actual branch network rather than a straight line. If you regress a financial-inclusion outcome (account ownership, credit access, loan-to-deposit ratio) on great-circle distance with municipality fixed effects, you get noisy results. Swap in effective distance and the signal sharpens. This is especially true in countries with sparse rural infrastructure: Portugal’s interior, much of Mozambique’s central districts, parts of rural Brazil.

Cross-border capital flows. A Portuguese Sociedade Gestora de Investimentos may have substantially more deployment to a Mozambican project than to a closer Spanish project. Great-circle distance predicts the opposite. The effective distance, which weights the Lusophone network (shared language, shared accounting standards, established correspondent banking, diaspora and migration ties), gets the sign right. For replication: build an effective-distance matrix over historical capital-flow data and use it in a gravity regression of bilateral SGM deployment.

Distributional impact of programs. A blended-finance program that subsidizes rural lending may shift the bottom of the rural income distribution without moving the mean much. A Wasserstein test on pre-post or treatment-control distributions captures this. The right pipeline is: estimate the empirical distributions in treated and control samples, compute W_1 between them, bootstrap a confidence interval, report both the number (in dollars or local currency) and a CDF-overlay plot. This is a more honest answer to the “did the program shift the bottom” question than a quantile regression at the 10th percentile alone.

Common traps

Mistaking Wasserstein for a centering measure. W_p(\mu, \nu) is a metric (distance) in the space of probability distributions. It is not an estimator of a parameter, and it has no “null value” except zero (when \mu = \nu). When you report it, report it as a distance with units, not as a point estimate of some underlying quantity.

Different supports. If \mu has support on incomes from 0 to 50,000 and \nu has support from 0 to 500,000, the Wasserstein distance is large because the upper tail of \nu is far from any point in \mu. This is correct but can mislead if you wanted to compare the bulk. Pre-process to a common support (truncate at a common quantile) or use a truncated variant if you only care about a region of the distribution.

Sinkhorn bias. The entropic-regularized Wasserstein is biased. As \epsilon increases the regularization penalizes optimal transport plans with low entropy, which means it underestimates the true distance and adds noise. For statistical inference (CIs, hypothesis tests) use Sinkhorn divergences (Feydy et al. 2019), which subtract the self-transport bias:

\text{SD}_\epsilon(\mu, \nu) = W_\epsilon(\mu, \nu) - \tfrac{1}{2} W_\epsilon(\mu, \mu) - \tfrac{1}{2} W_\epsilon(\nu, \nu).

This has the property of being zero when \mu = \nu and is the right object to bootstrap or to use as a test statistic.

Effective distance vs great-circle distance in regressions. If a paper says “controls for distance”, check the data appendix to see which one. Many published trade and migration papers use great-circle distance as a default. If the question is about network-mediated transmission (sanctions, finance, supply chains, contagion), the great-circle control is often the wrong one. The estimated coefficient on the variable of interest may absorb network-mediation effects that should have been controlled for separately.

Zero flows. Effective distance is undefined when P_{ij} = 0. Common fixes: add a small uniform prior so all entries are positive (with a sensitivity check at multiple smoothing strengths), or restrict to the largest connected component of the graph and treat unreachable pairs as having infinite effective distance. The choice matters for sparse networks and should be documented in the data appendix.

Confusing Wasserstein-1 and Wasserstein-2. W_1 has the CDF identity above and is robust to outliers in the support. W_2 is the metric that has a Riemannian structure on the space of distributions (the McCann interpolation, Brenier’s theorem). For two-sample testing W_2^2 is the standard choice because its asymptotic distribution is better-understood. For interpretable reporting in dollars or kilometers, W_1 is usually what you want.

References

Acemoglu, D., Carvalho, V., Ozdaglar, A., and Tahbaz-Salehi, A. (2012). “The Network Origins of Aggregate Fluctuations.” Econometrica 80(5), 1977-2016.

Brockmann, D., and Helbing, D. (2013). “The hidden geometry of complex network-driven contagion phenomena.” Science 342, 1337-1342.

Cuturi, M. (2013). “Sinkhorn Distances: Lightspeed Computation of Optimal Transport.” NeurIPS 26.

Feydy, J., Séjourné, T., Vialard, F.-X., Amari, S., Trouvé, A., and Peyré, G. (2019). “Interpolating between optimal transport and MMD using Sinkhorn divergences.” AISTATS.

Flamary, R., Courty, N., et al. (2021). “POT: Python Optimal Transport.” Journal of Machine Learning Research 22(78), 1-8.

Galichon, A. (2016). Optimal Transport Methods in Economics. Princeton University Press.

Helfrich, I. (2024). Network Centrality, Spatial Heterogeneity, and Optimal Transport in International Trade. PhD thesis, Georgia Institute of Technology.

Helfrich, I., and Gonchar, E. (2025). “Trade in the Spotlight: Effective-Distance Exposure and Bilateral Shock Transmission.” SSRN Working Paper 5202676.

Peyré, G., and Cuturi, M. (2019). Computational Optimal Transport. Foundations and Trends in Machine Learning 11(5-6), 355-607.

Pooley, J. (2022). “Network Distance and Bilateral Migration: Evidence from Effective-Distance Gravity.” Journal of International Economics (working paper draft cited in the network-migration literature).

Santambrogio, F. (2015). Optimal Transport for Applied Mathematicians. Birkhäuser.

Villani, C. (2009). Optimal Transport: Old and New. Grundlehren der mathematischen Wissenschaften, Springer.

Wang, R., and Wang, X. (2023). “Wasserstein-Based Inference for Distributional Treatment Effects.” Working paper, applied in development economics and program evaluation.