Time series: ARMA, stationarity, cointegration

Why time series is different

A cross-section of countries, firms, or people gives us n observations that we treat as roughly independent draws from some population. The standard tools (OLS standard errors, t-tests, the central limit theorem) rest on that independence. Time series breaks it. A measurement today depends on yesterday’s measurement. The unit of observation (one year of GDP, one month of unemployment, one minute of stock price) carries information about its neighbors. Treating these as independent draws produces standard errors that are too small, t-statistics that are too large, and “significant” findings that vanish on out-of-sample data.

Serial correlation is the rule, not a pathology. The whole apparatus of time series methods is built around two questions. First, how do we describe the dependence between y_t and its past? Second, how do we estimate parameters and quantify uncertainty when the observations are not independent? The autoregressive moving-average (ARMA) family answers the first question parametrically. Maximum likelihood (under stationarity) and method-of-moments (Yule-Walker) answer the second.

Stationarity

A stochastic process \{y_t\} is strictly stationary if the joint distribution of (y_{t_1}, \ldots, y_{t_k}) is invariant to time shifts: identical for (y_{t_1 + h}, \ldots, y_{t_k + h}) for any h. Strict stationarity is rarely testable directly. The working concept is weaker.

A process is weakly stationary (or second-order stationary, or covariance stationary) if its first two moments do not depend on time: E[y_t] = \mu \text{ for all } t, \quad \text{Var}(y_t) = \sigma^2 \text{ for all } t, \quad \text{Cov}(y_t, y_{t+h}) = \gamma(h) the covariance depending only on the lag h and not on the time origin t. Most identification and estimation in the ARMA framework presupposes weak stationarity. When it fails (a unit root, a structural break, a trending mean), the inferential apparatus collapses, and standard t-tests can produce spurious significance even when no relationship exists.

Two routes restore stationarity in practice. Differencing y_t \to \Delta y_t = y_t - y_{t-1} removes a unit root. Detrending (regressing on t and taking residuals) removes a deterministic trend. The two operations are not equivalent and lead to different long-run implications, which is why distinguishing trend-stationary from difference-stationary processes is a central diagnostic task.

ACF and PACF

The autocovariance function is \gamma(h) = \text{Cov}(y_t, y_{t+h}), and the autocorrelation function (ACF) is its standardization \rho(h) = \gamma(h) / \gamma(0). The sample ACF replaces population quantities with their empirical analogs: \hat\rho(h) = \frac{\sum_{t=1}^{n-h} (y_t - \bar y)(y_{t+h} - \bar y)}{\sum_{t=1}^{n} (y_t - \bar y)^2}

The partial autocorrelation function (PACF) at lag h is the correlation between y_t and y_{t+h} after removing the linear effect of the intermediate lags. Equivalently, it is the last coefficient in a regression of y_t on y_{t-1}, \ldots, y_{t-h}.

Reading the two functions together is the Box-Jenkins identification heuristic. For a pure AR(p) process, the PACF cuts off after lag p (the partial correlations are zero beyond lag p) and the ACF decays geometrically. For a pure MA(q) process, the ACF cuts off after lag q and the PACF decays geometrically. For a mixed ARMA, both decay. The cutoff patterns guide the initial choice of (p, q). The choice is then refined by comparing residual diagnostics and information criteria (AIC, BIC) across nearby candidate orders.

Approximate \pm 1.96 / \sqrt{n} confidence bands are routinely overlaid on the sample ACF and PACF to indicate which lags are statistically distinguishable from zero. The bands are derived under the null that the data are white noise, so they over-state precision for series that have any genuine autocorrelation.

AR, MA, ARMA

The autoregressive process of order p is y_t = c + \phi_1 y_{t-1} + \cdots + \phi_p y_{t-p} + \varepsilon_t with \varepsilon_t i.i.d. white noise of mean zero and variance \sigma^2. Stationarity requires the roots of 1 - \phi_1 z - \cdots - \phi_p z^p = 0 to lie outside the unit circle. For AR(1), this reduces to |\phi_1| < 1. When \phi_1 = 1, the process is a random walk with no mean reversion; this is a unit root.

The moving-average process of order q is y_t = c + \varepsilon_t + \theta_1 \varepsilon_{t-1} + \cdots + \theta_q \varepsilon_{t-q} An MA process is stationary by construction (it is a finite linear combination of white noise). Invertibility (the ability to write \varepsilon_t as a convergent series in past y’s) requires the roots of 1 + \theta_1 z + \cdots + \theta_q z^q = 0 to lie outside the unit circle.

The ARMA(p, q) model combines both: y_t = c + \phi_1 y_{t-1} + \cdots + \phi_p y_{t-p} + \varepsilon_t + \theta_1 \varepsilon_{t-1} + \cdots + \theta_q \varepsilon_{t-q} ARMA models are dense in the space of stationary stochastic processes (Wold’s decomposition), which is the theoretical reason they appear in nearly every applied time-series analysis. The Box-Jenkins heuristic identifies a candidate (p, q) from sample ACF/PACF patterns, estimates the coefficients, and diagnoses the residuals.

Estimation

Three estimation routes are standard.

Conditional MLE. Treat the first p observations as given and maximize the conditional likelihood of y_{p+1}, \ldots, y_n. For an AR(p) model with Gaussian innovations, this reduces to OLS of y_t on its p lags. Fast and consistent.

Full (unconditional) MLE. Use the joint density of all n observations, including the marginal distribution of the initial values implied by stationarity. Slightly more efficient than conditional MLE, especially in small samples. The Kalman filter computes the likelihood efficiently for general ARMA models.

Yule-Walker (method of moments). For an AR(p) model, the autocovariance function satisfies the Yule-Walker equations \gamma(h) = \phi_1 \gamma(h - 1) + \cdots + \phi_p \gamma(h - p), \quad h \geq 1 Replace population autocovariances with their sample analogs and solve the linear system for (\hat\phi_1, \ldots, \hat\phi_p). The Yule-Walker estimator is consistent and asymptotically normal. It is the default in some older software (and in many textbook examples) because it is computationally trivial.

In modern practice, conditional or full MLE via state-space representation (the Kalman filter) is the default. The three estimators agree asymptotically; differences in finite samples are small for moderately long series.

Forecasting

The optimal h-step-ahead forecast (in minimum mean squared error) is the conditional expectation \hat y_{t+h \mid t} = E[y_{t+h} \mid y_t, y_{t-1}, \ldots] For an AR(1) with mean \mu = c / (1 - \phi), \hat y_{t+h \mid t} = \mu + \phi^h (y_t - \mu) The forecast converges geometrically to the unconditional mean as h grows. The forecast variance is \text{Var}(y_{t+h} \mid y_t) = \sigma^2 \frac{1 - \phi^{2h}}{1 - \phi^2} which converges to the unconditional variance \sigma^2 / (1 - \phi^2) as h \to \infty. A 95 percent forecast interval is approximately \hat y_{t+h \mid t} \pm 1.96 \sqrt{\text{Var}(y_{t+h} \mid y_t)}, with the variance plugged in from the estimated \hat\sigma^2 and \hat\phi.

Forecast intervals that condition on a fixed parameter estimate underestimate true uncertainty because they ignore estimation error in \hat\phi and \hat\sigma^2. Bayesian methods propagate this uncertainty automatically. Bootstrap methods can do the same for frequentist intervals. The discrepancy matters most for long horizons and small samples.

A practical truth: ARMA forecasts mean-revert quickly. At long horizons, the forecast is just the unconditional mean and the interval width is the unconditional standard deviation. ARMA is good at short-horizon point forecasting and weak at long-horizon prediction except when the long-run mean is itself the quantity of interest.

Unit roots and ARIMA

A process with a unit root has no finite long-run mean. The classic example is the random walk y_t = y_{t-1} + \varepsilon_t which has \phi = 1 and accumulates innovations forever. Its variance grows linearly in t, and y_t wanders without drift. Differencing once produces a stationary white noise series; differencing more is overkill and induces a non-invertible MA structure.

The autoregressive integrated moving average model ARIMA(p, d, q) is an ARMA(p, q) applied to the d-th difference of y_t. The integer d is the order of integration: d = 0 for stationary, d = 1 for a single unit root, rarely d = 2 for the difference of a unit root. Pre-testing for the order of integration is a routine first step.

Dickey-Fuller and ADF tests. The Dickey-Fuller (DF) test regresses \Delta y_t = (\phi - 1) y_{t-1} + \varepsilon_t and tests H_0: \phi - 1 = 0 (unit root) against H_1: \phi - 1 < 0 (stationary). Critical values are nonstandard because the t-statistic does not have a standard normal distribution under the null. The augmented Dickey-Fuller (ADF) test adds lagged differences to absorb serial correlation in \varepsilon_t: \Delta y_t = \alpha + \beta t + \gamma y_{t-1} + \sum_{i=1}^{p} \delta_i \Delta y_{t-i} + \varepsilon_t with the null H_0: \gamma = 0. The number of lags is chosen by AIC, BIC, or a sequential testing rule (Schwert’s p_{\max}).

KPSS test. The Kwiatkowski-Phillips-Schmidt-Shin test reverses the null: H_0 is stationarity, H_1 is a unit root. The two tests complement each other. ADF and KPSS pointing in the same direction is decisive; disagreement signals borderline cases where the result depends on assumptions about the deterministic trend.

Both tests are notorious for low power against close-to-stationary alternatives (\phi near but below 1). A series with \phi = 0.95 over n = 100 observations will fail to reject the unit root null much of the time even though it is stationary. Confirming the test result with the visual diagnostic of the ACF (which decays slowly for unit-root processes and quickly for stationary ones) is good practice.

Cointegration

Two non-stationary series can be linked by a stationary linear combination. Engle and Granger (1987) named this cointegration. If x_t and y_t are both I(1) but z_t = y_t - \beta x_t is I(0), the series are cointegrated. The long-run equilibrium y_t \approx \beta x_t holds despite both quantities individually drifting without bound.

Engle-Granger two-step procedure. First, run OLS of y_t on x_t and obtain residuals \hat u_t. Second, test \hat u_t for a unit root using ADF with adjusted critical values (the residuals are estimated, which changes the distribution under the null). If the unit root is rejected, the series are cointegrated.

Johansen procedure. A multivariate generalization based on vector autoregressions. For k-dimensional y_t, write the VAR in error-correction form \Delta y_t = \Pi y_{t-1} + \sum_{i=1}^{p-1} \Gamma_i \Delta y_{t-i} + \varepsilon_t The rank of \Pi equals the number of cointegrating relations. Johansen’s trace and maximum-eigenvalue tests evaluate the null hypotheses of rank r against rank > r, sequentially. The procedure delivers all cointegrating vectors simultaneously and is the default for systems of more than two series.

Error-correction model. Once cointegration is established, the dynamics of adjustment back to equilibrium are captured by \Delta y_t = \alpha (y_{t-1} - \beta x_{t-1}) + \gamma \Delta x_t + \varepsilon_t The coefficient \alpha < 0 on the lagged equilibrium error measures the speed of adjustment. The model embodies the Granger representation theorem: cointegrated I(1) variables admit an error-correction representation, and any error-correction model implies cointegration.

Spurious regression (Granger and Newbold, 1974) is the cautionary partner to cointegration. Regressing two independent random walks on each other produces an R^2 that wanders toward 1 as the sample grows, with t-statistics that diverge. The cointegration framework distinguishes the spurious case (no cointegration, regression meaningless) from the meaningful case (cointegration present, OLS coefficients superconsistent).

Volatility: a preview

ARMA models the conditional mean. Many financial and macroeconomic series exhibit time-varying conditional variance: quiet periods alternate with turbulent ones, and squared returns are autocorrelated even when returns themselves are not. The GARCH(p, q) model (Bollerslev, 1986) writes \sigma_t^2 = \omega + \sum_{i=1}^{q} \alpha_i \varepsilon_{t-i}^2 + \sum_{j=1}^{p} \beta_j \sigma_{t-j}^2 and lets \varepsilon_t = \sigma_t z_t with z_t i.i.d. standard normal. Extensions include EGARCH (asymmetric response to positive and negative shocks), GJR-GARCH, IGARCH (integrated volatility), and multivariate GARCH (DCC, BEKK). The standard book-length reference is Tsay’s Analysis of Financial Time Series. We mention these models here only to flag that the conditional-mean apparatus of ARMA is not the end of the story for financial data.

Worked example: AR(1) fit, forecast, and unit-root test

Simulate n = 200 observations from an AR(1) with \phi = 0.7 and mean zero. Fit an AR(1), forecast 10 steps ahead with a 95 percent interval, and then simulate a separate random walk and test for a unit root.

Excel

Excel has no built-in ARIMA estimation. The simulation, OLS estimate of \phi, and forecast can be built in cells, but the unit-root test requires non-standard critical values that have to be entered manually.

Step 1: simulate AR(1) with phi = 0.7
B1 (assumption): phi = 0.7
A1   =NORM.S.INV(RAND())                 ' y_1 ~ N(0,1)
A2   =$B$1*A1 + NORM.S.INV(RAND())       ' y_2 = phi*y_1 + eps
fill A2:A200 down (Ctrl+D)

Step 2: estimate phi by OLS of y_t on y_{t-1}
C1   =SLOPE(A2:A200, A1:A199)            ' phi_hat
C2   =STEYX(A2:A200, A1:A199)            ' residual SE -> sigma_hat

Step 3: forecast 10 steps ahead, starting from y_200 = A200
D1   =$C$1 * A200                        ' yhat_201
D2   =$C$1 * D1                          ' yhat_202
fill D1:D10
the forecast converges geometrically to 0 (the unconditional mean)

Step 4: forecast variance and 95% bounds
sigma^2 / (1 - phi^2) is the unconditional variance
E1   =C2^2 * (1 - C1^(2*1)) / (1 - C1^2)
E2   =C2^2 * (1 - C1^(2*2)) / (1 - C1^2)
fill E1:E10
F1   =D1 - 1.96*SQRT(E1)                 ' lower 95% bound
G1   =D1 + 1.96*SQRT(E1)                 ' upper 95% bound

Step 5: ACF at lag 1
H1   =CORREL(A1:A199, A2:A200)           ' approx 0.7

Random-walk simulation for the unit-root demo
I1   =NORM.S.INV(RAND())
I2   =I1 + NORM.S.INV(RAND())
fill I2:I200
the ADF test on column I cannot be done in Excel without
custom critical values; export to a real package

This is enough to teach what an AR(1) does. For inferential time-series work, use Stata, R, or Julia.

Stata

* Simulate AR(1), phi = 0.7
clear
set obs 200
set seed 20260512
gen t = _n
gen y = .
replace y = rnormal() in 1
forvalues i = 2/200 {
    replace y = 0.7*y[`i'-1] + rnormal() in `i'
}
tsset t

* Fit AR(1)
arima y, ar(1) nolog

* 10-step dynamic forecast with 95% interval
predict yhat, dynamic(201)
predict sehat, mse  // forecast MSE; sqrt gives forecast SE

* Unit-root tests on a separate random walk
set seed 20260513
gen u = .
replace u = rnormal() in 1
forvalues i = 2/200 {
    replace u = u[`i'-1] + rnormal() in `i'
}

* ADF and KPSS
dfuller u, lags(4)        // ADF
dfuller u, lags(4) trend  // ADF with deterministic trend
kpss u, maxlag(4)         // KPSS, opposite null

* For the AR(1) series y, both should reject the unit root
dfuller y, lags(4)

Stata’s arima command uses the Kalman filter for full MLE. The forecast standard error is reported via predict ..., mse. The dfuller and kpss commands take the maximum lag length as the main tuning parameter; an automatic selection rule (varsoc followed by judgment) is also available.

R

library(forecast)
library(tseries)
set.seed(20260512)

# Simulate AR(1)
n <- 200
y <- arima.sim(list(ar = 0.7), n = n)

# Fit
fit <- Arima(y, order = c(1, 0, 0), include.mean = FALSE)
summary(fit)            # estimated phi, sigma^2, AIC

# 10-step forecast with 95% PI
fc <- forecast(fit, h = 10, level = 95)
print(fc)
autoplot(fc)            # plot of forecast and interval

# Unit-root tests on a random walk
set.seed(20260513)
u <- cumsum(rnorm(n))
adf.test(u)             # null: unit root  -> typically fail to reject
kpss.test(u)            # null: stationary -> typically reject

# Should reject the unit root for the stationary y
adf.test(y)
kpss.test(y)

# A tidyverts alternative
library(fable)
library(tsibble)
ts_df <- tsibble(t = 1:n, y = as.numeric(y), index = t)
fit2 <- ts_df |> model(arima = ARIMA(y ~ pdq(1, 0, 0)))
fc2  <- fit2 |> forecast(h = 10)
fc2

The forecast package (Hyndman) and the newer fable package (tidyverts) are the workhorse R libraries. forecast::Arima keeps the same parameterization across auto.arima and manual specifications. auto.arima automates the Box-Jenkins identification step by searching over candidate orders and picking by AICc.

Julia

using Random, Distributions, Statistics, LinearAlgebra
Random.seed!(20260512)

# Simulate AR(1)
n, phi, sigma = 200, 0.7, 1.0
y = zeros(n)
y[1] = randn()
for t in 2:n
    y[t] = phi * y[t-1] + sigma * randn()
end

# Estimate phi by OLS on lagged y
X = y[1:end-1]
Y = y[2:end]
phi_hat = (X' * Y) / (X' * X)
resid = Y .- phi_hat .* X
sigma2_hat = sum(resid.^2) / (length(resid) - 1)

# 10-step forecast
h = 10
fc = zeros(h)
fc[1] = phi_hat * y[end]
for k in 2:h
    fc[k] = phi_hat * fc[k-1]
end

# Forecast variance at each horizon
fvar = [sigma2_hat * (1 - phi_hat^(2k)) / (1 - phi_hat^2) for k in 1:h]
lower = fc .- 1.96 .* sqrt.(fvar)
upper = fc .+ 1.96 .* sqrt.(fvar)

# A higher-level alternative with StateSpaceModels.jl
# using StateSpaceModels
# model = SARIMA(y; order = (1, 0, 0), include_mean = false)
# fit!(model)
# forec = forecast(model, h)

# ADF test on a random walk
Random.seed!(20260513)
u = cumsum(randn(n))
# Manual ADF: regress Delta u on u_{t-1} and report the t-stat on the lagged level.
# Critical values are nonstandard; compare to MacKinnon (2010).
# Or use HypothesisTests.jl:
# using HypothesisTests
# ADFTest(u, :constant, 4)

Julia’s time-series stack is younger than R’s or Stata’s. StateSpaceModels.jl provides ARIMA via state-space methods; ARCHModels.jl handles GARCH; HypothesisTests.jl implements the ADF test. For pedagogical purposes, writing the AR(1) recursion and OLS step by hand reveals exactly what is happening, which is the point of the exercise.

Common traps

Applying a t-test to serially correlated data. A regression with autocorrelated residuals has standard errors that are systematically wrong (usually too small). Newey-West (HAC) standard errors are the standard correction for moderate autocorrelation: they are consistent under heteroskedasticity and autocorrelation of unknown form. For strong persistence, even Newey-West can be inadequate, and cluster-robust or block-bootstrap methods are appropriate.

Spurious regression. Regressing one random walk on another, unrelated random walk produces “significant” t-statistics in roughly 75 percent of cases with n = 100 even though no real relationship exists. The remedy is to test for unit roots before running level regressions, difference the data if both series are I(1) and not cointegrated, or use the cointegration framework if they are.

Forgetting to test for a unit root before differencing. Over-differencing a stationary series induces a non-invertible MA component, inflates forecast variance, and discards low-frequency information. Under-differencing a unit-root series produces invalid inference. The order of integration is a fundamental property of the data and should be checked explicitly before estimation.

Reading the ACF and PACF too literally. Sample autocorrelations have substantial variance, and the \pm 1.96 / \sqrt{n} bands are constructed under the white-noise null, not under the null that the relevant cutoff lag is exactly zero. A “borderline” lag at the cutoff is often best handled by fitting both candidate models and comparing AIC or BIC.

Treating Box-Jenkins identification as automatic. Multiple seasonal cycles, structural breaks, and outliers can swamp ACF/PACF patterns. Inspect the series visually, account for known interventions, and consider seasonal differences (SARIMA) when a clear annual cycle is present.

Trusting one unit-root test in isolation. ADF has notoriously low power against near-unit-root alternatives. KPSS has the opposite vulnerability. Reporting both, and noting when they disagree, prevents over-confidence about the order of integration.

Reporting checklist

Show a time plot of the raw series. Always.
State the order of integration and the test(s) used to determine it (ADF, KPSS, Phillips-Perron).
For the fitted model: report (p, d, q), coefficient estimates with standard errors, residual variance, and the information criterion used to compare candidates.
Provide ACF and Ljung-Box diagnostics of the residuals to verify that no autocorrelation remains.
For forecasts: state the horizon, the type of interval (analytic or simulated), and whether parameter uncertainty was incorporated.
For multivariate work: state the lag order of the VAR, the cointegrating rank (if applicable), and the procedure used to determine it (Johansen, Engle-Granger).
Note the sample period and any data transformations (logs, growth rates, seasonal adjustment).

References

Hamilton, J. D. (1994). Time Series Analysis. Princeton University Press. The standard graduate reference. Rigorous, comprehensive, and dense.
Box, G. E. P., Jenkins, G. M., Reinsel, G. C., & Ljung, G. M. (2015). Time Series Analysis: Forecasting and Control (5th ed.). Wiley. The original Box-Jenkins methodology, updated.
Tsay, R. S. (2010). Analysis of Financial Time Series (3rd ed.). Wiley. The standard reference for GARCH and financial econometrics.
Hyndman, R. J., & Athanasopoulos, G. (2021). Forecasting: Principles and Practice (3rd ed.). OTexts. Free online. Practical, modern, and aligned with the R forecast and fable packages.
Engle, R. F., & Granger, C. W. J. (1987). Co-integration and error correction: Representation, estimation, and testing. Econometrica, 55(2), 251 to 276.
Johansen, S. (1991). Estimation and hypothesis testing of cointegration vectors in Gaussian vector autoregressive models. Econometrica, 59(6), 1551 to 1580.
Dickey, D. A., & Fuller, W. A. (1979). Distribution of the estimators for autoregressive time series with a unit root. Journal of the American Statistical Association, 74(366), 427 to 431.
Kwiatkowski, D., Phillips, P. C. B., Schmidt, P., & Shin, Y. (1992). Testing the null hypothesis of stationarity against the alternative of a unit root. Journal of Econometrics, 54(1-3), 159 to 178.
Granger, C. W. J., & Newbold, P. (1974). Spurious regressions in econometrics. Journal of Econometrics, 2(2), 111 to 120.