Multivariate methods

Why multivariate

Most introductory statistics treats one variable at a time, or one outcome with several predictors. Many real problems have several outcomes at once, or so many predictors that the variables themselves are correlated and partly redundant. A consumer survey records dozens of attitudes about a product. A clinical study records biomarkers, vital signs, and patient-reported outcomes on every visit. A satellite returns 100 spectral bands at every pixel. Treating these dimensions one at a time throws away information about how they move together and risks compounding test-level error rates beyond recognition.

Multivariate methods address two problems at once. The first is dimension reduction: finding a small number of derived variables that capture most of the variation in the original ones. The second is structure discovery: grouping similar observations, identifying latent factors that drive observed correlations, or testing whether group means differ across a panel of correlated outcomes.

We work primarily in matrix algebra in this chapter. The data are n \times p: n observations on p variables, with columns standardized to mean zero (and often unit variance). The sample covariance matrix is S = \frac{1}{n - 1} (X - \bar X)^\top (X - \bar X) where \bar X is the row vector of column means broadcast across rows. Most of what follows reduces to algebra on S.

Principal component analysis

Principal component analysis (PCA) finds linear combinations of the original variables that capture maximum variance, subject to orthogonality constraints. The first principal component is the direction w_1 (a unit vector in \mathbb{R}^p) such that the projected data X w_1 has maximum variance. The second component is the direction with maximum variance among all directions orthogonal to w_1. And so on.

Derivation via the eigendecomposition

The variance of X w subject to \|w\| = 1 is w^\top S w. Using a Lagrange multiplier, \mathcal{L}(w, \lambda) = w^\top S w - \lambda (w^\top w - 1) Setting the gradient to zero gives S w = \lambda w. So w is an eigenvector of S and \lambda is its eigenvalue. The largest eigenvalue corresponds to the direction of maximum variance, the next largest to the second component, and so on.

The eigendecomposition S = W \Lambda W^\top gives us everything at once. The columns of W are the principal directions (loadings). The diagonal entries of \Lambda are the variances along each principal direction. The principal component scores are Z = X W The total variance is preserved: \sum_j \lambda_j = \text{tr}(S) = \sum_j \text{Var}(X_j). The fraction of variance explained by the first k components is \sum_{j=1}^k \lambda_j / \sum_{j=1}^p \lambda_j.

A practical equivalent uses the singular value decomposition X = U D V^\top. Then S = (n-1)^{-1} V D^2 V^\top, so the columns of V are the principal directions and the squared singular values are proportional to the eigenvalues. Most software computes PCA via SVD because it is numerically more stable than forming S explicitly.

How many components

Three heuristics dominate.

Scree plot. Plot eigenvalues in descending order. Look for an “elbow” where the curve flattens out. Keep components above the elbow. The criterion is subjective but easy to defend in writing.

Kaiser rule. Keep components with eigenvalues greater than 1 (after standardizing the variables). The rationale is that a component with \lambda < 1 explains less variance than a single original variable does. Often suggested in psychometric texts. Sometimes too aggressive in dropping components.

Cumulative variance. Keep enough components to exceed a target (often 80 or 90 percent of total variance).

In modern practice with p \gg 30 or so, cross-validation on a downstream task (regression, classification) often dominates the heuristics. Pick the number of components that optimizes a meaningful out-of-sample metric.

Factor analysis

Factor analysis (FA) and PCA produce visually similar output but answer different questions. PCA is a transformation: given the data, find directions of maximum variance. FA is a model: assume the observed variables are linear combinations of a small number of unobserved latent factors plus noise.

Formally, FA writes X = \mu + \Lambda F + \varepsilon where F is a k-vector of latent factors with mean zero and identity covariance, \Lambda is a p \times k matrix of factor loadings, and \varepsilon is a p-vector of idiosyncratic noise with diagonal covariance \Psi. The implied covariance structure of X is \Sigma = \Lambda \Lambda^\top + \Psi FA fits \Lambda and \Psi by maximum likelihood (assuming joint normality) or by iterative least squares, and rotates the factors after the fact (varimax, promax) for interpretability.

When to prefer FA over PCA: when you believe the data really are driven by a small number of latent constructs (intelligence, personality, attitude) and you want to recover those constructs as cleanly as possible. The diagonal \Psi formalizes the idea that some variance in each observed variable is idiosyncratic and should not contaminate the factor scores.

When to prefer PCA: when you want a tool for dimension reduction rather than a model of latent structure. PCA components are deterministic functions of the data; FA scores are estimated conditional on the model. PCA is simpler to defend in writing because it makes no distributional assumption.

MANOVA

Multivariate analysis of variance generalizes ANOVA to a vector-valued response. The setup: g groups, n_j observations in group j, p-dimensional response in each observation. The model is y_{ij} = \mu_j + \varepsilon_{ij}, \quad \varepsilon_{ij} \sim N_p(0, \Sigma) and the null is H_0: \mu_1 = \cdots = \mu_g.

The test statistic compares two p \times p sums of squares. The between-group matrix B measures dispersion of group means around the grand mean. The within-group matrix W measures pooled scatter around group means. In one-dimensional ANOVA, the F-statistic is the ratio of between to within. In multivariate ANOVA we need a scalar summary of a matrix ratio. Several are commonly reported:

Wilks’s lambda. \Lambda = |W| / |B + W|. Small values reject. A likelihood-ratio test in the normal model.

Pillai’s trace. V = \text{tr}(B (B + W)^{-1}). Approximately the average R^2 across canonical dimensions. Often the most robust to violations of normality.

Hotelling-Lawley trace. T = \text{tr}(B W^{-1}). Most powerful when groups separate along a single dimension.

Roy’s largest root. Largest eigenvalue of B W^{-1}. Most powerful when the alternative is a single dominant dimension of separation.

For two groups, all four reduce to Hotelling’s T^2: T^2 = n_1 n_2 / (n_1 + n_2) \cdot (\bar y_1 - \bar y_2)^\top S_{\text{pool}}^{-1} (\bar y_1 - \bar y_2) which has an F-distribution under the null.

Why bother? Because running p separate t-tests, one per dimension, inflates the family-wise error rate and ignores the correlation structure across outcomes. A single multivariate test holds error rate to the nominal level and gains power when group differences are correlated across dimensions.

Cluster analysis

Cluster analysis seeks groups of observations that are similar within and different across, without a label or outcome to learn from. Two families dominate applied work.

k-means

Fix the number of clusters k. Initialize k centroids (random data points, k-means++ initialization, etc.). Iterate:

  1. Assign each observation to the nearest centroid.
  2. Update each centroid to the mean of the observations assigned to it.

Continue until assignments stop changing. The algorithm minimizes the within-cluster sum of squares \sum_{j=1}^k \sum_{i \in C_j} \|x_i - \bar x_j\|^2 locally. It does not guarantee a global minimum: different starting points lead to different solutions. Standard practice is to run k-means many times with random starts and keep the best fit.

Choosing k is the hard part. The elbow method plots within-cluster SS against k and looks for a kink. The silhouette score combines two distances per observation (mean distance to its own cluster, mean distance to the nearest other cluster) into a value in [-1, 1]; higher is better. The gap statistic compares the within-cluster SS to that of a null reference distribution. None of these is perfect. In practice, examine the cluster structure at several values of k and pick the one that produces clusters you can interpret.

Hierarchical clustering

Build a tree (dendrogram) by either splitting (divisive) or merging (agglomerative). Agglomerative is more common: start with n singleton clusters, repeatedly merge the two closest clusters, until one cluster remains. The output is a dendrogram, which you cut at a chosen height to produce a flat clustering.

The choice of linkage determines what “closest” means.

  • Single linkage uses the minimum distance between cluster members. Produces long, stringy clusters (“chaining”). Often a bad default but useful when clusters are elongated.
  • Complete linkage uses the maximum distance. Produces compact, spherical clusters. Sensitive to outliers.
  • Average linkage uses the mean pairwise distance. A reasonable compromise.
  • Ward’s method merges the pair that minimizes the increase in within-cluster sum of squares. Tends to produce balanced clusters of similar size.

Hierarchical clustering does not require choosing k up front. It also does not require a Euclidean metric: any pairwise dissimilarity matrix works. The downside is computational cost (O(n^2) space, O(n^3) time for naive implementations) and instability to small perturbations of the data.

Multidimensional scaling

Given a matrix of pairwise dissimilarities D = (d_{ij}) among n objects, multidimensional scaling (MDS) finds a low-dimensional configuration z_1, \ldots, z_n \in \mathbb{R}^k whose pairwise Euclidean distances approximate D. Classical MDS solves this exactly when D is a Euclidean distance matrix: double-center D^2 to form a matrix B, then Z comes from the eigendecomposition of B.

If the data are in Euclidean coordinates to begin with, classical MDS on the Euclidean distance matrix produces exactly the PCA scores. The two methods are dual: PCA starts from the variable view, MDS starts from the distance view. For genuinely nonEuclidean dissimilarities (graph distances, perceptual differences), classical MDS approximates rather than reproduces. Nonmetric MDS (Kruskal, Shepard) targets the rank order of distances rather than their values and handles ordinal dissimilarity data.

Discriminant analysis

Discriminant analysis is supervised: each observation has a class label, and we want a rule that classifies new observations. Linear discriminant analysis (LDA) assumes each class density is multivariate normal with class-specific mean \mu_k and a common covariance \Sigma. The Bayes-optimal classification rule is \hat k(x) = \arg\max_k \, x^\top \Sigma^{-1} \mu_k - \tfrac{1}{2} \mu_k^\top \Sigma^{-1} \mu_k + \log \pi_k where \pi_k is the prior probability of class k. The decision boundaries are linear in x.

Quadratic discriminant analysis (QDA) drops the common-covariance assumption. Each class has its own \Sigma_k. The decision boundaries become quadratic. QDA is more flexible but requires more data per class to estimate the class-specific covariances reliably.

LDA also doubles as a dimension reduction technique: project to the k - 1 dimensions that maximize between-class variance relative to within-class variance, then classify in the reduced space. This is sometimes called Fisher’s discriminant.

When normality holds, LDA is optimal. When it does not, LDA still often performs well because the linear-boundary structure is reasonable in many real problems. For complex boundaries, logistic regression with interaction terms or a flexible machine learning model (random forest, gradient boosting) typically dominates.

Modern preview: t-SNE and UMAP

PCA and classical MDS are linear: the embedded coordinates are linear combinations of the original variables (or known transformations of distances). For highly nonlinear structure, two newer techniques dominate visualization workflows.

t-SNE (t-distributed stochastic neighbor embedding, van der Maaten and Hinton 2008) preserves local neighborhoods. It converts pairwise distances to conditional probabilities in the original space and in the embedding, and minimizes the KL divergence between the two. Excellent for visualizing clusters in high-dimensional data. The cost: global structure is not preserved, so cluster sizes and distances between clusters in a t-SNE plot are not interpretable.

UMAP (uniform manifold approximation and projection, McInnes et al. 2018) is conceptually similar but built on topological ideas. It typically runs faster than t-SNE on large data, preserves global structure better, and has become the default in single-cell genomics and many other applied fields.

Both are visualization-first methods. The embedded coordinates are not meaningful in absolute terms. The output is a map for the eye, not a feature representation for downstream modeling. If you need stable, interpretable low-dimensional features, use PCA.

Worked example: PCA on iris

We use Fisher’s iris dataset: n = 150 flowers, p = 4 measurements (sepal length, sepal width, petal length, petal width), three species (setosa, versicolor, virginica). The species label is set aside; we run PCA on the four numeric variables and then color the resulting PC1-versus-PC2 plot by species to check that the unsupervised projection separates groups we know to exist.

Before running PCA, we standardize each variable to mean 0 and variance 1. This is essential when variables are on different scales (a few millimeters of petal width versus a few centimeters of sepal length).

Excel

Excel is awkward for PCA. The matrix algebra is doable with MMULT, MINVERSE, and TRANSPOSE, but Excel has no built-in eigendecomposition. The most common workaround is the power method: iterate v \leftarrow S v / \|S v\| to extract the leading eigenvector, deflate, repeat. We sketch the workflow as a teaching exercise. For real PCA work, use any of the next three tools.

A1:D150  raw iris data: SepalLength, SepalWidth, PetalLength, PetalWidth

Step 1: standardize each column
F1   =(A1 - AVERAGE($A$1:$A$150)) / STDEV($A$1:$A$150)
fill F1:I150 with analogous formulas for B, C, D columns

Step 2: compute the 4 x 4 covariance matrix of F:I (call it S)
K1:N4  =MMULT(TRANSPOSE(F1:I150), F1:I150) / 149

Step 3: power method for leading eigenvector
P1:P4    starting guess, e.g., {1; 1; 1; 1}/2
Q1:Q4    =MMULT($K$1:$N$4, P1:P4)             'multiply by S
R1       =SQRT(SUMSQ(Q1:Q4))                  'norm
S1:S4    =Q1:Q4 / $R$1                        'normalize
Copy S into P and repeat. After 20 iterations P stops moving.

The eigenvalue is approximately T1 = SUMPRODUCT(P1:P4, MMULT(K1:N4, P1:P4)).

Step 4: PC1 scores
U1   =MMULT(F1:I1, P1:P4)
U1:U150 filled down

Deflate by subtracting \lambda_1 v_1 v_1^\top from S and repeat to extract the second component. Visualize PC1 versus PC2 with a scatter plot, color by species. This works but is fiddly and error-prone. Use it as a one-time demonstration of what the algorithm does, then move to a real tool.

Stata

sysuse auto  // replace with iris if you have it loaded
pca seplen sepwid petlen petwid, components(4)

* scree plot
screeplot

* save first two PCs as variables
predict pc1 pc2, score

* visualize
scatter pc2 pc1, mlabel(species)

The default Stata pca command standardizes automatically when called without the cov option. Add , cov to run on the raw covariance matrix (rarely what you want).

R

data(iris)
X <- iris[, 1:4]

# scale = TRUE standardizes each variable
pcr <- prcomp(X, scale. = TRUE)

# inspect
summary(pcr)         # variance explained per PC
pcr$rotation         # loadings: columns are PCs, rows are original variables
pcr$x[, 1:2]         # scores on first two PCs

# scree
plot(pcr, type = "l", main = "Scree plot")

# biplot
biplot(pcr, scale = 0)

# ggplot of PC1 vs PC2 colored by species
library(ggplot2)
df <- data.frame(PC1 = pcr$x[, 1], PC2 = pcr$x[, 2], Species = iris$Species)
ggplot(df, aes(PC1, PC2, color = Species)) +
  geom_point(alpha = 0.7, size = 2) +
  theme_minimal() +
  labs(title = "PCA of iris", x = "PC1", y = "PC2")

prcomp uses SVD under the hood. princomp exists as an older alternative that uses the eigendecomposition of the covariance matrix directly; prefer prcomp for numerical stability.

Julia

using MultivariateStats, DataFrames, CSV, Statistics

# load iris (e.g., from the RDatasets package)
using RDatasets
iris = dataset("datasets", "iris")
X = Matrix(iris[:, 1:4])'   # MultivariateStats wants p x n

# standardize
X_std = (X .- mean(X, dims=2)) ./ std(X, dims=2)

# fit PCA, keep 2 components
M = fit(PCA, X_std; maxoutdim = 2)

# scores
Z = MultivariateStats.transform(M, X_std)

# loadings (projection matrix)
projection(M)

# variance explained
principalvars(M) ./ tvar(M)

MultivariateStats.jl follows the convention of p \times n matrices (features in rows, observations in columns), which is the opposite of R’s. Transpose at the boundary and you avoid surprises.

Common traps

PCA without standardization. If sepal length is measured in centimeters and petal width in millimeters, the variable with the larger scale dominates the first component regardless of its substantive importance. Always standardize unless you have a strong reason not to (e.g., all variables are already on the same physical scale, like log-returns on a panel of assets).

Interpreting components as real variables. Principal components are linear combinations chosen for statistical convenience, not for substantive meaning. A loading vector (0.5, 0.5, -0.5, -0.5) may not correspond to any named concept. Resist the temptation to call PC1 “size” and PC2 “shape” unless the loadings unambiguously support that reading.

Choosing k for k-means by inspection. Eyeballing a scatter plot in two dimensions for a clustering done in p dimensions can mislead. Use silhouette or gap statistic alongside any visual judgment. And run the algorithm with at least 20 random starts; the answer often depends on the initialization.

Comparing absolute PC scores across samples. The sign of a principal component is arbitrary (eigenvectors are defined up to a sign), and the units depend on the data being decomposed. Comparing “the PC1 score of country A in 2010 to the PC1 score of country B in 2020” requires that both come from the same PCA fit on a common reference sample. Running PCA separately on each sample and then comparing is meaningless.

MANOVA assumptions. All four standard test statistics assume multivariate normality and equal covariance matrices across groups. Mild violations are tolerable. Severe ones (one group has dramatically different scatter than another) can invalidate the test. A permutation-based MANOVA is a safer alternative when the normality assumption is doubtful.

Hierarchical clustering with the wrong linkage. Single linkage produces chains and is rarely the right default. Ward’s method is a more conservative starting point. Always plot the dendrogram and inspect the merge heights before cutting.

Reporting checklist

  • State whether variables were standardized before the analysis, and why.
  • For PCA: report eigenvalues (or proportion of variance explained) for the components retained, and the loading matrix.
  • For factor analysis: report the rotation used, the loadings, and the communalities (the proportion of each variable’s variance accounted for by the factors).
  • For MANOVA: report which test statistic (Wilks’s lambda, Pillai’s trace, etc.) and the corresponding F approximation and p-value.
  • For cluster analysis: report the algorithm, the distance metric, the linkage (if hierarchical), the value of k, and how k was chosen.
  • For discriminant analysis: report the prior probabilities, the misclassification rate (ideally on a held-out test set), and a confusion matrix.
  • Mention dimension reduction methods only as visualizations unless you genuinely depend on the embedded coordinates for inference.

References

  • Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer. Chapter 14 on unsupervised learning is the standard modern reference.
  • Mardia, K. V., Kent, J. T., & Bibby, J. M. (1979). Multivariate Analysis. Academic Press. The classic theoretical treatment.
  • Jolliffe, I. T. (2002). Principal Component Analysis (2nd ed.). Springer. Encyclopedic coverage of PCA.
  • van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research, 9, 2579 to 2605.
  • McInnes, L., Healy, J., & Melville, J. (2018). UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv:1802.03426.
  • Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7, 179 to 188. The paper that introduced the iris dataset and linear discriminant analysis.