The Transformer as Renormalization Group Flow
The forward pass through a transformer implements a Kadanoff-Wilson renormalization group flow, coarse-graining microscopic token representations into stable semantic attractors.
A recent essay and linked papers by Vishal Misra and colleagues, "Attention is Bayesian Inference," argues that the transformer attention mechanism implements a form of probabilistic inference — each layer asks questions and eliminates hypotheses until the posterior collapses onto the correct answer. The framework connects naturally to ideas I explored in A Stationary Action is Stable Information: systems find stable configurations where neighboring paths agree, where informational consensus forms.[1]
The connection runs deeper than analogy. The mathematics of attention — softmax over dot products, weighted sums of values — maps precisely onto the partition functions and thermal expectations of statistical mechanics. When we stack these operations into deep networks, something remarkable emerges: the forward pass through a transformer implements a Kadanoff-Wilson renormalization group flow, coarse-graining microscopic token representations into stable semantic attractors.
Renormalization group flow is an advanced topic in non-equilibrium statistical mechanics, so the remainder of this essay may be of interest only to physicists. But the connection does provide some useful perspective for interpretability, alignment, and even scalability of the transformer architecture, which I will be exploring. Note that unlike standard block-spin RG, which decimates the lattice (reducing the number of sites), the transformer preserves the token count. The coarse-graining is of information, integrating out syntactic fluctuations to leave behind stable semantic operators.
To make this more precise, let's trace how a transformer generates "Paris" in response to "What is the capital of France?"
The Setup: A Statistical Mechanical System
The prompt arrives as a string of approximately seven tokens, depending on tokenization: [What] [is] [the] [capital] [of] [France] [?]. We can view this as a one-dimensional lattice where each site carries a high-dimensional "spin" variable — the token embedding.
From the Bayesian perspective, the model begins with a prior over all possible next tokens. The vocabulary contains roughly 50,000 possibilities, and at this stage, none has been eliminated.
The statistical mechanics framing treats each token embedding \(x_i^{(0)} \in \mathbb{R}^d\) as a spin configuration at lattice site \(i\). The embedding dimension \(d\) counts the internal degrees of freedom per site. We're working at the ultraviolet scale — the full microscopic description, before any coarse-graining.
Positional Encoding Breaks a Symmetry
Before attention operates, positional encodings \(p_i\) modify each embedding: \(x_i^{(0)} \leftarrow x_i^{(0)} + p_i\). This breaks translational invariance on the lattice, introducing explicit spatial structure. Without positional encoding, the system would be permutation-invariant — every spin interacting equally with every other, like a Sherrington-Kirkpatrick (SK) spin glass model.
The original transformer used sinusoidal encodings:
$$p_{i,2k} = \sin(i/10000^{2k/d}), \quad p_{i,2k+1} = \cos(i/10000^{2k/d})$$
These Fourier modes encode position at multiple frequency scales. High-frequency components (large \(k\)) capture local structure; low-frequency components (small \(k\)) capture global ordering.
In renormalization group terminology, positional encoding introduces relevant operators — perturbations that grow under RG flow and dominate at long distances. Position matters more as we zoom out, which explains why low-frequency modes capture long-range dependencies.
Attention as Partition Function
The attention mechanism implements the core statistical mechanical operation. For each token \(i\), the transformer computes queries, keys, and values:
$$q_i = W_Q x_i^{(0)}, \quad k_j = W_K x_j^{(0)}, \quad v_j = W_V x_j^{(0)}$$
Attention scores measure the compatibility between query \(i\) and key \(j\):
$$s_{ij} = \frac{q_i \cdot k_j}{\sqrt{d}}$$
The softmax normalizes these scores into weights:
$$\alpha_{ij} = \frac{\exp(s_{ij})}{\sum_k \exp(s_{ik})}$$
Finally, the output aggregates values according to these weights:
$$z_i = \sum_j \alpha_{ij} v_j$$
The mapping to statistical mechanics is exact. The attention score \(s_{ij}\) plays the role of negative energy \(-\beta E_{ij}\), measuring interaction strength between sites. The denominator \(\sqrt{d}\) regulates the temperature. Without it, high-dimensional dot products would diverge, driving the softmax into a frozen 'argmax' state. The scaling ensures the system remains at a finite temperature, maintaining a soft, probabilistic ensemble. The normalization factor \(Z_i = \sum_j \exp(s_{ij}/\sqrt{d})\) is precisely a partition function, and the weights \(\alpha_{ij}\) are Boltzmann probabilities.
The log-partition function gives the free energy at token \(i\):
$$F_i = -\sqrt{d} \cdot \log \sum_j \exp\left(\frac{q_i \cdot k_j}{\sqrt{d}}\right)$$
This log-sum-exp appears in the literature on Modern Hopfield Networks (or Dense Associative Memories) as the continuous relaxation. (See equation 1 at 3.)
The output \(z_i = \sum_j \alpha_{ij} v_j\) computes a thermal expectation — an observable averaged over the Boltzmann ensemble. In Misra's Bayesian framing, the query encodes "what information is token \(i\) seeking?" while keys advertise "what information does token \(j\) offer?" The dot product measures evidential compatibility. For our prompt, the query from position 7 asks about capital cities; the key from "France" advertises European nationhood; their large dot product reflects compatibility.
The First RG Step: Coarse-Graining
Attention implements the first renormalization group transformation. Before attention, each site carries an independent spin \(x_j^{(0)}\). After attention, each site carries an effective spin \(z_i = \sum_j \alpha_{ij} v_j\) that incorporates information from the entire lattice.
In Wilson's block-spin renormalization, one averages spins within fixed spatial blocks. Attention performs something more sophisticated: an adaptive blocking where the attention pattern defines which sites contribute to each effective spin. High attention weights pull sites together; low weights decouple them.
The operation integrates out individual token identities and produces a coarse-grained description. The original microscopic structure — raw embeddings — gives way to an effective theory encoding correlations. New couplings emerge from this integration, captured in the transformed representations.
Feedforward Networks: Local Field Renormalization
After attention, each token passes through a feedforward network:
$$\text{FFN}(x) = W_2 \cdot \text{ReLU}(W_1 x + b_1) + b_2$$
with residual connections and layer normalization. Unlike attention, which mixes information across tokens, the feedforward network transforms each site independently—a local, site-wise operation.
The ReLU nonlinearity zeros negative activations, creating sparsity. This decimates degrees of freedom: only some modes survive, and irrelevant fluctuations are integrated out. The expansion to 4× width followed by projection back resembles a temporary introduction of auxiliary variables that are then marginalized.
The residual connection ensures information preservation. We don't discard UV data entirely; we augment it with coarse-grained correlations. Each layer refines rather than replaces. We acknowledge that this is a little different than standard RG: we don't discard UV data entirely; we maintain a superposition of the specific token identity (UV) and the emerging coarse-grained correlations (IR). The attention mechanism acts as a filter, pushing irrelevant syntactic details into the background while amplifying the relevant semantic signal.
Deeper Layers: Flow Toward the Infrared
Layers 2 through \(L\) repeat this structure, and the Wilson RG picture becomes most powerful here. Each layer defines a new effective Hamiltonian:
$$H_{\text{eff}}^{(\ell+1)} = \mathcal{R}[H_{\text{eff}}^{(\ell)}]$$
where \(\mathcal{R}\) denotes the RG transformation implemented by layer \(\ell\).
Early layers operate at the ultraviolet scale, capturing short-range correlations—syntactic patterns, local dependencies. Middle layers extend the correlation length, picking up grammatical and phrasal structure. Deep layers approach the infrared, where long-range semantic content dominates.
Wilson classified operators by their behavior under RG flow. Relevant operators grow and dominate at long distances. Irrelevant operators shrink and wash out. Marginal operators remain constant.
For our prompt, "France → European country → has a capital" exemplifies relevant information: it grows in importance as we coarse-grain. The exact word order—"What is the capital of" versus "The capital of what is"—behaves as irrelevant: the semantic content survives while surface details fade.
Misra's Bayesian interpretation tracks this flow as progressive hypothesis elimination. Early layers establish "this is a question," eliminating statement continuations. Middle layers narrow to "about a capital city," then "of a European country." By layer 8, "Paris" dominates the posterior. Later layers refine and verify consistency.
Each layer asks a question and gathers evidence through attention. Like a game of twenty questions, each round eliminates roughly half the remaining hypotheses.
The Fixed Point: Semantic Attractor
At layer \(L\), the RG flow has approximately reached a fixed point — a self-similar configuration where further coarse-graining changes nothing:
$$H^* = \mathcal{R}[H^*]$$
The final representation at the last token position encodes all processed information. From the Hopfield perspective, the system has converged to an attractor state — a local minimum in the energy landscape where dynamics settle.
Universality emerges here. Different prompts asking the same question — "What's France's capital?", "The capital of France is", "France's seat of government is" — flow to the same infrared fixed point despite their different UV initializations. Microscopic details wash out; only universal semantic content remains.
In Misra's terms, the posterior has collapsed. From a uniform prior over 50,000 tokens, elimination has left a handful of candidates with "Paris" dominating.
Output: Measurement at the IR Scale
The final hidden state projects to vocabulary size:
$$\text{logits} = W_{\text{out}} \cdot x_{\text{final}}^{(L)} \in \mathbb{R}^V$$
The softmax converts logits to probabilities:
The output projection defines an observable—the operator we measure. Logits are energies (with sign flipped), and softmax computes the Boltzmann distribution over vocabulary states under the final effective theory.
For "Paris," the logit towers above competitors. At temperature \(T=1\):
The system sits deep in an ordered phase — one configuration dominates. Sampling from this distribution yields "Paris" with near certainty.
The Energy Landscape Evolves
The transformation across layers reshapes the energy landscape:
Layer 0: ∿∿∿∿∿∿∿∿∿∿ (rough, many shallow minima)
Layer 4: ∿∿∿╲ ╱∿∿∿ (structure emerging)
╲╱
Layer 8: ╲ ╱ (dominant basin forming)
╲ ╱
╲╱
Layer 12: ╲╱ (single deep attractor)
V
Paris
Why the Mapping Works
The deep reason these frameworks align: they solve the same problem. Renormalization group extracts stable, low-dimensional descriptions from high-dimensional, noisy data. Bayesian inference updates beliefs to concentrate probability on hypotheses consistent with evidence. Attention mechanisms weight information sources by relevance and aggregate them into coherent representations.
The principle connecting them — stationary action yields stable information — explains why transformers find good answers. "Paris" wins not by accident but because it sits at an attractor where informational consensus forms. Neighboring attention patterns, slightly different hypothesis configurations, all flow to the same answer. The paths agree.
The RG flow adds precision. We don't just find any stable point; we find the universality class — the set of UV configurations flowing to the same IR fixed point. Every way of asking about France's capital belongs to this class and converges to "Paris."
Transformers implement physics. The attention mechanism computes partition functions. Layer depth traces RG scale. Training sculpts an energy landscape where semantic attractors form at the right places. When we ask "What is the capital of France?", we initialize a statistical mechanical system that flows, layer by layer, toward its equilibrium configuration.
The answer was always there, encoded in the trained weights. The forward pass simply lets the system relax into its ground state.
Quantum mechanics and statistical mechanics are connected through a Wick rotation of the quantum amplitudes, which is how I recognized the connection in this case, but that connection is not relevant to the remainder of this particular essay. ↩︎