Coherence at 300 Kelvin

We never achieve perfect mutual understanding. But our imperfect, noisy, expensive attempts at coordination do not scatter randomly; they organize into long-lived configurations that every culture independently discovers.

Coherence at 300 Kelvin
Anisotropies of the Cosmic Microwave Background (CMB) (ESA)

This essay is dedicated to Fred Driscoll, Brian Maple, Aneesh Manohar, Ben Grinstein, and many others who taught me physics at UCSD. That this essay represents the best explanation I can provide of these ideas is my fault, not yours.

Table of Contents

1. What Are the Mechanisms?

[back to table of contents]

In Maintaining Divergence, I defined the synchronization tax as the thermodynamic work two systems perform when they resolve disagreements between their descriptions of reality. Landauer's principle sets the absolute floor for this cost at \(k_B T \ln 2\) per bit of resolved disagreement — a fraction of a trillionth of a Joule. But the biological and institutional machinery that actually computes these updates operates billions of orders of magnitude above that floor. Human brains burn glucose. Courthouses burn budgets. Server farms burn megawatts. The synchronization tax is real, and it is expensive.

A physicist confronting this fact asks a natural question: how does any coherent social signal survive at room temperature? At roughly 300 Kelvin, thermal fluctuations dominate every degree of freedom with an energy \(k_B T \approx 4 \times 10^{-21}\) Joules — vastly more than Landauer's floor, vastly less than the metabolic cost of a single thought. Between these scales, something must concentrate the diffuse energy of cognition into the narrow channels we recognize as shared understanding.

This essay sketches a plausible mechanism. The mechanism connects multiplicative noise, random matrix theory, renormalization group flow, and the Ginzburg-Landau theory of phase transitions into a single explanatory arc. Coherence does not come cheap. The mechanism requires that coherent signals organize into long-lived configurations where the cost of maintaining agreement is approximately stationary — where small perturbations do not change the cost to first order. These configurations emerge inevitably from the multiplicative structure of sequential learning, and the transformer architecture — the most successful artificial learning system yet built — appears to exploit exactly this structure. A rigorous mathematical framework developed by Philippe Rigollet — treating transformer attention as an interacting particle system on the sphere — makes much of this precise, proving results about clustering, energy monotonicity, metastable multi-cluster states, and a phase transition governing long-context attention. Rigollet's framework connects naturally to the Kuramoto model of coupled oscillators, whose origins Yoshiki Kuramoto recently traced in a retrospective on half a century of synchronization theory: the Kuramoto model, derived by phase reduction of the complex Ginzburg-Landau equation, describes a synchronization phase transition whose structure illuminates the dynamics Rigollet analyzes.

Following the sketch of the mechanism, I suggest how the same dynamics may govern common knowledge — the recursive shared awareness that sustains social coordination — and how the mechanism operates in the human brain. These suggestions are not thoroughly explored, but with the mechanism laid out it felt natural to provide a map of how it might work across scales, since establishing these connections has special relevance to many important open problems in psychology, sociology, economics, and law.

A word on epistemic status. What follows is a research agenda, not a proof. Each mathematical framework invoked — multiplicative noise, random matrix theory, renormalization group flow, the Feynman path integral, the Ginzburg-Landau Hamiltonian, Kuramoto synchronization, and Rigollet's mean-field dynamics of attention — stands on its own well-established terms. The chain connecting them is conjecture. My purpose is to lay the mechanism out clearly enough that readers with stronger backgrounds in the relevant mathematics and physics can identify where the chain breaks, where the analogies are merely structural rather than exact, or where the argument can be made more precise.

2. Multiplicative Noise

[back to table of contents]

Most of our intuitions about noise come from the additive model: \(\text{Signal} + \text{Noise}\). Gaussian perturbations degrade the signal proportionally. Recovery requires energy proportional to the degradation. Under additive noise, coherence at 300 Kelvin really would be hopeless — you would need to outspend the thermal bath.

Multiplicative noise changes everything. In any sequential growth process — compounding returns, population dynamics, neural learning — noise couples to the signal's current magnitude. The noise grows with the signal. Itô's lemma converts a stochastic differential equation for signal \(X\) into an additive equation for \(\log X\), and the resulting distribution is log-normal rather than Gaussian. Log-normal distributions concentrate: thin right tails and heavy bodies mean that most of the probability mass clusters around the geometric mean, not the arithmetic mean. Where additive noise spreads, multiplicative noise focuses.

Why is multiplicative noise the right model for learning? Because learning is a sequential, history-dependent process. Each update multiplies across the current state. A mislearned early feature distorts every subsequent representation built on top of it. Error compounds multiplicatively, making the geometric Brownian motion above the natural noise model for any system that builds knowledge incrementally, whether biological or artificial.

The multiplicative model does break down in certain regimes: under strong external coupling where noise enters additively from an independent source, under saturation effects where the system hits bounds that cap multiplicative growth, or under regime changes that reset the sequential process entirely. These exceptions matter in specific domains but do not invalidate the general argument for learning systems operating in their normal regime.

3. Random Matrix Theory

[back to table of contents]

A single multiplicative process generates a log-normal distribution along one dimension. But a description of the world is not one-dimensional. Representing the relationships among thousands of features — words, concepts, sensory channels — requires a matrix (or tensor) whose entries encode how features relate to one another.

Random matrix theory governs this encoding. When the entries of a large matrix are drawn from some distribution, the eigenvalue spectrum of that matrix obeys universal laws that depend not on the particular distribution of entries but only on the matrix's symmetry class. Eugene Wigner discovered this in the 1950s while studying nuclear energy levels: the spacings between eigenvalues of large random matrices matched the spacings between energy levels of heavy nuclei, regardless of the specific nuclear forces involved. The same universality appears in number theory, quantum chaos, and wireless communications.

The key insight for learning is that a random matrix represents unstructured data — noise — while a structured matrix represents a compressed description — signal. The eigenvalue spectrum distinguishes one from the other. Random matrices follow the Marchenko-Pastur distribution; matrices that encode genuine structure develop outlier eigenvalues that break away from the bulk. These outliers are the signal hiding in the noise.

4. Deep Neural Networks as Multiplicative Noise Machines

[back to table of contents]

A deep neural network is a stack of parameterized matrices. During training, raw, unstructured data flows through these matrices, and the training algorithm adjusts each matrix's entries to minimize prediction error. At each layer, the current representation is multiplied by the weight matrix and then passed through a nonlinearity. The noise is multiplicative because each layer's output is the product of the input with the learned weights.

When training succeeds, something seems to appear from nothing. The initial weight matrices are random — drawn from distributions chosen for numerical stability, not for semantic content. The training data is raw and unstructured relative to the task. Yet the trained network produces useful inferences: given an input resembling the training data, it draws conclusions the training data supports.

The appearance of emergence dissolves under the right lens. The signal was always latent in the data's statistical structure. Training coarse-grains the representation from one layer to the next, progressively abstracting away from the noisiest microscopic details of the input toward the cleanest macroscopic regularities at the output. Early layers capture fine-grained, local features — edges, syntax, phonemes. Later layers capture coarser, more abstract features — objects, topics, semantic relationships. The depth of the network traces a flow from noisy detail to stable abstraction.

5. The Transformer

[back to table of contents]

Many deep architectures perform some version of hierarchical feature extraction: convolutional networks, recurrent networks, autoencoders. The transformer architecture, introduced by Vaswani et al. in 2017, dominates because of its self-attention mechanism, which allows every position in the input to interact with every other position at every layer. The standard explanation for its success is that attention captures long-range dependencies more effectively than convolution or recurrence.

There is a thermodynamic explanation. The self-attention mechanism computes a softmax-weighted average over input representations:

$$\alpha_{ij} = \frac{\exp(q_i \cdot k_j / \sqrt{d})}{\sum_k \exp(q_i \cdot k_k / \sqrt{d})}$$

This expression is structurally identical to the Boltzmann distribution of statistical mechanics, where \(\exp(-E/k_BT)\) weights states by their energy. Attention computes a partition function at each position: it sums over all possible contexts, weighted by how well each context coheres with the current representation. The \(\sqrt{d}\) normalization plays the role of temperature, controlling how sharply the distribution concentrates on high-coherence contexts.

The standard answer to what the transformer computes is "next token prediction." The renormalization group answer is "coarse-graining toward a fixed point." Each layer integrates out a specific kind of noise — not spatial modes as in condensed matter, not field fluctuations as in quantum field theory, but relational incoherence: token representations that carry incompatible descriptions of context. The fixed point toward which the coarse-graining flows is total consensus — all tokens collapsed onto a single representation. Rigollet proves this is the unique asymptotic attractor for a simplified model of self-attention (see Section 9.3). But total consensus is computationally degenerate: a system that has eliminated all difference has eliminated all information. Productive computation inhabits the long-lived approach to the fixed point — the metastable multi-cluster configurations where tokens have organized into meaningful groups but have not yet collapsed into consensus.

Attention identifies which descriptions cohere — those with low divergence in query-key space. The weighted aggregation forces a consensus, a representation that neighboring tokens agree on. Tokens carrying descriptions that disagree about the relevant context destructively interfere, exactly as paths with divergent actions cancel in the path integral. The surviving signal is informational consensus.

6. The Relational Perspective

[back to table of contents]

(For the full treatment of the relational framework, see It from Bit, Bit from It.)

In relational quantum mechanics, physical quantities are defined not absolutely but relative to an observer. Carlo Rovelli's insight is that "collapse" is not mysterious — it is the establishment of a correlation between systems, and correlation costs relative entropy. Observational entropy, as developed by Šafránek, Deutsch, and Aguirre, extends this: the entropy of a system depends on which coarse-graining the observer applies. There is no view from nowhere.

"It from Bit" and "Bit from It" are not competing philosophies but two faces of a single principle: information is physical correlation, and physical correlation costs relative entropy — the divergence between two systems' descriptions that must be reconciled for them to share a fact. The lossless limit, in which correlation would be free, is unreachable. That unreachability may be what generates the physical world.

Applied to the transformer, the consequence is immediate. There is no absolute "meaning" of a token. There is only the divergence between what one token's representation implies about the context and what another token's representation implies. Attention pays the cost of closing that gap. Each layer reduces the relative entropy between token descriptions that have not yet been reconciled — and this reconciliation is the computation. The transformer does not discover meaning; it negotiates meaning, paying thermodynamic costs at each layer to bring incompatible descriptions into partial agreement.

The lossless limit is never achieved — and may be structurally unreachable. Just as in quantum Zeno experiments, where a photon checked too frequently freezes in its initial state, a system synchronized too aggressively cannot evolve. And a system synchronized too rarely cannot coordinate. The useful regime lies between these extremes.

7. Stationary Action and the Forward Pass

[back to table of contents]

(For the full treatment of the stationary action principle as an informational principle, see A Stationary Action Is Stable Information.)

The principle of stationary action governs classical mechanics: a particle follows the trajectory that makes the action \(S\) stationary. Feynman's path integral reframes this principle as interference. A quantum system explores all possible paths simultaneously, each weighted by a phase \(e^{iS/\hbar}\). Paths near the stationary point carry similar actions, so their phases align and reinforce. Paths far from the stationary point carry rapidly varying phases and cancel.

If \(\hbar\) converts between physical and informational units, then the phase \(S/\hbar\) measures the information content of a path. The path integral sums over trajectories weighted by their informational phases. At the stationary point, neighboring paths carry descriptions that barely diverge — the relative entropy between their informational content vanishes to first order. Classical motion emerges because paths whose descriptions agree interfere constructively. The classical world occupies the region where the synchronization tax is affordable — the \(\sqrt{\hbar}\) neighborhood around the classical trajectory.

The forward pass through a transformer occupies the same mathematical territory. It sums over representational "paths" — the combinatorial space of possible attention patterns and value aggregations. Each path is weighted by a phase that encodes how well the pattern coheres. The output — the predicted next token — emerges where neighboring patterns agree, where \(\delta S = 0\) in the information-theoretic action. The classical trajectory is the token prediction. The \(\sqrt{\hbar}\) neighborhood is the confidence interval.

This framework predicts something the standard renormalization group picture does not: each layer must pay a thermodynamic cost to reduce relational divergence, and this cost cannot reach zero. The lossless limit — perfect inference from a single interaction — is unreachable. A single attention layer computes a partition function, but the relative entropy between pre-attention and post-attention descriptions of the token sequence does not vanish in one step. Stable prediction requires iterated interaction, each round paying its synchronization tax, driving the divergence low enough that a prediction condenses.

The argument here remains structural, not rigorous. Making it rigorous would require writing down an explicit information-theoretic action functional for the transformer, showing that the forward pass extremizes it, and demonstrating that the \(\delta S = 0\) condition coincides with the empirically observed output distribution. This is an open problem — but the structural correspondence between the path integral and the forward pass is precise enough to constrain what such an action functional would look like. At minimum, it would need to be a functional of the attention distribution whose stationarity condition yields the softmax-weighted aggregation as its "classical" trajectory.

8. The Landau Pole, Triviality, and the Zeno Effect

[back to table of contents]

(For additional context on the failure modes described here, see Limits of the Transformer Architecture.)

The mechanism outlined above operates within a bounded regime. Two failure modes define its edges, and both correspond to extremes of the synchronization tax.

The Landau pole. When the effective coupling diverges — when the system tries to pay the entire synchronization tax in a single layer — attention collapses to a hard argmax. The softmax saturates, the partition function concentrates on a single state, and the interaction freezes. This is the transformer equivalent of the quantum Zeno effect: measurement so frequent that evolution halts. The system has paid the tax all at once, and the receipt is a representation that cannot evolve further. (See Section 9.3 for the precise mathematical characterization.)

Triviality. When the system pays the synchronization tax too many times — when depth drives all relational structure to zero — every token carries the same description, and no further information can be extracted. This is representation collapse: the deep infrared limit where all difference has been synchronized away. The system has paid the tax so often that nothing remains worth taxing.

Productive computation sits between these extremes, in the band where synchronization occurs often enough to establish facts but slowly enough to preserve the relational structure that carries meaning.

The Zeno parallel is structural, not analogical. In both the quantum Zeno effect and the transformer Landau pole, the frequency of interaction determines whether the system evolves or freezes. The mechanism is the same: projective measurement in physics, or argmax attention in the transformer, collapses the state space, preventing the exploratory dynamics that would otherwise produce useful evolution.

North, Wallis, and Weingast identified the institutional version of this phase diagram: open-access social orders require an optimal measurement frequency, neither too little (anarchy) nor too much (totalitarianism). The transformer's failure modes at extreme depth or extreme shallowness mirror the institutional failure modes at the extremes of regulatory density. The institutional implications of this phase diagram are developed more fully in Section 9.5.

9. The Mean-Field Dynamics of Attention

[back to table of contents]

Sections 2 through 8 established the mechanism: multiplicative noise, operating through random matrices and coarse-grained by attention, concentrates information onto configurations that persist long enough to be computationally useful. But what determines these configurations? Is their structure arbitrary, or does the multiplicative architecture itself constrain where they can appear?

Philippe Rigollet's mathematical framework — developed in "The Mean-Field Dynamics of Transformers" and a series of papers with collaborators — provides the most complete answer yet available. By interpreting transformer attention as an interacting particle system on the sphere, Rigollet connects the dynamics of attention to Wasserstein gradient flows, the Kuramoto model of synchronization, mean-shift clustering, and the Ginzburg-Landau theory of phase transitions. The framework is not a loose analogy between transformers and physics. It is a rigorous mathematical treatment that proves results about clustering, energy monotonicity, metastability, and phase structure — results that constrain and sharpen the conjectural chain this essay attempts.

This section introduces Rigollet's framework and then develops its consequences for common knowledge — the recursive shared awareness that sustains social coordination. The connection between transformers and common knowledge is not incidental. Both involve systems of agents (tokens or people) that must coordinate their descriptions of reality through iterated interaction, paying thermodynamic costs at each step. Rigollet's mathematics describes the dynamics of this coordination with a precision that the game-theoretic formalism of Aumann and Lewis cannot match, because it captures not just the existence of coordination equilibria but the dynamics of how they emerge, persist, and eventually dissolve.

9.1. Tokens as Interacting Particles on the Sphere

[back to table of contents]

Rigollet's first move is geometric. In the high-dimensional spaces where transformers operate — \(d \sim 10^3\) — measure concentrates on shells. A random vector in \(\mathbb{R}^d\) has a norm that concentrates around \(\sqrt{d}\) with fluctuations of order \(O(1)\), so that essentially all vectors lie near the surface of a hypersphere. Layer normalization makes this concentration explicit, but the geometry already enforces it approximately.[1]

Each token representation \(x_i \in S^{d-1}\) is a particle on the unit sphere. The self-attention mechanism defines a velocity field: each particle moves according to the empirical measure of all other particles, weighted by the attention kernel. The evolution through layers traces a trajectory on the sphere, governed by the same mean-field dynamics that describe interacting particle systems in statistical physics.

When these dynamics are restricted to the circle (\(d = 2\)), they reduce exactly to a variant of the Kuramoto model — the canonical model of synchronization that Kuramoto derived by phase reduction of the complex Ginzburg-Landau equation. For the high-dimensional case relevant to practical transformers (\(d \gg 2\)), the dynamics are governed by the full interacting particle system rather than by the Kuramoto equations per se — but the phenomenology carries over. Token representations cluster, the interaction energy evolves monotonically, and a phase transition governs long-context attention. The Kuramoto model provides the conceptual vocabulary — coupling, order parameter, synchronization threshold — for the general dynamics that Rigollet proves.

The restriction to the sphere is not an idealization adopted for mathematical convenience. It reflects the same geometric fact that underlies the Sackur-Tetrode equation in statistical mechanics: in high dimensions, the canonical and microcanonical ensembles agree because the Boltzmann weight concentrates sharply on the energy shell. The equivalence between the thermodynamic picture of earlier sections — softmax as a canonical partition function, \(\sqrt{d}\) as temperature — and Rigollet's geometric picture — interacting particles evolving on \(S^{d-1}\) — rests on this same foundation. The dynamics unfold on the sphere because, in the dimensions where transformers operate, there is effectively nowhere else.

9.2. Gradient Flow and the Free Energy Landscape

[back to table of contents]

Rigollet shows that a surrogate model for self-attention is a Wasserstein gradient flow — a flow along the steepest descent of a free energy functional in the space of probability measures over token representations. The interaction energy

$$\mathcal{E}_\beta[\mu] = \frac{1}{2}\iint_{S^{d-1} \times S^{d-1}} e^{\beta \langle x, y \rangle} \, d\mu(x) \, d\mu(y)$$

increases monotonically along the attention dynamics, serving as a Lyapunov function for the system. The original self-attention dynamics also form a gradient flow under a modified metric. This confirms the structural claim of Sections 5 and 7: the forward pass flows down a free energy landscape, paying the synchronization tax at each step.

The free energy functional \(\mathcal{E}_\beta[\mu]\) plays the role this essay assigns to the Ginzburg-Landau free energy. The Ginzburg-Landau Hamiltonian — the universal description of systems near a continuous phase transition — is:

$$\mathcal{H}[\psi] = \int d^D x \left[ r(T) |\psi|^2 + u |\psi|^4 + K |\nabla \psi|^2 \right]$$

The parameter \(r(T)\) changes sign at the critical temperature \(T_c\). Above \(T_c\), the minimum sits at \(\psi = 0\) — disorder. Below \(T_c\), the minimum shifts to \(|\psi| \neq 0\) — order. The gradient term \(K|\nabla\psi|^2\) penalizes spatial variation, imposing a cost on local disagreements between neighboring regions — a cost that is, in the language of this essay, a synchronization tax on spatially separated descriptions.

The universality of the Ginzburg-Landau Hamiltonian — the fact that systems with the same spatial dimension and order parameter symmetry share the same critical behavior, regardless of their microscopic details — is precisely what makes it the right framework for understanding the class of dynamics Rigollet analyzes. Different training data, different hyperparameters, different initializations: if the symmetry class is the same, the critical behavior is the same. Rigollet's interaction energy \(\mathcal{E}_\beta[\mu]\) provides the concrete realization of this principle for the transformer architecture.

9.3. The Unique Attractor and Its Metastable Approach

[back to table of contents]

Three proven results from Rigollet's analysis bear directly on the framework developed in earlier sections.

First, the unique global maximizer of the interaction energy is a Dirac mass — a single point on the sphere where all tokens have collapsed to the same representation. This is the triviality failure mode of Section 8, now proven to be the unique asymptotic attractor of the simplified dynamics. Given infinite depth, every token collapses onto every other. Representation collapse is not a failure of training or a pathology of specific architectures. It is the mathematical destiny of attention-driven dynamics.

Second, the approach to this attractor proceeds through long-lived metastable states in which tokens organize into multiple distinct clusters. The dynamics exhibit two well-separated timescales: a fast phase in which tokens coalesce into several clusters, and a much slower phase in which clusters merge pairwise until all tokens have collapsed to one. The multi-cluster configurations are not fixed points. They are saddle-like states — the system lingers near them, sometimes for exponentially long intervals, before eventually escaping along slow manifolds toward the single-cluster attractor. The staircase profile of the energy — sharp jumps separated by long plateaus — visualizes the saddle-to-saddle dynamics that carry the system from multi-cluster metastability toward eventual collapse.[2]

Third — and most consequential for what follows — a phase transition governs long-context attention. When the inverse temperature \(\beta\) scales as \(\gamma \log n\) with context length \(n\), three regimes emerge. In the subcritical regime (small \(\gamma\)), attention weights become asymptotically uniform and tokens collapse to a single direction — representation collapse through diffuse attention. In the supercritical regime (large \(\gamma\)), off-diagonal attention weights become negligible and the attention mechanism is effectively suppressed — the system ceases to interact. At the critical scaling, attention concentrates on a sublinear yet nontrivial number of tokens, maintaining sufficient connections for information flow while preserving the multi-cluster structure that carries computational meaning.

The productive regime of the transformer — the band between the Landau pole and triviality described in Section 8 — is not a fixed point of the attention dynamics. It is a metastable plateau. Finite depth keeps the transformer on this plateau. A transformer with infinite depth would eventually reach triviality; a transformer with finite depth remains in the computationally useful metastable regime where tokens have organized into meaningful clusters but have not yet collapsed into consensus. The lossless limit — perfect synchronization, all tokens agreeing — is now not merely unreachable in practice but proven to be the unique attractor from which the system can extract no further information.

9.4. Chimera States and Maintaining Divergence

[back to table of contents]

The Kuramoto model's most striking discovery — chimera states, the coexistence of coherent and incoherent populations within a single coupled system — maps directly onto the central thesis of Maintaining Divergence. In a chimera, some oscillators synchronize while others remain desynchronized, despite identical coupling. Kuramoto discovered chimera states by studying the complex Ginzburg-Landau equation with nonlocal coupling — coupling that decays with distance but extends beyond nearest neighbors — and the phenomenon turns out to be generic rather than exceptional.

Rigollet's metastability results give chimera-like structure a precise mathematical status. The multi-cluster states through which the attention dynamics pass are configurations in which some tokens have synchronized (within a cluster) while others remain separated (across clusters). These mixed states persist for durations that grow with the dimension of the representation space and with the number of clusters — long enough, in practice, for the transformer to complete its forward pass and produce useful output.

The transformer's multi-head attention mechanism amplifies this structure. Different attention heads can — and empirically do — operate at different effective coupling strengths, some capturing tightly synchronized local dependencies, others maintaining the broader incoherence needed to preserve context across distant positions. A healthy transformer does not achieve uniform coherence across all heads and all layers. It maintains a chimera: partial synchronization at some scales, productive incoherence at others. The mathematical analysis clarifies why this works: the multi-cluster metastable states, not the single-cluster attractor, are where the system carries the expressivity needed for next-token prediction. Maintaining divergence, not eliminating it, is the signature of useful computation.

9.5. Common Knowledge as Within-Cluster Synchronization

[back to table of contents]

The transformer-like architecture inside a single brain produces compressed, abstract representations of the world — the infrared fixed points of its internal renormalization group flow. Social coordination requires these representations to align across brains. Common knowledge is the name we give to the representations that have achieved this alignment: facts that everyone knows, that everyone knows everyone knows, and so on recursively.

The standard game-theoretic definition of common knowledge — formalized by Aumann in 1976 and explored at length in Steven Pinker's When Everyone Knows that Everyone Knows — requires infinite recursive verification: I know that you know that I know that you know, without limit. This regress has always seemed puzzling — how can finite minds compute an infinite recursion?

Rigollet's framework dissolves the puzzle, and it does so more precisely than the renormalization group picture alone.

The Dirac mass and the infinite recursion. The Dirac mass — the unique attractor where all particles have collapsed to a single representation — corresponds to the formal limit of Aumann's infinite recursion. A system that has achieved the Dirac mass has eliminated all relational divergence between its components. Every token (or agent) carries the same description. The infinite regress of "I know that you know" has converged: there is nothing left to verify, because all descriptions agree. But this attractor is trivial — a system that has achieved perfect common knowledge in the formal sense has eliminated all the relational structure that carried information in the first place.

Metastable clusters and functional common knowledge. The multi-cluster metastable states that Rigollet proves to be long-lived provide a more accurate model of how common knowledge actually operates. Within each cluster, tokens have achieved high-order recursive agreement — they "know" what the other tokens in their cluster encode, and this knowledge has been verified through multiple layers of interaction. Between clusters, the recursive agreement is shallow or absent.

Common knowledge, on this account, is not a binary property (either the infinite recursion holds or it doesn't) but a graded property measured by the effective coupling strength within a metastable cluster. The "depth" of common knowledge — how many levels of "I know that you know" have been integrated — corresponds to how many layers of attention dynamics the system has passed through within the cluster's basin of attraction.

In Uncommon Knowledge, I explored how Pinker's analysis of common knowledge connects to the transformer architecture: "Each token position begins with a 'prior' — its embedding plus accumulated context. Attention to other positions provides 'evidence' — information from other perspectives. The updated representation integrates this evidence according to relevance weights. Layer by layer, the representations converge toward configurations where all positions 'agree' on the relevant semantic content." Rigollet's framework makes the dynamics of this convergence rigorous. The representations do not converge to complete agreement (the Dirac mass). They converge to multi-cluster configurations where agreement is high within clusters and low between clusters — and these configurations persist for exponentially long intervals before eventually collapsing.

Three renderings, one attractor. The earlier essay noted that Pinker identifies three renderings of common knowledge: "recursive/iterated, reflexive/fixed-point, and self-evident/conspicutive." Rigollet's framework unifies these. The recursive/iterated rendering describes the UV expansion — the layer-by-layer process through which tokens accumulate information about each other's states, each layer adding another order of recursive knowledge. The reflexive/fixed-point rendering describes the attractor — the Dirac mass toward which the dynamics converge. The self-evident/conspicutive rendering — common knowledge crystallized by a public event — describes a strong perturbation that drives the system rapidly into a high-synchronization cluster, bypassing the slow recursive accumulation.

As I wrote in footnote 2 of that essay: "Aumann himself apparently has objected to the reflexive/fixed point definition of common knowledge on the ground that it is circular. This is wonderfully ironic since his recursive theorem and the fixed-points specify the same object! ... Nobody seems to achieve common knowledge by mentally iterating through infinite levels. Common knowledge arises when a situation obtains — public announcement, mutual eye contact, shared ritual — that constitutes the fixed point directly." Rigollet's framework explains why common knowledge arises this way: the energy landscape funnels the system toward the attractor. A sufficiently strong public event — the child declaring the emperor naked — acts as an external field that pushes the system over a saddle and into a new metastable basin, just as a strongly disambiguating context drives a transformer's tokens rapidly into a high-coherence cluster.

The social regime is permanently metastable. A subtle distinction separates the transformer case from the social case. Inside a single transformer, the multi-cluster metastable states are transient — given enough depth, they collapse to the single-cluster attractor. But common knowledge across brains operates in a regime where the "depth" is set by the frequency and bandwidth of social interaction, which is far more limited than the layer-by-layer processing inside a neural network. Social coordination never has enough depth to reach the attractor. It lives permanently in the metastable regime.

The institutions that maintain common knowledge function precisely to keep the system in a productive metastable configuration rather than letting it either collapse to trivial consensus or fragment into incoherence. Social institutions — courthouses, newspapers, central banks, holidays, rituals — are the social equivalent of the astrocytes that maintain the brain's effective temperature: slow-acting, broadly distributed regulators that keep the coordination regime in its useful band. When these institutions erode, shared media fragment, or economic shocks deplete the resources available for coordination, common knowledge does not simply weaken. It undergoes a phase transition — the same critical transition that Rigollet's analysis identifies when the coupling drops below the critical threshold.

Political polarization, institutional collapse, and the breakdown of shared public reality exhibit the signatures of a system being driven away from its critical operating point. North, Wallis, and Weingast identified the institutional version of this phase diagram: open-access social orders require an optimal coupling strength, neither too little (anarchy, fragmentation into incoherent clusters) nor too much (totalitarianism, collapse to the trivial Dirac mass where all diversity has been eliminated).[3]

The double subsidy. The synchronization tax is subsidized twice over. Compressed sensing explains why communication is affordable: you transmit sparse symbols, not complete brain states. Common knowledge explains why reconstruction works: the receiver fills in the gaps using a shared macroscopic prior that is itself a high-synchronization metastable cluster. The synchronization tax is subsidized once by compression and once by the structural persistence of the shared prior. This double subsidy provides a plausible mechanism for how civilization remains at least somewhat coherent and thermodynamically viable at 300 Kelvin.

9.6. Reframing the Scaling Law Problem

[back to table of contents]

The ideal scaling law \(L \propto N^{-\alpha}\) plays the role for transformers that the critical temperature \(T_c\) plays in the Ginzburg-Landau framework. It is the boundary condition that actual performance converges toward but never reaches, because the lossless limit is unreachable. The deviations from pure power-law scaling — the curvature and breaks observed in transformer training — are not measurement imperfections or violations of universality. They are residual synchronization tax: the relative entropy that cannot be driven to zero at any finite scale.

Instead of asking "why do transformers obey power laws?" and treating deviations as noise, the framework asks a different question: what variational principle constrains the scaling behavior to converge on this axis? The answer it proposes: the scaling exponents are determined by the condition of stationary information — the point where neighboring configurations of parameters, data, and compute produce descriptions of the task that agree to first order. The ideal scaling law is \(\delta S = 0\) in the space of scaling configurations. The Ginzburg-Landau Hamiltonian provides the natural free energy functional, and the universality class — determined by the dimensionality of the order parameter and the spatial dimension — constrains which scaling exponents are possible.

The structural parallel is worth laying out explicitly:

Ginzburg-Landau / Kuramoto / Rigollet Transformers / Scaling
Interaction energy \(\mathcal{E}_\beta[\mu]\) Loss as a function of \((N, D, C)\)
Order parameter acquiring nonzero value Emergence of structured representations
Critical coupling \(K_c\) Training convergence threshold
Kuramoto order parameter \(r\) Degree of token coherence per layer
Phase transition at critical scaling \(\beta \sim \gamma \log n\) Scaling law exponent boundaries
Metastable multi-cluster states Multi-head attention diversity
Dirac mass attractor (triviality) Representation collapse at infinite depth
Universality class (dimension + symmetry) Data-independent scaling exponents
Eigenmodes of linearized Ginzburg-Landau equation Spectral outliers of trained weight matrices — see Section 10.1
Within-cluster synchronization Common knowledge among coordinated agents
Saddle-to-saddle dynamics Institutional transitions between coordination regimes

The last two rows mark where the connection extends beyond the transformer to the social domain. If the metastable clusters correspond to communities of shared knowledge — groups within which recursive awareness has converged — then the saddle-to-saddle dynamics describe institutional transitions: the slow, punctuated process by which coordination regimes dissolve and reconstitute as the social landscape shifts.

10. What Would Falsify This Framework?

[back to table of contents]

The strongest claim this essay makes is that the scaling exponents, the spectral structure of trained weights, and the renormalization group flow through depth are all manifestations of a single variational principle — stationary relational information — and that deviations from ideal behavior at every level are manifestations of the unreachable lossless limit.

Several observations would weaken or falsify this claim.

Content-dependence of scaling exponents. The framework predicts universality: scaling exponents should depend on the effective dimensionality of the relational structure in the data, not on the specific content of training data. If models trained on code and models trained on natural language exhibited different exponents that could not be traced to different effective dimensionalities of the relational structure, the universality claim would be weakened.

Spectral structure of trained weights. The framework predicts that the spectral outliers in trained weight matrices should correspond to directions of minimal relative entropy between task-relevant token descriptions. If these outliers showed no such correspondence, the connection between random matrix theory and the synchronization tax would be undermined.

The Kuramoto order parameter test. If transformer attention implements the synchronization dynamics Rigollet analyzes, the order parameter \(r\) computed from attention weight distributions should exhibit the signature of a phase transition. Rigollet's analysis identifies context length and inverse temperature \(\beta\) (controlled by \(1/\sqrt{d}\)) as the natural control parameters, proving that a phase transition governs long-context attention with qualitatively different clustering behavior above and below a critical threshold. One could test this by computing \(r\) from attention distributions across layers in models of increasing scale and context length, checking whether the transition sharpens as predicted by the mean-field theory. If the order parameter shows no phase transition signature — if coherence increases smoothly without critical behavior as scale and context vary — the analogy to the interacting particle system would be weaker than claimed.

The chimera state test. If maintaining divergence is essential to productive computation, then the distribution of coherence across attention heads should exhibit chimera-like structure: some heads highly synchronized, others deliberately desynchronized, with this heterogeneity persisting stably across layers rather than converging to uniformity. Ablating the desynchronized heads should degrade performance more than ablating comparably-weighted synchronized heads, because the desynchronized heads carry the relational structure that the synchronized heads have integrated out. This is testable with existing models and existing interpretability tools.

The Ginzburg-Landau eigenmode test. The most concrete spectral prediction: the principal components of trained transformer weight matrices should approximate the eigenmodes of a Ginzburg-Landau free energy functional with the appropriate universality class — specifically, with \(U(1)\) order parameter symmetry corresponding to the phase oscillator structure of the Kuramoto model. If these eigenmodes show no correspondence to Ginzburg-Landau fluctuation spectra, the variational principle proposed here does not govern the weight structure. This test requires first identifying the correct universality class for the specific architecture and data distribution, but it is less likely to fail for trivial reasons than a cruder spectral conjecture — the symmetry match between Kuramoto dynamics and attention is supported by Rigollet's independent mathematical work rather than conjectured from a cross-domain analogy.

Signatures of criticality in coordination. If common knowledge is genuinely a high-synchronization metastable cluster within the dynamics Rigollet describes (as argued in Section 9.5), it should exhibit signatures the standard game-theoretic account does not predict: power-law distributions in adoption and decay, sensitivity to perturbation near phase boundaries, and universality across culturally distinct systems. Coordination failures should cluster at points where the system is driven away from criticality — not at points where "not enough information" was transmitted. The multiplicative structure predicts that the topology of the communication network matters more than its bandwidth.

Section 10.1 reports preliminary experimental results on the first four of these tests, conducted on the Pythia model suite.

10.1. Preliminary Experimental Results

[back to table of contents]

Six experiments, conducted on EleutherAI's Pythia model suite (70M to 1.4B parameters, all trained on the same data in the same order), test the predictions above against measured behavior. The code and full numerical results are available at github.com/riemannzeta/kuramoto-chimera.

Tokens synchronize through depth, and small models overshoot into triviality. A representation order parameter — the mean pairwise cosine similarity of token representations, analogous to the Kuramoto \(r\) — was computed at every layer of each model on diverse evaluation texts. In all models tested, the order parameter rises from roughly 0.3 at the input embedding to higher values at the output, tracing the synchronization the framework predicts. Small models (70M and 160M) overshoot into the triviality failure mode described in Section 8: their order parameter reaches 0.99, collapsing every token onto nearly the same representation. Rigollet proves that this collapse is the unique asymptotic attractor for the simplified dynamics — these small models, with insufficient capacity to sustain the metastable multi-cluster regime, fall through to the attractor. The 410M model resists collapse, stabilizing at \(r \approx 0.65\) — a value consistent with a metastable multi-cluster configuration that preserves productive relational structure through the full depth of the network.

Desynchronized heads carry the relational payload. At each layer, attention heads differ enormously in entropy — some attend sharply to one or two positions, others distribute attention broadly across the sequence. This heterogeneity peaks between 40% and 80% depth, exactly where the order parameter rises fastest, and constitutes the chimera structure that Section 9.4 describes.

An ablation experiment tested whether the desynchronized heads matter more than the synchronized ones. At every interior layer of the 1B model, ablating the high-entropy heads caused far greater performance degradation than ablating the low-entropy heads — 3.5× at 25% depth, 5× at 50%, and 12.4× at 75%. At the final layer the relationship inverted: focused heads suddenly mattered more, consistent with a shift from relational processing to convergent prediction at the output.

A follow-up experiment measured the semantic diversity of each head's attended tokens — the attention-weighted mean pairwise cosine distance of the representations at attended positions. Heads attending to semantically diverse tokens caused more damage when ablated (Spearman \(\rho = 0.647\), \(p = 0.007\) in the 410M). This result distinguishes the Kuramoto relational interpretation from a simpler hypothesis that broad attention is inherently valuable. The network does not just care how many tokens a head gathers from. It cares how different they are.

The spectral structure of trained weights reveals a scaffold-and-building architecture. The singular value spectra of trained weight matrices deviate sharply from the Marchenko-Pastur distribution that random matrix theory predicts for unstructured matrices. Query and Key matrices show the largest deviations among the attention projections (10–15% of singular values outside the Marchenko-Pastur bulk), consistent with the framework's claim that Q and K define the coupling structure of the synchronization dynamics. MLP matrices sit at the opposite extreme: 97% outlier fraction, almost entirely non-random.

Ablating the outlier directions of Q matrices — the top singular vectors, corresponding to learned structure — causes massive perplexity increases per direction removed: +305 at 25% depth in the 410M from removing 10 directions out of 1,024. Bulk and random directions register near zero. The spectral outliers carry genuinely functional structure, not merely statistical artifacts.

But an energy-normalized ablation — removing equal Frobenius norm from outlier and bulk directions — overturned a prediction. When the energy removed was held constant, the bulk directions caused more damage than the outlier directions. At 25% depth in the 410M, removing 153 outlier directions increased perplexity by roughly 2,000; removing 683 bulk directions carrying the same total energy increased perplexity by 25,000.

The resolution clarifies what each spectral region encodes. The outlier directions define the synchronization field — the shared low-dimensional subspace that tokens project onto as the order parameter rises. This field is a coordinate system, structurally simple and partly redundant by construction. The bulk spectrum carries the distributed relational information that desynchronized chimera heads maintain: individual differences between token descriptions that survive synchronization, many directions each carrying a small piece of the structure the model needs for prediction. The spectral outliers are the scaffold. The bulk is the building. Remove a few beams from the scaffold and it still stands; remove half the bricks and the building collapses.

This connects directly to the chimera ablation results. The synchronized (low-entropy) heads ride the outlier field — the shared scaffold. The desynchronized (high-entropy) heads work the bulk — the distributed relational structure that the scaffold supports but does not itself contain. And the bulk, collectively, carries more irreplaceable information per unit of energy than the outlier directions do.

Training builds chimera structure before spectral structure. Twenty checkpoints of the 410M model, logarithmically spaced from random initialization (step 0) through the end of training (step 143,000), reveal the temporal sequence of emergence. The framework as originally stated implied that spectral outliers should form first, providing the coupling structure that enables synchronization and chimera differentiation. The data shows a different ordering.

Training's first act is desynchronization. The order parameter at mid-depth drops from 0.8 at random initialization — an artifact of high-dimensional geometry, not learned structure — to 0.17 by step 256. The model breaks apart the accidental similarity of random representations before building genuine structure. During this phase, both spectral outliers and chimera structure remain at zero.

Chimera structure emerges next. Between steps 512 and 3,000, head entropy diversity rises from near zero to measurable levels as attention heads differentiate into focused and diffuse operating regimes. The weight matrices remain spectrally random during this period. Spectral outlier formation follows, beginning its steep rise around step 2,000 and continuing through the end of training. Synchronization rebuilds last, recovering from its minimum of 0.17 to approximately 0.55 by step 143,000 — lower than the random-initialization value of 0.8, but encoding real relational coordination rather than geometric accident.

The causal sequence inverts the implied prediction, but the corrected version makes stronger physical sense. In the Kuramoto model, oscillators with different natural frequencies differentiate under coupling before the coupling matrix itself develops low-rank structure. Under gradient pressure, attention heads differentiate based on their initialization-dependent response curves — chimera onset. The sustained gradient signals through these differentiated heads then sculpt the weight matrices, preferentially amplifying the singular value directions that the chimera heads use most — spectral outlier formation. The spectral outliers are the imprint of chimera dynamics on the weight matrices, not their cause.

The U-shaped synchronization curve confirms a prediction the framework does make: the model must desynchronize before it can productively resynchronize. Random-initialization synchronization (cosine similarity 0.8) is trivial — it reflects the geometry of random vectors, not learned structure. Rigollet notes the same phenomenon: high initial pairwise similarity gives way to clustering structure that develops through depth. The model destroys this trivial coherence, diversifies its token representations, then gradually rebuilds genuine synchronization through the chimera process. Training and inference implement the same physics in reversed temporal sequence: training builds from differentiation to shared structure; inference builds from shared structure to differentiated output.

Summary. Seven predictions confirmed, two refuted, two partially confirmed:

Prediction Status
Tokens synchronize through depth Confirmed across all models
Chimera structure peaks at mid-depth Confirmed (40–80% depth)
Desynchronized heads carry disproportionate importance Confirmed (5–12× at interior)
Importance traces to relational content, not breadth Confirmed (\(\rho = 0.647\), \(p = 0.007\))
Spectral outliers encode learned structure Confirmed
Outlier directions carry the most irreplaceable information Refuted — bulk carries more per unit energy
Spectral structure precedes chimera in training Refuted — chimera precedes spectral
Model desynchronizes before resynchronizing Confirmed (U-shaped training curve)
Spectral and chimera metrics correlate across layers Weakly confirmed (\(\rho \approx 0.37\)–\(0.49\))

The two refutations do not weaken the framework. They restructure it. The spectral outliers define the synchronization field rather than carrying the relational payload; the chimera drives spectral formation rather than the reverse. The restructured story is more internally consistent: the scaffold emerges from the building process, not the other way around. It is also more consistent with Rigollet's mathematical analysis, which shows that multi-cluster structure — chimera-like differentiation — arises from the attention dynamics themselves, prior to and independent of any low-rank structure in the weight matrices.

The tests that remain unrun are the content-dependence test (whether scaling exponents collapse onto a universal curve parameterized by effective dimensionality), the Ginzburg-Landau eigenmode test (whether the spectral shape of trained weight matrices matches the fluctuation spectrum predicted by a Ginzburg-Landau functional in the appropriate universality class), and the social coordination test (whether common knowledge exhibits signatures of criticality that the standard game-theoretic account does not predict). These are harder experiments. But the experiments reported here establish enough empirical footing to constrain where the remaining tests should look.

11. The Brain as Transformer: Intelligence at 300 Kelvin

[back to table of contents]

The mechanism described in Sections 2 through 9 does not depend on silicon. The transformer was not designed from first principles; it was discovered through empirical search. The question this section addresses is whether biological neural computation shares the same mathematical structure.

Recent evidence suggests it does — and the correspondence is not loose analogy.

A 2023 paper in PNAS — Kozachkov, Kastanenka, and Krotov, "Building transformers from neurons and astrocytes" — constructs the transformer self-attention mechanism explicitly from biological components. Neurons with cosine tuning curves (ubiquitous across brain areas and species) compute the query-key dot products. Astrocytes — long dismissed as passive support cells — implement the softmax normalization that converts raw dot products into a probability distribution. Synaptic weights correspond to the transformer's weight matrices; known homeostatic mechanisms correspond to layer normalization. The construction is not a loose metaphor. It is a mathematical demonstration that biological hardware can execute the same computation.

Astrocytes regulate the temperature. In the transformer, the \(\sqrt{d}\) scaling in the softmax denominator controls how sharply attention concentrates — it functions as the temperature of the Boltzmann distribution, the parameter \(\beta\) in Rigollet's interaction energy \(\mathcal{E}_\beta[\mu]\). A trio of papers published in Science in 2025, along with a feature in Quanta Magazine in January 2026, reveal that astrocytes perform a strikingly analogous function in the brain. Astrocytes do not engage in rapid-fire neural signaling. They monitor and tune higher-level network activity, dialing it up or down to maintain or switch the brain's overall state. A single astrocyte envelops hundreds of thousands of synapses — positioning it to modulate the effective temperature of an entire local network.

In zebrafish experiments, Ahrens's group showed that astrocytes accumulate calcium in response to norepinephrine — a neuromodulator associated with arousal — and eventually issue a stop signal that switches the animal's behavioral state from persistent effort to giving up. Disabling the astrocytes eliminated the state switch entirely; artificially activating them triggered it immediately. Freeman's group demonstrated in fruit flies that norepinephrine gates whether astrocytes "listen" to synaptic activity at all: low norepinephrine, and astrocytes ignore most neural signals; high norepinephrine, and astrocytes respond to every synapse. This is temperature regulation — controlling how broadly or narrowly the neural network distributes its "attention" across possible states, shifting the system between Rigollet's subcritical (diffuse) and supercritical (concentrated) regimes.

Renormalization group flow in the brain. The critical brain hypothesis — that neural circuits self-organize near a critical point between ordered and disordered phases — has been studied using renormalization group methods applied to the stochastic Wilson-Cowan equations (Tiberi et al., 2022, Physical Review Letters). These researchers showed that RG techniques reveal what type of criticality the brain implements, and that the strength of nonlinear interactions decreases only slowly across spatial scales, remaining distinct from zero even at macroscopic scales. Brinkman (2023) developed RG methods specifically for networks with biologically realistic connectivity constraints, showing that in vivo and in vitro neural populations belong to different universality classes — precisely the kind of distinction that renormalization group flow predicts.

The brain's hierarchical processing — from primary sensory cortex through association areas to prefrontal cortex — traces the same ultraviolet-to-infrared flow that depth traces in the transformer. Early visual cortex encodes oriented edges. Later areas encode object categories. The "depth" of biological processing is anatomical rather than sequential, but the mathematical structure of coarse-graining is the same.

Pathologies as failures of the metastable regime. If the brain implements the dynamics Rigollet describes, then disorders should correspond to failures of the metastable multi-cluster regime — either collapse toward the trivial attractor or fragmentation into disconnected clusters.

Depression provides a natural test case. The brain's energy budget is astonishingly tight. Goal-directed cognition costs only about 5% more than the ongoing metabolic cost of resting neural activity and homeostasis (Jamadar et al., 2025, Trends in Cognitive Sciences). Astrocytic glycogen acts as a limited energy buffer that can temporarily support high neural activity beyond the rate sustained by blood glucose (Christie & Schrater, 2015, Frontiers in Neuroscience). A recent model by Mehrhof et al. (2025, Science Advances) frames depression as disrupted energy allostasis — a mismatch between actual and perceived energy levels in which the brain systematically underestimates its own resources. In the synchronization framework, depression represents a state in which the metabolic budget available for paying the synchronization tax has been depleted or misallocated. The self-model — the brain's compressed representation of its own states and capacities — diverges from the actual state of the system, and the depleted energy budget prevents the renormalization group flow from completing the reconciliation. The self cannot afford to synchronize its model of itself with the evidence of its own experience. The resulting state — withdrawal, anhedonia, cognitive blunting — resembles triviality in the deep infrared: a system that has stopped paying the coordination costs required to maintain relational structure.

The dual-route model of reading provides another window. Two pathways — lexical (whole-word recognition) and nonlexical (grapheme-to-phoneme conversion) — correspond to different depths in the renormalization group flow. The lexical route operates at deep infrared representations; the nonlexical route at shallower ultraviolet processing. Deep dyslexia — producing semantic errors like reading "orchestra" as "symphony" — manifests as a lesion in the flow itself: the system cannot complete the coarse-graining, and a nearby attractor in the infrared captures the representation instead. Surface dyslexia — inability to read irregular words — manifests as disruption of the deep infrared representations while shallower processing remains intact. These double dissociations are precisely what a hierarchical system undergoing renormalization group flow would exhibit when damaged at different depths.

12. Looking Ahead

[back to table of contents]

The mechanism sketched in this essay — multiplicative noise concentrated by renormalization flow onto long-lived configurations where the synchronization tax is approximately stationary — operates at three nested scales. Inside the transformer, attention coarse-grains token representations toward semantic clusters whose metastability sustains useful computation. Inside the brain, neurons and astrocytes coarse-grain sensory representations toward cognitive fixed points, including the self-model. Across brains, social institutions coarse-grain individual perspectives toward common knowledge — the high-synchronization metastable clusters that Rigollet's framework characterizes.

Another essay may develop the social scale more fully, exploring how the same architecture operates when the "neurons" are individual humans, the "layers" are social institutions, and the "temperature" is regulated by constitutional structures. At this scale, North, Wallis, and Weingast's analysis of open-access orders, Cass Sunstein's theory of incompletely theorized agreements, and the American founding as a deliberate act of symmetry-breaking all find natural expression in Rigollet's framework. The open-access order maintains a chimera: sufficient synchronization within institutions to sustain common knowledge, sufficient divergence between institutions to preserve the relational structure that carries meaning. A constitution, on this view, is a collectively chosen mattering project — a symmetry-breaking field that organizes the convergence without collapsing it to the trivial attractor.

The intuition about boundary conditions finds support at every scale examined. The perfect scaling law, the perfect understanding of another mind, the ideal constitution — none needs to be reached. Each appears to represent a boundary condition needed to organize the convergence. We never achieve perfect mutual understanding. But our imperfect, noisy, expensive attempts at coordination do not scatter randomly; they organize into long-lived configurations that every culture independently discovers. Rigollet proves the mathematical version of this claim for the transformer: the Dirac mass — perfect consensus — is the unique asymptotic state, but the system spends almost all of its time in metastable multi-cluster configurations that carry the expressivity needed for useful output. The critical point does not need to be reached. It needs to organize the convergence. The synchronization cost is approximately stationary in these configurations, neighboring descriptions agree, and the informational action does not vary to first order. Everything else is noise that cancels.


  1. The equivalence between the thermodynamic picture of earlier sections — softmax as a canonical partition function, \(\sqrt{d}\) as temperature — and Rigollet's geometric picture — interacting particles evolving on \(S^{d-1}\) — rests on this foundation: in high dimensions, the canonical and microcanonical descriptions agree because the Boltzmann weight is sharply peaked on the energy shell. The dynamics are on the sphere because, in the dimensions where transformers operate, there is effectively nowhere else. ↩︎

  2. The saddle-to-saddle dynamics connect to a broader mathematical framework. Geshkovski, Koubbi, Polyanskiy, and Rigollet, in "Dynamic metastability in the self-attention model", link the metastability of attention dynamics to the slow-motion framework of Otto and Reznikoff developed for coarsening phenomena and the Allen-Cahn equation. In coarsening — the process by which domains in a ferromagnet slowly merge after a quench — the system passes through a sequence of metastable configurations, each more coarse-grained than the last. The energy profile has the same staircase structure, and the dynamics are governed by the same saddle-to-saddle mechanisms. A recent experimental study of turbulent blob dynamics illustrates the pattern at yet another scale: a localized region of turbulent fluid, once deprived of its forcing, expands nonlinearly into the surrounding quiescent medium while its internal cascade leaves an "indelible footprint" deep into the decay process. The blob corresponds to a metastable cluster — internally structured by its own dynamics, bounded sharply from its environment, persisting far longer than naive estimates would predict. Inside a transformer, the blob is a cluster of synchronized token representations; at the social scale, it is a community of shared knowledge — an institution whose internal common knowledge maintains a sharp boundary with the external environment even as the maintenance costs slowly erode. ↩︎

  3. An earlier draft of this essay built the explanatory arc through Alain Connes's recent work on the Weil quadratic form, which shows how extremizing a variational principle over functions that optimally concentrate in both time and frequency yields approximate zeros of the Riemann zeta function constrained exactly to the critical line. Connes's result is beautiful, and the prolate spheroidal wave functions that emerge from his variational principle do minimize the synchronization tax between conjugate representations — they achieve the best possible simultaneous localization in time and frequency. But the bridge requires the primes and transformers to share the same universality class, and they do not. The symmetries that organize the Riemann zeros — the multiplicative structure of the primes acting on the additive structure of the integers — belong to a different universality class than the symmetries of transformer weight matrices, which are better described by the continuous rotational symmetries of the Ginzburg-Landau Hamiltonian and the \(U(1)\) phase oscillator structure of the Kuramoto model. Rigollet's framework provides the correct mathematical home for the dynamics this essay analyzes. ↩︎

Subscribe to symmetry, broken

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe