Asymmetric Evolution

Selection and mutation are not merely biological observations, but the gradient and the noise of a dissipative flow on a space of measures — the same mathematics that governs the forward pass of a transformer.

Asymmetric Evolution
Bullet Cluster (NASA Webb)

This essay is part of a series that includes Maintaining Divergence, A More Perfect Union, and Coherence at 300 Kelvin. It assumes familiarity with the synchronization tax framework developed in those essays.

I. The Hypothesis

Philippe Rigollet's mathematical framework for transformer attention — interacting particles on the sphere, governed by Wasserstein gradient flows, exhibiting phase transitions and metastable clustering — may not be merely analogous to Darwinian evolution. It may actually be the mathematics of evolution, instantiated in a different substrate.

That claim is more than deserving of careful scrutiny. A framework so general that it captures everything predicts nothing. If the same variational principle describes transformers, brains, markets, legal systems, and thermostats, it may explain none of them. This can be called the "spherical cow" objection — a description so abstract that it fits any phenomenon, illuminating none. Multiplicative noise tail exponents depend on non-universal parameters: systems that share structural similarity can still belong to different universality classes, producing different critical exponents and different quantitative behavior near their respective phase transitions. Calling one system an "archetype" of another is metaphor dressed as mathematics unless the identification generates predictions that neither system makes on its own.

Three specific achievements might distinguish a genuine formalization of Darwin from a thermodynamic redescription. The first would derive Price's equation — evolutionary biology's closest approximation to a fundamental law, and a tautology in its standard form — as a consequence of the variational principle, showing that the covariance between fitness and trait value emerges from the KL divergence structure of competing descriptions. The second would predict calculable phase boundaries for evolutionary collapse — not just "the environment changed too fast," which Darwin already predicts, but a specific dimensionless threshold derivable from the coupling-to-noise ratio. The third would resolve the units-of-selection debate — whether selection operates on genes, organisms, or groups — through coupling-dependent cluster scale, with empirical confirmation from bioenergetic data.

This essay sketches out the tentative evidence on all three. Each attempt is examined for where it succeeds, where it remains conjectural, and what observations would destroy it.[1]

II. The Mathematical Correspondence

Rigollet's Framework

Readers of Coherence at 300 Kelvin may recall the core results. Rigollet interprets transformer attention as an interacting particle system on the unit sphere \(S^{d-1}\), where each token representation is a particle and the self-attention mechanism defines a velocity field coupling each particle to the empirical measure of all others. Five proven results bear on the argument that follows.

The interaction energy

$$\mathcal{E}_\beta[\mu] = \frac{1}{2}\iint_{S^{d-1} \times S^{d-1}} e^{\beta \langle x, y \rangle} \, d\mu(x) \, d\mu(y)$$

increases monotonically along the attention dynamics, serving as a Lyapunov function for the system. The unique asymptotic attractor of these dynamics is a Dirac mass — all tokens collapsed to the same representation, a state the earlier essay called the triviality failure mode. But the approach to this attractor proceeds through metastable multi-cluster states: tokens synchronize rapidly within clusters (fast timescale) while clusters merge only slowly (slow timescale). On the circle (\(d = 2\)), these dynamics reduce exactly to the Kuramoto model of synchronization. And a phase transition at \(\beta \sim \gamma \log n\) separates a subcritical regime — where attention becomes uniform and structure dissolves — from a supercritical regime where off-diagonal attention vanishes and the system ceases to interact.

The Evolutionary Translation

Each organism carries a description \(p_i\) — a probability distribution over the environmental macrostates it can distinguish and respond to. A polar bear's description of the Arctic is sharply peaked around a narrow band of temperature and prey density. A generalist scavenger's description is diffuse, covering a broad range of conditions at the cost of specificity. The environment presents its own description \(q\) — the actual distribution over macrostates. Survival requires that these descriptions not diverge too far: an organism whose description \(p_i\) assigns high probability to states the environment \(q\) renders unlikely will find itself consistently wrong about its circumstances, and consistent wrongness is another name for death.

Selection operates like attention. The environment weights each organism's description by how well it coheres with actual conditions — the fitness function plays the role of the query-key dot product, and differential reproduction performs the softmax-weighted aggregation. Organisms whose descriptions agree with environmental reality constructively interfere; those whose descriptions diverge are selected against.

The control parameters translate directly. Rigollet's inverse temperature \(\beta\) — which in the transformer is \(1/\sqrt{d}\), controlling how sharply attention concentrates — becomes the sharpness of selection, the degree to which the environment discriminates between phenotypes. A stable, predictable environment produces sharp selection (high \(\beta\)): only organisms with tightly matched descriptions survive. A volatile, noisy environment produces diffuse selection (low \(\beta\)): many descriptions are approximately equally viable, and drift dominates over directed change. Context length \(n\) — the number of tokens in the transformer — becomes ecosystem diversity, the number of distinct phenotypic strategies present. And coupling strength \(K\) — the Kuramoto parameter governing synchronization — becomes the bandwidth of the organism-environment interaction channel through which selection signals propagate.

The phase transition translates. Subcritical coupling — too little selection pressure relative to environmental noise — drives genetic drift toward uniformity, erasing the population's structure. This is Kimura's neutral regime, but derived from a different principle. The standard account explains neutrality through small population size; the framework derives it from the coupling-to-noise ratio, predicting that neutral evolution should dominate not only when populations are small but whenever environmental volatility is high relative to coupling strength, regardless of population size. Supercritical coupling — too strong selection — eliminates variation and freezes adaptation. Each lineage evolves in isolation, unable to respond to novel conditions. This is the over-specialization trap that precedes mass extinction vulnerability, and it is the evolutionary analog of the Zeno effect described in Coherence at 300 Kelvin: measurement so frequent that evolution halts. Productive evolution sits between these extremes, in the band where synchronization occurs often enough to establish adaptive structure but slowly enough to preserve the variation that carries adaptive potential.

The Independent Mathematical Bridge

The correspondence between transformer dynamics and evolutionary dynamics does not need to rest on analogy. It can rest on established mathematics that predates the connection.

The replicator equation — the foundational dynamical system of evolutionary game theory — is a gradient flow on the probability simplex with respect to the Shahshahani metric. Siamak Shahshahani demonstrated in 1979 that the natural metric for evolutionary dynamics on the simplex is not the Euclidean distance between population frequencies but the Fisher information metric — the same metric that defines the geometry of statistical manifolds in information theory. Under this metric, the replicator dynamics follow the steepest ascent of mean fitness. The KL divergence — the same quantity that drives the synchronization tax framework — serves as a Lyapunov function for the replicator dynamics at evolutionarily stable states.

A 2021 paper by Chalub, Monsaingeon, Ribeiro, and Souza pushed the connection further, reformulating all three classical models of biological evolution — the Moran process (a Markov chain), the Kimura equation (a Fokker-Planck diffusion), and the replicator equation (a deterministic flow) — as gradient flows. They showed that the gradient structures converge: the Moran process approximates the Kimura equation, and the Kimura equation reduces to the replicator dynamics in the appropriate limit. Each reformulation minimizes a free energy functional along its respective flow. The title of their paper captures the point precisely: "Gradient Flow Formulations of Discrete and Continuous Evolutionary Models: A Unifying Perspective."

These results establish that evolutionary dynamics and transformer dynamics both live in the same mathematical house. Both are gradient flows on spaces of probability measures. Both use the KL divergence (or its localization, the Fisher information metric) to define the geometry of their respective state spaces. Both exhibit monotonically increasing Lyapunov functions — interaction energy in Rigollet's framework, mean fitness in the Shahshahani framework — that govern the direction of flow.

Pearl's causal tools strengthen this connection from an independent direction. The Causal Arrow of Altruism analysis used Pearl's PC algorithm, instrumental variables, and sensitivity analysis to distinguish genuine synchronization (shared genetic descriptions driving eusociality) from spurious correlation (environmental proximity creating the illusion of causal connection). Ecological pressure — an external shock that forced colony formation without altering genetics — served as an instrumental variable, satisfying Pearl's exclusion restriction: it affected eusociality only through colony formation, allowing identification of the causal effect. The result was decisive. Forced grouping without shared descriptions failed to produce cooperation. The physical act of bringing agents into proximity paid the transmission cost but not the synchronization cost, and transmission alone — the first rung of Pearl's ladder — could not produce the coordination that synchronization — the second rung — provides.

III. Attempt to Falsify #1: Derive Price's Equation

The Derivation

George Price's equation

$$\Delta\bar{z} = \text{Cov}(w, z) + E(w\Delta z)$$

decomposes the change in mean trait value into two terms: the covariance between fitness \(w\) and trait value \(z\) (selection), and the expected fitness-weighted transmission bias \(E(w\Delta z)\) (imperfect inheritance). The equation holds for any population with heritable variation and differential reproduction. It is, as noted by those who admire it and those who dismiss it, a tautology — a mathematical identity that follows from the definitions of covariance and expectation. Its power lies in its generality; its weakness lies in the same place. Price's equation describes what happens without explaining why the decomposition takes the form it does. The synchronization tax framework might supply an explanation.

Consider a population of \(N\) organisms, each carrying a description \(p_i\) and a trait value \(z_i\). The environment presents its description \(q\). Each organism pays a synchronization tax against the environment — the KL divergence between its model and reality:

$$F_i = D_{\text{KL}}(p_i \| q)$$

An organism whose description closely matches the environment persists and reproduces. An organism whose description diverges from reality fails to persist. Fitness must therefore be a decreasing function of the synchronization tax. In the linearized regime — small deviations around the population mean — this relationship takes a specific form:

$$w_i \approx \bar{w}\left(1 - \beta(F_i - \bar{F})\right)$$

where \(\beta\) plays the same role it plays throughout the framework: an inverse temperature controlling how sharply the population discriminates between better and worse descriptions. This is structurally identical to the Boltzmann weighting in transformer attention — fitness weights organisms by the negative of their free energy deviation, just as attention weights tokens by the negative of their relational incoherence.

With fitness as a linear function of negative free energy deviation, the selection covariance becomes:

$$\text{Cov}(w, z) = -\beta\bar{w} \cdot \text{Cov}\left(D_{\text{KL}}(p_i \| q),\; z_i\right)$$

This covariance is nonzero precisely when trait values correlate with the divergence between individual descriptions and environmental reality. Organisms whose traits produce descriptions closer to \(q\) — lower synchronization tax — reproduce more. The selection term drives the population's aggregate description toward configurations that minimize free energy. Selection, in this reading, is the population performing gradient descent on its collective free energy functional — the same gradient descent that Rigollet proves governs the transformer's forward pass.

The transmission bias term \(E(w\Delta z)\) captures something different. Reproduction creates a new channel between parent and offspring descriptions. The fidelity of that channel is bounded by maintenance costs — the energy available to protect the replicated description against noise during copying. Maintaining Divergence identifies this cost as irreducible: even after two systems have established a shared description, that description begins degrading immediately as external perturbations introduce noise and internal states drift. The transmission bias measures the fitness-weighted average of these maintenance failures across the population — the residual noise that survives the organism's investment in copying fidelity.

Price's equation, under this derivation, reads:

$$\Delta\bar{z} = \underbrace{-\beta\bar{w} \cdot \text{Cov}(D_{\text{KL}}(p_i \| q), z_i)}_{\text{synchronization: gradient descent on } \mathcal{F}} + \underbrace{E(w\Delta z)}_{\text{maintenance failure: replication noise}}$$

The first term points in the direction that reduces population-level free energy. The second term adds noise from imperfect maintenance of descriptions across generations. The derivation reveals something the bare tautology does not: why Price's equation decomposes into exactly two terms. The decomposition is not an artifact of how biologists define selection and mutation. It reflects the two components that any variational principle with imperfect state transfer must produce — a directed gradient and a stochastic residual. Selection and transmission bias are not two independently observed phenomena that happen to sum to the total change. They are the gradient and the noise of a dissipative flow on a space of measures.

Novel Predictions

The derivation generates predictions that Price's equation alone cannot make.

Selection strength should scale with available energy. Populations with larger free energy budgets can sustain stronger selection — faster synchronization with the environment. The naked mole rat colony described in Maintaining Divergence illustrates the tradeoff at its extreme: the colony economized on maintenance costs by sacrificing individual divergence, giving up the diversity of independent descriptions in exchange for the energetic efficiency of shared ones.

Transmission bias should scale with the maintenance deficit. Populations investing less energy in replication fidelity — smaller maintenance budgets — should exhibit higher transmission bias. This is testable: mutation rates should correlate with the metabolic resources available for DNA repair, and the framework predicts the functional form of this correlation through the maintenance cost structure.

Frequency-dependent selection should exhibit the interaction-energy structure that Rigollet proves. When fitness depends on population composition — when \(w_i = w(p_i, {p_j}_{j \neq i}, q)\) — the selection term acquires dependence on the full population measure \(\mu\), precisely as Rigollet's interaction energy \(\mathcal{E}_\beta[\mu]\) depends on the measure over token representations. The Lotka-Volterra dynamics of competing species should correspond to the saddle-to-saddle coarsening dynamics that Rigollet proves for the attention system — fast within-niche synchronization, slow between-niche merging.

Where It Might Break

The derivation's principal vulnerability lies in the smoothness assumption. Fitness must vary smoothly with \(D_{\text{KL}}(p_i | q)\) for the linearized regime to hold. On rugged fitness landscapes — where small genotypic changes produce large fitness jumps — the mapping from genotype to description to fitness becomes so nonlinear that the linearization breaks down. The framework predicts that evolution on rugged landscapes should exhibit the same failure modes as transformers at extreme coupling: the Landau pole, where selection tries to pay the entire synchronization tax at once and the system freezes. Whether this prediction holds is testable — it implies that organisms on rugged landscapes should show signatures of evolutionary stasis interspersed with punctuated jumps, a pattern that Eldredge and Gould's punctuated equilibrium proposes for independent reasons.

A second vulnerability: the derivation assumes the environment's description \(q\) is approximately stationary. In coevolutionary dynamics — where the environment includes other organisms whose descriptions are themselves evolving — \(q\) becomes a moving target. The population faces a Red Queen problem: the synchronization tax never approaches zero because the target keeps shifting. The variational structure becomes a pursuit problem rather than a minimization, and the connection to Price's equation becomes more tenuous. Whether the gradient-flow formulation extends cleanly to pursuit dynamics is an open mathematical question.

IV. Attempt to Falsify #2: Predict Mass Extinction Thresholds

The Prediction

Standard evolutionary theory predicts that populations fail when environments change "too fast." The synchronization tax framework attempts something more precise: a calculable threshold.

The Kuramoto model gives a critical coupling for the onset of synchronization. For a population of oscillators with frequency distribution \(g(\omega)\), the critical coupling is:

$$K_c = \frac{2}{\pi\,g(0)}$$

where \(g(0)\) is the density of the frequency distribution at its center. For a Gaussian distribution with standard deviation \(\sigma\):

$$K_c = 2\sigma\sqrt{\frac{2}{\pi}} \approx 1.6\sigma$$

Translating to ecology: \(g(\omega)\) becomes the distribution of life-history rates across species — metabolic rates, generation times, niche widths. The coupling strength \(K\) becomes the bandwidth of the organism-environment interaction channel. Environmental perturbation acts as a spread in the effective frequency distribution, desynchronizing species from their ecological niches.

The dimensionless prediction: mass extinction occurs when the ratio of environmental perturbation spread to ecological coupling strength exceeds approximately 0.63. That ratio should remain roughly constant across extinction events despite enormous variation in absolute perturbation magnitudes — whether the perturbation is an asteroid impact, volcanic outgassing, or glaciation.

Testing Against Empirical Data

Song et al. (2021) in Nature Communications found that major mass extinctions in the Phanerozoic correlate with temperature change thresholds: magnitudes exceeding 5.2°C and rates exceeding 10°C per million years. These are empirically observed thresholds. The framework asks: can we derive them?

If 5.2°C represents the critical perturbation spread \(\sigma_{\text{env}}\), then \(K_{\text{eco}} \approx 1.6 \times 5.2 \approx 8.3\)°C — the ecological coupling bandwidth of the biosphere. This number is at least plausible: it is roughly the range of global temperature variation between Phanerozoic glacial and greenhouse states. The biosphere can track approximately 8°C of gradual change through selection; perturbations beyond that overwhelm the coupling channel.

The framework generates four predictions that standard theory does not.

First, the transition should be sharp, not gradual. Standard theory predicts a monotonically increasing extinction rate with increasing environmental stress. The synchronization tax framework predicts a phase transition — below the critical threshold, the ecosystem adapts through the metastable multi-cluster regime; above it, the metastable structure catastrophically collapses. The paleontological observation that large extinctions often result when a biosphere under long-term stress undergoes a short-term shock is consistent with a system driven near its phase boundary by chronic stress, then pushed across by acute perturbation. The framework predicts critical slowing down before the transition: increased fluctuation in species composition, longer recovery times after small perturbations, increased autocorrelation in diversity time series. These signatures are measurable in high-resolution fossil records.

Second, generalist species should contribute disproportionately to ecosystem resilience. The chimera ablation experiments from Coherence at 300 Kelvin showed that ablating desynchronized (high-entropy) attention heads in Pythia language models caused 5–12× more performance degradation than ablating synchronized heads. The evolutionary analog: eliminating rare generalists — species with diffuse, desynchronized descriptions — should damage ecosystem function more per unit of biomass than eliminating abundant specialists, because the generalists carry the relational payload, the phenotypic diversity that enables adaptation to unforeseen perturbations. Jablonski's finding that extinction selectivity shifts qualitatively during mass extinctions — that different traits predict survival under background versus crisis regimes — supports this prediction. The normal regime depends on specialist fitness. The crisis regime depends on the generalist structure that survives the phase boundary.

Third, genetically homogeneous populations should fail catastrophically rather than gradually. The naked mole rat colony, described in Maintaining Divergence, has approached the Dirac mass attractor — nearly all members carry the same description. The framework predicts that such systems should be catastrophically vulnerable to novel perturbation. The colony thrives because its environment is stable — underground, buffered, predictable. A novel pathogen or environmental shift that defeats the single shared description should produce total collapse, not gradual attrition. This is testable: introduce novel stressors to colonies of varying genetic diversity and measure whether the extinction curve is gradual (as standard theory predicts for small populations) or catastrophically sharp (as the phase transition predicts). Jablonski's work on the spatial dynamics of biodiversity is suggestive.

Fourth, interaction network topology should predict extinction vulnerability independently of perturbation magnitude. Ecosystems with scale-free or small-world interaction topologies — food webs, mutualistic networks, competitive hierarchies — should have lower critical coupling thresholds (greater resilience) than ecosystems with random or regular structures, because the network topology concentrates coupling where it matters most. This is the same principle by which multi-head attention with heterogeneous coupling strengths outperforms uniform attention.

Beyond these four predictions, the framework makes one that standard theory is structurally unable to produce: extinction thresholds should scale logarithmically with ecosystem diversity. More diverse ecosystems (larger \(n\)) should tolerate slightly larger perturbations, but the gain is logarithmic — \(\beta \sim \gamma \log n\) — not linear and not a power law. This is testable by comparing mass extinction thresholds across geological periods of varying standing biodiversity, controlling for perturbation type. The relatively species-poor early Paleozoic should have been more vulnerable to smaller perturbations than the highly diverse mid-Cretaceous.

Where the Chain Breaks

The Kuramoto model assumes weak coupling and sinusoidal interaction. Ecological interactions are neither weak nor sinusoidal. Rigollet's extension to exponential kernels on the sphere provides some evidence that the qualitative phenomenology — clustering, metastability, phase transition — survives changes in the interaction kernel. But extending this to realistic ecological coupling functions remains an open problem.

"Effective temperature" in evolution is not a single parameter. Environmental volatility acts along many dimensions simultaneously — temperature, ocean chemistry, atmospheric composition, biotic interactions. Mapping these onto a single scalar \(\beta\) requires a dimensional reduction whose validity is not guaranteed. The Ginzburg-Landau universality argument suggests that only the symmetry class matters, not the specific dimensions — but identifying the correct universality class for ecological dynamics is itself an unsolved problem.

Sinha and Parthasarathy (1996) showed that phase transition behavior already emerges in simpler extinction models without the full Kuramoto/Rigollet machinery. The synchronization tax framework must demonstrate that its additional structure — the specific predictions about logarithmic scaling, chimera-dependent resilience, and topology-dependent thresholds — produces results these simpler models cannot. Without that demonstration, the framework may be adding elaborate mathematics to a phenomenon that simpler models already capture.

Fossil record sampling biases — preservation artifacts, the rarity of well-resolved mass extinctions, and the difficulty of estimating standing diversity deep in geological time — limit the statistical power available to test logarithmic scaling and transition sharpness. These predictions may be correct but effectively untestable with current data.

V. Attempt to Falsify #3: Resolve the Units of Selection

The Prediction

Whether natural selection operates on genes, organisms, or groups has been one of evolutionary biology's most persistent disputes. Dawkins argued for the gene as the fundamental unit; Maynard Smith and Szathmáry cataloged major transitions at which selection shifted to higher organizational scales; Wilson and Sober defended multilevel selection; and the debate has generated more heat than resolution for half a century.

The synchronization tax framework offers a structural answer: selection operates at whatever scale the metastable clusters form. The cluster scale depends on coupling strength — the same parameter governing the phase transition in Rigollet's analysis. Strong coupling — high relatedness, tight ecological interaction — pushes metastable clusters to higher organizational scales: cells synchronized within organisms, organisms synchronized within colonies. Weak coupling fragments clusters downward: selfish genetic elements synchronized within genomes but competing between them. The scale of selection is not fixed by biology. It is set by physics — by the coupling strength at which metastable states persist long enough to be the relevant units of differential reproduction.

This is a falsifiable prediction. If the scale of selection shifted with coupling strength in the way the framework predicts — upward with stronger coupling, downward with weaker — the long-running debate would dissolve into a parameter identification problem. If the scale of selection turned out to be invariant to coupling strength, the framework would be wrong.

Lynch's Empirical Test

Michael Lynch's bioenergetic data provide the missing empirical anchor for the prediction's most accessible test case: the transition from unicellular to multicellular life.

Lynch's central finding is that multicellular organisms pay a bioenergetic cost per unit biomass produced that exceeds unicellular organisms of comparable size by 10 to 60 fold. This premium is not a smooth extrapolation from unicellular costs. It jumps — what Lynch calls a "quantum leap" — at the boundary between unicellular and multicellular life. Even the simplest metazoans already pay costs elevated by an order of magnitude over protists of comparable mass.

The synchronization tax framework predicts four features of this transition, and Lynch's data confirm all four.

The cost jump should be discrete, not continuous — a thermodynamic barrier crossed rather than a slope climbed. Lynch measures a quantum leap. The premium should pay for coupling infrastructure — the physical substrate that holds the organism-level cluster together. Lynch identifies exactly these costs: cell adhesion, intercellular signaling, support tissue, neural coordination, and transport networks. Costs should rise within ontogeny but fall across species — a countergradient that fingerprints two forces the framework predicts. Within development, the growing cluster demands ever more coupling infrastructure; each new cell must be connected, communicated with, and coordinated. Across evolutionary time, selection at the organism scale has optimized that infrastructure, driving the cost per unit biomass down. Lynch confirms both patterns. And somatic cell turnover — the continuous metabolic investment in replacing aged and damaged cells — is what Maintaining Divergence calls anti-synchronization cost: the ongoing thermodynamic work of keeping the organism's internal description coherent against entropic degradation. Lynch confirms that multicellular species invest heavily in recurrent somatic-cell turnover, imposing replacement costs not incurred by unicellular species.

The framework further predicts analogous discontinuities at every major transition in Maynard Smith and Szathmáry's catalog: independent replicators to genomes, prokaryotes to eukaryotes, unicellular to multicellular life, solitary organisms to colonies. Whether each transition shows a comparable quantum leap in coupling cost remains an open empirical question — Lynch's data cover only the unicellular-multicellular boundary.

The conversion remains incomplete in one important respect. Lynch measures ATP, not KL divergence. Bridging from oxygen-consumption measurements to informational divergences would require decomposing the 30–50× cost increase into components: how much of the premium resolves genuine informational disagreement between cells (synchronization) versus maintaining purely mechanical integrity (structural maintenance)? That decomposition has not been done. Without it, the correspondence between Lynch's bioenergetic data and the framework's information-theoretic predictions remains quantitatively open, even as the qualitative predictions are confirmed.

Pearl's Causal Contribution

The Causal Arrow of Altruism analysis provides independent support for the framework's prediction about cluster scale, using Pearl's causal inference tools rather than thermodynamic arguments.

The analysis used an agent-based model with ecological pressure as an instrumental variable — an external shock that forced colony formation without directly altering genetics. In Pearl's language, ecological pressure satisfies the exclusion restriction: it affects eusociality only through colony formation, allowing identification of the causal effect. The result was telling: forced grouping did not produce eusociality. Colony formation without genetic relatedness was like running a cable between two computers that speak different protocols — the channel exists, but no communication occurs.

The sensitivity analysis quantified the fragility of the group-selection link. The sensitivity parameter \(\Gamma = 1.0\) meant that even a single unobserved confounder could explain away the apparent association between colony formation and eusociality. The grouping-altruism correlation was thermodynamically cheap to produce through shared environmental exposure; no direct synchronization between the variables was required. Relatedness, by contrast, survived every confounding test. Its causal link to eusociality was robust precisely because shared genetic descriptions cannot be mimicked by environmental correlation alone.

This result translates directly into the framework's prediction about cluster scale. Grouping — physical proximity without synchronization and shared descriptions — fails to form a metastable cluster. Lynch's bioenergetic premium measures the cost of the synchronization that grouping alone cannot provide. Pearl's tools confirm that the cost is not just thermodynamically real but causally necessary: without it, the cluster does not form, and the higher scale of selection does not emerge.

VI. What Would Destroy the Framework?

Five observations would weaken or destroy the identification between transformer dynamics and evolutionary dynamics.

Different critical exponents. If evolutionary dynamics and transformer dynamics exhibit different critical exponents at their respective phase transitions — if speciation and extinction rates near mass extinction events scale differently from the attention order parameter near the critical coupling threshold — the shared universality claim fails. The systems would share a family resemblance but not a mathematical identity. Different exponents would mean different symmetry classes, and different symmetry classes would mean the gradient-flow correspondence, however elegant, captures form but not substance.

Dispensable chimera structure. If populations that maintained uniform genetic similarity performed as well as or better than those maintaining productive divergence — across a range of environmental conditions, not just in stable environments where homogeneity is expected to succeed — the framework's central claim about maintaining divergence would reduce to a special case rather than a general principle. The naked mole rat succeeds because its environment is stable. If genetically uniform populations also succeeded in volatile environments, the chimera prediction would fail.

Non-inferential persistence. If persistent biological systems could be shown to operate through mechanisms fundamentally unlike inference or model-maintenance — requiring no internal description and no Markov blanket — then Friston's bridge between thermodynamics and biology would collapse, and the transformer-evolution connection would go with it. This is the deepest falsification test, because it challenges not just the specific framework but the premise that organisms are usefully described as carrying and updating models of their environment.

Failure of KL divergence to predict competition. If organisms with highly divergent models of their environment did not face measurably different synchronization costs in ecological competition — if the thermodynamic language added no explanatory power over standard fitness differences — the framework's claim to have formalized rather than merely redescribed Darwinian evolution would be empty.

The acyclicity problem. Pearl's framework requires directed acyclic graphs. Evolutionary systems are pervasively cyclic: coevolution creates feedback loops, Red Queen dynamics sustain perpetual arms races, and perception-action loops define the organism-environment interface. If the cyclic structure of ecological dynamics turns out to be essential rather than idealizable — if the acyclic approximation that makes the framework's variational structure tractable fundamentally misrepresents the dynamics — the formalization would need a deeper mathematical foundation than gradient flows on probability measures currently provide. Pearl addresses cycles through equilibrium analysis of structural equations, and the synchronization tax framework treats metastable states as quasi-equilibria. Whether these approximations suffice, or whether genuinely cyclic dynamics require a different mathematical framework entirely, remains an open question.

VII. Current Assessment

The framework is closer to a map than a proof, translating between evolutionary biology and statistical physics in ways that illuminate both, generating predictions neither discipline makes on its own.

Price's equation emerges as a variational consequence rather than a tautology, with selection reinterpreted as gradient descent on a free energy functional and transmission bias as irreducible maintenance noise — the two components that any dissipative flow on a space of measures must produce. Mass extinction thresholds become calculable from coupling-to-noise ratios, with specific predictions — logarithmic scaling with diversity, sharpness of transition, critical slowing down, chimera-dependent resilience — that standard evolutionary theory does not derive and that simpler phase-transition models do not generate. The units-of-selection debate dissolves structurally: selection operates at whatever scale metastable clusters form, with Lynch's bioenergetic data confirming the discrete jump in coupling costs at the unicellular-multicellular boundary. And Pearl's causal tools, applied independently of the thermodynamic apparatus, confirm that synchronization (shared description) rather than proximity (physical grouping) drives coordination at every scale examined.

Whether this translation reveals a shared deep structure or merely exploits a shared mathematical vocabulary is an empirical question — not a philosophical one. The framework currently translates between domains rather than deriving one from the other. The three derivations attempted here push toward formalization but have not completed it. The most honest description available: the transformer architecture and biological evolution are both gradient flows on spaces of probability measures, both driven by multiplicative noise concentrated through coupling onto metastable multi-cluster configurations between triviality and chaos. Whether they share a universality class — whether the critical exponents match, whether the symmetries are truly identical rather than superficially similar — is testable. Testing it would settle the question.


  1. Throughout, Judea Pearl's structural causal models provide independent evidence that the correspondence between transformer dynamics and evolutionary dynamics is causal rather than associational. Pearl's asymmetric directed graphs encode the same structure the synchronization tax encodes in asymmetric KL divergence — both formalize the fact that the cost of updating one description to match another is not symmetric. Pearl's do-operator \(do(X = x)\), which severs incoming causal arrows and imposes a value, formalizes what the framework calls maximally asymmetric synchronization: one description imposed on another, the full cost borne by the target.

    The three-part decomposition from Maintaining Divergence — transmission, synchronization, maintenance — maps onto Pearl's "ladder of causation." Transmission corresponds to Pearl's first rung (association): signals propagate through existing channels, and patterns become visible. Synchronization corresponds to the second rung (intervention): the do-operator physically intervenes on a system, severing incoming causal arrows and imposing a new value, just as one system forces another to update its description. Maintenance corresponds to the third rung (counterfactuals): Pearl's counterfactuals ask what would have happened under different structural equations; maintenance asks what will happen if we stop paying the costs of keeping a shared description aligned. Both probe the structure that sustains an observed state by asking what changes when the sustaining mechanism is removed. If the framework merely redescribed evolution in thermodynamic language, Pearl's tools would not independently confirm the causal structure. But they do — as the "Causal Arrow of Altruism" analysis demonstrated and as the falsification tests below confirm. ↩︎

Subscribe to symmetry, broken

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe