The Wall as Unpaid Synchronization Tax

The Wall as Unpaid Synchronization Tax
2024 Solar Eclipse (NASA)

On what three follow-up experiments reveal about the wall between supervised and unsupervised positions in transformers — and why the diagnosis matters for anyone using them on noisy data.

The wall is real

Vishal Misra's essay The Wall Between Shannon and Kolmogorov describes training a small transformer on a mixed task of predicting integer tokens in a sequence: half the sequences follow deterministic linear recurrences \(x_{t+1} = a \cdot x_t + b \bmod 17\) with fresh \((a, b)\) per episode, and half are uniform noise.

In Misra's "wall" experiment, cross-entropy loss supervises only positions 1–5. At position 5 the model reaches near-Bayesian precision (≈0.020 bits MAE). At unsupervised position 6 — where the model hits the "wall" — predictive entropy reverts to \(\log_2(17) \approx 1.63\) bits, an 83× collapse. Scaling from 3M to 300M parameters does not move it.

Three follow-up experiments — wall-erosion-experiment, roof-experiment, and auxiliary-causal — reproduce the wall (~76× at full convergence) and confirm Misra's core observation.

But they seem to disagree with his explanation.

Misra frames the wall as a "Shannon/Kolmogorov divide." The model, he argues, is a position-local identifier rather than a re-deployable program. Gradient descent compiled a circuit at positions 3–5 that could, in principle, run at position 6 — but never did. No portable algorithm exists in the weights.

Three pieces of evidence, taken together, tell a more nuanced story. The wall does not mark where an algorithm failed to compile. It marks where a tax was not paid.

What is the wall?

Every position in the sequence presents the model with a prediction problem: given the hidden state \(h_t\), produce an output distribution \(Q_t\) over the 17 possible next tokens. The Bayes-optimal distribution \(P_t\) — the one that accounts for the mixture of deterministic recurrences and noise — is the target. The KL divergence \(D_{KL}(P_t | Q_t)\) measures how far apart these two descriptions sit at each position: how much work remains to bring the model's prediction into alignment with reality.

At supervised positions, gradient descent does that work. Cross-entropy loss drives \(D_{KL}\) toward zero by adjusting the internal representations so that the model's output converges on the Bayesian answer. At unsupervised positions, nothing does the work. The KL stays near \(\log(17) \approx 2.83\) nats.

The wall, in this framing, is the boundary between positions where the the thermodynamic work (what I have called the "synchronization tax") has been done (or "paid") and positions where it has not. Understanding why the synchronization tax must be paid per position — why attention dynamics cannot carry it across the wall for free — requires looking at what the model actually builds inside.

Global substrate; local readout alignment

The substrate exists everywhere

The roof-experiment characterizes the hidden states at unsupervised positions directly.

At the final layer, hidden states at unsupervised positions sit in the same cluster as those at supervised positions — cosine similarity ranges from 0.65 to 0.76, and increases with depth. The "neighborhood" is right.

A linear probe trained on hidden states at unsupervised positions recovers the correct next token at 54% accuracy in entropy-regularized models (Bayesian-optimal for this task), compared to a 1/17 ≈ 6% chance baseline. The substrate encodes the answer.

But the model's own output at those same positions achieves only 15–25% top-1 accuracy. No post-hoc temperature rescaling recovers Bayes; KL stays at 1.5–1.8 bits regardless. The direction of the logits is wrong.

This is the geometric core of the problem. The hidden state \(h_t\) sits in a high-dimensional space. The unembedding matrix \(W_u\) projects it to logits — one inner product \(\langle w_v, h_t \rangle\) per vocabulary token. Which token wins depends on which direction \(h_t\) points relative to those row vectors. At supervised positions, gradient pressure through the cross-entropy loss rotates \(h_t\) so that its projection onto the correct token's unembedding vector dominates. At unsupervised positions, \(h_t\) sits in the right neighborhood but its angular orientation relative to \(W_u\)'s rows has never been disciplined. The information is encoded in the clustering and structure of \(h_t\); the prediction lives in its angular relationship to the unembedding. Encoding is not outputting.

Rigollet's Mean-Field Dynamics of Transformers predicts exactly this substrate behavior. Attention, treated as an interacting-particle system on the sphere, produces token clustering whose continuum limit is global — it does not respect position boundaries. The cosine similarity increasing with depth is the empirical signature of asymptotic clustering that the mean-field analysis anticipates. The substrate is global because attention dynamics are global.

But the rotation that aligns each \(h_t\) to the unembedding cannot be global, because the correct next token differs at every position. Attention can share what the task is; it cannot share what the answer is here. The per-position alignment is the residual that remains after the global computation finishes, and only gradient can set it.

The synchronization tax — the KL divergence between \(Q_t\) and \(P_t\) — must therefore be paid at every position individually and locally. The substrate is a necessary condition for prediction, not a sufficient one. Where the loss never reaches, the tax goes unpaid and the wall stands.

No localized circuit exists — even at trained positions

Misra's framing seems to assume a localized rule-and-operand circuit at positions 3–5 that would work at position 6 if gradient descent had compiled it there. Phase 2a interchange interventions in auxiliary-causal falsify that premise for every model tested, including those with full wall erosion.

The experiment swaps single residual-stream activations between two models at positions 3 and 4, asking: does the receiving model adopt the source model's recurrence rule? It does not. Maximum probability on the counterfactual token is 0.060–0.062 across all conditions — within 0.003 of the \(1/17 = 0.0588\) noise floor. No single residual-stream slot at any trained position carries the rule.

What can be patched is the bundled "predicted next token" itself: deep-layer (L≥3) patching at the final position imports the source model's prediction with probability rising to 0.40. The readout transfers; the rule does not.

The modular recurrence is not implemented as a portable \((rule, operand)\) pair anywhere in this network. The computation that produces Bayes-faithful predictions at supervised positions is distributed across positions and routed through attention — consistent with Rigollet's picture of computation emerging from inter-particle dynamics rather than localized circuits.

This has consequences for the explanation offered by Misra. If no portable algorithm exists even at positions where prediction succeeds, then the wall cannot be the boundary between "algorithm present" and "algorithm absent." Both sides of the wall lack a localized program. What differs is whether gradient descent has paid the per-position cost of aligning the distributed substrate to the output.

How the tax gets paid — or doesn't

The auxiliary-causal experiment maps 13 training conditions across 3 seeds, varying both what information the model receives and how the gradient reaches it.

Three signals carrying the same class-posterior information land in three different regimes:

  • P1 (scalar-head MSE): wall ratio = 64.8, untrained MAE = 2.09 — worse than baseline.
  • P5 (class-posterior routed through softmax): wall ratio = 0.75, untrained MAE = 0.025 — near-complete erosion.
  • C1 (full Bayesian entropy through softmax): wall ratio = 0.22, untrained MAE = 0.008.

The 83× spread between P1 and P5 occurs at identical per-position information content. The discriminator is the gradient path. P1 routes the signal through a separate scalar head that reads shared hidden states; the gradient reshapes those states but never touches the softmax distribution. P5 routes the same signal directly through the output logits; the gradient completes the rotation that aligns \(h_t\) to \(W_u\).

This is a statement about where the synchronization tax is accepted. The tax must be paid at the interface between internal representation and external prediction — the softmax. A signal that flows through a separate head is like wiring a payment to the wrong account: the funds arrive, the information exists in the hidden states, but the alignment between output distribution and Bayes distribution never occurs because the gradient never reaches the thing that needs rotating. The delivery-mode bifurcation is really a bifurcation in the routing of synchronization costs.

Calibration pays only part of the tax

Even where the softmax subsidy erodes the wall by every calibration metric Misra uses, the model remains wrong in a specific and measurable way.

P5 — the condition that matches Bayesian entropy to within 0.025 bits — places on the order of \(1/3{,}000\) to \(1/14{,}000\) of its probability mass on the Bayes-correct token at unsupervised positions. The entropy shape of the distribution is right; the support is wrong. The class-posterior MAE confirms this directly: at positions \(t \geq 4\), the model's inferred weight on the deterministic-best token sits near zero even when the true posterior weight is ~0.85.

This decomposition maps onto the asymmetry of KL divergence. The entropy-matching subsidy pays the symmetric part of the divergence — the model learns how uncertain to be. But the asymmetric part — which tokens the uncertainty should favor — remains unpaid. Calibration addresses one component of the synchronization tax; correctness requires both.

Condition untrained MAE (bits) KL (nats, all pos.) class-posterior MAE (\(t \geq 4\))
C0 baseline (no signal) 1.54 0.80 0.64
P1 class-posterior, scalar head 2.09 0.92 0.81
P5 class-posterior via softmax 0.025 2.69 0.74
C1 full entropy via softmax 0.008 3.23 0.78

Three-seed means; runs/*/eval.json.

A model can match Bayesian entropy while placing vanishingly small mass on the Bayesian answer. Calibration audits that rely on entropy agreement alone cannot detect this failure.

The shield does not propagate

One might hope that paying the synchronization tax at position 6 would carry over to position 7 — that once the readout aligns, attention dynamics would propagate the alignment forward. The roof-experiment (Experiment 4) falsifies this: applying the entropy signal at position 6 does not improve position 7. Each position demands its own payment. The tax is position-local because the correct prediction is position-local, even though the substrate that enables it is position-global.

Experiment 5 shows the converse: late introduction of the signal works (wall ratio drops to 0.07×), and removal degrades slowly (wall ratio climbs from 0.07× to 19.4× over 100k steps). The alignment is learned and forgotten at the rate gradient reshapes the weights, not at the rate attention propagates information.

Connection to the Causal Hierarchy Theorem

Three earlier introspection conditions — self-entropy and self-perturbation — attempted to let the model pay its own synchronization tax by providing it with information about its own uncertainty. Both failed (final wall ratios of 106× and 69×). The model could not lift its own observational competence to structural competence at unsupervised positions.

Bareinboim's Causal Hierarchy Theorem explains why. Observational data (Layer 1 in the causal hierarchy) cannot, in general, identify interventional or counterfactual quantities (Layers 2 and 3). A model that observes its own outputs remains within L1. Breaking the wall requires an exogenous structural signal — one that carries information the model's own observation cannot generate. P5 provides exactly this: a class-posterior signal that tells the model which regime it is in, routed through the softmax so the gradient can act.

Misra is right that the model cannot self-bootstrap across the wall. He is wrong about what needs to be supplied. Not a re-deployable algorithm — no such thing exists even at trained positions — but the smallest exogenous signal that lets gradient descent pay the per-position synchronization tax. P5 shows that regime information via the softmax suffices for entropy calibration. The same data show it does not suffice for next-token correctness, because it pays only the symmetric component of the KL divergence.

Implications for using transformers on noisy data

Calibration audits do not certify correctness. If a model's calibration looks good on out-of-distribution slices — confident where it should be, uncertain where it should be — that is consistent with the model placing all its mass on the wrong tokens. Always disaggregate calibration error from accuracy or KL on a held-out structural reference. Extracting the model's mass on the Bayes-best token and comparing it to the true posterior weight is a cheap version of this audit.

Out-of-coverage positions are not "where the algorithm doesn't run." They are where the synchronization tax was never paid. This distinction is operationally consequential. If you are using a transformer to detect a structural pattern in a regime your training distribution does not cover, the fix is not necessarily new architecture or scale — Misra showed those don't move the wall. The fix is per-regime supervision through any signal whose gradient flows through the softmax. Even a single regime-class scalar will do, provided the gradient touches the output distribution. Auxiliary heads reading shared hidden states are an anti-pattern: P1's scalar-head condition produces a wall ratio worse than the unsupervised baseline.

Trust interchange tests over probing tests. Probes find what the substrate represents; they do not find what the readout will use. A linear probe recovers the correct token at 54% in entropy-regularized models, but the interchange intervention at trained positions shows the localized rule is not separable at the readout. A model can encode without acting, act without being correct, and be calibrated without either. Interchange interventions catch the alignment failure that probes miss.

Structure training to provide gradient at every deployment regime. The substrate clustering that attention produces is global — a generic property of the dynamics, not a quirk of modular arithmetic. If Rigollet's mean-field analysis holds broadly, the practical posture for high-stakes pattern detection is: provide per-regime gradient at every regime you intend to deploy on, audit calibration and correctness separately, and never interpret confidence at unsupervised positions as evidence that an algorithm has compiled there.

Conclusion

Misra's wall is real. But it is not a wall that blocks algorithm compilation — no such algorithm exists, even at the positions where prediction succeeds. The wall marks the boundary where the synchronization tax goes unpaid: where gradient descent has never supplied the energy needed to rotate each position's hidden state into alignment with the unembedding, completing the final step between encoding an answer and outputting it. The substrate that attention builds is global. The cost of aligning that substrate to prediction is local. And where the loss never reaches, the tax is never paid.

Data and code: auxiliary-causal (runs/*/eval.json, runs/*/phase2a_local_p3_p4.json, runs/*/phase2c.json). Prior repos: wall-erosion-experiment, roof-experiment, introspection.

Subscribe to symmetry, broken

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe