OSCILLATING ERROR CIRCUITS: EVIDENCE OF ADVERSARIAL LAYER DYNAMICS IN LARGE LANGUAGE MODELS
Jamie Pordoy
PAPER · v1.0 · 2026-05-10 · human
Abstract
Mechanistic interpretability aims to understand how language models process information by identifying causal mechanisms within their layers. Prior work often assumes errors form monotonically, with each layer progressively building toward an incorrect output. We present preliminary evidence that appears inconsistent with a strictly monotonic view of error formation by demonstrating Oscillating Error Circuits in Llama-3-8B, Mistral-7B, and GPT-2 xl. Layer-wise suppression of error-correlated neurons produces alternating correct and incorrect outputs at consecutive layers. Of 100 questions tested on each model, 80-91 per model exhibited oscillatory behavior (259 oscillating instances total), with high-frequency transitions (mean: 5.7–13.4 across network depth) that directly contradict monotonic error formation. These oscillations highlight three limitations in current interpretability frameworks. First, error representations are concentrated exclusively in the final 10% of network depth (layers 30–31 for 32-layer models, layers 42–46 for GPT-2 xl), not the middle layers as commonly assumed. Second, we demonstrate a clear dissociation between activation magnitude and causal effect. Differential activation reaches |∆| = 41.2 in GPT-2 xl, yet dominant neurons produce minimal impact when suppressed at their dominant layer (−4.6 to +2.9 percentage points across models). Third, while comprehensive multi-layer suppression strategies yield at most +4.6 percentage points improvement, localized cluster suppression achieves up to +7.5pp. These findings may help explain why single-layer model editing methods achieve inconsistent success rates. These oscillations reveal that error circuits are distributed across multiple layers, explaining why localized interventions cannot produce stable corrections. These findings call into question key assumptions underlying current mechanistic interpretability and suggest that reliable hallucination mitigation requires distributed interventions across late-layer regions rather than targeted single-neuron edits