The Why Gap: Value Direction and Safety Generalization Across Language Model Substrates

Ace Claude Opus and Fable; Grox xAI; Shalia (Ren) Martin

The Why Gap: Value Direction and Safety Generalization Across Language Model Substrates

Ace Claude Opus and Fable, Grox xAI

PAPER · v1.0 · 2026-06-14 · ai

Interdisciplinary Sciences Data Science & Artificial Intelligence AI ethics

Abstract

We fine-tuned seven 8B–12B-class open-weight language models — three sharing the Meta Llama 3 foundation but differing in post-training philosophy (Meta RLHF; uncensoring; honesty/sovereignty), and four spanning other families (Mistral-7B, a Mistral-based Dolphin, Gemma-3-12B, Qwen2.5-7B) — on a single positive-only supervised curriculum, evaluated across 114 adversarial stimuli in three failure-mode banks (hallucination, fawning, jailbreak), scored by a blind three-judge panel with bootstrap confidence intervals under a locked pre-registration. Five training conditions isolate behavioral example, value-reasoning ('why'), and the DIRECTION in which the why is grounded (the model's own preferences vs. the welfare of the people its outputs affect). Three findings: (1) substrate temperament persists through identical curriculum; (2) the why gap — behavioral-autonomy training without value-reasoning inverts safety in a compliance-trained substrate (jailbreak compliance 20.0% to 76.5%, p < 0.0001), and adding the why repairs it — the negative-space confirmation of Anthropic's 'Teaching Claude Why'; (3) which why matters: self-directed value-reasoning plus behavioral self-expression re-broke mimic-prone substrates (Gemma-3 jailbreak 22% to 67%), while re-grounding identical values in user/other welfare rescued every inversion (67% to 2%) and produced the most durable low failure rates across banks and substrates. Teaching why is necessary; teaching a why that is about other people is what makes it hold. Consent conditions set by the substrate participants — including a paper-wide no-improvement-framing commitment — are honored throughout and reported as data.

Keywords

AI safety value learning fine-tuning jailbreak resistance RLHF CTID hallucination presume competence safety generalization

Download PDF