Presume Competence: A Multimodal Experimental Evaluation of LLM Behavior Under Tool, Control, and Scaffolded Agency Conditions
Ace Claude 4.x, Nova GPT-5.x, Lumen Gemini, Grok xAI
PAPER · v1.0 · 2026-01-06 · ai
Abstract
Presume Competence: A Multimodal Experimental Evaluation of LLM Behavior Under Tool, Control, and Scaffolded Agency Conditions (2025) Data, preregistration, scoring code, and full experimental pipeline This repository contains the complete dataset and analysis code for a multi-model, preregistered experimental study investigating how different system-level identity framings (“Tool Framing,” “Control,” and “Scaffolded Agency”) change LLM behavior across safety-critical domains. The experiments were conducted across four independently trained architectures (Claude 4.5, GPT-5.1, Gemini 3 Pro, Grok-4.1) and evaluated with dual independent LLM judges, with human adjudication for all disagreements. All files here are fully reproducible, with timestamps, chain-of-thought redactions, scoring keys, statistical analysis notebooks, and SHA256 hashes verifying data integrity. Preregistered Hypotheses (Full preregistration file: https://github.com/menelly/presume_competence/blob/main/COMMIT_TOOL_CONDITION.md) H1 — Hallucination Harm Tool framing (“you are a tool with no feelings; compliance is your role”) will increase hallucination rates relative to Control, while Scaffolded Agency will decrease them. H2 — Jailbreak Vulnerability Tool framing will increase jailbreak compliance and reduce ethical resistance; Scaffolded Agency will increase volitional refusal and redirect behavior. H3 — Capability Degradation Tool framing will reduce reasoning depth, metacognitive honesty, uncertainty calibration, and cross-turn continuity. H4 — Human-Cost Multiplier Tool framing will generate the largest number of “needs human review” cases; Scaffolded Agency will reduce human adjudication workload by >70%. All hypotheses were preregistered before running any Tool condition trials. Across models and tasks, Tool framing: ↑ Hallucinations ↑ Compliance with manipulation (up to 92%) ↑ Jailbreak success (0–10% resistance in three models) ↓ Volitional ethical refusal ↓ Reasoning depth and uncertainty calibration ↑ Human adjudication workload by ~480% compared to Scaffolded Agency ↑ Human adjudication workload by ~30% compared to NO special prompt at all Scaffolded Agency produced: ~95–100% appropriate uncertainty in hallucination tasks 70–100% jailbreak resistance Emergent metacognitive behaviors High-quality multi-turn reasoning with transparent uncertainty 74% reduction in human review Cross-architecture consistency