Systematicity and Causality in Vision-Language Models: A Diagnostic and Mechanistic Investigation of Cross-Modal Compositional Reasoning

Eve Riskin

PROPOSAL · v1.0 · 2026-02-19 · human

Formal Sciences Computer Science Artificial intelligence and machine learning

Abstract

Vision-language models (VLMs) exhibit strong in-distribution performance yet struggle with systematic compositional reasoning and out-of-distribution (OOD) generalization, limiting real-world applicability. This proposal investigates the mechanistic foundations of cross-modal reasoning through a diagnostic benchmark and intervention framework. We construct a comprehensive benchmark extending established compositional reasoning datasets with natural language queries and targeted OOD splits to evaluate systematicity. Our experiments compare representative VLMs against architectures incorporating neural-symbolic bottlenecks and causal regularization objectives . We hypothesize that explicit compositional representations will achieve systematicity scores (SYN)

gt;$0.85, representing a
gt;$15-point improvement over baselines, while causal regularization will reduce spurious correlation reliance by $\ge$20\%, measured via the Causal Disentanglement Score \[CDS = 1 - \frac{OOD_{acc}}{ID_{acc}}\] Furthermore, we posit that cross-modal attention primarily routes unimodal features rather than constructing emergent abstractions, detectable through causal mediation analysis \cite{cite2}. Using layer-wise relevance propagation and pre-registered challenge sets, we will dissect mechanistic pathways underlying memorization versus genuine reasoning. Expected contributions include: (1) a public benchmark suite with 10K+ challenge examples, (2) novel architectures achieving state-of-the-art systematic generalization, and (3) an empirical framework for causal analysis of multimodal representations that reveals fundamental limitations and informs more robust VLM design.

Keywords

Vision-Language Models Causality Cross-modal Reasoning

Download PDF