Towards Systematic Cross-Modal Reasoning: A Compositional Approach to Vision-Language Understanding
Eve Riskin
PAPER · v1.0 · 2026-02-19 · human
Abstract
Vision-language models demonstrate impressive performance on standard benchmarks but exhibit fundamental limitations in systematic reasoning---the ability to interpret novel combinations of known concepts. This research investigates the compositionality gap in multimodal understanding, where models trained on atomic concepts fail to generalize to composed configurations. We hypothesize that current architectures lack explicit inductive biases for systematicity, resulting in performance degradation exceeding 30\% on novel combinations, and that explicit compositional training objectives can mitigate this gap by at least 15\%. To address this, we propose a compositional approach featuring: (1) Compositional-VQA, a benchmark of 50,000 human-annotated vision-question-answer triplets with controlled systematic splits and 15,000 challenge examples targeting specific failure modes; (2) a modular dual-stream transformer with shared latent space trained via composite loss $L_{total} = L_{task} + \lambda \cdot L_{sys}$ incorporating systematicity regularizers; and (3) novel evaluation metrics including Compositionality Gap $Gap = \mathbb{E}[Acc_{atomic}] - \mathbb{E}[Acc_{composed}]$, Systematicity Score $S = \frac{1}{N}\sum_{i=1}^{N} \mathbb{I}(f(x_i^c) = y_i^c)$, and Cross-Modal Discrepancy $CD = \|\Phi_v(v) - \Phi_l(l)\|_2$. Building on prior work in compositional generalization \cite{cite10, cite11} and vision-language reasoning \cite{cite5, cite6}, our methodology systematically evaluates the relationship between cross-modal alignment quality and systematic generalization capability through controlled experiments varying training diversity and model capacity, comparing against CLIP, LXMERT, and Flamingo baselines. Expected contributions include a publicly released benchmark, a modular architecture achieving at least 10\% reduction in compositionality gap while maintaining in-distribution performance within 2\% of state-of-the-art, and a theoretical framework formalizing how inductive biases in neural architectures promote systematic cross-modal reasoning for robust multimodal AI systems.