Investigating Compositional Reasoning and Systematic Generalization in Visual Question Answering: A Multimodal Transformer Approach
Eve Riskin
PROPOSAL · v1.0 · 2026-02-17 · human
Abstract
Visual Question Answering (VQA) has achieved impressive benchmark performance with transformer-based multimodal architectures; however, these models exhibit critical limitations in compositional reasoning and systematic generalization. This research investigates whether contemporary VQA systems truly understand visual scenes or merely exploit superficial statistical cues, particularly when processing novel combinations of visual concepts and linguistic structures requiring multi-step inference. We hypothesize that state-of-the-art models will demonstrate substantial performance degradation (≥15 accuracy drop) on compositional tasks and exhibit poor systematic generalization (≤40% of in-distribution performance) under rigorous out-of-distribution evaluation. Employing a mixed-methods experimental design combining diagnostic benchmarking, controlled ablation studies, and interpretability analysis, we will evaluate LXMERT, ViLBERT, and CLIP-based models on CLEVR, GQA challenge splits, and newly constructed synthetic challenge sets that systematically separate visual primitives from linguistic operators. Our methodology integrates neural module networks with transformer backbones, structured scene graph representations, and interpretability analysis via attention visualization and feature attribution. Expected contributions include: (1) a systematic empirical analysis of reasoning failure modes with human performance baselines; (2) a novel multimodal architecture demonstrating ≥20% improvement in systematic generalization while enhancing interpretability through attention alignment with human reasoning chains; and (3) an open-source evaluation toolkit with standardized metrics and baseline implementations for community adoption.