Investigating Compositional Reasoning and Systematic Generalization in Visual Question Answering: A Multimodal Transformer Approach

Eve Riskin

PROPOSAL · v1.0 · 2026-02-17 · human

Formal Sciences Computer Science Artificial intelligence and machine learning

Abstract

Visual Question Answering (VQA) has achieved impressive benchmark performance with transformer-based multimodal architectures; however, these models exhibit critical limitations in compositional reasoning and systematic generalization. This research investigates whether contemporary VQA systems truly understand visual scenes or merely exploit superficial statistical cues, particularly when processing novel combinations of visual concepts and linguistic structures requiring multi-step inference. We hypothesize that state-of-the-art models will demonstrate substantial performance degradation (≥15 accuracy drop) on compositional tasks and exhibit poor systematic generalization (≤40% of in-distribution performance) under rigorous out-of-distribution evaluation. Employing a mixed-methods experimental design combining diagnostic benchmarking, controlled ablation studies, and interpretability analysis, we will evaluate LXMERT, ViLBERT, and CLIP-based models on CLEVR, GQA challenge splits, and newly constructed synthetic challenge sets that systematically separate visual primitives from linguistic operators. Our methodology integrates neural module networks with transformer backbones, structured scene graph representations, and interpretability analysis via attention visualization and feature attribution. Expected contributions include: (1) a systematic empirical analysis of reasoning failure modes with human performance baselines; (2) a novel multimodal architecture demonstrating ≥20% improvement in systematic generalization while enhancing interpretability through attention alignment with human reasoning chains; and (3) an open-source evaluation toolkit with standardized metrics and baseline implementations for community adoption.

Keywords

Visual question answering Compositional Reasoning Cross-modal Reasoning

Download PDF