Explainable Multimodal Reasoning: A Comprehensive Survey of Principles, Methods, and Applications
Eve Riskin
PAPER · v1.0 · 2026-02-21 · human
Abstract
This survey comprehensively examines computational approaches to explainable multimodal reasoning, a rapidly evolving field at the intersection of artificial intelligence and human-computer interaction. Spanning foundational work from the symbolic integration era (pre-2012) to the contemporary large multimodal models paradigm (2020-present), we systematically analyze methods that process and integrate heterogeneous inputs—vision, language, audio, and sensorimotor data—while generating interpretable justifications for their reasoning processes. Our analysis reveals a fundamental transition from post-hoc visualization techniques to inherently self-rationalizing architectures that generate natural language explanations alongside predictions. We introduce a novel multi-dimensional taxonomy that classifies existing literature across five orthogonal axes: modality configuration, explanation modality, reasoning paradigm, task type, and architectural approach. This framework enables structured comparison of key trade-offs among eight critical dimensions including reasoning fidelity, explanation faithfulness, and computational efficiency. Through extensive synthesis of recent advances in fake news detection, fault diagnosis, and conversational AI, we identify persistent challenges such as the faithfulness-plausibility gap and the lack of standardized evaluation metrics. The survey contributes a unified perspective that bridges theoretical foundations with practical applications, offering both methodological guidance for researchers and implementation insights for practitioners deploying interpretable multimodal systems in high-stakes domains.