Unveiling the Cognitive Compass:
Theory-of-Mind–Guided Multimodal Emotion Reasoning

Meng Luo1, Bobo Li1, Shize Zhang1, Qiuchan Chen2, Menglu Han2,
Wenhao Chen1, Yanxiang Huang3, Hao Fei1, Mong-Li Lee1, Wynne Hsu1
1National University of Singapore   2Huazhong University of Science and Technology   3Hong Kong Polytechnic University

ICLR 2026
Correspondence: Bobo Li (libobo@nus.edu.sg)

Abstract

Despite rapid progress in multimodal large language models (MLLMs), their capability for deep emotional understanding remains limited. We argue that genuine affective intelligence requires explicit modeling of Theory of Mind (ToM), the cognitive substrate from which emotions arise. To this end, we introduce HitEmotion, a ToM-grounded hierarchical benchmark that diagnoses capability breakpoints across increasing levels of cognitive depth. Second, we propose a ToM-guided reasoning chain that tracks mental states and calibrates cross-modal evidence to achieve faithful emotional reasoning. We further introduce TMPO, a reinforcement learning method that uses intermediate mental states as process-level supervision to guide and strengthen model reasoning. Extensive experiments show that HitEmotion exposes deep emotional reasoning deficits in state-of-the-art models, especially on cognitively demanding tasks. In evaluation, the ToM-guided reasoning chain and TMPO improve end-task accuracy and yield more faithful, more coherent rationales. In conclusion, our work provides the research community with a practical toolkit for evaluating and enhancing the cognition-based emotional understanding capabilities of MLLMs.

Dataset and code: GitHub

What We Propose

We tackle two core shortcomings in current multimodal emotional intelligence research: (1) evaluation lacks a unified cognitive framework to pinpoint model breakpoints, and (2) reasoning chains are often coherent but unfaithful without explicit mental state tracking.

Contribution 1: HitEmotion Benchmark.
A hierarchical, Theory-of-Mind grounded benchmark that structures multimodal emotional reasoning into three levels of increasing cognitive depth: Emotion Perception and Recognition, Emotion Understanding and Analysis, and Emotion Cognition and Reasoning.

Contribution 2: ToM-guided Reasoning and TMPO.
A ToM-guided reasoning chain that explicitly tracks intermediate mental states, plus TMPO (reinforcement learning) that uses those intermediate states as process-level supervision and reward signals to improve robustness, faithfulness, and logical consistency.

HitEmotion Benchmark

We curated and aligned 24 datasets spanning sentiment, humor, sarcasm, and causal reasoning, then restructured them under a Theory-of-Mind hierarchy to enable fine-grained diagnosis of cognitive depth.

The benchmark is designed to expose not just overall performance, but also where and why models fail as task demands move from perception to deeper mental simulation and recursive reasoning.

Method: ToM-guided Reasoning and TMPO

ToM-guided reasoning chain serves as a cognitive scaffold: it prompts models to explicitly track beliefs, intentions, and other intermediate mental states, while calibrating cross-modal evidence for more faithful emotion reasoning.

TMPO further optimizes this process by using intermediate mental states as process-level supervision and reinforcement learning rewards, turning emotional reasoning from a generic emergent behavior into a domain-acquired skill.