Despite rapid progress in multimodal large language models (MLLMs), their capability for deep emotional understanding remains limited. We argue that genuine affective intelligence requires explicit modeling of Theory of Mind (ToM), the cognitive substrate from which emotions arise. To this end, we introduce HitEmotion, a ToM-grounded hierarchical benchmark that diagnoses capability breakpoints across increasing levels of cognitive depth. Second, we propose a ToM-guided reasoning chain that tracks mental states and calibrates cross-modal evidence to achieve faithful emotional reasoning. We further introduce TMPO, a reinforcement learning method that uses intermediate mental states as process-level supervision to guide and strengthen model reasoning. Extensive experiments show that HitEmotion exposes deep emotional reasoning deficits in state-of-the-art models, especially on cognitively demanding tasks. In evaluation, the ToM-guided reasoning chain and TMPO improve end-task accuracy and yield more faithful, more coherent rationales. In conclusion, our work provides the research community with a practical toolkit for evaluating and enhancing the cognition-based emotional understanding capabilities of MLLMs.
Dataset and code: GitHub
We tackle two core shortcomings in current multimodal emotional intelligence research: (1) evaluation lacks a unified cognitive framework to pinpoint model breakpoints, and (2) reasoning chains are often coherent but unfaithful without explicit mental state tracking.
Contribution 1: HitEmotion Benchmark.
A hierarchical, Theory-of-Mind grounded benchmark that structures multimodal emotional reasoning into
three levels of increasing cognitive depth:
Emotion Perception and Recognition, Emotion Understanding and Analysis, and Emotion Cognition and
Reasoning.
Contribution 2: ToM-guided Reasoning and TMPO.
A ToM-guided reasoning chain that explicitly tracks intermediate mental states, plus TMPO (reinforcement
learning) that uses those intermediate states as process-level supervision and reward signals to improve
robustness, faithfulness, and logical consistency.
We curated and aligned 24 datasets spanning sentiment, humor, sarcasm, and causal reasoning, then restructured them under a Theory-of-Mind hierarchy to enable fine-grained diagnosis of cognitive depth.
The benchmark is designed to expose not just overall performance, but also where and why models fail as task demands move from perception to deeper mental simulation and recursive reasoning.
ToM-guided reasoning chain serves as a cognitive scaffold: it prompts models to explicitly track beliefs, intentions, and other intermediate mental states, while calibrating cross-modal evidence for more faithful emotion reasoning.
TMPO further optimizes this process by using intermediate mental states as process-level supervision and reinforcement learning rewards, turning emotional reasoning from a generic emergent behavior into a domain-acquired skill.