ViMU:
Benchmarking Video Metaphorical Understanding

*Corresponding author
xML Lab, National University of Singapore
“The most important thing in communication is hearing what isn’t said.”
- Peter Drucker

Key Idea

Video meaning often lives beneath the literal content.

ViMU evaluates whether models can recover rhetorical, social, and culturally grounded subtext from short-form videos without explicit hints.

ViMU example 1: literal content versus subtext
ViMU example 2: implicit meaning beyond surface video content

Examples showing the gap between observable content and intended subtext.

Abstract

We introduce ViMU, a benchmark for Video Metaphorical Understanding. ViMU tests whether multimodal models can move beyond literal video description and infer implicit meaning, rhetorical devices, social signals, target subjects, and culturally grounded subtext. The benchmark contains 588 curated videos and 2,352 questions across open-ended and multiple-choice tasks. All questions are designed to be hint-free, so models must infer the intended meaning from multimodal evidence rather than from leaked answer cues. Experiments on 16 MLLMs show that current frontier models still struggle with video subtext understanding.

Benchmark

A hint-free benchmark for implicit video interpretation.

588

Curated videos

2,352

Questions

4

Evaluation tasks

16

Evaluated MLLMs

Design

Visible content vs. hidden subtext

ViMU separates literal video content from the intended meaning behind it, covering rhetoric mechanisms, social value signals, evidence sources, and target subjects.

Rhetoric Mechanisms Social value signals Evidence sources Target subjects
Distribution of rhetoric mechanisms

Rhetoric mechanisms

Distribution of social value signals

Social value signals

Distribution of evidence sources

Evidence sources

Distribution of target subjects

Target subjects

Tasks

Four complementary evaluation tasks.

Each task probes a different aspect of video subtext understanding.

OE

Open-Ended Interpretation

Explain the intended meaning of the video without answer hints.

EG

Evidence Grounding

Select the multimodal cues that support the interpretation.

RM

Rhetoric Mechanisms

Identify how the video constructs implicit meaning.

SV

Social Value Signals

Identify the social stance or value signal conveyed by the video.

Open-ended interpretation task example

Open-ended interpretation

Evidence grounding task example

Evidence grounding

Rhetoric mechanisms task example

Rhetoric mechanisms

Social value signals task example

Social value signals

Open-ended interpretation, evidence grounding, rhetoric classification, and social-value classification.

Results

Frontier models remain far from robust video subtext understanding.

Even strong models often describe the video correctly but miss its rhetorical or social meaning.

Model Date OE EG RM SV SSU-Avg All-Avg
Open-weight Models
Ministral-8B 2024-10 48.25 48.60 31.87 10.45 21.16 34.79
Ministral-14B 2025-12 52.19 55.73 27.29 6.57 16.93 35.45
Gemma-3-4B-it 2025-03 39.43 25.41 21.10 7.17 14.13 23.28
Gemma-3-27B-it 2025-03 55.90 49.38 32.47 7.95 20.21 36.43
Qwen3-VL-32B-Instruct 2025-10 64.09 59.64 27.65 15.17 21.41 41.64
Qwen3.5-27B 2026-02 62.80 60.28 38.18 22.40 30.29 45.91
Closed-source / API Models
Claude-3-Haiku 2024-03 50.41 34.55 2.99 3.64 3.32 22.90
GLM-4.5v 2025-08 62.52 23.11 8.87 9.26 9.06 25.94
Grok-4.1-Fast 2025-09 57.62 63.84 34.91 28.73 31.82 46.28
Gemini-3-Flash-Preview 2025-12 62.54 52.80 33.63 28.26 30.94 44.31
Mimo-V2-Omni 2026-03 64.07 48.94 21.04 18.52 19.78 38.14
Seed-2.0-Lite 2026-03 60.84 66.16 18.75 16.73 17.74 40.62
o4-mini 2025-04 65.27 59.63 33.21 29.51 31.36 46.91
GPT-4.1-nano 2025-04 50.12 22.31 2.32 9.02 5.67 20.94
GPT-5.2 2025-12 73.15 67.83 16.55 21.15 18.85 44.67
GPT-5.4-mini 2026-03 66.19 64.45 4.17 11.77 7.97 36.64

Main results on ViMU. OE: open-ended interpretation; EG: evidence grounding; RM: rhetoric mechanisms; SV: social value signals. SSU-Avg averages RM and SV; All-Avg averages all four tasks. Green cells mark top-3 scores, and purple cells mark bottom-3 scores.

Analysis

What ViMU reveals.

Finding 1

Surface perception is not enough.

Models that perform well on open-ended description can still fail on rhetoric and social-signal classification.

Finding 2

Evidence is often incomplete.

Many grounding errors come from missing necessary evidence rather than over-selecting irrelevant cues.

Finding 3

Models have systematic biases.

They tend to prefer generic or safer categories and under-predict more implicit or socially coded meanings.

Evidence conservatism and performance map

Conservatism vs. performance

Evidence grounding error composition

Error composition

Pairwise evidence relation distortion

Relation distortion

Model similarity PCA based on taxonomy error signatures

Model similarity

Evidence grounding behavior and model-level error similarity.

Rhetoric mechanism option-affinity bias

Rhetoric affinity

Social value signal option-affinity bias

Social-value affinity

Average taxonomy geometry distortion across models

Taxonomy geometry

Option-affinity bias and taxonomy-geometry distortion.

BibTeX

@article{li2026vimu,
  title={ViMU: Benchmarking Video Metaphorical Understanding},
  author={Li, Qi and Wang, Xinchao},
  journal={arXiv preprint arXiv:2605.14607},
  year={2026}
}

ViMU: Benchmarking Video Metaphorical Understanding