$\mathcal{RTV}\text{-}Bench$: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video

Shuhang Xun1*, Sicheng Tao2*, Jungang Li2,3*†, Yibo Shi4, Zhixin Lin5, Zhanhui Zhu1, Yibo Yan2,3, Hanqian Li2, Linghao Zhang5, Shikang Wang6, Yixin Liu1, Hanbo Zhang7, Ying Ma1‑, Xuming Hu2,3
1HIT, 2HKUST(GZ), 3HKUST, 4XJTU, 5SDU, 6CityU, 7HUST
*Equal Contribution †Project Leader ‑Corresponding Author
πŸŽ‰ Accepted to
39th Conference on Neural Information Processing Systems (NeurIPS 2025)
Track on Datasets and Benchmarks
RTV-Bench Examples

If our project helps you, please give us a star ⭐ on GitHub to support us. πŸ₯ΈπŸ₯Έ

πŸ”₯ News

  • 2025-09-20 πŸŽ‰πŸŽ‰πŸŽ‰ Our paper has been accepted by NeurIPS 2025, we will update our dataset and code for community as soon as possible~
  • 2025-06-27 πŸŽ‰ We update core code for evaluation.
  • 2025-05-17 πŸŽ‰ We have released the label json, which is named QA.json.
  • 2025-05-04 πŸŽ‰ We released the paper $\mathcal{RTV}\text{-}Bench$: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video.
  • 2025-05-03 🌟 We are happy to release the $\mathcal{RTV}\text{-}Bench$. You can find it on HuggingFace or ModelScope.

πŸ‘€ $\mathcal{RTV}\text{-}Bench$ Overview

We introduce $\mathcal{RTV}\text{-}Bench$, a fine-grained benchmark for MLLM real-time video analysis, which contains 552 videos (167.2 hours) and 4,631 high-quality QA pairs. We evaluated leading MLLMs, including proprietary (e.g. GPT-4o, Gemini 2.0), open-source offline (e.g. Qwen2.5-VL, VideoLLaMA3), and open-source real-time (e.g. VITA-1.5, InternLM-XComposer2.5-OmniLive) models. Experiment results show open-source real-time models largely outperform offline ones but still trail top proprietary models. Our analysis also reveals that larger model size or higher frame sampling rates do not significantly boost $\mathcal{RTV}\text{-}Bench$ performance, sometimes causing slight decreases. This underscores the need for better model architectures optimized for video stream processing and long sequences to advance real-time video analysis with MLLMs.

$\mathcal{RTV}\text{-}Bench$ includes three key principles:

  • Multi-Timestamp Question Answering (MTQA), where answers evolve with scene changes;
  • Hierarchical Question Structure, combining basic and advanced queries; and
  • Multi-dimensional Evaluation, assessing the ability of continuous perception, understanding, and reasoning.

Video Categories and Distribution of Question Difficulty and Query Characteristics

Dataset Statistics

(Left) RTV-Bench covers 3 key domains and 16 sub-class video types.

(Center) Distribution of question difficulty levels across eight representative task types, measured by percentage-based performance ranges.

(Right) Distribution of question queries by video length, categorized into Shallow, Moderate, and Deep levels. The bar heights indicate counts, while the line chart overlays query proportions for each duration bucket.

πŸ”– Evaluation Results

Evaluation Results

πŸ“Š Visualization

Visualization

πŸ“‘ Citation

If you find $\mathcal{RTV}\text{-}Bench$ useful for your research and applications, please cite using this BibTeX:

@inproceedings{xun2025rtv,
  title={RTV-Bench: Benchmarking MLLM Continuous Perception, Understanding and Reasoning through Real-Time Video},
  author={Xun, Shuhang and Tao, Sicheng and Li, Jungang and Shi, Yibo and Lin, Zhixin and Zhu, Zhanhui and Yan, Yibo and Li, Hanqian and Zhang, Linghao and Wang, Shikang and Liu, Yixin and Zhang, Hanbo and Ma, Ying and Hu, Xuming},
  booktitle={Advances in Neural Information Processing Systems},
  volume={38},
  year={2025},
  organization={NeurIPS}
}