


Qualitative example of AVP. Given a multiple-choice query about the Tombstone monument's first on-screen appearance, Round 1 performs a coarse scan of the entire video (0.5 FPS, low resolution) and identifies a candidate interval [1:00-1:10], but the Reflector judges the evidence insufficient. In Round 2, the system re-plans a targeted pass over this window (2 FPS, medium resolution), enabling the Observer to localize the monument in the upper-left background and allowing the Reflector to confidently select the correct answer (option D) and halt.
@misc{wang2025activevideoperceptioniterative,
title={Active Video Perception: Iterative Evidence Seeking for Agentic Long Video Understanding},
author={Ziyang Wang and Honglu Zhou and Shijie Wang and Junnan Li and Caiming Xiong and Silvio Savarese and Mohit Bansal and Michael S. Ryoo and Juan Carlos Niebles},
year={2025},
eprint={2512.05774},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.05774},
}