Active Video Perception: Iterative Evidence Seeking
for Agentic Long Video Understanding

Ziyang Wang^1,2* Honglu Zhou¹ Shijie Wang¹ Junnan Li¹ Caiming Xiong¹

Silvio Savarese¹ Mohit Bansal² Michael S. Ryoo¹ Juan Carlos Niebles¹

¹Salesforce AI Research

²University of North Carolina at Chapel Hill

*: Work done during internship at Salesforce

Paper Code

2-minute Video Introduction

Motivation

Motivation of Active Video Perception. Prior methods follow a passive perception paradigm that leverages query-agnostic captioners to perceive video information, leading to low efficiency and imprecise visual grounding. Instead, our model AVP actively perceives query-relevant content by treating the long video as an interactive environment to be explored in a goal-directed manner.

Method

Framework of Active Video Perception (AVP). AVP operates through an iterative plan-observe-reflect process with MLLM agents. At each round, the planner decides what, where, and how to interact with the video; the observer executes the plan to extract structured, query-relevant evidence; and the reflector evaluates the evidence to determine whether another round is needed. This closed-loop design steers computation toward informative segments, enables revisiting ambiguous moments, and adaptively allocates computational budget on long, complex videos.

Results

Comparison with general-purpose MLLMs, video-specific MLLMs, and agentic video frameworks on five long video understanding benchmarks (MINERVA, LVBench, MLVU, Video-MME, LongVideoBench). We bold the best and underline the second-best result in each column. Results show that AVP consistently achieves the best performance across all datasets and baselines, delivering significant improvements over its backbone model (highlighted in blue) on every benchmark.

Qualitative Analysis

Qualitative example of AVP. Given a multiple-choice query about the Tombstone monument's first on-screen appearance, Round 1 performs a coarse scan of the entire video (0.5 FPS, low resolution) and identifies a candidate interval [1:00-1:10], but the Reflector judges the evidence insufficient. In Round 2, the system re-plans a targeted pass over this window (2 FPS, medium resolution), enabling the Observer to localize the monument in the upper-left background and allowing the Reflector to confidently select the correct answer (option D) and halt.

BibTeX

@misc{wang2025activevideoperceptioniterative,
      title={Active Video Perception: Iterative Evidence Seeking for Agentic Long Video Understanding}, 
      author={Ziyang Wang and Honglu Zhou and Shijie Wang and Junnan Li and Caiming Xiong and Silvio Savarese and Mohit Bansal and Michael S. Ryoo and Juan Carlos Niebles},
      year={2025},
      eprint={2512.05774},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.05774}, 
}