“Visuals as Precise Control”
A video generated with the proposed In-Video Instructions. The textual prompt is fixed as “Follow the instructions step by step,” while the model synthesizes content purely from the embedded visual signals within the input frames.
Abstract
Large-scale video generative models have recently demonstrated strong visual capabilities. In this work, we investigate whether such capabilities can be harnessed for controllable image-to-video generation by interpreting visual signals embedded within the frames as instructions, a paradigm we term In-Video Instruction.
In contrast to prompt-based control, which provides textual descriptions that are inherently global and coarse, In-Video Instruction encodes user guidance directly into the visual domain through elements such as overlaid text, arrows, or trajectories. This enables explicit, spatial-aware, and unambiguous correspondences between visual subjects and their intended actions by assigning distinct instructions to different objects.
Figure 1. Overview of the proposed In-Video Instruction framework.
Generative Results
BibTeX
@article{fang2025invideo,
title={In-Video Instructions: Visual Signals as Generative Control},
author={Fang, Gongfan and Ma, Xinyin and Wang, Xinchao},
journal={arXiv preprint arXiv:2511.19401},
year={2025}
}