In-Video Instructions: Visual Signals as Generative Control

“Visuals as Precise Control”

A video generated with the proposed In-Video Instructions. The textual prompt is fixed as “Follow the instructions step by step,” while the model synthesizes content purely from the embedded visual signals within the input frames.

Abstract

Large-scale video generative models have recently demonstrated strong visual capabilities. In this work, we investigate whether such capabilities can be harnessed for controllable image-to-video generation by interpreting visual signals embedded within the frames as instructions, a paradigm we term In-Video Instruction.

In contrast to prompt-based control, which provides textual descriptions that are inherently global and coarse, In-Video Instruction encodes user guidance directly into the visual domain through elements such as overlaid text, arrows, or trajectories. This enables explicit, spatial-aware, and unambiguous correspondences between visual subjects and their intended actions by assigning distinct instructions to different objects.

Figure 1. Overview of the proposed In-Video Instruction framework.

Generative Results

Example 01

Example 02

Example 03

Example 04

Example 05

Example 06

Example 07

Example 08

Example 09

Example 10 (Kling 2.5)

Example 11 (Kling 2.5)

Example 12 (Kling 2.5)

BibTeX

@article{fang2025invideo,
  title={In-Video Instructions: Visual Signals as Generative Control},
  author={Fang, Gongfan and Ma, Xinyin and Wang, Xinchao},
  journal={arXiv preprint arXiv:2511.19401},
  year={2025}
}