OSCBench: Benchmarking Object State Change in Text-to-Video Generation

1 National University of Singapore 2 Singapore Management University 3 Carnegie Mellon University 4 Fudan University
arXiv 2026
*Equal Contribution

Current text-to-video (T2V) models generate high-quality videos
but still struggle with object state changes.

Regular scenario: A woman is slicing an apple
(Discontinuous state changes)

Minimal Prompt: Slicing apple
(Objects appear suddenly)

Novel scenario: A chef is sauteing pineapple on the grass
(Objects appear suddenly)

Compositional scenario: A chef is mincing and sauteing onion in a kitchen
(Artificial artifacts)

Why Object State Change Matters

  • Object state change (OSC) is common in daily life and indicates whether a task has been completed.
  • Existing text-to-video evaluations mainly focus on semantic alignment, visual quality and physical plausibility, but overlook whether the object reaches the intended target state.
  • T2V-generated videos may look plausible but still fail to produce correct and temporally consistent object state changes.
Radar plot showing the gap between semantic alignment and OSC performance

What is OSCBench

OSCBench evaluates whether T2V models can correctly reason and render action-induced object state changes.

OSCBench construction and evaluation pipeline

Overview of the OSCBench construction and evaluation pipeline. We build unified action and object categories from instructional cooking data via a human-in-the-loop process, and construct regular, novel, and compositional OSC scenarios as text prompts for video generation. The generated videos are evaluated by humans and MLLMs, and we analyze their correlations to assess automatic evaluation reliability.

Benchmark Statistics

20 action elements → 8 action categories

134 object elements → 28 object categories

108 regular scenarios

20 novel scenarios

12 compositional scenarios

Each scenario contains 8 action–object pairs

1,120 prompts overall

Benchmark statistics visualization

Example prompts and failure cases from regular, novel, and compositional OSC scenarios.

Main Results

Main result figure 1

Overall performance of T2V models from human and MLLM-based evaluators.

Main result figure 2

OSC performance across action categories by human evaluation.

Main results table

Interesting Results

Regular scenario: A chef is slicing leek at a street food stand
(Artificial artifacts)

Regular scenario (minimal prompt): Peeling zucchini
(Artificial artifacts)

Novel scenario: A woman is zesting grapefruit in the kitchen
(memorization rather than understanding)

Compositional scenario: A robot with robotic hands is mincing and sauteing ginger in the kitchen
(Only one state change)

BibTeX

@article{OSCBench2026,
  title={OSCBench: Benchmarking Object State Change in Text-to-Video Generation},
  author={Han, Xianjing and Zhu, Bin and Hu, Shiqi and Li, Mingzhe Franklin and Carrington, Patrick and Zimmermann, Roger and Chen, Jingjing},
  year={2026},
  url={https://arxiv.org/abs/2603.11698}
}