SCOPE
SCOPE mascot

SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation

A specification-guided framework that keeps semantic commitments identifiable across generation lifecycle.

Tianfei Ren1 Zhipeng Yan1 Yiming Zhao1 Zhen Fang1* Yu Zeng1* Guohui Zhang1 Hang Xu1 Xiaoxiao Ma1 Shiting Huang1 Ke Xu1 Wenxuan Huang Lionel Z. Wang2,3 Lin Chen1 Zehui Chen1 Jie Huang1 Feng Zhao1†

1MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China

2The Hong Kong Polytechnic University   3Nanyang Technological University

*Project lead   Corresponding author

SCOPE teaser with diverse complex image generation examples
Existing text-to-image pipelines can retrieve facts, verify outputs, and repair images, but they often lose track of which semantic commitment each operation is resolving, a discontinuity we term the Conceptual Rift. SCOPE decomposes each prompt into an evolving structured specification of entities, constraints, and unknowns. It conditionally invokes retrieval, reasoning, and repair skills for unresolved or violated commitments, and uses verification to route the generation lifecycle. The examples above illustrate faithful generation across knowledge-intensive events, reference-heavy intellectual properties, and multi-entity compositions with precise relation, layout, and attribute constraints.

Abstract

Text-to-image models have made strong progress in visual fidelity, but faithfully realizing complex visual intents remains challenging because many requirements must be tracked across grounding, generation, and verification. We refer to these requirements as semantic commitments and formalize their lifecycle discontinuity as the Conceptual Rift. SCOPE addresses this problem by maintaining semantic commitments in an evolving structured specification and conditionally invoking retrieval, reasoning, and repair skills around unresolved or violated commitments. We further introduce Gen-Arena, a human-annotated benchmark with entity- and constraint-level specifications, together with Entity-Gated Intent Pass Rate (EGIP), a strict entity-first pass criterion. SCOPE substantially outperforms evaluated baselines on Gen-Arena, achieving 0.60 EGIP, and also achieves strong results on WISE-V and MindBench.

Problem and Contributions

Conceptual Rift

Multi-step generation can retrieve facts, verify outputs, or repair images, but these operations often lose track of which semantic commitment they are resolving or fixing.

Persistent Specification

SCOPE represents each prompt as entities, constraints, and unknowns, then keeps this specification as the shared interface across all stages.

Conditional Skills

Retrieval, reasoning, and repair are invoked only when the current specification exposes unresolved or violated commitments.

Commitment-Level Evaluation

Gen-Arena evaluates strict intent fulfillment: an instance passes only when all required entities and all gated constraints are satisfied.

Method Overview

SCOPE iterates over decomposition, conditional skill invocation, synthesis, generation, verification, and repair. The Decomposer makes entities, constraints, and unknowns explicit in a structured specification. When the specification exposes missing evidence or unresolved requirements, SCOPE invokes retrieval and reasoning skills to fill them before synthesis. After generation, the Verifier maps item-level failures back to the specification, allowing repair skills to target the violated commitments instead of regenerating blindly.

Overview of the SCOPE framework
SCOPE decomposes prompts into entities, constraints, and unknowns, invokes skills for unresolved or failed commitments, and feeds the updated specification into generation and verification.

Gen-Arena and EGIP

Gen-Arena contains 300 human-annotated instances across cartoon, game, sports, entertainment, competition, and ceremony categories. The benchmark includes 1,954 required entities, 2,533 atomic constraints, and 310 reference images. Constraints are grouped into attribute, relation, and layout commitments.

300instances
1,954entities
2,533constraints
310reference images
Gen-Arena construction and EGIP evaluation
EGIP uses an entity-first gate: missing or incorrect required entities fail the instance before dependent constraints are evaluated.

Main Results

These numbers follow the current manuscript setting. The public paper link will be attached after release.

Full Gen-Arena Category Results

MethodCartoonGameSportsEnt.Comp.CeremonyOverall
Nano Banana0.020.000.160.140.100.000.07
Nano Banana Pro0.040.000.340.160.180.540.21
Qwen-Image0.000.000.100.020.000.000.02
Z-Image-Turbo0.000.000.090.020.000.000.01
FLUX.1-dev0.000.000.030.020.000.000.01
SCOPE0.520.460.720.620.520.740.60

Gen-Arena EGIP

MethodOverallSportsCeremony
Nano Banana0.070.160.00
Nano Banana Pro0.210.340.54
Qwen-Image0.020.100.00
Z-Image-Turbo0.010.090.00
SCOPE0.600.720.74

Diagnostic Pass Rates

MethodEGIPEntityGated Constraint
Nano Banana Pro0.210.820.59
Qwen-Image0.020.830.49
Z-Image-Turbo0.010.840.48
SCOPE0.600.920.83

Ablation

MethodEGIPGated Constraint
Direct single0.210.59
Direct best-of-30.400.71
SCOPE w/o retrieval and reasoning0.220.60
SCOPE w/o repair0.420.72
SCOPE0.600.83

External Benchmarks

BenchmarkSCOPEBest compared baseline
WISE-V Overall0.9070.876
MindBench Knowledge0.590.40
MindBench Reasoning0.630.44
MindBench Overall0.610.41
Qualitative comparison between SCOPE and direct generation baselines
Qualitative comparisons show that plausible images can still fail under exact entity, attribute, relation, or layout commitments.

BibTeX

@misc{ren2026scope,
  title  = {SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation},
  author = {Ren, Tianfei and Yan, Zhipeng and Zhao, Yiming and Fang, Zhen and Zeng, Yu and others},
  year   = {2026},
  eprint = {2605.08043},
  archivePrefix = {arXiv},
  url    = {https://arxiv.org/abs/2605.08043}
}