SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation

A specification-guided framework that keeps semantic commitments identifiable across generation lifecycle.

Tianfei Ren¹ Zhipeng Yan¹ Yiming Zhao¹ Zhen Fang^1* Yu Zeng^1* Guohui Zhang¹ Hang Xu¹ Xiaoxiao Ma¹ Shiting Huang¹ Ke Xu¹ Wenxuan Huang Lionel Z. Wang^2,3 Lin Chen¹ Zehui Chen¹ Jie Huang¹ Feng Zhao^1†

¹MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China

²The Hong Kong Polytechnic University ³Nanyang Technological University

^*Project lead ^†Corresponding author

arXiv Paper </> Code 🏆 Gen-Arena

SCOPE teaser with diverse complex image generation examples — Existing text-to-image pipelines can retrieve facts, verify outputs, and repair images, but they often lose track of which semantic commitment each operation is resolving, a discontinuity we term the **Conceptual Rift**. SCOPE decomposes each prompt into an evolving structured specification of entities, constraints, and unknowns. It conditionally invokes retrieval, reasoning, and repair skills for unresolved or violated commitments, and uses verification to route the generation lifecycle. The examples above illustrate faithful generation across knowledge-intensive events, reference-heavy intellectual properties, and multi-entity compositions with precise relation, layout, and attribute constraints.

Abstract

Text-to-image models have made strong progress in visual fidelity, but faithfully realizing complex visual intents remains challenging because many requirements must be tracked across grounding, generation, and verification. We refer to these requirements as semantic commitments and formalize their lifecycle discontinuity as the Conceptual Rift. SCOPE addresses this problem by maintaining semantic commitments in an evolving structured specification and conditionally invoking retrieval, reasoning, and repair skills around unresolved or violated commitments. We further introduce Gen-Arena, a human-annotated benchmark with entity- and constraint-level specifications, together with Entity-Gated Intent Pass Rate (EGIP), a strict entity-first pass criterion. SCOPE substantially outperforms evaluated baselines on Gen-Arena, achieving 0.60 EGIP, and also achieves strong results on WISE-V and MindBench.

Problem and Contributions

Conceptual Rift

Multi-step generation can retrieve facts, verify outputs, or repair images, but these operations often lose track of which semantic commitment they are resolving or fixing.

Persistent Specification

SCOPE represents each prompt as entities, constraints, and unknowns, then keeps this specification as the shared interface across all stages.

Conditional Skills

Retrieval, reasoning, and repair are invoked only when the current specification exposes unresolved or violated commitments.

Commitment-Level Evaluation

Gen-Arena evaluates strict intent fulfillment: an instance passes only when all required entities and all gated constraints are satisfied.

Method Overview

SCOPE iterates over decomposition, conditional skill invocation, synthesis, generation, verification, and repair. The Decomposer makes entities, constraints, and unknowns explicit in a structured specification. When the specification exposes missing evidence or unresolved requirements, SCOPE invokes retrieval and reasoning skills to fill them before synthesis. After generation, the Verifier maps item-level failures back to the specification, allowing repair skills to target the violated commitments instead of regenerating blindly.

Overview of the SCOPE framework — SCOPE decomposes prompts into entities, constraints, and unknowns, invokes skills for unresolved or failed commitments, and feeds the updated specification into generation and verification.

Gen-Arena and EGIP

Gen-Arena contains 300 human-annotated instances across cartoon, game, sports, entertainment, competition, and ceremony categories. The benchmark includes 1,954 required entities, 2,533 atomic constraints, and 310 reference images. Constraints are grouped into attribute, relation, and layout commitments.

300instances

1,954entities

2,533constraints

310reference images

Gen-Arena construction and EGIP evaluation — EGIP uses an entity-first gate: missing or incorrect required entities fail the instance before dependent constraints are evaluated.

Main Results

These numbers follow the current manuscript setting. The public paper link will be attached after release.

Full Gen-Arena Category Results

Method	Cartoon	Game	Sports	Ent.	Comp.	Ceremony	Overall
Nano Banana	0.02	0.00	0.16	0.14	0.10	0.00	0.07
Nano Banana Pro	0.04	0.00	0.34	0.16	0.18	0.54	0.21
Qwen-Image	0.00	0.00	0.10	0.02	0.00	0.00	0.02
Z-Image-Turbo	0.00	0.00	0.09	0.02	0.00	0.00	0.01
FLUX.1-dev	0.00	0.00	0.03	0.02	0.00	0.00	0.01
SCOPE	0.52	0.46	0.72	0.62	0.52	0.74	0.60

Gen-Arena EGIP

Method	Overall	Sports	Ceremony
Nano Banana	0.07	0.16	0.00
Nano Banana Pro	0.21	0.34	0.54
Qwen-Image	0.02	0.10	0.00
Z-Image-Turbo	0.01	0.09	0.00
SCOPE	0.60	0.72	0.74

Diagnostic Pass Rates

Method	EGIP	Entity	Gated Constraint
Nano Banana Pro	0.21	0.82	0.59
Qwen-Image	0.02	0.83	0.49
Z-Image-Turbo	0.01	0.84	0.48
SCOPE	0.60	0.92	0.83

Ablation

Method	EGIP	Gated Constraint
Direct single	0.21	0.59
Direct best-of-3	0.40	0.71
SCOPE w/o retrieval and reasoning	0.22	0.60
SCOPE w/o repair	0.42	0.72
SCOPE	0.60	0.83

External Benchmarks

Benchmark	SCOPE	Best compared baseline
WISE-V Overall	0.907	0.876
MindBench Knowledge	0.59	0.40
MindBench Reasoning	0.63	0.44
MindBench Overall	0.61	0.41

Qualitative comparison between SCOPE and direct generation baselines — Qualitative comparisons show that plausible images can still fail under exact entity, attribute, relation, or layout commitments.

Qualitative Gallery

The gallery shows selected qualitative outputs from SCOPE runs, including curated showcase examples and additional Gen-Arena samples from all six categories. Each card includes the concrete prompt used for the case, with filters for commitment types and Gen-Arena categories.

Some examples use recognizable public characters, brands, or venues as research stress tests for reference-grounded generation.

BibTeX

@misc{ren2026scope,
  title  = {SCOPE: Structured Decomposition and Conditional Skill Orchestration for Complex Image Generation},
  author = {Ren, Tianfei and Yan, Zhipeng and Zhao, Yiming and Fang, Zhen and Zeng, Yu and others},
  year   = {2026},
  eprint = {2605.08043},
  archivePrefix = {arXiv},
  url    = {https://arxiv.org/abs/2605.08043}
}