InterleaveThinker: Reinforcing Agentic Interleaved Generation

⬇ agent context pack (.md) — machine-readable summary + sources for AI agents

Summary

Generating a coherent mix of text and images in a single output — think a step-by-step visual recipe, an illustrated storybook, or a robotics manipulation guide — is called interleaved generation. While modern image generators are impressively photorealistic, they are architecturally constrained to produce images one at a time, and even the best open-source Unified Multimodal Models (UMMs) struggle to plan and execute long sequences of alternating text and images coherently. InterleaveThinker fills this gap as the first multi-agent pipeline that can bolt interleaved generation capability onto any existing image generator, without retraining it.

How It Works

The system is built around two cooperating language model agents:

Planner agent: Given a user prompt, the planner decomposes the task into a sequence of steps — deciding what text to write and what image to generate at each point, and writing precise instructions for the image generator for every step.
Critic agent: After each generator call, the critic inspects the output, checks whether it faithfully follows the planner's instruction, and — if it doesn't — rewrites the instruction and triggers regeneration. This creates a closed, self-correcting loop.

Training proceeds in two stages. First, a supervised fine-tuning (SFT) cold-start: the team constructed Interleave-Planner-SFT-80k (80 k samples to teach planning format) and Interleave-Critic-SFT-112k (112 k samples to teach critique format). Second, the critic is further strengthened with reinforcement learning: Interleave-Critic-RL-13k uses GRPO (Group Relative Policy Optimization) to sharpen step-wise instruction correction inside a generation trajectory. Because a single trajectory can span more than 25 generator calls, optimizing the whole trajectory end-to-end is computationally prohibitive. The paper's key training innovation is therefore a pair of targeted reward signals — an accuracy reward (does the final output match intent?) and a step-wise reward (is each individual correction useful?) — that allow single-step RL to propagate learning across the full trajectory efficiently.

Why It Matters

The plug-and-play design is the standout contribution: no changes to the underlying image generator are needed. The paper demonstrates gains with multiple generators, including FLUX variants. On interleaved generation benchmarks the system reaches performance comparable to Nano Banana and GPT-5 — impressive closed-source reference points. Perhaps the most surprising result is the spillover benefit: it also significantly enhances the base model on reasoning-based benchmarks; for example, on 4-step FLUX.2-klein, substantial gains are observed on WISE and RISE. This suggests the critic's correction loop trains sharper reasoning even for single-image tasks.

Related Work

The OpenING benchmark (CVPR 2025) introduced a comprehensive evaluation suite comprising 5,400 human-annotated instances across 56 real-world interleaved tasks, covering scenarios such as travel guides, design, and brainstorming — and serves as one of the key evaluation beds for InterleaveThinker. The IRG (Interleaving Reasoning for Generation) line of work also targets reasoning-enhanced image synthesis, reporting absolute gains of 5–10 points across benchmarks including GenEval, WISE, and TIIF. Wan-Weaver takes a decoupled training approach to interleaved image-text generation, evaluating on OpenING's seven metrics against integrated and pipeline-based methods. On the RL side, GRPO replaces traditional actor-critic setups with group-wise standardized advantage estimators, improving sample efficiency and alignment stability — exactly the property InterleaveThinker exploits for its step-wise critic training. Nano Banana refers to Google's Gemini-based image generation system capable of automatic interleaved image and text output, serving as a key closed-source baseline.

Implementations

An official open-source implementation is available on GitHub at github.com/zhengdian1/InterleaveThinker, including the training datasets (Interleave-Planner-SFT-80k, Interleave-Critic-SFT-112k, and Interleave-Critic-RL-13k) and code for GRPO-based critic training with accuracy and step-wise rewards. No third-party independent re-implementations were found at the time of writing.

Applications

Visual narratives: Automatically generating illustrated stories, comics, or educational explainers where text and images must interlock meaningfully.
Step-by-step guides: How-to content (cooking, DIY, assembly) where each instruction step is paired with a matching image.
Embodied manipulation: Providing robots with grounded, multi-step visual+language plans — an explicit motivation cited in the paper.
Creative design: Producing multi-panel mood boards, travel itineraries, or product concept decks with coherent visual–textual flow.
Enhanced single-image reasoning: The critic's correction training improves performance on standard image generation benchmarks such as WISE and RISE, making it useful even outside pure interleaved settings.