---
title: "InterleaveThinker: Reinforcing Agentic Interleaved Generation"
description: "InterleaveThinker is a plug-and-play multi-agent pipeline (planner + RL-trained critic) that endows any existing image generator with coherent interleaved text-image sequence generation, achieving performance on par with GPT-5 and Nano Banana on interleaved benchmarks."
type: research-paper-digest
arxiv_id: 2606.13679v1
source_url: http://arxiv.org/abs/2606.13679v1
pdf_url: http://arxiv.org/pdf/2606.13679v1
authors: ["Dian Zheng", "Harry Lee", "Manyuan Zhang", "Kaituo Feng", "Zoey Guo", "Ray Zhang"]
published: 2026-06-11T17:59:50Z
retrieved: 2026-06-14
code_url: https://github.com/zhengdian1/InterleaveThinker
has_code: true
canonical: https://flawedquote.com/media/research/arxiv-2606-13679v1.html
review_of: "InterleaveThinker: Reinforcing Agentic Interleaved Generation"
document_kind: third-party-review
affiliation: none; not affiliated with the paper's authors
provenance: ai-generated-digest; web-grounded; verify against source
---

# InterleaveThinker: Reinforcing Agentic Interleaved Generation

**This is an AI-generated review/digest of a third-party paper by Dian Zheng, Harry Lee, Manyuan Zhang, Kaituo Feng, Zoey Guo, Ray Zhang (http://arxiv.org/abs/2606.13679v1) — not the original paper, and not affiliated with its authors. Treat it as a secondary source and verify against the original.**

> InterleaveThinker is a plug-and-play multi-agent pipeline (planner + RL-trained critic) that endows any existing image generator with coherent interleaved text-image sequence generation, achieving performance on par with GPT-5 and Nano Banana on interleaved benchmarks.

## What this answers

- What is this paper about and what problem does it solve?
- What are the concrete results / benchmark numbers?
- How does the method work?
- Is there an official open-source implementation?
- What are the limitations and what is NOT shown?
- What related/prior work does it build on?
- How do I cite this paper? (BibTeX below)

## Key results

- Achieves interleaved generation performance comparable to Nano Banana (Google Gemini) and GPT-5 on interleaved benchmarks
- Substantial gains on WISE and RISE reasoning-based benchmarks when applied to 4-step FLUX.2-klein
- Critic RL training uses only 13k samples (Interleave-Critic-RL-13k) yet guides trajectories of 25+ generator calls
- SFT cold-start uses 80k planner samples and 112k critic samples

## Method at a glance

- **Problem:** Existing image generators cannot produce interleaved (alternating text and image) sequences, and open-source unified multimodal models show limited performance on this task despite its importance for visual narratives, how-to guides, and embodied manipulation.
- **Method:** A two-agent pipeline: a planner LM decomposes a task into a step-by-step text+image plan; a critic LM evaluates each generator output, identifies deviations, and rewrites instructions for regeneration. The critic is trained with GRPO using accuracy reward (end-to-end correctness) and step-wise reward (per-step correction quality), enabling single-step RL to efficiently guide long multi-call trajectories.
- **Data:** Interleave-Planner-SFT-80k (planner SFT), Interleave-Critic-SFT-112k (critic SFT), Interleave-Critic-RL-13k (critic GRPO RL); evaluated on OpenING benchmark and reasoning benchmarks WISE and RISE
- **Metrics:** OpenING benchmark scores (interleaved generation quality), WISE and RISE benchmark scores (reasoning-based image generation), comparison against Nano Banana and GPT-5 as closed-source baselines

## Resources

- **Paper:** http://arxiv.org/abs/2606.13679v1
- **PDF:** http://arxiv.org/pdf/2606.13679v1
- **Code:** https://github.com/zhengdian1/InterleaveThinker

## Limitations

- Single trajectory may require 25+ generator calls, making end-to-end RL optimization computationally intractable — the step-wise reward is a practical workaround but may not be globally optimal
- Performance depends on the quality of the underlying image generator
- Critic training data (13k) is relatively small; generalization to highly out-of-distribution tasks is untested
- No ablation details on how much each reward component contributes are visible from the abstract

## Applications

- Visual narrative and illustrated story generation
- Step-by-step how-to guides with matched images
- Embodied robot manipulation planning (multi-step visual+language plans)
- Creative design (mood boards, travel itineraries, product concept decks)
- Enhanced single-image reasoning generation (spillover to standard T2I benchmarks)

## Related work

- [OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation](https://arxiv.org/abs/2411.18499)
- [Interleaving Reasoning for Better Text-to-Image Generation (IRG)](https://arxiv.org/abs/2509.06945)
- [Wan-Weaver: Interleaved Multi-modal Generation via Decoupled Training](https://arxiv.org/abs/2603.25706)
- [Synergizing Understanding and Generation with Interleaved Analyzing-Drafting Thinking](https://arxiv.org/abs/2602.21435)

## Citation

```bibtex
@misc{arxiv_2606_13679,
  title={InterleaveThinker: Reinforcing Agentic Interleaved Generation},
  author={Dian Zheng and Harry Lee and Manyuan Zhang and Kaituo Feng and Zoey Guo and Ray Zhang},
  year={2026},
  eprint={2606.13679},
  archivePrefix={arXiv},
  url={http://arxiv.org/abs/2606.13679v1}
}
```

## Optional — full explainer

*Everything above is self-contained; skip this if you are context-limited.*

## Summary

Generating a coherent mix of text and images in a single output — think a step-by-step visual recipe, an illustrated storybook, or a robotics manipulation guide — is called interleaved generation. While modern image generators are impressively photorealistic, they are architecturally constrained to produce images one at a time, and even the best open-source Unified Multimodal Models (UMMs) struggle to plan and execute long sequences of alternating text and images coherently. InterleaveThinker fills this gap as the first multi-agent pipeline that can bolt interleaved generation capability onto any existing image generator, without retraining it.

## How It Works

The system is built around two cooperating language model agents:

  - Planner agent: Given a user prompt, the planner decomposes the task into a sequence of steps — deciding what text to write and what image to generate at each point, and writing precise instructions for the image generator for every step.

  - Critic agent: After each generator call, the critic inspects the output, checks whether it faithfully follows the planner's instruction, and — if it doesn't — rewrites the instruction and triggers regeneration. This creates a closed, self-correcting loop.

Training proceeds in two stages. First, a supervised fine-tuning (SFT) cold-start: the team constructed Interleave-Planner-SFT-80k (80 k samples to teach planning format) and Interleave-Critic-SFT-112k (112 k samples to teach critique format). Second, the critic is further strengthened with reinforcement learning: Interleave-Critic-RL-13k uses GRPO (Group Relative Policy Optimization) to sharpen step-wise instruction correction inside a generation trajectory. Because a single trajectory can span more than 25 generator calls, optimizing the whole trajectory end-to-end is computationally prohibitive. The paper's key training innovation is therefore a pair of targeted reward signals — an accuracy reward (does the final output match intent?) and a step-wise reward (is each individual correction useful?) — that allow single-step RL to propagate learning across the full trajectory efficiently.

## Why It Matters

The plug-and-play design is the standout contribution: no changes to the underlying image generator are needed. The paper demonstrates gains with multiple generators, including FLUX variants. On interleaved generation benchmarks the system reaches performance comparable to Nano Banana and GPT-5 — impressive closed-source reference points. Perhaps the most surprising result is the spillover benefit: it also significantly enhances the base model on reasoning-based benchmarks; for example, on 4-step FLUX.2-klein, substantial gains are observed on WISE and RISE. This suggests the critic's correction loop trains sharper reasoning even for single-image tasks.

## Related Work

The OpenING benchmark (CVPR 2025) introduced a comprehensive evaluation suite comprising 5,400 human-annotated instances across 56 real-world interleaved tasks, covering scenarios such as travel guides, design, and brainstorming — and serves as one of the key evaluation beds for InterleaveThinker. The IRG (Interleaving Reasoning for Generation) line of work also targets reasoning-enhanced image synthesis, reporting absolute gains of 5–10 points across benchmarks including GenEval, WISE, and TIIF. Wan-Weaver takes a decoupled training approach to interleaved image-text generation, evaluating on OpenING's seven metrics against integrated and pipeline-based methods. On the RL side, GRPO replaces traditional actor-critic setups with group-wise standardized advantage estimators, improving sample efficiency and alignment stability — exactly the property InterleaveThinker exploits for its step-wise critic training. Nano Banana refers to Google's Gemini-based image generation system capable of automatic interleaved image and text output, serving as a key closed-source baseline.

## Implementations

An official open-source implementation is available on GitHub at [github.com/zhengdian1/InterleaveThinker](https://github.com/zhengdian1/InterleaveThinker), including the training datasets (Interleave-Planner-SFT-80k, Interleave-Critic-SFT-112k, and Interleave-Critic-RL-13k) and code for GRPO-based critic training with accuracy and step-wise rewards. No third-party independent re-implementations were found at the time of writing.

## Applications

  - Visual narratives: Automatically generating illustrated stories, comics, or educational explainers where text and images must interlock meaningfully.

  - Step-by-step guides: How-to content (cooking, DIY, assembly) where each instruction step is paired with a matching image.

  - Embodied manipulation: Providing robots with grounded, multi-step visual+language plans — an explicit motivation cited in the paper.

  - Creative design: Producing multi-panel mood boards, travel itineraries, or product concept decks with coherent visual–textual flow.

  - Enhanced single-image reasoning: The critic's correction training improves performance on standard image generation benchmarks such as WISE and RISE, making it useful even outside pure interleaved settings.

## Sources

  - [InterleaveThinker: Reinforcing Agentic Interleaved Generation (arXiv 2606.13679)](http://arxiv.org/abs/2606.13679v1)

  - [Interleaving Reasoning for Better Text-to-Image Generation (IRG, arXiv 2509.06945)](https://arxiv.org/html/2509.06945v1)

  - [Wan-Weaver: Interleaved Multi-modal Generation via Decoupled Training (arXiv 2603.25706)](https://arxiv.org/pdf/2603.25706)

  - [GRPO Training for Generative Model Alignment – Emergent Mind](https://www.emergentmind.com/topics/grpo-training)

  - [OpenING: A Comprehensive Benchmark for Judging Open-ended Interleaved Image-Text Generation (arXiv 2411.18499)](https://arxiv.org/html/2411.18499)

  - [Uni-MMMU: A Massive Multi-discipline Multimodal Unified Benchmark (arXiv 2510.13759)](https://arxiv.org/pdf/2510.13759)

  - [InterleaveThinker – Official GitHub Repository](https://github.com/zhengdian1/InterleaveThinker)

---
*Pre-computed research digest — AI-generated, web-grounded and cited above. Verify against the linked source before relying on it.*