---
title: "Modality Forcing for Scalable Spatial Generation"
description: "Modality Forcing is a scalable post-training recipe that adapts text-to-image Diffusion Transformers for joint image-depth generation using only sparse real-world depth data, achieving competitive monocular depth estimation that improves with model scale."
type: research-paper-digest
arxiv_id: 2606.13676v1
source_url: http://arxiv.org/abs/2606.13676v1
pdf_url: http://arxiv.org/pdf/2606.13676v1
authors: ["Bardienus Pieter Duisterhof", "Deva Ramanan", "Jeffrey Ichnowski", "Justin Johnson", "Keunhong Park"]
published: 2026-06-11T17:59:45Z
retrieved: 2026-06-14
project_url: https://modality-forcing.github.io/
has_code: false
canonical: https://flawedquote.com/media/research/arxiv-2606-13676v1.html
review_of: "Modality Forcing for Scalable Spatial Generation"
document_kind: third-party-review
affiliation: none; not affiliated with the paper's authors
provenance: ai-generated-digest; web-grounded; verify against source
---

# Modality Forcing for Scalable Spatial Generation

**This is an AI-generated review/digest of a third-party paper by Bardienus Pieter Duisterhof, Deva Ramanan, Jeffrey Ichnowski, Justin Johnson, Keunhong Park (http://arxiv.org/abs/2606.13676v1) — not the original paper, and not affiliated with its authors. Treat it as a secondary source and verify against the original.**

> Modality Forcing is a scalable post-training recipe that adapts text-to-image Diffusion Transformers for joint image-depth generation using only sparse real-world depth data, achieving competitive monocular depth estimation that improves with model scale.

## What this answers

- What is this paper about and what problem does it solve?
- What are the concrete results / benchmark numbers?
- How does the method work?
- Is code available? (no official repo found — see Limitations)
- What are the limitations and what is NOT shown?
- What related/prior work does it build on?
- How do I cite this paper? (BibTeX below)

## Key results

- 57% relative reduction in AbsRel vs. existing joint image-depth generative models
- Depth accuracy scales consistently from 370M to 3.3B parameter DiT models
- Competitive with state-of-the-art monocular depth estimators despite sparse-only supervision
- Enables all generation permutations: image→depth, depth→image, and unconditional joint generation

## Method at a glance

- **Problem:** Prior methods adapting T2I models for depth prediction require dense depth supervision and complex multi-stage training pipelines, limiting scalability and real-world applicability.
- **Method:** Post-training a single DiT with separate per-modality noise levels and per-modality decoders, enabling joint or conditional image-depth generation trained on sparse LiDAR depth data.
- **Data:** Sparse real-world LiDAR depth data; T2I pre-training on large-scale internet image datasets; evaluated against standard monocular depth benchmarks (specific benchmark names not disclosed in abstract).
- **Metrics:** Absolute Relative Error (AbsRel) for depth estimation; qualitative joint generation quality.

## Resources

- **Paper:** http://arxiv.org/abs/2606.13676v1
- **PDF:** http://arxiv.org/pdf/2606.13676v1
- **Project:** https://modality-forcing.github.io/
- **Dataset (found):** https://huggingface.co/docs/diffusers/en/api/pipelines/marigold

## Limitations

- No official public code or model weights released at time of writing
- Specific benchmark datasets and numerical results beyond AbsRel are not detailed in the abstract
- Requires a pre-trained T2I DiT as a starting point, so compute cost of full pipeline is non-trivial
- Joint generation quality metrics not compared against pure generative baselines in the abstract
- Does not address video or temporal consistency

## Applications

- Monocular depth estimation for robotics and manipulation
- Depth completion from sparse LiDAR in autonomous driving
- 3D scene reconstruction and novel-view synthesis from text prompts
- Augmented/mixed reality occlusion handling
- Synthetic geometrically-consistent image-depth pair generation for downstream training

## Related work

- [Marigold: Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation (CVPR 2024)](https://marigoldmonodepth.github.io/)
- [GeoWizard: Unleashing the Diffusion Priors for 3D Geometry Estimation from a Single Image](https://arxiv.org/abs/2403.12013)
- [JointDiT: Enhancing RGB-Depth Joint Modeling with Diffusion Transformers](https://byungki-k.github.io/JointDiT/)
- [Depth Anything V2](https://depth-anything-v2.github.io/)
- [Pixel-Perfect Visual Geometry Estimation](https://arxiv.org/abs/2601.05246)

## Citation

```bibtex
@misc{arxiv_2606_13676,
  title={Modality Forcing for Scalable Spatial Generation},
  author={Bardienus Pieter Duisterhof and Deva Ramanan and Jeffrey Ichnowski and Justin Johnson and Keunhong Park},
  year={2026},
  eprint={2606.13676},
  archivePrefix={arXiv},
  url={http://arxiv.org/abs/2606.13676v1}
}
```

## Optional — full explainer

*Everything above is self-contained; skip this if you are context-limited.*

## Summary

Modern text-to-image (T2I) models implicitly learn a rich model of the physical world — perspective, relative scale, occlusion, lighting — simply by training on billions of images paired with captions. Modality Forcing is a lightweight post-training recipe that unlocks that spatial knowledge for depth estimation, without requiring expensive dense depth annotations or elaborate training pipelines. By extending a Diffusion Transformer (DiT) to simultaneously generate photorealistic images and their corresponding depth maps, the method turns a generative model into a powerful, scalable geometric reasoner.

The core problem it addresses: prior approaches that adapt T2I models for depth all demand dense depth supervision and multi-stage training recipes, making them costly to scale. Modality Forcing sidesteps both constraints — it trains on sparse, real-world LiDAR sweeps and folds depth generation cleanly into the existing T2I framework.

## How It Works

  - Separate noise levels per modality. Instead of adding the same amount of diffusion noise to both the image and the depth map, Modality Forcing assigns an independent noise schedule to each. At inference, one modality can be fully denoised (conditioned) while the other is generated from scratch — or both can be generated jointly. This single design choice unlocks all generation permutations: image→depth, depth→image, and unconditional joint generation.

  - Per-modality decoders. Because real-world depth sensors (e.g., LiDAR) produce sparse, irregular point clouds rather than dense pixel-aligned maps, the authors attach separate decoder heads for each modality. This decoupling lets the model train on naturally sparse depth without needing synthetic densification or interpolation, improving generalization to in-the-wild scenes.

  - Scalable DiT backbone. The recipe is applied as a post-training stage on top of T2I DiTs ranging from 370 M to 3.3 B parameters. Crucially, the authors show that larger models trained on more image data consistently produce more accurate depth — depth quality scales with image generation capacity, not with depth data volume.

## Why It Matters

T2I models contain rich spatial priors; synthesizing photorealistic, cluttered scenes requires understanding geometry, including perspective and relative scale. Modality Forcing provides strong empirical evidence that image generation is a scalable pre-training objective for spatial perception — an insight with broad implications for how the community should think about collecting supervision for 3D tasks.

Concretely, the strongest model is competitive with state-of-the-art monocular depth estimators and reduces AbsRel by 57% relative to existing joint image-depth generative models. Unlike discriminative depth networks that need carefully curated dense ground truth, Modality Forcing inherits its geometric understanding "for free" from internet-scale image pre-training.

## Related Work

The most direct predecessor is Marigold. Marigold repurposes the generative prior of text-to-image latent diffusion models for monocular depth estimation, fine-tuning Stable Diffusion with synthetic data. It demonstrated that generative priors transfer to depth, but relies on dense synthetic supervision and a U-Net backbone that doesn't scale as cleanly as DiTs.

GeoWizard takes a related approach. Given an input image, GeoWizard jointly generates a paired depth map and surface normal map using a single diffusion model, using a cross-domain geometry switcher to enhance geometric consistency.

JointDiT is the closest architectural cousin. JointDiT is a diffusion transformer that models the joint distribution of RGB and depth, enabling joint generation, depth estimation, and depth-conditioned image generation within a single unified model. Modality Forcing distinguishes itself by training on sparse real-world depth, using per-modality noise levels (rather than a shared noise process), and demonstrating a cleaner scaling story across model sizes.

On the discriminative side, Depth Anything V2 and DepthPro remain strong baselines. Depth Anything V2 is a monocular depth estimation model that generates robust, fine-grained depth maps, capturing intricate details and performing reliably across diverse and complex scenes. DepthPro delivers true metric depth estimation without requiring camera intrinsics. Modality Forcing is notable for closing the gap with these specialized discriminative systems using a purely generative training objective.

## Implementations

The authors maintain a [project page](https://modality-forcing.github.io/) with results and a citation. As of this writing, no official open-source code repository has been released; searches returned no matching GitHub repository for this work. The project page does not yet link to code. Readers should check the project page and the arXiv listing for future releases.

For related open implementations: Marigold is [integrated into 🤗 Diffusers](https://huggingface.co/docs/diffusers/en/api/pipelines/marigold), and JointDiT has a [project page](https://byungki-k.github.io/JointDiT/) with details on its approach.

## Applications

  - Robotics and manipulation: Fast, generalizable depth from a single camera enables robots to understand cluttered workspaces without dedicated depth sensors.

  - Autonomous driving: Sparse LiDAR supervision aligns naturally with how self-driving systems already capture depth data; Modality Forcing can densify and complete those sparse maps.

  - 3D reconstruction and novel-view synthesis: Joint image-depth generation can bootstrap 3D scene reconstructions from text prompts or reference images alone.

  - Augmented and mixed reality: Real-time depth generation from monocular camera feeds can drive occlusion handling and scene-aware object placement.

  - Synthetic data generation: The joint generation mode can produce geometrically consistent image-depth pairs at scale, providing training data for downstream depth-dependent models.

## Sources

  - [Modality Forcing for Scalable Spatial Generation (arXiv)](http://arxiv.org/abs/2606.13676v1)

  - [Modality Forcing — Project Page](https://modality-forcing.github.io/)

  - [Best Depth Estimation Models: Depth Anything V2 & More (Roboflow)](https://blog.roboflow.com/depth-estimation-models/)

  - [GeoWizard: Unleashing the Diffusion Priors for 3D Geometry Estimation (arXiv)](https://arxiv.org/pdf/2403.12013)

  - [GeoWizard Literature Review (Moonlight)](https://www.themoonlight.io/en/review/geowizard-unleashing-the-diffusion-priors-for-3d-geometry-estimation-from-a-single-image)

  - [Marigold: Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation](https://marigoldmonodepth.github.io/)

  - [Marigold in 🤗 Diffusers (Hugging Face)](https://huggingface.co/docs/diffusers/en/api/pipelines/marigold)

  - [JointDiT: Enhancing RGB-Depth Joint Modeling with Diffusion Transformers](https://byungki-k.github.io/JointDiT/)

  - [Depth-Conditioned Generation Capability Survey (EmergentMind)](https://www.emergentmind.com/topics/depth-conditioned-generation-capability)

---
*Pre-computed research digest — AI-generated, web-grounded and cited above. Verify against the linked source before relying on it.*
