Modality Forcing for Scalable Spatial Generation

Modality Forcing for Scalable Spatial Generation

⬇ agent context pack (.md) — machine-readable summary + sources for AI agents

Summary

Modern text-to-image (T2I) models implicitly learn a rich model of the physical world — perspective, relative scale, occlusion, lighting — simply by training on billions of images paired with captions. Modality Forcing is a lightweight post-training recipe that unlocks that spatial knowledge for depth estimation, without requiring expensive dense depth annotations or elaborate training pipelines. By extending a Diffusion Transformer (DiT) to simultaneously generate photorealistic images and their corresponding depth maps, the method turns a generative model into a powerful, scalable geometric reasoner.

The core problem it addresses: prior approaches that adapt T2I models for depth all demand dense depth supervision and multi-stage training recipes, making them costly to scale. Modality Forcing sidesteps both constraints — it trains on sparse, real-world LiDAR sweeps and folds depth generation cleanly into the existing T2I framework.

How It Works

Why It Matters

T2I models contain rich spatial priors; synthesizing photorealistic, cluttered scenes requires understanding geometry, including perspective and relative scale. Modality Forcing provides strong empirical evidence that image generation is a scalable pre-training objective for spatial perception — an insight with broad implications for how the community should think about collecting supervision for 3D tasks.

Concretely, the strongest model is competitive with state-of-the-art monocular depth estimators and reduces AbsRel by 57% relative to existing joint image-depth generative models. Unlike discriminative depth networks that need carefully curated dense ground truth, Modality Forcing inherits its geometric understanding "for free" from internet-scale image pre-training.

Related Work

The most direct predecessor is Marigold. Marigold repurposes the generative prior of text-to-image latent diffusion models for monocular depth estimation, fine-tuning Stable Diffusion with synthetic data. It demonstrated that generative priors transfer to depth, but relies on dense synthetic supervision and a U-Net backbone that doesn't scale as cleanly as DiTs.

GeoWizard takes a related approach. Given an input image, GeoWizard jointly generates a paired depth map and surface normal map using a single diffusion model, using a cross-domain geometry switcher to enhance geometric consistency.

JointDiT is the closest architectural cousin. JointDiT is a diffusion transformer that models the joint distribution of RGB and depth, enabling joint generation, depth estimation, and depth-conditioned image generation within a single unified model. Modality Forcing distinguishes itself by training on sparse real-world depth, using per-modality noise levels (rather than a shared noise process), and demonstrating a cleaner scaling story across model sizes.

On the discriminative side, Depth Anything V2 and DepthPro remain strong baselines. Depth Anything V2 is a monocular depth estimation model that generates robust, fine-grained depth maps, capturing intricate details and performing reliably across diverse and complex scenes. DepthPro delivers true metric depth estimation without requiring camera intrinsics. Modality Forcing is notable for closing the gap with these specialized discriminative systems using a purely generative training objective.

Implementations

The authors maintain a project page with results and a citation. As of this writing, no official open-source code repository has been released; searches returned no matching GitHub repository for this work. The project page does not yet link to code. Readers should check the project page and the arXiv listing for future releases.

For related open implementations: Marigold is integrated into 🤗 Diffusers, and JointDiT has a project page with details on its approach.

Applications

Sources

  1. Modality Forcing for Scalable Spatial Generation (arXiv)
  2. Modality Forcing — Project Page
  3. Best Depth Estimation Models: Depth Anything V2 & More (Roboflow)
  4. GeoWizard: Unleashing the Diffusion Priors for 3D Geometry Estimation (arXiv)
  5. GeoWizard Literature Review (Moonlight)
  6. Marigold: Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation
  7. Marigold in 🤗 Diffusers (Hugging Face)
  8. JointDiT: Enhancing RGB-Depth Joint Modeling with Diffusion Transformers
  9. Depth-Conditioned Generation Capability Survey (EmergentMind)