Modality Forcing for Scalable Spatial Generation

⬇ agent context pack (.md) — machine-readable summary + sources for AI agents

Summary

Modern text-to-image (T2I) models implicitly learn a rich model of the physical world — perspective, relative scale, occlusion, lighting — simply by training on billions of images paired with captions. Modality Forcing is a lightweight post-training recipe that unlocks that spatial knowledge for depth estimation, without requiring expensive dense depth annotations or elaborate training pipelines. By extending a Diffusion Transformer (DiT) to simultaneously generate photorealistic images and their corresponding depth maps, the method turns a generative model into a powerful, scalable geometric reasoner.

The core problem it addresses: prior approaches that adapt T2I models for depth all demand dense depth supervision and multi-stage training recipes, making them costly to scale. Modality Forcing sidesteps both constraints — it trains on sparse, real-world LiDAR sweeps and folds depth generation cleanly into the existing T2I framework.

How It Works

Separate noise levels per modality. Instead of adding the same amount of diffusion noise to both the image and the depth map, Modality Forcing assigns an independent noise schedule to each. At inference, one modality can be fully denoised (conditioned) while the other is generated from scratch — or both can be generated jointly. This single design choice unlocks all generation permutations: image→depth, depth→image, and unconditional joint generation.
Per-modality decoders. Because real-world depth sensors (e.g., LiDAR) produce sparse, irregular point clouds rather than dense pixel-aligned maps, the authors attach separate decoder heads for each modality. This decoupling lets the model train on naturally sparse depth without needing synthetic densification or interpolation, improving generalization to in-the-wild scenes.
Scalable DiT backbone. The recipe is applied as a post-training stage on top of T2I DiTs ranging from 370 M to 3.3 B parameters. Crucially, the authors show that larger models trained on more image data consistently produce more accurate depth — depth quality scales with image generation capacity, not with depth data volume.

Why It Matters

T2I models contain rich spatial priors; synthesizing photorealistic, cluttered scenes requires understanding geometry, including perspective and relative scale. Modality Forcing provides strong empirical evidence that image generation is a scalable pre-training objective for spatial perception — an insight with broad implications for how the community should think about collecting supervision for 3D tasks.

Concretely, the strongest model is competitive with state-of-the-art monocular depth estimators and reduces AbsRel by 57% relative to existing joint image-depth generative models. Unlike discriminative depth networks that need carefully curated dense ground truth, Modality Forcing inherits its geometric understanding "for free" from internet-scale image pre-training.

Related Work

The most direct predecessor is Marigold. Marigold repurposes the generative prior of text-to-image latent diffusion models for monocular depth estimation, fine-tuning Stable Diffusion with synthetic data. It demonstrated that generative priors transfer to depth, but relies on dense synthetic supervision and a U-Net backbone that doesn't scale as cleanly as DiTs.

GeoWizard takes a related approach. Given an input image, GeoWizard jointly generates a paired depth map and surface normal map using a single diffusion model, using a cross-domain geometry switcher to enhance geometric consistency.

JointDiT is the closest architectural cousin. JointDiT is a diffusion transformer that models the joint distribution of RGB and depth, enabling joint generation, depth estimation, and depth-conditioned image generation within a single unified model. Modality Forcing distinguishes itself by training on sparse real-world depth, using per-modality noise levels (rather than a shared noise process), and demonstrating a cleaner scaling story across model sizes.

On the discriminative side, Depth Anything V2 and DepthPro remain strong baselines. Depth Anything V2 is a monocular depth estimation model that generates robust, fine-grained depth maps, capturing intricate details and performing reliably across diverse and complex scenes. DepthPro delivers true metric depth estimation without requiring camera intrinsics. Modality Forcing is notable for closing the gap with these specialized discriminative systems using a purely generative training objective.

Implementations

The authors maintain a project page with results and a citation. As of this writing, no official open-source code repository has been released; searches returned no matching GitHub repository for this work. The project page does not yet link to code. Readers should check the project page and the arXiv listing for future releases.

For related open implementations: Marigold is integrated into 🤗 Diffusers, and JointDiT has a project page with details on its approach.

Applications

Robotics and manipulation: Fast, generalizable depth from a single camera enables robots to understand cluttered workspaces without dedicated depth sensors.
Autonomous driving: Sparse LiDAR supervision aligns naturally with how self-driving systems already capture depth data; Modality Forcing can densify and complete those sparse maps.
3D reconstruction and novel-view synthesis: Joint image-depth generation can bootstrap 3D scene reconstructions from text prompts or reference images alone.
Augmented and mixed reality: Real-time depth generation from monocular camera feeds can drive occlusion handling and scene-aware object placement.
Synthetic data generation: The joint generation mode can produce geometrically consistent image-depth pairs at scale, providing training data for downstream depth-dependent models.