Dens3R: A Foundation Model for 3D Geometry Prediction

Xianze Fang1*     Jingnan Gao2*     Zhe Wang1     Zhuo Chen2     Xingyu Ren2     Jiangjing Lyu1†     Qiaomu Ren1     Zhonglei Yang1     Xiaokang Yang2     Yichao Yan2‡     Chengfei Lyu1    
1Alibaba Group       2Shanghai Jiao Tong University       

*Equal contribution     Project leader     Corresponding author

 

Arxiv 2025

 


Dens3R is a feed-forward visual foundation model that takes unposed images as input and outputs high-quality 3D pointmap with
unified geometric dense prediction. Dens3R also accepts generalized inputs, supporting both multi-view and multi-resolution inputs.
As a versatile backbone, Dens3R achieves robust dense prediction under several scenarios and can be easily extended to downstream applications.


Abstract

Recent advances in dense 3D reconstruction have led to significant progress, yet achieving accurate unified geometric prediction remains a major challenge. Most existing methods are limited to predicting a single geometry quantity from input images. However, geometric quantities such as depth, surface normals, and point maps are inherently correlated, and estimating them in isolation often fails to ensure consistency, thereby limiting both accuracy and practical applicability. This motivates us to explore a unified framework that explicitly models the structural coupling among different geometric properties to enable joint regression. In this paper, we present Dens3R, a 3D foundation model designed for joint geometric dense prediction and adaptable to a wide range of downstream tasks. Dens3R adopts a two-stage training framework to progressively build a pointmap representation that is both generalizable and intrinsically invariant. Specifically, we design a lightweight shared encoder-decoder backbone and introduce position-interpolated rotary positional encoding to maintain expressive power while enhancing robustness to high-resolution inputs. By integrating image-pair matching features with intrinsic invariance modeling, Dens3R accurately regresses multiple geometric quantities such as surface normals and depth, achieving consistent geometry perception from single-view to multi-view inputs. Additionally, we propose a post-processing pipeline that supports geometrically consistent multi-view inference. Extensive experiments demonstrate the superior performance of Dens3R across various dense 3D prediction tasks and highlight its potential for broader applications.

High-quality geometric predictions for (2K) inputs. Please zoom in to better observe the fine-grained details.

Video


Overview

This work aims to utilize a single model to predict various geometric data from unconstrained images, including 3D pointmaps, depth maps, normal maps, and image-pair matching. To this end, we built a backbone network based on dense visual transformers and designed input configurations that adapt to multi-resolution and multi-view requirements. In the first stage, we train the backbone and heads to obtain scale-invariant pointmaps. In the second stage, we fine-tune the backbone on this foundation to obtain intrinsic-invariant pointmaps. Finally, we fine-tune the prediction heads for each downstream task to adapt to different application scenarios.



Results

High-quality Geometric Predictions

Normal Comparison

Depth and Pointmap Comparison

Image-pair Matching

High-Resolution Inference Comparison

BibTeX

@article{dens3r,
      title={Dens3R: A Foundation Model for 3D Geometry Prediction}, 
      author={Xianze Fang and Jingnan Gao and Zhe Wang and Zhuo Chen and Xingyu Ren and Jiangjing Lyu and Qiaomu Ren and Zhonglei Yang and Xiaokang Yang and Yichao Yan and Chengfei Lyu},
      journal={arXiv preprint arXiv:},
      year={2025}
}