UniDepthV2: Universal Monocular Metric Depth Estimation Made Simpler

Accurate monocular metric depth estimation (MMDE) is crucial to solvingdownstream tasks in 3D perception and modeling. However, the remarkableaccuracy of recent MMDE methods is confined to their training domains. Thesemethods fail to generalize to unseen domains even in the presence of moderatedomain gaps, which hinders their practical applicability. We propose a newmodel, UniDepthV2, capable of reconstructing metric 3D scenes from solelysingle images across domains. Departing from the existing MMDE paradigm,UniDepthV2 directly predicts metric 3D points from the input image at inferencetime without any additional information, striving for a universal and flexibleMMDE solution. In particular, UniDepthV2 implements a self-promptable cameramodule predicting a dense camera representation to condition depth features.Our model exploits a pseudo-spherical output representation, which disentanglesthe camera and depth representations. In addition, we propose a geometricinvariance loss that promotes the invariance of camera-prompted depth features.UniDepthV2 improves its predecessor UniDepth model via a new edge-guided losswhich enhances the localization and sharpness of edges in the metric depthoutputs, a revisited, simplified and more efficient architectural design, andan additional uncertainty-level output which enables downstream tasks requiringconfidence. Thorough evaluations on ten depth datasets in a zero-shot regimeconsistently demonstrate the superior performance and generalization ofUniDepthV2. Code and models are available athttps://github.com/lpiccinelli-eth/UniDepth

Further reading