STAG4D: Spatial-Temporal Anchored Generative 4D Gaussians

Recent progress in pre-trained diffusion models and 3D generation havespurred interest in 4D content creation. However, achieving high-fidelity 4Dgeneration with spatial-temporal consistency remains a challenge. In this work,we propose STAG4D, a novel framework that combines pre-trained diffusion modelswith dynamic 3D Gaussian splatting for high-fidelity 4D generation. Drawinginspiration from 3D generation techniques, we utilize a multi-view diffusionmodel to initialize multi-view images anchoring on the input video frames,where the video can be either real-world captured or generated by a videodiffusion model. To ensure the temporal consistency of the multi-view sequenceinitialization, we introduce a simple yet effective fusion strategy to leveragethe first frame as a temporal anchor in the self-attention computation. Withthe almost consistent multi-view sequences, we then apply the scoredistillation sampling to optimize the 4D Gaussian point cloud. The 4D Gaussianspatting is specially crafted for the generation task, where an adaptivedensification strategy is proposed to mitigate the unstable Gaussian gradientfor robust optimization. Notably, the proposed pipeline does not require anypre-training or fine-tuning of diffusion networks, offering a more accessibleand practical solution for the 4D generation task. Extensive experimentsdemonstrate that our method outperforms prior 4D generation works in renderingquality, spatial-temporal consistency, and generation robustness, setting a newstate-of-the-art for 4D generation from diverse inputs, including text, image,and video.

Further reading