Free4D: Tuning-free 4D Scene Generation
with Spatial-Temporal Consistency

Tianqi Liu^1,2^* Zihao Huang^1,2^* Zhaoxi Chen² Guangcong Wang³ Shoukang Hu²
Liao Shen^1,2 Huiqiang Sun^1,2 Zhiguo Cao¹ Wei Li²^\(\dagger\) Ziwei Liu²^\(\dagger\)

¹Huazhong University of Science and Technology ²Nanyang Technological University ³Great Bay University
^*Equal Contribution ^\(\dagger\)Corresponding Authors

arXiv

YouTube

Code

Cite

Abstract

We present Free4D, a novel tuning-free framework for 4D scene generation from a single image. Existing methods either focus on object-level generation, making scene-level generation infeasible, or rely on large-scale multi-view video datasets for expensive training, with limited generalization ability due to the scarcity of 4D scene data. In contrast, our key insight is to distill pre-trained foundation models for consistent 4D scene representation, which offers promising advantages such as efficiency and generalizability. 1) To achieve this, we first animate the input image using image-to-video diffusion models followed by 4D geometric structure initialization. 2) To turn this coarse structure into spatial-temporal consistent multiview videos, we design an adaptive guidance mechanism with a point-guided denoising strategy for spatial consistency and a novel latent replacement strategy for temporal coherence. 3) To lift these generated observations into consistent 4D representation, we propose a modulation-based refinement to mitigate inconsistencies while fully leveraging the generated information. The resulting 4D representation enables real-time, controllable spatial-temporal rendering, marking a significant advancement in single-image-based 4D scene generation.

Method

Given an input image or text prompt, we first generate a dynamic video \( \mathcal{V} = \{I(t,1)\}_{t=1}^{T} \) using an off-the-shelf video generation model.
Then, we employ MonST3R with a progressive static point cloud aggregation strategy for dynamic reconstruction, obtaining a 4D geometric structure.
Next, guided by this structure, we render a coarse multi-view video \( \mathcal{S}^{\prime} = \{\{I^{\prime}(t,k)\}_{t=1}^{T}\}_{k=1}^{K} \) along a predefined camera trajectory and refine it into \( \mathcal{S} = \{\{I(t,k)\}_{t=1}^{T}\}_{k=1}^{K} \) using ViewCrafter. To ensure spatial-temporal consistency, we introduce Adaptive Classifier-Free Guidance (CFG) and Point Cloud Guided Denoising for spatial coherence, along with Reference Latent Replacement for temporal coherence.
Finally, we propose an efficient training strategy with a Modulation-Based Refinement to lift the generated multi-view video \( \mathcal{S} \) into a consistent 4D representation \( \mathcal{R} \).

Related Works

4Real: Towards Photorealistic 4D Scene Generation via Video Diffusion Models
Dream-in-4D: A Unified Approach for Text- and Image-guided 4D Scene Generation
4D-fy: Text-to-4D Generation Using Hybrid Score Distillation Sampling
GenXD: Generating Any 3D and 4D Scenes
DimensionX: Create Any 3D and 4D Scenes from a Single Image with Controllable Video Diffusion
CAT4D: Create Anything in 4D with Multi-View Video Diffusion Models
DreamGaussian4D: Generative 4D Gaussian Splatting
L4GM: Large 4D Gaussian Reconstruction Model

Free4D: Tuning-free 4D Scene Generation
with Spatial-Temporal Consistency

TL;DR

Free4D is a tuning-free framework for 4D scene generation from a single image or text.

4D Interactive Viewer

Demo

Overview Video

Abstract

Method

Text-to-4D Comparisons

4Real (Left) vs. Ours (Right)

Dream-in-4D (Left) vs. Ours (Right)

4Dfy (Left) vs. Ours (Right)

Image-to-4D Comparisons

GenXD^* (Left) vs. Ours (Right)

DimensionX (Left) vs. Ours (Right)

Animate124 (Left) vs. Ours (Right)

More Results

Citation

Related Works

4Real: Towards Photorealistic 4D Scene Generation via Video Diffusion Models

Dream-in-4D: A Unified Approach for Text- and Image-guided 4D Scene Generation

4D-fy: Text-to-4D Generation Using Hybrid Score Distillation Sampling

GenXD: Generating Any 3D and 4D Scenes

DimensionX: Create Any 3D and 4D Scenes from a Single Image with Controllable Video Diffusion

CAT4D: Create Anything in 4D with Multi-View Video Diffusion Models

DreamGaussian4D: Generative 4D Gaussian Splatting

L4GM: Large 4D Gaussian Reconstruction Model

Free4D: Tuning-free 4D Scene Generation with Spatial-Temporal Consistency

TL;DR

Free4D is a tuning-free framework for 4D scene generation from a single image or text.

4D Interactive Viewer

Demo

Overview Video

Abstract

Method

Text-to-4D Comparisons

4Real (Left) vs. Ours (Right)

Dream-in-4D (Left) vs. Ours (Right)

4Dfy (Left) vs. Ours (Right)

Image-to-4D Comparisons

GenXD* (Left) vs. Ours (Right)

DimensionX (Left) vs. Ours (Right)

Animate124 (Left) vs. Ours (Right)

More Results

Citation

Related Works

4Real: Towards Photorealistic 4D Scene Generation via Video Diffusion Models

Dream-in-4D: A Unified Approach for Text- and Image-guided 4D Scene Generation

4D-fy: Text-to-4D Generation Using Hybrid Score Distillation Sampling

GenXD: Generating Any 3D and 4D Scenes

DimensionX: Create Any 3D and 4D Scenes from a Single Image with Controllable Video Diffusion

CAT4D: Create Anything in 4D with Multi-View Video Diffusion Models

DreamGaussian4D: Generative 4D Gaussian Splatting

L4GM: Large 4D Gaussian Reconstruction Model

Free4D: Tuning-free 4D Scene Generation
with Spatial-Temporal Consistency

GenXD^* (Left) vs. Ours (Right)