Dual3D: Efficient and Consistent Text-to-3D Generation with Dual-mode Multi-view Latent Diffusion

Anonymous.

A novel framework for high-quality text-to-3D generation in 50 seconds.

Loading too slow? Click this!

a blue and white vase with a lid

a brown paper bag

a chair that looks like a tree

a pair of shorts

a penguin

a chair that looks like a zebra

a women's fashion bag

a designer dress

a dog

a donut with pink icing

a fast car

a large metal bell on a red wooden stand

a large tree stump

a pair of sunglasses

a rusty padlock

a palm tree with a brown trunk and green leaves

a stone water well with a wooden shed

a traffic cone

a white and black police SUV car

a wooden chair with a red cushion

a wooden chest with golden trim

a wooden church with a cross on top

a cushion of rest

a large banyan tree with extended vines

a small round wooden table

High-quality meshes generated by our method.

The inference code can be found in our supplementary material.

Abstract

We present Dual3D, a novel text-to-3D generation framework that generates high-quality 3D assets from texts in only 1 minute. The key component is a dual-mode multi-view latent diffusion model. Given the noisy multi-view latents, the 2D mode can efficiently denoise them with a single latent denoising network, while the 3D mode can generate a tri-plane neural surface for consistent rendering-based denoising. Most modules for both modes are tuned from a pre-trained text-to-image latent diffusion model to circumvent the expensive cost of training from scratch. To overcome the high rendering cost during inference, we propose the dual-mode toggling inference strategy to use only 1/10 denoising steps with 3D mode, successfully generating a 3D asset in just 10 seconds without sacrificing quality. The texture of the 3D asset can be further enhanced by our efficient texture refinement process in a short time. Extensive experiments demonstrate that our method delivers state-of-the-art performance while significantly reducing generation time.

Framework:

Compositional Generation

The videos are rendered by Blender, and all visible 3D assets are generated by our framework.

Diverse Generation

a hamburger

a house

Fine-grained Generation

a supercar

a truck car

a car made of sushi

a SUV car

a bar chair

a esports chair

a sofa chair

a traditional chinese chair

BibTeX


      @article{Dual3D2024,
        author    = {Anonymous},
        title     = {Dual3D: Efficient and Consistent Text-to-3D Generation with Dual-mode Multi-view Latent Diffusion},
        journal   = {Under Review},
        year      = {2024},
    }

More results

a bald head of a man

a bag of crisps

a black fish with blue eyes

a bottle of Coke

a brown couch

a campfire

a cartoon tiger wearing a red shirt

a chair that looks like fruit

a coffee house

a couple of books

a tank with a gun on top

a dog with a backpack on its back

a small trash can_9

a high-rise building with multiple floors

a man wearing a blue shirt and gray pants

a office building

a plate of juicy cooked steak

a white switch

a red and black toy horse

a women's fashion bag

a rusty old car

a small Christmas tree made of Legos

a stone fountain with a spout

a tall brown lighthouse with a light on top

a white bust of a man with a stern expression

a white helmet with a strap

a wooden chair with a red cushion

incandescent light bulb

coffee maker

cutting board

a wooden nightstand

fresh carrots