MVDiffusion: Enabling Holistic Multi-view Image Generation with Correspondence-Aware Diffusion

Abstract

This paper introduces MVDiffusion, a simple yet effective method for generating consistent multi-view images from text prompts given pixel-to-pixel correspondences (e.g., perspective crops from a panorama or multi-view images given depth maps and poses). [Expand] Unlike prior methods that rely on iterative image warping and inpainting, MVDiffusion simultaneously generates all images with a global awareness, effectively addressing the prevalent error accumulation issue. At its core, MVDiffusion processes perspective images in parallel with a pre-trained text-toimage diffusion model, while integrating novel correspondence-aware attention layers to facilitate cross-view interactions. For panorama generation, while only trained with 10k panoramas, MVDiffusion is able to generate high-resolution photorealistic images for arbitrary texts or extrapolate one perspective image to a 360-degree view. For multi-view depth-to-image generation, MVDiffusion demonstrates state-of-the-art performance for texturing a scene mesh.

Given text descriptions, MVDiffusion generates holistically consistent multi-view images with high resolution and rich content, which benefits practical tasks such as panorama generation and multi-view depth-to-image generation.

Results showcase

We show the capacity of MVDiffusion on two challanging multi-view image generation tasks: 1) panorama generation and 2) multi-view depth-to-image generation.

Task 1: Panorama generation

(Note: click the text to view the stitched panorama image)

Examples results of panorama generation from text prompts using MVDiffusion. Check out our online demo to generate panorama images using your own descriptions.

Task 2: Multi-view depth-to-image generation

Given a sequence of depth maps from a raw mesh, MVDiffusion can generate a sequence of RGB images while preserving the underlying geometry and maintaining multi-view consistency. The generation results can be further exported to a textured mesh. Check out more results in the gallery page.

Mesh w/o texture

Input depth | Pred texture

Prompt: An office with a computer desk, a bookcase, a couch, chairs and a trash can. Two monitors and a keyboard are on the desk. Couches is sitting next to the desk.

Mesh w/o texture

Input depth | Pred texture

Prompt: A living room with multiple couches and a coffee table. A wooden book shelf filled with lots of books next to a door. A white refrigerator sitting next to a wooden bench.

Mesh w/o texture

Input depth | Pred texture

Prompt: A bed with a blue comforter and a vase with purple flowers. Two backpacks are sitting on the floor next to the bed. A window with a curtain is next to the bed.

Mesh w/o texture

Input depth | Pred texture

Prompt: A white stove top oven and a white refrigerator freezer sitting inside of a kitchen. The kitchen with a sink, dishwasher, and a microwave. A wooden table and chairs in the kitchen.

Citation


                    @article{Tang2023mvdiffusion,

                      author = {Tang, Shitao and Zhang, Fuyang and Chen, Jiacheng and Wang, Peng and Furukawa, Yasutaka},

                      title  = {MVDiffusion: Enabling Holistic Multi-view Image Generation with Correspondence-Aware Diffusion},

                      journal = {arXiv},

                      year   = {2023},

                }

Acknowledgements

This research is partially supported by NSERC Discovery Grants with Accelerator Supplements and DND/NSERC Discovery Grant Supplement, NSERC Alliance Grants, and John R. Evans Leaders Fund (JELF). We thank the Digital Research Alliance of Canada and BC DRI Group for providing computational resources.

MVDiffusion: Enabling Holistic Multi-view Image Generation with Correspondence-Aware Diffusion

NeurIPS 2023 (spotlight)

Shitao Tang^*1

Fuyang Zhang^*1

Jiacheng Chen¹

Peng Wang²

Yasutaka Furukawa¹

¹Simon Fraser University

²Bytedance

Abstract

Given text descriptions, MVDiffusion generates holistically consistent multi-view images with high resolution and rich content, which benefits practical tasks such as panorama generation and multi-view depth-to-image generation.

Results showcase

Task 1: Panorama generation

(Note: click the text to view the stitched panorama image)

Task 2: Multi-view depth-to-image generation

Citation

Acknowledgements

MVDiffusion: Enabling Holistic Multi-view Image Generation with Correspondence-Aware Diffusion

NeurIPS 2023 (spotlight)

Shitao Tang*1

Fuyang Zhang*1

Jiacheng Chen1

Peng Wang2

Yasutaka Furukawa1

1Simon Fraser University

2Bytedance

Abstract

Given text descriptions, MVDiffusion generates holistically consistent multi-view images with high resolution and rich content, which benefits practical tasks such as panorama generation and multi-view depth-to-image generation.

Results showcase

Task 1: Panorama generation

(Note: click the text to view the stitched panorama image)

Task 2: Multi-view depth-to-image generation

Citation

Acknowledgements

Shitao Tang^*1

Fuyang Zhang^*1

Jiacheng Chen¹

Peng Wang²

Yasutaka Furukawa¹

¹Simon Fraser University

²Bytedance