MVDiffusion: Enabling Holistic Multi-view Image Generation with Correspondence-Aware Diffusion

NeurIPS 2023 (spotlight)
arXiv PDF Project Code Gallery Hugging Face Logo Demo

Abstract

This paper introduces MVDiffusion, a simple yet effective method for generating consistent multi-view images from text prompts given pixel-to-pixel correspondences (e.g., perspective crops from a panorama or multi-view images given depth maps and poses). [Expand] Unlike prior methods that rely on iterative image warping and inpainting, MVDiffusion simultaneously generates all images with a global awareness, effectively addressing the prevalent error accumulation issue. At its core, MVDiffusion processes perspective images in parallel with a pre-trained text-toimage diffusion model, while integrating novel correspondence-aware attention layers to facilitate cross-view interactions. For panorama generation, while only trained with 10k panoramas, MVDiffusion is able to generate high-resolution photorealistic images for arbitrary texts or extrapolate one perspective image to a 360-degree view. For multi-view depth-to-image generation, MVDiffusion demonstrates state-of-the-art performance for texturing a scene mesh.


Image Description
Given text descriptions, MVDiffusion generates holistically consistent multi-view images with high resolution and rich content, which benefits practical tasks such as panorama generation and multi-view depth-to-image generation.

Results showcase

We show the capacity of MVDiffusion on two challanging multi-view image generation tasks: 1) panorama generation and 2) multi-view depth-to-image generation.


Task 1: Panorama generation

(Note: click the text to view the stitched panorama image)

Examples results of panorama generation from text prompts using MVDiffusion. Check out our online demo to generate panorama images using your own descriptions.



Task 2: Multi-view depth-to-image generation

Given a sequence of depth maps from a raw mesh, MVDiffusion can generate a sequence of RGB images while preserving the underlying geometry and maintaining multi-view consistency. The generation results can be further exported to a textured mesh. Check out more results in the gallery page.

Mesh w/o texture

Input depth | Pred texture

Prompt: An office with a computer desk, a bookcase, a couch, chairs and a trash can. Two monitors and a keyboard are on the desk. Couches is sitting next to the desk.

Mesh w/o texture

Input depth | Pred texture

Prompt: A living room with multiple couches and a coffee table. A wooden book shelf filled with lots of books next to a door. A white refrigerator sitting next to a wooden bench.

Mesh w/o texture

Input depth | Pred texture

Prompt: A bed with a blue comforter and a vase with purple flowers. Two backpacks are sitting on the floor next to the bed. A window with a curtain is next to the bed.

Mesh w/o texture

Input depth | Pred texture

Prompt: A white stove top oven and a white refrigerator freezer sitting inside of a kitchen. The kitchen with a sink, dishwasher, and a microwave. A wooden table and chairs in the kitchen.


Citation

@article{Tang2023mvdiffusion,
  author = {Tang, Shitao and Zhang, Fuyang and Chen, Jiacheng and Wang, Peng and Furukawa, Yasutaka},
  title = {MVDiffusion: Enabling Holistic Multi-view Image Generation with Correspondence-Aware Diffusion},
  journal = {arXiv},
  year = {2023},
}

Acknowledgements

This research is partially supported by NSERC Discovery Grants with Accelerator Supplements and DND/NSERC Discovery Grant Supplement, NSERC Alliance Grants, and John R. Evans Leaders Fund (JELF). We thank the Digital Research Alliance of Canada and BC DRI Group for providing computational resources.