OmniBooth: Learning Latent Control for Image Synthesis with Multi-modal Instruction

 arXiv  Slides  Code

Video demonstration: We enable Multi-modal Instruction for users to draw masks and use text or image as refernece to generate images with the desired content.

Teaser Image

We represent our conditions as a high-dimensional latent feature that seamlessly incorporates mask guidance and multi-model instruction.

Teaser Image 1

Users can choose text or image as the condition to control image synthesis as needed.

Abstract

We present OmniBooth, an image generation framework that enables spatial control with instance-level multi-modal customization. For all instances, the multimodal instruction can be described through text prompts or image references. Given a set of user-defined masks and associated text or image guidance, our objective is to generate an image, where multiple objects are positioned at specified coordinates and their attributes are precisely aligned with the corresponding guidance. This approach significantly expands the scope of text-to-image generation, and elevates it to a more versatile and practical dimension in controllability.
In this paper, our core contribution lies in the proposed latent control signals, a high-dimensional spatial feature that provides a unified representation to integrate the spatial, textual, and image conditions seamlessly. The text condition extends ControlNet to provide instance-level open-vocabulary generation. The image condition further enables fine-grained control with personalized identity. In practice, our method empowers users with more flexibility in controllable generation, as users can choose multi-modal conditions from text or images as needed. Furthermore, thorough experiments demonstrate our enhanced performance in image synthesis fidelity and alignment across different tasks and datasets.

Below, we display generated images and input mask.

Results

We provide panoptic mask and instance-level descriptions as conditional input. Our method generate realistic and precisely-aligned images.

Video demonstration

Citation

@article{li2024omnibooth,
  title={OmniBooth: Learning Latent Control for Image Synthesis with Multi-modal Instruction},
  author={Li, Leheng and Qiu, Weichao and Yan, Xu and He, Jing and Zhou, Kaiqiang and Cai, Yingjie and Lian, Qing and Liu, Bingbing and Chen, Ying-Cong},
  journal={arXiv preprint arXiv:2410.04932},
  year={2024}
}