OmniBooth: Learning Latent Control for Image Synthesis with Multi-modal Instruction

arxiv preprint

Leheng Li¹, Weichao Qiu³, Xu Yan³, Jing He¹, Kaiqiang Zhou³, Yingjie Cai³, Qing Lian², Bingbing Liu³, Ying-Cong Chen^1,2

¹HKUST(GZ)²HKUST³HUAWEI Noah's Ark Lab

arXiv Slides Code

Video demonstration: We enable Multi-modal Instruction for users to draw masks and use text or image as refernece to generate images with the desired content.

We represent our conditions as a high-dimensional latent feature that seamlessly incorporates mask guidance and multi-model instruction.

Users can choose text or image as the condition to control image synthesis as needed.

Abstract

We present OmniBooth, an image generation framework that enables spatial control with instance-level multi-modal customization. For all instances, the multimodal instruction can be described through text prompts or image references. Given a set of user-defined masks and associated text or image guidance, our objective is to generate an image, where multiple objects are positioned at specified coordinates and their attributes are precisely aligned with the corresponding guidance. This approach significantly expands the scope of text-to-image generation, and elevates it to a more versatile and practical dimension in controllability.
In this paper, our core contribution lies in the proposed latent control signals, a high-dimensional spatial feature that provides a unified representation to integrate the spatial, textual, and image conditions seamlessly. The text condition extends ControlNet to provide instance-level open-vocabulary generation. The image condition further enables fine-grained control with personalized identity. In practice, our method empowers users with more flexibility in controllable generation, as users can choose multi-modal conditions from text or images as needed. Furthermore, thorough experiments demonstrate our enhanced performance in image synthesis fidelity and alignment across different tasks and datasets.

Below, we display generated images and input mask.

Results

We provide panoptic mask and instance-level descriptions as conditional input. Our method generate realistic and precisely-aligned images.

Video demonstration

Citation

@article{li2024omnibooth,
  title={OmniBooth: Learning Latent Control for Image Synthesis with Multi-modal Instruction},
  author={Li, Leheng and Qiu, Weichao and Yan, Xu and He, Jing and Zhou, Kaiqiang and Cai, Yingjie and Lian, Qing and Liu, Bingbing and Chen, Ying-Cong},
  journal={arXiv preprint arXiv:2410.04932},
  year={2024}
}