OmniBooth: Learning Latent Control for Image Synthesis with Multi-modal Instruction
Video demonstration: We enable Multi-modal Instruction for users to draw masks and use text or image as refernece to generate images with the desired content.
We represent our conditions as a high-dimensional latent feature that seamlessly incorporates mask guidance and multi-model instruction.
Users can choose text or image as the condition to control image synthesis as needed.
Abstract
We present OmniBooth, an image generation framework that enables spatial
control with instance-level multi-modal customization. For all instances, the multimodal
instruction can be described through text prompts or image references. Given
a set of user-defined masks and associated text or image guidance, our objective is
to generate an image, where multiple objects are positioned at specified coordinates
and their attributes are precisely aligned with the corresponding guidance. This
approach significantly expands the scope of text-to-image generation, and elevates
it to a more versatile and practical dimension in controllability.
In this paper, our
core contribution lies in the proposed latent control signals, a high-dimensional
spatial feature that provides a unified representation to integrate the spatial, textual,
and image conditions seamlessly. The text condition extends ControlNet to provide
instance-level open-vocabulary generation. The image condition further enables
fine-grained control with personalized identity. In practice, our method empowers
users with more flexibility in controllable generation, as users can choose
multi-modal conditions from text or images as needed. Furthermore, thorough
experiments demonstrate our enhanced performance in image synthesis fidelity
and alignment across different tasks and datasets.
Below, we display generated images and input mask.