[ad_1]
In recent years, diffusion models have shown great success in generating images from text, achieving higher image quality, improved inference performance, and expanding our creative inspiration. Nevertheless, it is still difficult to effectively control the generation, especially in conditions that are difficult to describe in text.
Today we’re announcing MediaPipe’s Diffusion plug-ins that enable controlled text-to-image generation on a device. Extending our previous work on GPU inference for device-large generative models, we present new low-cost controlled text-to-image solutions that can be incorporated into existing diffusion models and their low-rank adaptation (LoRA) variants.
Generate text images with control plugins running on the device. |
background
With diffusion models, image generation is modeled as an iterative denoising process. Starting from the noise image, at each step, the diffusion model gradually removes the image to reveal the image of the target concept. Research has shown that using language understanding through text querying can significantly improve image generation. To generate an image from text, the text embedding is connected to the model through cross-attention layers. However, some information is difficult to describe with a textual query, e.g., the position and pose of an object. To solve this problem, researchers add additional models to introduce control information into the state image in the diffusion.
Common approaches for controlled text-to-image generation include Plug-and-Play, ControlNet, and the T2I adapter. Plug-and-Play uses the widely used Denoising Diffusion Implicit Model (DDIM) inversion approach, which reverses the generation process, starting from the input image to obtain the initial noise input, and then using a copy of the diffusion model (860M parameter for steady diffusion. 1.5) to encode the state from the input image. Plug-and-Play extracts spatial features from self-aware copy diffusion and feeds them into text-image diffusion. ControlNet creates a training copy of the diffusion model encoder, which is fed through a convolution layer with zero initialized parameters to encode the conditioning information that is passed to the decoder layers. However, as a result, the size is large, half that of the diffusion model (430M parameter for steady diffusion 1.5). The T2I adapter is a smaller network (77M parameter) and achieves similar effects in controlled generation. The T2I adapter takes only the state image as input and its output is shared by all diffusion iterations. However, the adapter model is not intended for portable devices.
MediaPipe Diffusion Plugins
To make conditional generation efficient, configurable, and scalable, we build the MediaPipe diffusion plugin as a separate network that is:
- connector: It can be easily connected to a pre-trained base model.
- Trained from scratch: does not use the pretrained weights of the base model.
- portable: It works on mobile devices outside the base model, at a negligible cost compared to the base model inference.
method | parameter size | connector | from heaven | portable | ||||
Plug-and-Play | 860 m* | ✔️ | ❌ | ❌ | ||||
ControlNet | 430 m* | ✔️ | ❌ | ❌ | ||||
T2I adapter | 77 m | ✔️ | ✔️ | ❌ | ||||
MediaPipe module | 6 m | ✔️ | ✔️ | ✔️ |
Comparison of Plug-and-Play, ControlNet, T2I adapter and MediaPipe diffusion module. * Number varies depending on the details of the diffusion model. |
The MediaPipe diffusion plugin is a device-based model for generating images from text. It extracts multiscale features from the conditioning image, which are added to the diffusion model encoder at appropriate levels. When coupled with a text-to-image diffusion model, the additive model can provide additional conditioning cues for image generation. We plan the module network to be a lightweight model with only 6M parameters. It uses deep convolutions and inverse blocks from MobileNetv2 for fast inference on mobile devices.
Overview of the MediaPipe diffusion model module. A plug-in is a separate network whose output can be fed into a pre-trained text-to-image generation model. The features extracted by the plugin are applied to the associated reduction layer (blue) of the diffusion model. |
Unlike ControlNet, we set the same control characteristics in all diffusion iterations. That is, we run the plug-in only once to generate one image, which saves computation. We show some intermediate results of the diffusion process below. Control is effective at each diffusion step and allows controlled generation even at an early stage. More iterations improve the alignment of the image with the text prompt and create more detail.
Illustration of the generation process using the MediaPipe diffusion module. |
examples
In this work, we developed plugins for the diffusion-based text-to-image generation model MediaPipe Face Landmark, MediaPipe Holistic Landmark, Depth Maps, and Canny edge. For each task, we select approximately 100K images from a web-scale image-text dataset and compute the control signals using the corresponding MediaPipe solutions. We use fancy notations from PaLI to teach the additions.
A sight to behold
The MediaPipe Face Landmarker task computes 478 human face landmarks (with attention). We use images in MediaPipe to draw faces, including facial contours, mouths, eyes, eyebrows, and irises, in different colors. The following table shows the randomly generated patterns on the face grid and through demand conditioning. In comparison, both ControlNet and the Plugin can control text to image generation under given conditions.
Face-landmark plugin for text-to-image generation compared to ControlNet. |
A holistic landmark
The MediaPipe Holistic Landmarker task includes body posture, hand, and facial mesh landmarks. Below, we create various stylized images with holistic features in mind.
A holistic landmark module for text-to-image generation. |
depth
A depth plugin for generating images from text. |
Canny Edge
Canny-edge plugin for generating images from text. |
Rate
We conduct quantitative research A sight of sorts A module to demonstrate the working of the model. The evaluation dataset contains 5K human images. We compare generation quality with widely used metrics, Fréchet Inception Distance (FID) and CLIP scores. The base model is a pre-trained text-to-image diffusion model. Here we use Stable Diffusion v1.5.
As shown in the following table, ControlNet and the MediaPipe diffusion plug-in produce significantly better sample quality than the base model in terms of FID and CLIP scores. Unlike ControlNet, which must run for each diffusion step, the MediaPipe module runs only once for each generated image. We measured the performance of the three models on a server machine (Nvidia V100 GPU) and a mobile phone (Galaxy S23). We run all three models on the server with 50 diffusion steps, and on mobile we run 20 diffusion steps using the MediaPipe image generation application. Compared to ControlNet, the MediaPipe module shows a clear advantage in inference efficiency while maintaining sample quality.
model | FID↓ | CLIP↑ | Conclusion Time(s) | |||||
Nvidia V100 | Galaxy S23 | |||||||
base | 10.32 | 0.26 | 5.0 | 11.5 | ||||
Base + ControlNet | 6.51 | 0.31 | 7.4 (+48%) | 18.2 (+58.3%) | ||||
Base + MediaPipe Plugin | 6.50 | 0.30 | 5.0 (+0.2%) | 11.8 (+2.6%) |
Quantitative comparison of FID, CLIP and inference time. |
We test the effectiveness of the plugin on a wide range of mobile devices, including mid-range and high-end. We list the results on some representative devices in the following table, which includes both Android and iOS.
device | Android | iOS | ||||||||||
Pixel 4 | Pixel 6 | Pixel 7 | Galaxy S23 | iPhone 12 Pro | iPhone 13 Pro | |||||||
time (lady) | 128 | 68 | 50 | 48 | 73 | 63 |
Module inference time (mm) on different mobile devices. |
conclusion
In this work, we present MediaPipe, a portable plugin for generating images from conditional text. It embeds the features extracted from the state image into the diffusion model and thus controls image generation. Portable plugins can be connected to pre-trained diffusion models running on servers or devices. By running text-to-image generation and plug-ins fully on-device, we enable more flexible applications of generative AI.
Acknowledgments
We would like to thank all team members who contributed to this work: Raman Sarokin and Juhyun Lee for the GPU inference solution; Khanh LeViet, Chuo-Ling Chang, Andrei Kulik and Matthias Grundmann for leadership. Special thanks to Jiuqiang TangJoe Zhu and Lu Wang, who created this technology and running on all demo devices.
[ad_2]
Source link