[ad_1]
The proliferation of large diffusion models for image generation has led to significant increases in model size and inference burden. On-device ML inference in a mobile environment requires careful performance optimization and consideration due to resource constraints. Running the inference of large diffusion models (LDM) on a device, driven by the need for cost-effectiveness and user privacy, poses even greater challenges due to the significant memory requirements and computational demands of these models.
We address this challenge in our paper entitled “Speed is All You Need: Accelerating Large-Diffusion Models on Device with GPU-Aware Optimization” (to be presented at the CVPR 2023 Workshop for Efficient Deep Learning for Computer Vision), which focuses on the optimized Execution of a basic LDM model on a mobile GPU. In this blog post, we summarize the key techniques we used to successfully run large diffusion models, such as stable diffusion at full resolution (512×512 pixels) and 20 iterations on modern smartphones with high-quality inference speed of the original model in less than 12 seconds without distillation. . As discussed in our previous blog post, GPU-accelerated ML inference is often limited by memory performance, and performing LDMs is no exception. Therefore, a central theme of our optimization is efficient memory input/output (I/O), even if this means choosing memory-efficient algorithms over those that prioritize arithmetic logic unit efficiency. Ultimately, our main goal is to reduce the overall latency of ML inference.
A sample LDM on a mobile GPU with text provided: “A photo of a cute puppy with realistic and high resolution images with flowers around.” |
An enhanced attention module for memory efficiency
An ML inference engine typically provides various optimized ML operations. Despite this, achieving optimal performance can still be difficult, as there is some overhead for executing individual neural network operators on the GPU. To reduce this overhead, ML inference engines incorporate extensive operator-combination rules that combine multiple operators into a single operator, thereby reducing the number of iterations in the tensor elements while maximizing the computation per iteration. For example, TensorFlow Lite uses operator fusion to combine computationally expensive operations such as convolution with post-activation functions such as rectified linear entities into one.
A distinct possibility for optimization is the widely used attention block adopted in the denoiser model in LDM. Attention blocks allow the model to focus on specific parts of the input by assigning higher weights to important regions. There are many ways to optimize attention modules, and we selectively use one of the two optimizations explained below depending on which optimization works best.
The first optimization we call Partially merged softmax, removes the need for extensive memory writes and reads between softmax and matrix multiplication in the attention module. Let the attention block be just a simple matrix multiplication of the form i = softmax (X) * V where X and V are 2D matrices of the form A×b and b×crespectively (shown in the upper half below).
For numerical stability, T = softmax (X) is usually calculated for three transitions:
- Determine the maximum value in the list, i.e., For each row of the matrix X
- Sum the differences between the exponent and the maximum value of each element of the list (starting from 1)
- We divided the maximum value minus the exponent of the units by the sum obtained from 2
Naively performing these transitions will result in a huge memory write for the temporary intermediate tensor T Saving the output of the entire softmax function. We bypass this large memory write-up by storing only the results of passes 1 and 2, labeled M and STherefore, which are small vectors, with A elements compared to each other T which has A·B elements. With this technique, we can reduce tens or even hundreds of megabytes of memory by several orders of magnitude (shown below in the lower half).
attention modules. top: A naive attention block consisting of SOFTMAX (with all three transitions) and MATMUL requires a large memory write for a large intermediate tensor. T. lower: our memory-efficient attention block partially combined with softmax in MATMUL only needs to store two small intermediate tensors M and S. |
Another optimization involves using FlashAttention, which is an I/O-aware, precision attention algorithm. This algorithm minimizes the number of GPU high-bandwidth memory accesses, making it well-suited to our limited memory bandwidth usage. However, we found that this technique only works for a certain size of SRAM and requires a large number of registers. Therefore, we only use this technique for attention matrices of a certain size on a selected set of GPUs.
Fast Winograd convolution for 3×3 convolution layers
The backbone of conventional LDMs relies heavily on 3×3 convolutional layers (convolution with filter size 3×3), which comprise more than 90% of the layers in the decoder. Despite the increased memory consumption and numerical errors, we found that the fast Winograd convolution is effective in speeding up convolutions. Unlike filter size 3×3 is used in convolutions, tile size refers to the size of a subregion of the input tensor processed at a time. Increasing the tile size increases the efficiency of convolution in terms of Arithmetic Logic Unit (ALU) usage. However, this improvement comes at the expense of increased memory usage. Our tests show that a tile size of 4×4 achieves the optimal compromise between computational efficiency and memory usage.
Memory usage | |||
tile size | FLOPS savings | Intermediate tensors | weights |
2×2 | 2.25× | 4.00× | 1.77× |
4×4 | 4.00× | 2.25× | 4.00× |
6×6 | 5.06× | 1.80× | 7.12× |
8×8 | 5.76× | 1.56× | 11.1× |
Winograd effect for 3×3 convolution of tiles of different sizes. |
Specialized operator concatenation for memory efficiency
We found that efficient inference of LDMs on a mobile GPU requires significantly larger fusion windows for layers and entities commonly used in LDMs than current off-the-shelf GPU-accelerated ML inference engines. Consequently, we developed specialized implementations that could perform a larger range of neural operators than conventional concatenation rules would allow. Specifically, we focused on two specializations: the Gaussian Error Linear Unit (GELU) and the group normalization layer.
The GELU approximation to the hyperbolic tangent function requires writing and reading from seven auxiliary intermediate tensors (shown as light orange rounded rectangles in the figure below), reading from the input tensor. x three times and write to the output tensor year Once in eight GPU programs executing the labeled operation (light blue rectangles). A custom GELU implementation that performs eight operations in a single shader (shown below) can bypass all memory I/O for intermediate tensors.
Implementation of GELU. top: A naive implementation with built-in operations requires 8 memory writes and 10 reads. lower: Our custom GELU requires only 1 memory read (for this x) and 1 write (for this year). |
results
After applying all these optimizations, we ran Stable Diffusion 1.5 (image resolution 512×512, 20 iterations) tests on high-end mobile devices. Running stable diffusion with our GPU-accelerated ML inference model uses 2,093 MB for weights and 84 MB for intermediate tensors. With the latest high-end smartphones, stable diffusion can be launched in 12 seconds.
Stable Diffusion works in 12 seconds on modern smartphones. Note that running the decoder after each iteration to display the intermediate output in this animated GIF causes a ~2× slowdown. |
conclusion
Performing ML inference of large models on a device is a substantial challenge, which includes limitations in model file size, extensive runtime memory requirements, and long inference latency. Recognizing the primary bottleneck of memory bandwidth usage, we focused our efforts on optimizing memory bandwidth usage and striking a delicate balance between ALU efficiency and memory efficiency. As a result, we have achieved state-of-the-art inference latencies for large diffusion models. You can learn more about this work in the paper.
Acknowledgments
We would like to thank Yu-Hui Chen, Jiuqiang Tang, Frank Barchard, Yang Zhao, Joe Zou, Khanh LeViet, Chuo-Ling Chang, Andrei Kulik, Lu Wang, and Matthias Grundmann.
[ad_2]
Source link