[ad_1]
The underlying vision-language models are built on a single pre-training basis, followed by subsequent adaptation to multiple downstream tasks. Two main and distinct training scenarios are popular: CLIP-style contrastive learning and cue-following prediction. Contrast learning trains the model to predict whether image-text pairs match correctly, effectively building visual and textual representations for the corresponding image and text input, while next-token prediction predicts the most likely next text token in the sequence, thus learning to generate text. , according to the required task. Contrast learning enables image-to-text and image-to-text extraction tasks such as finding the image that best matches a given description, while cue-following learning enables text-generative tasks such as image captioning and visual question answering (VQA). Although both approaches have shown strong results when the model is pre-trained in a contrastive manner, it typically does not perform well on text generative tasks and vice versa. In addition, adaptation to other tasks is often difficult or inefficient. For example, to extend a vision language model to videos, some models must make inferences for each video frame separately. This limits the size of videos that can only be processed to a few frames and does not fully utilize the motion information between frames.
Motivated by this, we present a “simple architecture for collaborative learning for multimodal tasks”, called MaMMUT, which can be jointly trained for these competing objectives and which provides the basis for many vision language tasks, either directly or by simple adaptation. MaMMUT is a compact, 2B-parameter multimodal model that trains contrast, text generation, and localization cognitive targets. It consists of a single image encoder and text decoder, allowing direct reuse of both components. In addition, direct adaptation of video-to-text tasks requires the use of an image encoder only once and can handle many more frames than previous work. Consistent with state-of-the-art language models (eg, PaLM, GLaM, GPT3), our architecture uses only the decoder text model and can be considered a simple extension of the language models. Although modest in size, our model outperforms the state of the art or achieves competitive performance in image-to-text and image-to-text retrieval, video question answering (VideoQA), video captioning, open vocabulary recognition, and VQA.
The MaMMUT model enables a wide range of tasks such as image-to-text/text-to-image retrieval (upper left and upper right), VQA (middle left), revealing an open vocabulary (middle right), and VideoQA (lower). |
Decoder model architecture only
One surprising finding is that a single language decoder is sufficient for all these tasks, obviating the need for both the complex constructs and learning procedures presented earlier. For example, our model (shown on the left in the figure below) consists of a single visual encoder and a single text decoder, connected via cross-attention and simultaneously training the contrastive and text-generating types of loss. In comparison, previous work is either unable to perform image-to-text retrieval tasks, or only applies certain losses to only some parts of the model. In order to incorporate multimodal tasks and take full advantage of the decoder-only model, we need to jointly train both contrast losses and text-generating caption-like losses.
MaMMUT Architecture (left) is a simple construction consisting of one vision encoder and one text decoder. Compared to other popular sight language models—eg, PaLI (the middle) and ALBEF, CoCa (right) — it trains jointly and efficiently on multiple visual language tasks with both contrast and text generation losses, fully distributing the weights between the tasks. |
Decoder two-pass learning
Decoder-only models for language learning show clear performance advantages with small model sizes (almost half of the parameters). A major challenge in using them in multimodal settings is combining contrast learning (which uses an unconditional sequence-level representation) with labeling (which optimizes the likelihood of a token given previous cues). We propose a two-pass approach to jointly learn these two conflicting types of text representation in the decoder. In the first pass, we use cross-attention and causal masking to learn the caption generation task—text features can attend to image features and guess the cues in sequence. In the second pass, we turned off cross-attention and causal masking to learn the contrast task. Text features cannot see image features, but can mutually attend to all text features simultaneously to create a final text-based representation. Completing this two-pass approach in the same decoder allows you to meet both types of tasks that were previously difficult. Despite its simplicity, we show that this model architecture can form the basis for many multimodal tasks.
Only two-pass learning with the MaMMUT decoder allows both contrastive and generative learning paths with the same model. |
Another advantage of our architecture is that, because it was trained for these discrete tasks, it can be used seamlessly in many applications, such as image-to-text and text-to-image retrieval, VQA, and captioning.
In addition, MaMMUT is easily adapted to video language tasks. Previous approaches used a vision encoder to process each frame individually, requiring it to be used multiple times. This is slow and limits the number of frames the model can handle to typically only 6-8. With MaMMUT, we use sparse video pipes for lightweight adaptation with spatio-temporal information directly from the video. Additionally, the model is adapted to open vocabulary recognition by simply training it to detect bounding boxes via an object detection head.
Adaptation of the MaMMUT architecture to video tasks (left) is simple and fully exploits the model. This is done by generating a feature representation of video “tubes”, similar to image patches, which are projected onto lower dimensional marks and passed through a vision encoder. Unlike previous approaches (right) that requires running multiple individual images through the vision encoder, we only use it once. |
results
Our model achieves excellent results in zero-frame image-to-text and text-to-image retrieval without any adaptation, outperforming all previous state-of-the-art models. VQA results are competitive with state-of-the-art results achieved with much larger models. The PaLI model (17B parameters) and the Flamingo model (80B) have the best performance on the VQA2.0 dataset, but MaMMUT (2B) has the same accuracy as the 15B PaLI.
MaMMUT outperforms state-of-the-art (SOTA) zero-frame image-to-text (I2T) and text-to-image (T2I) retrieval on both MS-COCO (top) and Flickr (lower) benchmarks. |
Performance on the VQA2.0 dataset is competitive but not superior to larger models such as Flamingo-80B and PalI-17B. Performance is evaluated in more complex plain text generation settings. |
MaMMUT also outperforms the latest VideoQA, as shown below on the MSRVTT-QA and MSVD-QA datasets. Note that we outperform much larger models such as Flamingo, which is specifically designed for image+video pre-training and is pre-trained with both image-text and video-text data.
MaMMUT outperforms SOTA models on VideoQA tasks (MSRVTT-QA dataset, topMSVD-QA dataset, lower), outperforms much larger models, e.g., 5B GIT2 or Flamingo, which use 80B parameters and are pre-trained on both image-language and vision-language tasks. |
Our results exceed the state of the art in terms of precision for open vocabulary recognition, as also shown below.
Basic ingredients
We show that training both contrastive and text generative goals together is not an easy task, and in our experiments we find that these tasks are better served by different design choices. We find that fewer cross-attentional connections are better for retrieval tasks, but more preferred for VQA tasks. However, although this shows that our model design choices may be suboptimal for individual tasks, our model is more efficient than more complex or larger models.
Ablation studies show that less cross-attentional coupling (1-2) is better for removal (top), while more connectivity facilitates text generative tasks such as VQA (lower). |
conclusion
We present MaMMUT, a simple and compact vision encoder language-decoder model that jointly trains multiple conflicting objectives to reconcile contrast and text generation tasks. Our model also serves as the foundation for many other vision language tasks, achieving state-of-the-art or competitive performance in image-to-text and text-to-image retrieval, video QA, video captioning, open vocabulary recognition, and VQA. We hope that it can be further used for more multimodal applications.
Acknowledgments
The co-authors of the work described are: Weichen Kuo, AJ Piergiovanni, Dahun Kim, Xiang Luo, Ben Kane, Wei Li, Abhijit Ogale, Luowei Zhou, Andrew Dai, Zhifen Chen, Claire Cui, and Anelia Angelova. We would like to thank Mojtaba Seyedosain, Vijay Vasudevan, Priya Goyal, Jiahui Yu, Zhirui Wang, Yonghui Wu, Runze Li, Ji Mei, Radu Sorikut, Qing Qing Huang, Andy Li, Nan Du, Yuxin Wu, Tom Duerig, Paul Natsev, Zubin Ghahramani for help and support.
[ad_2]
Source link