[ad_1]
Detection is a fundamental vision task aimed at localizing and recognizing objects in an image. However, the process of collecting data by manually annotating bounding boxes or masks is tedious and expensive, limiting the size of modern recognition vocabulary to about 1000 object classes. These are smaller commands than the vocabulary people use to describe the visual world and leave out many categories. Recent vision and language models (VLMs), such as CLIP, have demonstrated improved open vocabulary visual recognition capabilities by learning Internet-scale image-text pairs. This VLM is used for zero-hit classification without the need for refinement using frozen model weights, which is in stark contrast to existing paradigms used for retraining or fine-tuning VLMs for open vocabulary recognition.
Intuitively, by training to match image content to textual descriptions, VLMs can learn region-sensitive and discriminative features that transfer to object recognition. Surprisingly, the frozen VLM features contain rich information that is both region sensitive for object shape description (second column below) and discriminative for region classification (third column below). In fact, feature grouping can nicely delineate object boundaries without any supervision. This motivates us to investigate the use of frozen VLMs for open vocabulary object recognition to extend recognition beyond a limited set of annotated categories.
We explore the potential of frozen vision and language features for open vocabulary recognition. K-Means feature clustering reveals rich semantic and region-sensitive information where object boundaries are well delineated (column 2). The same frozen features may well classify the ground truth (GT) regions without refinement (column 3). |
Presented at ICLR 2023, “F-VLM: Open Vocabulary Object Recognition on Frozen Vision and Language Models,” we present a simple and scalable open vocabulary detection approach built on frozen VLMs. F-VLM reduces the complexity of training an open vocabulary detector to the level of the complexity of a standard detector, eliminating the need for knowledge distillation, detection-specific pre-training, or weakly supervised learning. We show that by fully preserving the knowledge of pre-trained VLMs, F-VLM maintains a similar philosophy to ViTDet and decouples detector-specific learning from the more agnostic vision knowledge in the detector backbone. We will also publish the F-VLM code along with the demo on our project page.
Learning frozen vision and language models
We want to preserve the knowledge of pre-trained VLMs to minimize the effort and cost needed to adapt them to open vocabulary detection. We use a frozen VLM image encoder as the backbone of the detector and a text encoder to cache text embeddings for offline data dictionary recognition. We take this VLM backbone and attach a detector head that predicts object regions for localization and outputs detection scores that indicate the probability of a certain category of detected box. The detection scores are the cosine similarity of the features of the region (the bounded set of cells output by the detector head) and the text embeddings of the category. Category text embedding is obtained by providing category names through a pre-trained VLM text model (which has both image and text models)r.
A VLM image encoder consists of two parts: 1) a feature extractor and 2) a feature fusion layer. We use a feature extractor to train the detector head, which is the only step we train (on standard detection data) that allows us to directly use the frozen weights to inherit rich semantic knowledge (eg, long-tail categories like martini, fedora hat, pensant). from the spine of the VLM. Detection losses include box regression and classification losses.
During training, the F-VLM is simply a detector, with the last classification layer replaced by embedding the base category text. |
Region-level open vocabulary recognition
The ability to recognize open vocabulary at the region level (ie, the constraint level, as opposed to the image level) is an integral part of F-VLM. Because the spine features are frozen, they do not overfit the exercise categories (eg, donut, zebra) and can be directly cropped for region-level classification. F-VLM performs this open vocabulary classification only at test time. To obtain VLM features for a region, we apply a feature pooling layer to the output features of the clipped spines. Because the pooling layer requires a fixed-size input, eg, 7×7 ResNet50 (R50) for a CLIP backbone, we trade and replace the region features with the ROI-Align layer (shown below). Unlike existing open vocabulary detection approaches, we do not trade or change RGB image regions and do their embeddings in a separate offline process, but train the detector in a single step. It’s simpler and uses disk storage space more efficiently. In addition, we did not trade the features of the VLM region during training because the spine features are frozen.
Despite never being trained on the regions, the features of the clipped region retain a good ability to recognize the open vocabulary. However, we observe that the features of the cropped region are not sensitive enough to the degree of localization of the regions, i.e., loosely and tightly localized boxes both have similar characteristics. This can be good for classification, but problematic for discovery, as we need discovery scores to reflect the degree of localization. To remedy this, we use the geometric mean to combine the VLM scores with the detection scores for each region and category. VLM scores indicate the probability of a certain category of detection box according to the pre-trained VLM. Detection scores indicate the probability distribution of each field class based on the similarity of the region features and input text embeddings.
During the test, F-VLM uses the region suggestions to extract the top-level features of the VLM spine and calculate the VLM score per region. The trained detector head provides detection boxes and masks, and the final detection scores are a combination of detection and VLM scores. |
Rate
We apply F-VLM to the popular LVIS open vocabulary recognition benchmark. At the system level, the best F-VLM achieves 32.8 average precision (AP) on sparse categories (APr), which exceeds the state of the art with 6.5 mask APr and many other approaches based on knowledge distillation, pretraining, or joint training with weak supervision. F-VLM shows a strong scaling property with the power of the frozen model while the number of training parameters is fixed. Moreover, F-VLM generalizes and scales well to transmission detection tasks (e.g., Objects365 and Ego4D datasets) by simply replacing dictionaries without specifying the model. We test models trained on LVIS on popular Objects365 datasets and show that the model can perform very well without training on domain detection data.
F-VLM outperforms the state-of-the-art (SOTA) LVIS open vocabulary detection benchmark and transfer object recognition. On the x-axis, we show the LVIS metric mask on AP rare categories (APr) and the Objects365 (O365) metric field on all AP categories. The dimensions of the detector spine are as follows: Small(R50), Base (R50x4), Large(R50x16), Huge(R50x64). Naming follows the CLIP convention. |
We visualize the F-VLM on open vocabulary recognition and transfer recognition tasks (shown below). On LVIS and Objects365, F-VLM correctly detects both new and regular objects. A key advantage of open vocabulary recognition is the distribution-free testing of data with categories given by users on the fly. See the F-VLM paper for additional visualizations of the LVIS, Objects365, and Ego4D datasets.
F-VLM open vocabulary and transfer detection. Top: Open Vocabulary Recognition on LVIS. We only show novel categories for clarity. lower: Moving through the Objects365 dataset shows accurate detection of many categories. New categories discovered: Fedora, Martini, Pen, Football Helmet (LVIS); Slide ( Objects365 ). |
Effectiveness of training
We show that F-VLM can achieve the best performance with much less computational resources in the table below. Compared to the state-of-the-art approach, F-VLM can achieve better performance with 226 times fewer resources and 57 times faster wall clock time. In addition to saving training resources, F-VLM has the potential to save significant memory by running in inference mode during training. The F-VLM system works almost as fast as a standard detector during inference, because the only addition is a single attentional pooling layer on the features of the detected region.
method | Apr | training eras | Cost of training (in core clock) |
Saving training costs | ||||||||||
SOTA | 26.3 | 460 | 8000 | 1x | ||||||||||
F-VLM | 32.8 | 118 | 565 | 14x | ||||||||||
F-VLM | 31.0 | 14.7 | 71 | 113x | ||||||||||
F-VLM | 27.7 | 7.4 | 35 | 226x |
We provide additional results using shorter Detectron2 training recipes (12 and 36 epochs) and show similarly strong performance using frozen spines. The default setting is highlighted in gray.
spine | Large Scale Jitter | #eras | Image size | Apr | ||||||||||
R50 | 12 | 16 | 18.1 | |||||||||||
R50 | 36 | 64 | 18.5 | |||||||||||
R50 | ✓ | 100 | 256 | 18.6 | ||||||||||
R50x64 | 12 | 16 | 31.9 | |||||||||||
R50x64 | 36 | 64 | 32.6 | |||||||||||
R50x64 | ✓ | 100 | 256 | 32.8 |
conclusion
We present F-VLM – a simple open vocabulary detection method that harnesses the power of frozen pre-trained large vision language models to provide novel object recognition. This is done without knowledge distillation, detection-specific pre-training, or weakly supervised learning. Our approach offers significant computational cost and eliminates the need for image-level labels. F-VLM achieves the new state-of-the-art in open vocabulary recognition on the LVIS benchmark at the system level and demonstrates very competitive transfer recognition on other datasets. We hope that this study will contribute to further research on new object detection and help the community to explore frozen VLMs for a wider range of vision tasks.
Acknowledgments
This work will be conducted by Weichen Kuo, Yin Cui, Xue Gu, AJ Piergiovanni and Anelia Angelova. We would like to thank our colleagues at Google Research for advice and helpful discussions.
[ad_2]
Source link