[ad_1]
Visual Question Answering (VQA) is a machine learning task that requires a model to answer a question about an image or set of images. Conventional VQA approaches require large amounts of labeled training data consisting of thousands of human-annotated question-answer pairs associated with images. Advances in large-scale pretraining in recent years have led to the development of VQA methods that perform well with fewer than fifty training examples (few shots) and without human-annotated VQA training data (zero shots). However, there is still a significant performance gap between these methods and state-of-the-art fully supervised VQA methods such as MaMMUT and VinVL. In particular, multiple-hit methods deal with spatial reasoning, counting, and multihop reasoning. In addition, multiple capture methods are generally limited to answering questions about individual images.
To improve accuracy on VQA examples involving complex reasoning, “Answering Modular Visual Questions Through Code Generation” appearing at ACL 2023, we present CodeVQA, a framework that answers visual questions using program synthesis. Specifically, when asked a question about an image or set of images, CodeVQA generates a Python program (code) with simple visual functions that allow it to process the images and executes that program to determine the answer. We show that in a few frame settings, CodeVQA outperforms previous work by about 3% on the COVR dataset and by 2% on the GQA dataset.
CodeVQA
The CodeVQA approach uses a large-scale code writing language model (LLM), such as PALM, to generate Python programs (code). We guide the LLM in the proper use of visual features by creating a prompt consisting of a description of those features and no fewer than fifteen “contextual” examples of visual questions paired with their associated Python code. To select these examples, we computed the embeddings for the input question and for all questions for which we have annotated programs (fifty randomly selected sets). Then, we select the questions that have the greatest similarity to the input and use them as contextual examples. Given the request and the question we want to answer, LLM generates a Python program that represents that question.
We build the CodeVQA framework using three visual features: (1) query
(2) get_pos
and (3) find_matching_image
.
Query
, which answers the question about a single image, is implemented using a multi-shot Plug-and-Play VQA (PnP-VQA) method. PnP-VQA generates captions using BLIP—an image caption transformer pre-trained on millions of image-subtitle pairs—and feeds them to the LLM, which outputs the answers to the query.Get_pos
, which is an object localizer that takes an object description as input and returns its position in the image, is implemented using GradCAM. In particular, the description and image are transferred by BLIP joint text-image encoding, which predicts the image-text matching score. GradCAM takes the gradient of this score against the image features to find the region of best fit for the text.Find_matching_image
, which is used in multi-image queries to find the image that best matches a given input phrase, is implemented using BLIP text and image encryption to compute text for the phrase and image embeddings for each image. Then the dot products of the text embedded with each image embedding represent the relevance of each image to the phrase, and we select the image that maximizes that relevance.
The three functions can be implemented using models that require very little annotation (eg, text and image-text pairs collected from the Internet and a small number of VQA examples). Furthermore, the CodeVQA framework can be easily generalized beyond these functions to other functions that the user may perform (eg, object detection, image segmentation, or knowledge base retrieval).
An illustration of the CodeVQA method. First, a large language model generates a Python program (code) that invokes visual functions that represent a question. In this example, the simple VQA method (query ) is used to answer one part of the question and the object localizer (get_pos ) is used to find the positions of the mentioned objects. The program then produces an answer to the original question by combining the results of these functions. |
results
The CodeVQA Framework correctly generates and executes Python programs not only for single-image queries, but also for multi-image queries. For example, if given two pictures, each of which shows two pandas, the question might be asked, “Is it true that there are four pandas? In this case, LLM transforms a count query about a pair of images into a program that obtains the number of objects for each image (using question function). The numbers from both pictures are then added to calculate the total number, which is then compared to the number in the original question to get a yes or no answer.
We evaluate CodeVQA on three visual reasoning datasets: GQA (single image), COVR (multiple images), and NLVR2 (multiple images). For GQA, we provide 12 context examples for each method, and for COVR and NLVR2, we provide six context examples for each method. The table below shows that CodeVQA consistently outperforms the baseline few-hit VQA method on all three datasets.
method | GQA | COVR | NLVR2 | ||||||||
A few shots of PnP-VQA | 46.56 | 49.06 | 63.37 | ||||||||
CodeVQA | 49.03 | 54.11 | 64.04 |
Results on GQA, COVR, and NLVR2 datasets, showing that CodeVQA consistently outperforms several captured PnP-VQAs. The metric is exact match accuracy, that is, the percentage of examples in which the predicted answer exactly matches the ground truth answer. |
We find that CodeVQA’s accuracy in GQA is about 30% higher than baseline on spatial reasoning questions, 4% higher on “and” questions, and 3% higher on “or” questions. The third category includes multihop queries such as “Is the image a salt shaker or a skateboard?”, for which the generated program is shown below.
img = open_image("Image13.jpg")
salt_shakers_exist = query(img, "Are there any salt shakers?")
skateboards_exist = query(img, "Are there any skateboards?")
if salt_shakers_exist == "yes" or skateboards_exist == "yes":
answer = "yes"
else:
answer = "no"
In COVR, we find that the CodeVQA gain compared to the baseline is higher when the number of input images is larger, as shown in the table below. This trend suggests that breaking down the problem into single-picture questions is useful.
Number of images | |||||||||||
method | 1 | 2 | 3 | 4 | 5 | ||||||
A few shots of PnP-VQA | 91.7 | 51.5 | 48.3 | 47.0 | 46.9 | ||||||
CodeVQA | 75.0 | 53.3 | 48.7 | 53.2 | 53.4 |
conclusion
We present CodeVQA, a framework for multi-hit visual question answering that relies on code generation to perform multi-step visual reasoning. Interesting directions for future work include expanding the set of applied modules and creating a similar framework for visual tasks beyond VQA. We note that care should be taken when considering whether to deploy a system such as CodeVQA, as visual language models such as those in our visual features have been shown to detect social biases. At the same time, compared to monolithic models, CodeVQA offers additional interpretability (through a Python program) and controllability (by changing requirements or visual functions), which is useful in production systems.
Acknowledgments
This research was a collaboration between UC Berkeley’s Laboratory for Artificial Intelligence Research (BAIR) and Google Research and was conducted by Sanjay Subramanian, Medini Narasimhan, Kushal Khangaonkar, Kevin Yang, Arsha Nagran, Cordelia Schmid, Andy Zeng, Trevor Darrell, and Dan Klein. . .
[ad_2]
Source link