[ad_1]
Visual language is a form of communication that relies on pictorial symbols to convey information outside of text. It is everywhere in our digital lives in the form of iconography, infographics, tables, drawings, and charts, extending into the real world through street signs, comics, food labels, and more. For this reason, computers understand better how this type of media can help in scientific communication and discovery, accessibility and transparency of data.
Although computer vision models have made tremendous progress using learning-based solutions since the advent of ImageNet, the focus has been on natural images for all kinds of tasks such as classification, visual question answering (VQA), labeling, detection, and segmentation. It has been defined, studied and in some cases promoted to achieve human functioning. However, visual language has not received a similar level of attention, perhaps due to the lack of a large-scale training suite in this space. But in the last few years, new academic datasets have been developed that aim to evaluate question answering systems on visual language images, such as PlotQA, InfographicsVQA, and ChartQA.
Example from ChartQA. To answer the question, you need to read the information and calculate the sum and difference. |
Existing models for these tasks have relied on integrating optical character recognition (OCR) information and their coordinates into larger pipelines, but the process is error-prone, slow, and poorly generalizable. The proliferation of these methods was due to the fact that existing computer vision models based on convolutional neural networks (CNN) or pre-trained transformers on natural images could not be easily adapted to the visual language. But existing models are ill-prepared for the challenges of answering questions on charts, including reading the relative height of bars or slice angles in pie charts, understanding the scale of axes, correctly drawing icons with their legend values in colors, sizes, and textures. And finally, perform numerical operations with the extracted numbers.
With these challenges in mind, we propose “MatCha: Enhancing Visual Language Pre-Training with Mathematical Reasoning and Schematics”. MatCha, which stands for Math and Charts, is a pixel-to-text basis model (a pre-trained model with built-in inductive biases that can be fine-tuned for many applications) trained on two additional tasks: (a) diagram. deformation and (b) mathematical reasoning. When de-rendering a diagram, given a diagram or diagram, the image-to-text model is required to generate the underlying data table or code used to render it. For mathematical reasoning pre-training, we select textual numerical reasoning datasets and receive input in images that the image-to-text model needs to decode the answer. We also propose “DePlot: One-Shot Visual Language Reasoning with Plot-to-Table Translation”, a model built on top of MatCha for one-shot reasoning on plots by translation to tables. With these methods we outperform the state-of-the-art in ChartQA by over 20% and outperform the best summarization systems with 1000x more parameters. Both papers will be presented at ACL2023.
De-render the diagram
Plots and charts are usually generated by a data table and a piece of code. The code defines the overall layout of the figure (eg, type, direction, color/shape scheme), and the data table defines the actual numbers and their groupings. Both data and code are sent to the compiler/rendering engine to create the final image. To understand a diagram, one needs to detect visual patterns in the image and effectively analyze and group them to extract key information. Reversing the plot rendering process requires all such capabilities and can thus become an ideal preliminary training task.
A diagram created from a table on the Airbus A380 Wikipedia page using the randomization options. A preliminary MatCha training task consists in recovering a source table or source code from an image. |
In practice it is difficult to get the schemes, their underlying data tables and their transfer code at the same time. To collect sufficient pre-training data, we collect independently [chart, code] and [chart, table] couples. for that [chart, code], we search all GitHub IPython notebooks with the appropriate license and highlight the blocks with figures. figure and a block of code before it is saved as a [chart, code] a couple. for that [chart, table] As a couple, we examined two sources. For the first source, synthetic data, we manually write the code to convert the web-based Wikipedia tables from the TaPas codebase into charts. We sampled and combined several layout options based on column types. In addition, we will also add [chart, table] Pairs generated in PlotQA to diversify the pre-training corpus. The second source is web-stored [chart, table] couples. We use it directly [chart, table] The pairs were crawled through the ChartQA training set, which contains about 20,000 pairs from four websites: Statista, Pew, Our World in Data, and the OECD.
Mathematical reasoning
We incorporate numerical reasoning skills into MatCha by learning mathematical reasoning skills from textual mathematics datasets. We use two existing textual mathematical reasoning datasets, MATH and DROP for pre-training. MATH is designed synthetically, containing two million learning examples per module (type) of questions. DROP is a reading-comprehension style QA dataset where the input is the paragraph context and the question.
To solve the questions in DROP, the model has to read the paragraph, extract the corresponding numbers and perform numerical calculations. We found that both datasets are complementary. MATH contains a large number of questions in different categories that help us identify the mathematical operations needed to enter the model clearly. DROP’s read-comprehension format resembles a typical QA format, where models simultaneously perform information retrieval and reasoning. In practice, we convert both input data sets into images. The model is trained to decode the response.
To improve MatCha’s mathematical reasoning skills, we include examples from MATH and DROP in the pre-practice objectives, entered as text images. |
results to the end
We use the backbone of the Pix2Struct model, which is an image-to-text transformer tailored for website understanding, and pre-train it with the two tasks described above. We demonstrate the strengths of MatCha by specifying it on several visual language tasks—tasks that involve diagrams and pictures to answer questions and provide summaries where no basic spreadsheet can be accessed. MatCha outperforms previous models by a wide margin and also outperforms the previous state of the art for accessing the underlying tables.
In the figure below, we first evaluate two basic models that incorporate information from the OCR pipeline, which until recently was the standard approach to working with charts. The first is based on T5 and the second is based on VisionTaPas. We also compare the PaLI-17B, which is a large (~1000 times larger than other models) image plus text-to-text transformer trained to perform a variety of tasks, but with limited capacity to read text and other forms of visual language. . Finally, we report the results of the Pix2Struct and MatCha model.
Experimental results for two chart QA benchmarks ChartQA & PlotQA (using relaxed precision) and a chart summary benchmark in Chart-Text (using BLEU4). The Matcha outperforms the QA in terms of hardware by a wide margin compared to larger models and matches these larger models in summary. |
For the QA dataset, we use a formal relaxed accuracy metric that allows for small relative errors in the numerical outputs. For chart-text summaries, we report BLEU scores. MatCha achieves significantly improved results compared to question-answering baselines, and in summarization compared to PaLI, where large and extensive long-form text/subtitle generation pre-training is favorable for this type of long-form text generation.
Derendering plus large language model chains
Although extremely efficient in their number of parameters, especially for subtraction tasks, we observed that well-tuned MatCha models could still struggle with extremely complex reasoning (eg, mathematical operations involving large numbers or many steps). Therefore, we also propose a two-step method to solve this: 1) the model reads the diagram, then outputs the underlying table, 2) the large language model (LLM) reads this output and then tries to answer the question based only on the textual input.
For the first model, we specified MatCha only on the diagram-to-table task, increasing the length of the output sequence to guarantee recovery of all or most of the diagram information. DePlot is the derived model. In the second step, any LLM (such as FlanPaLM or Codex) can be used to perform the task, and we can rely on standard methods to enhance the performance of LLMs, such as chain of reasoning and self-consistency. We also tested the thoughts program, where the model generates executable Python code to perform complex calculations.
Illustration of the DePlot+LLM method. This is a real example using FlanPaLM and Codex. The blue boxes are input to the LLM and the red boxes contain the response generated by the LLMs. In each answer, we outline several key steps in reasoning. |
As shown in the example above, the DePlot model with LLMs outperforms sophisticated models by a significant margin, especially so with ChartQA’s human sources, where the questions are more natural but require more complex reasoning. Additionally, DePlot+LLM can do this without accessing the training data.
We’ve released the new models and code to our GitHub repo, where you can try it out in Kolab itself. Check out the MatCha and DePlot papers for more details on the experimental results. We hope that our results will benefit the research community and make the information in diagrams and drawings more accessible to everyone.
Acknowledgments
This work was done by Fangyu Liu, Julian Martin Eisenschlos, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Wenhu Chen and Yasemin Altun from our language team as part of the Fangyu Internship Project. Nigel Collier of Cambridge was also a contributor. We would like to thank Joshua Howland, Alex Polozow, Shrestha Basu Malik, Massimo Nicosia, and William Cohen for valuable comments and suggestions.
[ad_2]
Source link