[ad_1]
Imagine sitting on a park bench and watching someone walk by. Although the scene may be constantly changing as a person walks, the human brain can transform this dynamic visual information into a more stable representation over time. This ability, known as perceptual alignment, helps predict the trajectory of a walking person.
Unlike humans, computer vision models typically do not exhibit perceptual accuracy, so they learn to represent visual information in a highly unpredictable manner. But if machine learning models had this ability, it would allow them to better estimate how objects or people move.
MIT researchers have found that a specific training method can help computer vision models learn more perceptually correct representations than humans do. Training involves showing the machine learning model millions of examples so it can learn the task.
The researchers found that training computer vision models using a technique called inverse training, which makes them less reactive to small errors added to the images, improved the models’ perceptual accuracy.
The team also found that the accuracy of the perception is affected by the task that trains the model. Models trained to perform abstract tasks, such as image classification, learn more accurate representations than those trained to perform finer-grained tasks, such as assigning each pixel in an image to a category.
For example, model nodes have internal activations representing “dog”, which allows the model to detect a dog whenever it sees any image of a dog. Perceptually correct images maintain a more stable “dog” representation when there are small changes in the image. This makes them more durable.
By better understanding the accuracy of perception in computer vision, researchers hope to discover insights that will help them build models that make more accurate predictions. For example, this feature could improve the safety of autonomous vehicles that use computer vision models to predict the trajectories of pedestrians, cyclists and other vehicles.
“One of the take-home messages here is that taking inspiration from biological systems like human vision can give you an understanding of why some things work the way they do, and also inspire ideas for improving neural networks,” says Vasa Dutelli. , an MIT postdoc and co-author of a paper examining perceptual accuracy in computer vision.
Joining DuTell are lead paper author Ann Harrington, a graduate student in the Department of Electrical Engineering and Computer Science (EECS); Ayush Tewar, Postdoctoral Fellow; Mark Hamilton, graduate student; Simon Stent, Research Manager at Woven Planet; Ruth Rosenholtz, principal investigator in the Department of Brain and Cognitive Sciences and member of the Computer Science and Artificial Intelligence Laboratory (CSAIL); and senior author William T. Freeman, the Thomas and Gerd Perkins Professor of Electrical Engineering and Computer Science and a member of CSAIL. The research is presented at the international conference of educational presentations.
study alignment
After reading a 2019 paper by a team of researchers at New York University on perceptual accuracy in humans, Dutel, Harrington, and their colleagues wondered whether this property might also apply to computer vision models.
They set out to determine whether different types of computer vision models correct the visual representations they study. They provided video frames of each model and then studied the performance at different stages of its learning process.
If the representation of the model changes in a predictable way within the frames of the video, the model is corrected. Finally, its output representation should be more stable than its input representation.
“You can think of an image as a line that starts out very curved. A model that straightens can take that curve from the video and smooth it through the processing steps,” Dutel explains.
Most of the models they tested didn’t pan out. The few that aligned most effectively were trained for classification tasks using a technique known as reversal training.
Reverse learning involves subtly changing images by changing each pixel slightly. Although a person won’t notice the difference, these slight changes can trick the machine into rendering the image incorrectly. Reverse training makes the model more robust, so it won’t be fooled by these manipulations.
Because reverse training teaches the model to be less reactive to slight changes in images, it helps it learn a representation that is more predictable over time, Harrington explains.
“People already had the idea that reverse training could help you make your model more human-like, and it was interesting to see it carried over to other properties that people hadn’t tried before,” he says.
But the researchers found that competition-trained models only learned to align when they were trained for broader tasks, such as classifying whole images into categories. Models tasked with segmentation—labeling each pixel in an image as a certain class—didn’t improve, even when they were trained adversarially.
Consistent classification
The researchers tested these image classification models by showing them videos. They found that models that learned more perceptually correct representations were more consistent in classifying objects in videos.
“It’s amazing to me that these adversarial trained models that have never seen video and never trained on temporal data still show some alignment,” says Dutel.
The researchers don’t know exactly what allows the computer vision model to adjust to the competitive training process, but their results suggest that stronger training regimens lead to more alignment of the models, he explains.
Based on this work, the researchers want to use what they’ve learned to develop new training schemes that will clearly impart this property to the model. They also want to dig deeper into reverse training to understand why the process helps the model align.
“From a biological point of view, training for resistance does not make sense at all. This is not how people understand the world. “There are still a lot of questions about why it helps models act human,” says Harrington.
“Understanding the representations learned by deep neural networks is critical to improving properties such as robustness and generalization,” says Bill Lotter, an assistant professor at the Dana-Farber Cancer Institute and Harvard Medical School, who was not involved in the study. “Harrington et al. Perform an extensive evaluation of how the representations of computer vision models change over time when processing natural videos, showing that the curve of these trajectories varies significantly depending on the model’s architecture, learning properties, and task. These findings can inform the development of improved models as well as provide insights into biological visual processing.
“The paper confirms that the alignment of natural videos is a rather unique property that is detected by the human visual system. Only reverse-trained networks show it, which provides an interesting link to another signature of human perception: its robustness to various image transformations, natural or artificial,” says Olivier Henaff, a DeepMind researcher who was not involved in the study. That even scene segmentation models do not correct their data raises important questions for future work: Do humans analyze natural scenes as well as computer vision models? How to represent and predict the trajectory of moving objects sensitive to their spatial details? By linking the alignment hypothesis to other aspects of visual behavior, the paper lays the groundwork for more unified theories of perception.”
The research is funded in part by the Toyota Research Institute, the MIT CSAIL METEOR Fellowship, the National Science Foundation, the US Air Force Research Laboratory, and the US Air Force Artificial Intelligence Accelerator.
[ad_2]
Source link