[ad_1]
Humans have a remarkable ability to take in vast amounts of information (probably ~1010 bits/s input to the retina) and selectively attend to several task-relevant and interesting regions for further processing (eg, memory, comprehension, action). Thus, modeling human attention (the result of which is often called a salience model) has been of interest in the fields of neuroscience, psychology, human-computer interaction (HCI), and computer vision. The ability to predict which regions will attract attention has many important applications in fields such as graphics, photography, image compression and processing, and visual quality measurement.
We previously discussed the possibility of accelerating eye movement research using machine learning and smartphone-based gaze estimation, which previously required specialized hardware costing up to $30,000 per unit. Related research includes “Look to Speak,” which helps users with accessibility needs (eg, people with ALS) communicate with their eyes, and the recently published “Differentially Personal Heatmap” technique to compute heatmaps such as those for attention while protecting to users. ‘Privacy.
In this blog, we present two papers (one from CVPR 2022 and one recently accepted at CVPR 2023) that highlight our recent research in modeling human attention: “Deep salience to reduce early visual destruction” and “Learning from a unique perspective.” : User-Aware Saliency Modeling’ together with recent studies on progressive loading of saliency for image compression (1, 2). We show how predictive models of human attention can improve user experiences, such as image editing to reduce visual clutter, distraction or artifacts, image compression for faster loading of web pages or apps, and pointing ML models toward more intuitive human-like interpretation and model performance. . We focus on image editing and image compression and discuss recent advances in modeling in the context of these applications.
Carefully managed image editing
Human attention models typically take an image as input (eg a natural image or a webpage screenshot) and predict a heatmap as output. The heat map projected onto the image is evaluated against ground truth attention data, typically collected with an eye tracker or approximate mouse movement/clicking. Previous models used hand-crafted features for visual cues such as color/luminance contrast, edges, and shape, while more recent approaches automatically learn discriminative features based on deep neural networks, from convolutional and recurrent neural networks to more recent vision transformer networks.
In “Deep Saliency to Reduce Visual Attention Priorities” (more information on this project site), we use deep saliency models for dramatic yet visually realistic editing that can significantly shift the viewer’s attention to different regions of an image. For example, removing distracting objects in the background can reduce clutter in photos, leading to increased user satisfaction. Similarly, during a video conference, background clutter can increase focus on the main speaker (example demo).
To investigate what types of editing effects can be achieved and how they affect viewer attention, we developed an optimization framework for guiding visual attention in images using a differentiable, predictive saliency model. Our method uses a state-of-the-art deep feature model. Given an input image and a binary mask representing the distractor regions, the pixels inside the mask are edited under the guidance of a predictive saliency model to reduce the saliency in the masked region. In order to make sure that the edited image is natural and realistic, we carefully select four image editing operators: two standard image editing operations, namely, recolor and image distortion (shift); and two learned operators (we do not define the editing operation explicitly), namely the multilayer convolution filter and the generative model (GAN).
With these operators, our framework can create a variety of powerful effects, examples in the figure below, including repainting, painting, masking, editing or inserting objects, and editing facial attributes. Importantly, all these effects are due to only one, pre-trained salience model, without any additional supervision or training. Note that our goal is not to compete with specific methods for producing each effect, but to demonstrate how multiple editing operations can be guided by the knowledge embedded in deep feature models.
Examples of visual distraction reduction guided by a multi-operator salience model. The distractor region is marked on top of the saliency map (red border) in each example. |
Enriching the experience with distinctive user-informed modeling
Previous research assumes a single distinctive model for the entire population. However, human attention varies between individuals—although the detection of important cues is fairly consistent, their order, interpretation, and distribution of gaze can vary substantially. This offers the ability to create a personalized user experience for individuals or groups. In “Learning from a unique perspective: User-informed saliency modeling,” we present a user-informed saliency model, the first that can predict the attention of a single user, a group of users, and the general population with a single model.
As shown in the figure below, the core of the model is the combination of each participant’s visual preferences with a per-user attention map and user-adaptive masks. This requires annotations of each user’s attention to be available in the training data, e.g., the OSIE mobile gaze dataset for natural images; FiWI and WebSaliency datasets for web pages. Instead of predicting a single salient map that represents the attention of all users, this model predicts per-user attention maps to encode individuals’ attention patterns. In addition, the model takes a user mask (a binary vector whose size is equal to the number of participants) to indicate the presence of participants in the current sample, which makes it possible to select a group of participants and combine their preferences into one. Heat map.
Overview of the User Informed Distinctive Model Framework. The example image is from the OSIE image set. |
In conclusion, the user mask allows you to make predictions for any combination of participants. In the following figure, the first two rows are the predictions of attention for two different groups of participants (with three people in each group) to the picture. A conventional attention prediction model predicts identical heat maps of attention. Our model can distinguish between two groups (eg, the second group pays less attention to the face and more attention to the food than the first). Similarly, the last two rows are predictions for two different participants on the webpage, our model showing different preferences (eg, the second participant pays more attention to the left region than the first).
Predicted attention versus ground truth (GT). EML-Net: Predictions from the latest model that will have the same predictions for the two participants/groups. Ours: Predictions from our unique user awareness model that can correctly identify the unique preferences of each participant/group. The first image is from the OSIE image set and the second is from FiWI. |
Progressive image decoding focused on important features
In addition to image editing, human attention models can also improve users’ browsing experience. One of the most frustrating and annoying user experiences while browsing is waiting for web pages with images to load, especially in low network conditions. One way to improve the user experience in such cases is to use progressive image decoding, which decodes and displays higher-resolution image sections as the data is downloaded until a full-resolution image is ready. Progressive decoding usually takes place in a sequential order (eg, left to right, top to bottom). With the predictive attention model (1, 2), we can instead decode the image based on saliency, making it possible to send the data necessary to display the details of the most salient regions first. For example, in a portrait, bytes for the face may be prioritized over bytes for the focused background. Consequently, users perceive better image quality sooner and experience significantly reduced latency. More details can be found in our open source blog posts (post 1, post 2). Thus, predictive attention models can help with image compression and faster loading of web pages with images, improving rendering for large images and streaming/VR applications.
conclusion
We have shown how predictive models of human attention can enable delightful user experiences with applications such as image editing, which can reduce clutter, distraction or artifacts in images or photos for users, and progressive image decoding, which can significantly reduce user wait times. And the pictures are fully rendered. The distinctive model known to our users can be further personalized for individual users or groups of the above applications, enabling a richer and more unique experience.
Another interesting direction for predictive attention models is whether they can help improve the robustness of computer vision models in tasks such as object classification or detection. For example, in “Teacher-generated spatial attention labels increase the robustness and accuracy of contrast models,” we show that a predictive model of human attention can guide contrast learning models to achieve better performance and improve the accuracy/robustness of classification tasks. ImageNet and ImageNet-C datasets). Further research in this direction could facilitate applications such as using radiologist attention on medical images to improve health screening or diagnosis, or using human attention to guide autonomous driving systems in complex driving scenarios.
Acknowledgments
This work involved a collaborative effort from a multidisciplinary team of software engineers, researchers, and cross-functional contributors. We would like to thank all co-authors of the paper/study, including Kfir Aberman, Gamaleldin F. Elsayed, Moritz Firsching, Shi Chen, Nachiappan Valiapan, Yush Yao, Chang Ye, Yossi Gandelsman, Inbar Moser, David E. Jacobs, Yale Pritch, Shaolei Shen, and Xinyu Ye. We would also like to thank team members Oscar Ramirez, Venky Ramachandran, and Tim Fujita for their help. Finally, we thank Vidhya Navalpakkam for her technical guidance in initiating and supervising this work.
[ad_2]
Source link