8 Papers of the Key Laboratory of Intelligent Algorithm Security (Chinese Academy of Sciences) were Accepted by CVPR 2024

Date: May 22, 2024
The CVPR conference is IEEE Conference on Computer Vision and Pattern Recognition, which is the top conference in computer vision and pattern recognition, and will open in Seattle, USA from June 17 to 21,2024. The Key Laboratory of Intelligent Algorithm Security (Chinese Academy of Sciences) has had 8 papers accepted by CVPR 2024.

1.HPNet: Dynamic Trajectory Forecasting with Historical Prediction Attention
Authors: Xiaolong Tang, Meina Kan, Shiguang Shan, Zhilong Ji, Jinfeng Bai, Xilin Chen
Content Introduction: Predicting the trajectory of road participants is critical to autonomous driving systems. Recent mainstream approaches follow a static paradigm, using a fixed number of historical frames to predict future trajectories. However, even at adjacent timestamps, the prediction results of these methods are independent, potentially causing potential instability and temporal inconsistencies. As the predictions of continuous timestamps have large numbers of overlapping historical frames, their predictions should have intrinsic correlation, such as the overlap of the predicted trajectory should be consistent, or vary depending on the road situation but maintain the same motion target. Based on this, we propose a new dynamic trajectory prediction method, HPNet. In order to achieve stable and accurate trajectory prediction, our method not only utilizes historical frames (including map and agent state information), but also utilizes historical prediction. Specifically, we designed a historical prediction attention (Historical Prediction Attention) module to automatically encode the dynamic relationships between continuous predictions. In addition, benefiting from historical predictions, it extends attention beyond the current visible window. We combine the proposed Historical Prediction Attention with Agent Attention and Mode Attention to further form the 3 D decomposition attention (Triple Factorized Attention) module, which is also the core design of HPNet. Experiments on the Argoverse motion prediction benchmark show that HPNet achieves state-of-the-art performance and generates accurate and stable future trajectories.
Figure 1: Track prediction method HPNet, model schematic diagram

2.Pre-trained Model Guided Fine-Tuning for Zero-Shot Adversarial Robustness Author: Sibo Wang, Jie Zhang, Zheng Yuan, Shiguang Shan
Content Introduction: In recent years, large-scale pre-trained visual language models like CLIP have shown superior performance in a variety of tasks, as well as significant zero-sample generalization ability, and they are also susceptible to imperceptible adversarial samples. As increasingly large models are deployed in security-related downstream tasks, the robustness of such models must be enhanced. Current research focuses on improving accuracy, compared with less research on their robustness problems, and existing work usually uses adversarial training (fine-tuning) as a defense method for adversarial samples. However, direct application to the CLIP model may lead to overfitting, impairing the generalization ability of the model. Our work is inspired by the good generalization of the original pre-trained model, and proposes the pre-trained model-guided adversarial fine-tuning (PMG-AFT) method, which utilizes the supervision from the original pre-trained model by designing auxiliary branches to enhance the zero-sample adversarial robustness of the model. Specifically, PMG-AFT minimizes the distance between the features of the adversarial sample in the target model and the features in the pre-trained model, aiming to retain the generalization features already captured by the pre-trained model. Our extensive experiments on 15 zero-sample datasets show that PMG-AFT significantly outperforms state-of-the-art methods, increasing top-1 robust accuracy by an average of 4.99%. Furthermore, our method simultaneously improves the clean sample accuracy by an average of 8.72%.

  • Figure 2: Schematic diagram of the countertuning (PMG-AFT) method model

3.ES3: Evolving Self-Supervised Learning of Robust Audio-Visual Speech Representations
Authors: Yuanhang Zhang, Shuang Yang, Shiguang Shan, Xilin Chen
Content Introduction: In recent years, the audio-visual speech representation learning (audio-visual speech representation learning) task has attracted much attention for its application in lip recognition, multi-modal speech recognition, speech enhancement and other aspects. Most of the current methods (such as AV-HuBERT, etc.) rely on a single guidance of audio mode, focusing on learning audio ovisual common information. Considering the inherent asymmetry between two modes, this paper proposed a new robust sound-visual representation self-supervised learning strategy ES 3, from learning mode common (shared) information, modal specificity (unique) and collaborative gain (synergistic) information to re-examine this task, gradually build a robust sound / visual unimodal representation and sound-visual joint representation. Specifically, we start to learn the specific information of audio, and continue to learn the specificity information of visual modality (i. e., lip), and form a preliminary audio-visual joint representation; Finally, maximize the total audio-visual information including the collaborative gain information. We implemented this strategy with a simple twin network and verified the effectiveness on two English data sets (LRS 2-BBC, LRS 3-TED) and our newly collected large-scale Chinese sentence level data set CAS-VSR-S101; especially on the LRS 2-BBC data set, we used the smallest self-supervised model with 1 / 2 parameters and 1 / 8 unannotated data volume (223 hours).
  • Figure 3: Schematic diagram of the ES 3 model for the self-supervised learning strategy

4.Video Harmonization with Triplet Spatio-Temporal Variation Patterns
Authors: Zonghui Guo, Xinyu Han, Jie Zhang, Shiguang Shan, Haiyong Zheng
Content Introduction: Video harmony is an important and challenging visual task. It aims to obtain visually realistic synthetic video by automatically adjusting the appearance of the synthetic video prospect to align with the background. Inspired by the gradual adjustment process of short-term and long-term video frames in manual harmonization operation, we propose a video ternary Transformer framework to model three spatial and temporal change modes in video, namely, short-term frame space, long-term frame global and long-term frame dynamics, for video-to-video conversion tasks such as video harmony. Specifically, for short-term frame harmony, we use the subtle changes between adjacent frames to adjust the foreground appearance to align the spatial dimension with the background; for long-term frame harmony, we not only explore the global appearance change to enhance the timing consistency of the video, but also dynamically align the appearance of similar context under the influence of motion offset. Extensive experiments demonstrate the effectiveness of our approach, achieving state-of-the-art performance in video harmonization, video enhancement, and video demolarity tasks. We also propose a temporal consistency metric to better evaluate the harmonized video.

  • Figure 4: Schematic diagram of the video ternary Transformer frame
5.Lookahead Exploration with Neural Radiance Representation for Continuous Vision-LanguageNavigation
Authors: Zihan Wang, Xiangyang Li, Jiahao Yang, Yeqi Liu, Junjie Hu, Ming Jiang, Shuqiang Jiang
Content Introduction: Visual Language Navigation (VLN) requires agents to be able to navigate to target locations in a 3 D environment based on a given natural language instruction. The previous visual language navigation method only makes single-step motion prediction and lacks the long-term motion planning ability. To achieve more accurate navigation path planning, this paper proposes prospective exploration strategies based on neural radiation field, aiming to render future environments for more accurate action prediction. Different from the previous work to predict the unknown environment of RGB image image prediction distortion and high computational cost, the method proposed in this paper is based on large-scale training hierarchical neural radiation representation model, using three-dimensional feature space coding to predict the visual representation of the future environment, compared to predict pixel level image method is more efficient and more robust. Further, using the predicted future environment representation, the proposed forward navigation model is able to construct a navigable future path tree and select the optimal navigation path by evaluating the action value of each branch in parallel. We verified the effectiveness of the proposed method on multiple datasets of visual language navigation in a continuous environment.

  • Figure 5: Schematic diagram of the forward-looking navigation model

6.A Category Agnostic Model for Visual Rearrangement
Authors: Yuyi Liu, Xinhang Song, Weijie Li, Xiaohan Wang, Shuqiang Jiang
Introduction: This paper proposes a category-independent model for the inductive reduction task, which can help the embodied agent to transform the scene from a shuffled state to a target state without relying on any category concept. Existing methods often follow a similar framework, which performs the inductive reduction task by matching between the target environment and the semantic scene graph of the shuffled environment. However, constructing scene graphs requires inferred category labels, which not only leads to a decrease in accuracy for the whole task, but also limits the application in real-world scenarios. Therefore, this paper explores the nature of inductive reduction, and focuses on two most basic problems: scene change detection and scene change matching. We leverage the movement and prominence of point clouds to accurately identify scene changes and match these changes based on the similarity of category-independent appearance features. Furthermore, to help agents explore the environment more efficiently and comprehensively, we propose a closer aligned exploration strategy designed to be closer to observe more details of the scene. We conducted experiments on the AI2THOR induction reduction competition based on the RoomR dataset and on MrMiR, a novel multi-chamber multi-instance dataset that we self-collected. The experimental results fully demonstrate the validity of our proposed method.

  • Figure 6: Schematic diagram of scene change detection and scene change matching

7.Imagine Before Go: Self-Supervised Generative Map for Object Goal Navigation
Authors: Sixian Zhang, Xinyao Yu, Xinhang Song, Xiaohan Wang, Shuqiang Jiang
Author introduction: The object target navigation task requires the agent to navigate to a designated target object in an unseen environment. Since the environmental layout is unknown, the agent needs to infer unknown context objects from partial observations to infer the possible location of the target object.Previous end-to-end reinforcement learning methods learn these contextual relations through implicit representations, but they lack geometric relations and limit their generalization. On the other hand, the modular approach builds local semantic maps for the observed navigation environment, which contains the observed geometric relationships, however, the modular approach limits the exploration efficiency due to the lack of reasoning about the contextual relationships. In this work, we present self-supervised generative maps (Self-Supervised Generative Map, SGM), a modular method for displaying learning contextual object relations, through self-supervised learning. SGM was trained to reconstruct the masked pixels of the trimmed global map using plot observations and general knowledge. During navigation, the agent maintains an incomplete local semantic map, while the unknown region of the local map is generated by the pre-trained SGM. Based on the expanded local map, the agent sets the predicted position of the target object as the target and moves towards it. We validated our method, based on experiments on Gibson, MP3D and HM3D.

Figure 7: Schematic diagram of the SGM model

8.CLIP-KD: An Empirical Study of CLIP Model Distillation
Authors: Chuanguang Yang, Zhulin An, Libo Huang, Junyu Bi, Xinqiang Yu, Han Yang, Boyu Diao, Yongjun Xu
Content Introduction: Comparison language-image pre-training model C LIP has become a widely used multimodal basic model, showing strong zero-sample generalization ability and improving the performance of downstream visual recognition tasks. Existing C LIP usually use large models, with good performance, but high complexity, making it difficult to apply in resource-constrained scenarios. Although small C LIP model has low complexity, their performance is poor and it is difficult to meet the requirements of task accuracy. To solve this problem, this study proposes C LIP knowledge distillation technology, by transferring the knowledge of large models to small models, so that the small models to significantly improve performance without changing the complexity. This study explored the form of knowledge of C LIP in terms of relationship, features, gradient and contrast patterns, and found that using feature alignment and interactive contrast learning could significantly improve the effect of C LIP knowledge distillation. The reason that the different C LIP knowledge distillation methods increase the feature similarity between teacher-student models. According to the zero-sample ImageNet image classification experiment, we showed that the pre-trained ViT-L / 14 teachers on the large-scale dataset Laion-400M improved the classification accuracy of 20.5% for the ViT-B \ 16 student model trained on the CC3M + 12M dataset.