VLG-Net: Video-Language Graph Matching Network for Video Grounding

ICCVW'2021 - Best Paper Award


Figure 1. The video-language matching graph. The nodes represent video snippets and query tokens. Ordering Edge models the sequential nature of both modalities. Semantic Edge connects graph nodes in the same modality according to their feature similarity. Matching Edge captures the cross-modality relationships. We apply graph convolution on the video-language graph for cross-modal context modeling and multi-modal fusion. The neighborhood is specific for the node highlighted in red.


Grounding language queries in videos aims at identifying the time interval (or moment) semantically relevant to a language query. The solution to this challenging task demands understanding videos' and queries' semantic content and the fine-grained reasoning about their multi-modal interactions. Our key idea is to recast this challenge into an algorithmic graph matching problem. Fueled by recent advances in Graph Neural Networks, we propose to leverage Graph Convolutional Networks to model video and textual information as well as their semantic alignment. To enable the mutual exchange of information across the modalities, we design a novel Video-Language Graph Matching Network (VLG-Net) to match video and query graphs. Core ingredients include representation graphs built atop video snippets and query tokens separately and used to model intra-modality relationships. A Graph Matching layer is adopted for cross-modal context modeling and multi-modal fusion. Finally, moment candidates are created using masked moment attention pooling by fusing the moment's enriched snippet features. We demonstrate superior performance over state-of-the-art grounding methods on three widely used datasets for temporal localization of moments in videos with language queries: ActivityNet-Captions, TACoS, and DiDeMo.


Table 1. State-of-the-art comparison on ActivityNet Captions. We report the results at different Recall@κ and different IoU thresholds. VLG-Net reaches the highest scores for IoU0.7 for both R@1 and R@5.

Table 2. State-of-the-art comparison on TACoS. Our model outperforms all previous methods achieving significantly higher performance with great margins on all metrics.

Table 3. State-of-the-art comparison on DiDeMo. Our proposed model outperforms the top ranked method ROLE and ACRN with respect to IoU0.5 and 0.7 for R@1 and R@5 with clear margins. It also reaches the highest performance in regards to R@1 IoU1.0.



@inproceedings{soldan2021vlg,title={VLG-Net: Video-Language Graph Matching Network for Video Grounding},author={Soldan, Mattia and Xu, Mengmeng and Qu, Sisi and Tegner, Jesper and Ghanem, Bernard},booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},pages={3224--3234},year={2021}}