HOME | My Site

Abstract

Video moment retrieval, i.e., localizing the specific video moments within a video given a description query, has attracted substantial attention over the past several years. Although great progress has been achieved thus far, most of the existing methods are supervised, which require moment-level temporal annotation information. In contrast, weakly-supervised methods which only need video-level annotations remain largely unexplored. In this paper, we propose a novel method, which uses an end-to-end Siamese Alignment Network for weakly-supervised video moment retrieval. To be specific, we design a multi-scale Siamese module, which not only generates multi-scale moment candidates in a single pass, but could progressively reduce the semantic gap between the visual and textual modality with Siamese structure. In addition, we present a context-aware MIL module by considering the influence of adjacent contexts, which could enhance the moment-query and video-query alignment simultaneously. By promoting the matching of both moment-level and video-level, our model can effectively improve the retrieval performance, even if only having weak video level annotations. Extensive experiments on two benchmark datasets, i.e., ActivityNet-Captions and Charades-STA, verify the superiority of our model compared with several state-of-the-art baselines.

Model

Our proposed SAN comprises of the following components: 1) the feature representation component encodes both visual and textual information and project the output embeddings into a joint embedding space; 2) the multi-scale Siamese network generates multi-scale moment candidates online and enhances fine-grained cross-modal semantic alignment; and 3) the context-aware MIL module leverages adjacent moment candidates to facilitate the video-query similarity estimation.

Examples

We displayed two moment retrieval examples on the Charades-STA dataset. To analyze the effectiveness of each component, we also displayed the retrieval results of two variants: SAN-LSE and w/o MSN. Meanwhile, we displayed the retrieval results of first weakly-supervised approach TGA for comparison. All of the above figures are the R@1 results. The orange, green and blue bars denote the time line of the groundtruth, result of TGA and our results, respectively.

Data&Code

Data & Code

Click here to download the data and code：

(Password：d779)