FILS: Self-Supervised Video Feature Prediction In Semantic Language Space

Mona Ahmadian (University of Surrey), Frank Guerin (University of Surrey), Andrew Gilbert (University of Surrey)

The 35^th British Machine Vision Conference

PDF Poster Video (Right click to download)Supplementary

Abstract

This paper demonstrates a self-supervised approach for learning semantic video representations. Recent vision studies show that a masking strategy for vision and natural language supervision has contributed to developing transferable visual pretraining. Our goal is to achieve a more semantic video representation by leveraging the text related to the video content during the pretraining in a fully self-supervised manner. To this end, we present FILS, a novel self-supervised video Feature prediction In semantic Language Space (FILS). The vision model can capture valuable structured information by correctly predicting masked feature semantics in language space. It is learned using a patch-wise video-text contrastive strategy, in which the text representations act as prototypes for transforming vision features into a language space, which are then used as targets for semantically meaningful feature prediction using our masked encoder-decoder structure. FILS demonstrates remarkable transferability on downstream action recognition tasks, achieving state-of-the-art on challenging egocentric datasets, like Epic-Kitchens, Something-SomethingV2, Charades-Ego, and EGTEA, using ViT-Base. Our efficient method requires less computation and smaller batches compared to previous works.

Citation

@inproceedings{Ahmadian_2024_BMVC,
author    = {Mona Ahmadian and Frank Guerin and Andrew Gilbert},
title     = {FILS: Self-Supervised Video Feature Prediction In Semantic Language Space},
booktitle = {35th British Machine Vision Conference 2024, {BMVC} 2024, Glasgow, UK, November 25-28, 2024},
publisher = {BMVA},
year      = {2024},
url       = {https://papers.bmvc2024.org/0790.pdf}
}

Copyright © 2024 The British Machine Vision Association and Society for Pattern Recognition
The British Machine Vision Conference is organised by The British Machine Vision Association and Society for Pattern Recognition. The Association is a Company limited by guarantee, No.2543446, and a non-profit-making body, registered in England and Wales as Charity No.1002307 (Registered Office: Dept. of Computer Science, Durham University, South Road, Durham, DH1 3LE, UK).

Imprint | Data Protection

body { background-color: white !important; color: black !important; }FILS: Self-Supervised Video Feature Prediction In Semantic Language Space

Mona Ahmadian (University of Surrey), Frank Guerin (University of Surrey), Andrew Gilbert (University of Surrey)

Mona Ahmadian (University of Surrey), Frank Guerin (University of Surrey), Andrew Gilbert (University of Surrey)

The 35th British Machine Vision Conference

Abstract

Citation

FILS: Self-Supervised Video Feature Prediction In Semantic Language Space

The 35^th British Machine Vision Conference