MotionMAE: Self-supervised Video Representation Learning with Motion-Aware Masked Autoencoders


Haosen Yang (University of Surrey), Deng Huang (Meituan), Bin Wen (Beijing University of Aeronautics and Astronautics), Jiannan Wu (University of Hong Kong), Hongxun Yao (Harbin Institute of Technology), Yi Jiang (Bytedance), Xiatian Zhu (University of Surrey), Zehuan Yuan (ByteDance Inc.)
The 35th British Machine Vision Conference

Abstract

Apart from learning to reconstruct individual masked patches of video frames, our model is designed to additionally predict the corresponding motion structure information over time. This motion information is available at the temporal difference of nearby frames. As a result, our model can extract effectively both static appearance and dynamic motion spontaneously, leading to superior spatiotemporal representation learning capability. Extensive experiments show that our MotionMAE outperforms significantly both supervised learning baseline and state-of-the-art MAE alternatives, under both domain-specific and domain-generic pretraining-then-finetuning settings. In particular, when using ViT-B as the backbone our MotionMAE surpasses the prior art model by a margin of 1.2% on Something-Something V2. Encouragingly, it also surpasses the competing MAEs by a large margin of over 3% on the challenging video object segmentation task.

Citation

@inproceedings{Yang_2024_BMVC,
author    = {Haosen Yang and Deng Huang and Bin Wen and Jiannan Wu and Hongxun Yao and Yi Jiang and Xiatian Zhu and Zehuan Yuan},
title     = {MotionMAE: Self-supervised Video Representation Learning with Motion-Aware Masked Autoencoders},
booktitle = {35th British Machine Vision Conference 2024, {BMVC} 2024, Glasgow, UK, November 25-28, 2024},
publisher = {BMVA},
year      = {2024},
url       = {https://papers.bmvc2024.org/0499.pdf}
}


Copyright © 2024 The British Machine Vision Association and Society for Pattern Recognition
The British Machine Vision Conference is organised by The British Machine Vision Association and Society for Pattern Recognition. The Association is a Company limited by guarantee, No.2543446, and a non-profit-making body, registered in England and Wales as Charity No.1002307 (Registered Office: Dept. of Computer Science, Durham University, South Road, Durham, DH1 3LE, UK).

Imprint | Data Protection