SignVTCL: Multi-Modal Continuous Sign Language Recognition Enhanced by Visual-Textual Contrastive Learning


Hao Chen (Department of Computer Science and Engineering, The Chinese University of Hong Kong), Jiaze Wang (The Chinese University of Hong Kong), Ziyu Guo (Department of Computer Science and Engineering, The Chinese University of Hong Kong), Jinpeng Li (The Chinese University of Hong Kong), Donghao Zhou (The Chinese University of Hong Kong), Bian Wu (Zhejiang University), Chenyong Guan (Gudsen Technology Co. Ltd), Guangyong Chen (Zhejiang Lab), Pheng-Ann Heng (The Chinese University of Hong Kong)
The 35th British Machine Vision Conference

Abstract

Sign language recognition (SLR) plays a vital role in facilitating communication for the hearing-impaired community. A significant challenge in SLR arises from its weakly supervised nature, where each entire video is annotated with a sequence of glosses. This makes it particularly difficult to accurately identify the corresponding gloss for specific video segments. To address this challenge, we present SignVTCL, a multi-modal continuous sign language recognition framework enhanced by visual-textual contrastive learning, which leverages the full potential of multi-modal data and the generalization ability of language model. First, SignVTCL consolidates multi-modal data to train a unified visual feature extractor, resulting in more robust visual representations. Subsequently, it employs a visual-textual alignment approach that integrates gloss-level and sentence-level alignments, establishing precise correspondences between visual and textual features to enhance SLR accuracy. Experimental results conducted on three datasets, Phoenix-2014, Phoenix-2014T, and CSL-Daily, demonstrate that SignVTCL achieves state-of-the-art results compared with previous methods. Project page: https://jiazewang.com/projects/signvtcl.html.

Citation

@inproceedings{Chen_2024_BMVC,
author    = {Hao Chen and Jiaze Wang and Ziyu Guo and Jinpeng Li and Donghao Zhou and Bian Wu and Chenyong Guan and Guangyong Chen and Pheng-Ann Heng},
title     = {SignVTCL: Multi-Modal Continuous Sign Language Recognition Enhanced by Visual-Textual Contrastive Learning},
booktitle = {35th British Machine Vision Conference 2024, {BMVC} 2024, Glasgow, UK, November 25-28, 2024},
publisher = {BMVA},
year      = {2024},
url       = {https://papers.bmvc2024.org/0335.pdf}
}


Copyright © 2024 The British Machine Vision Association and Society for Pattern Recognition
The British Machine Vision Conference is organised by The British Machine Vision Association and Society for Pattern Recognition. The Association is a Company limited by guarantee, No.2543446, and a non-profit-making body, registered in England and Wales as Charity No.1002307 (Registered Office: Dept. of Computer Science, Durham University, South Road, Durham, DH1 3LE, UK).

Imprint | Data Protection