GLCM-Adapter: Global-Local Content Matching for Few-shot CLIP Adaptation


Shuo Wang (University of Science and Technology of China), Xieenlong (University of Science and Technology of China), Jinda Lu (University of Science and Technology of China), Jinghan Li (University of Science and Technology of China), Yanbin Hao (University of Science and Technology of China)
The 35th British Machine Vision Conference

Abstract

Recent adaptations aim to boost the few-shot capability of Contrastive Vision Language Pre-training (CLIP) by transferring textual knowledge into an image recognition procedure. However, these adaptation methods are usually operated on the global view of an input image, and thus biased recognition of partial details of the image. To solve this issue, we propose a Global-Local Content Matching (GLCM) strategy, which focuses on both global and local views of the image. Specifically, we first extract global and local features from the input image using the CLIP visual encoder. Meanwhile, we embed the corresponding text knowledge into features by the CLIP textual encoder. Then, we construct local representation with the textual features by selectively combining discriminative local content. The local representation contains sufficient local details, and it can help the classifier to focus on the details of the image. Finally, we match the global and local content to construct a robust classifier, namely GLCM-Adapter. Our GLCM-Adapter pays attention to information from different views, and thus achieves robust recognition. We evaluate our method on the popular few-shot classification task with 11 benchmark datasets and achieve a significant improvement over state-of-the-art methods. For example, our method achieves more than 1% average gains over the Tip-Adapter-F, and obtains more than 76.5% average accuracy for the 16-shot setting.

Citation

@inproceedings{Wang_2024_BMVC,
author    = {Shuo Wang and Xieenlong and Jinda Lu and Jinghan Li and Yanbin Hao},
title     = {GLCM-Adapter: Global-Local Content Matching for Few-shot CLIP Adaptation},
booktitle = {35th British Machine Vision Conference 2024, {BMVC} 2024, Glasgow, UK, November 25-28, 2024},
publisher = {BMVA},
year      = {2024},
url       = {https://papers.bmvc2024.org/0425.pdf}
}


Copyright © 2024 The British Machine Vision Association and Society for Pattern Recognition
The British Machine Vision Conference is organised by The British Machine Vision Association and Society for Pattern Recognition. The Association is a Company limited by guarantee, No.2543446, and a non-profit-making body, registered in England and Wales as Charity No.1002307 (Registered Office: Dept. of Computer Science, Durham University, South Road, Durham, DH1 3LE, UK).

Imprint | Data Protection