CLIP with Generative Latent Replay: a Strong Baseline for Incremental Learning


Emanuele Frascaroli (University of Modena and Reggio Emilia), Aniello Panariello (University of Modena and Reggio Emilia), Pietro Buzzega (University of Modena and Reggio Emilia), Lorenzo Bonicelli (University of Modena and Reggio Emilia), Angelo Porrello (University of Modena and Reggio Emilia, AimageLab), Simone Calderara (University of Modena and Reggio Emilia)
The 35th British Machine Vision Conference

Abstract

With the emergence of Transformers and Vision-Language Models (VLMs) such as CLIP, fine-tuning large pre-trained models has recently become a prevalent strategy in Continual Learning. This has led to the development of numerous prompting strategies to adapt transformer-based models without incurring catastrophic forgetting. However, these strategies often compromise the original zero-shot capabilities of the pre-trained CLIP model and struggle to adapt to domains that significantly deviate from the pre-training data. In this work, we propose **Continual Generative training for Incremental prompt-Learning**, a simple and novel approach to mitigate forgetting while adapting CLIP. Briefly, we employ Variational Autoencoders (VAEs) to learn class-conditioned distributions within the embedding space of the visual encoder. We then exploit these distributions to sample new synthetic visual embeddings and train the corresponding class-specific textual prompts during subsequent tasks. Through extensive experiments on different domains, we show that such a generative replay approach can adapt to new tasks while improving zero-shot capabilities, evaluated using a novel metric tailored for CL scenarios. Notably, further analysis reveals that our approach can bridge the gap with joint prompt tuning. The codebase is available at https://github.com/aimagelab/mammoth.

Citation

@inproceedings{Frascaroli_2024_BMVC,
author    = {Emanuele Frascaroli and Aniello Panariello and Pietro Buzzega and Lorenzo Bonicelli and Angelo Porrello and Simone Calderara},
title     = {CLIP with Generative Latent Replay: a Strong Baseline for Incremental Learning},
booktitle = {35th British Machine Vision Conference 2024, {BMVC} 2024, Glasgow, UK, November 25-28, 2024},
publisher = {BMVA},
year      = {2024},
url       = {https://papers.bmvc2024.org/0863.pdf}
}


Copyright © 2024 The British Machine Vision Association and Society for Pattern Recognition
The British Machine Vision Conference is organised by The British Machine Vision Association and Society for Pattern Recognition. The Association is a Company limited by guarantee, No.2543446, and a non-profit-making body, registered in England and Wales as Charity No.1002307 (Registered Office: Dept. of Computer Science, Durham University, South Road, Durham, DH1 3LE, UK).

Imprint | Data Protection