Revisiting Image Captioning Training Paradigm via Direct CLIP-based Optimization


Nicholas Moratelli (University of Modena and Reggio Emilia), Davide Caffagni (University of Modena and Reggio Emilia), Marcella Cornia (University of Modena and Reggio Emilia), Lorenzo Baraldi (University of Modena and Reggio Emilia ), Rita Cucchiara (University of Modena and Reggio Emilia)
The 35th British Machine Vision Conference

Abstract

The conventional training approach for image captioning involves pre-training a network using teacher forcing and subsequent fine-tuning with Self-Critical Sequence Training to maximize hand-crafted captioning metrics. However, when attempting to optimize modern and higher-quality metrics like CLIP-Score and PAC-Score, this training method often encounters instability and fails to acquire the genuine descriptive capabilities needed to produce fluent and informative captions. In this paper, we propose a new training paradigm termed Direct CLIP-Based Optimization (DiCO). Our approach jointly learns and optimizes a reward model that is distilled from a learnable captioning evaluator with high human correlation. This is done by solving a weighted classification problem directly inside the captioner. At the same time, DiCO prevents divergence from the original model, ensuring that fluency is maintained. DiCO not only exhibits improved stability and enhanced quality in the generated captions but also aligns more closely with human preferences compared to existing methods, especially in modern metrics. Additionally, it maintains competitive performance in traditional metrics. Our source code and trained models are publicly available at https://github.com/aimagelab/DiCO.

Citation

@inproceedings{Moratelli_2024_BMVC,
author    = {Nicholas Moratelli and Davide Caffagni and Marcella Cornia and Lorenzo Baraldi and Rita Cucchiara},
title     = {Revisiting Image Captioning Training Paradigm via Direct CLIP-based Optimization},
booktitle = {35th British Machine Vision Conference 2024, {BMVC} 2024, Glasgow, UK, November 25-28, 2024},
publisher = {BMVA},
year      = {2024},
url       = {https://papers.bmvc2024.org/0754.pdf}
}


Copyright © 2024 The British Machine Vision Association and Society for Pattern Recognition
The British Machine Vision Conference is organised by The British Machine Vision Association and Society for Pattern Recognition. The Association is a Company limited by guarantee, No.2543446, and a non-profit-making body, registered in England and Wales as Charity No.1002307 (Registered Office: Dept. of Computer Science, Durham University, South Road, Durham, DH1 3LE, UK).

Imprint | Data Protection