Enhancing Radiology Report Generation: The Impact of Locally Grounded Vision and Language Training


Sergio Sanchez Santiesteban (University of Surrey), Muhammad Awais (University of Surrey), Yi-Zhe Song (University of Surrey), Josef Kittler (University of Surrey)
The 35th British Machine Vision Conference

Abstract

In the medical domain, the integration of multimodal data—specifically radiology images paired with corresponding reports—presents a valuable opportunity for enhanced diagnostics. Recently, there has been growing interest in using Multimodal Large Language Models (MLLMs) for this purpose, due to their proficiency in learning effectively from the limited examples typical in specialized fields like radiology. Traditionally, radiologists generate reports by scrutinizing specific regions of an x-ray for changes, which are then systematically described with references to anatomical structures in the report’s text. Existing methodologies, however, often process the radiographic image as a whole, which requires the fine-grained alignment to be learnt during the training phase through predominantly global optimization objectives. During pretraining this approach overlooks the subtleties of local image-to-text correspondences which results in automatically generated reports that are deficient in critical grounding elements, subsequently impeding the explanation of model predictions. In this paper, we introduce a novel dataset of interleaved radiology images with locally aligned phrase grounding annotations provided by radiologists. Drawing on grounding techniques employed in general-domain MLLMs, our methodology introduces learnable location tokens to enhance understanding of spatial relationships for model. We structure our training samples as sequences that encompass entire x-ray images, corresponding report texts, and region anchors. The region anchors are defined as sequences composed of the aforementioned location tokens to denote specific anatomical areas of interest. Combined with a grounding prompttuning strategy, this dataset fosters a direct connection between the radiology report’s text and specific regions of the x-ray image. Our evaluation, conducted on large-scale public datasets, demonstrates that our proposed approach significantly refines the capabilities of existing MLLMs for radiology report generation.

Citation

@inproceedings{Santiesteban_2024_BMVC,
author    = {Sergio Sanchez Santiesteban and Muhammad Awais and Yi-Zhe Song and Josef Kittler},
title     = {Enhancing Radiology Report Generation: The Impact of Locally Grounded Vision and Language Training},
booktitle = {35th British Machine Vision Conference 2024, {BMVC} 2024, Glasgow, UK, November 25-28, 2024},
publisher = {BMVA},
year      = {2024},
url       = {https://papers.bmvc2024.org/0857.pdf}
}


Copyright © 2024 The British Machine Vision Association and Society for Pattern Recognition
The British Machine Vision Conference is organised by The British Machine Vision Association and Society for Pattern Recognition. The Association is a Company limited by guarantee, No.2543446, and a non-profit-making body, registered in England and Wales as Charity No.1002307 (Registered Office: Dept. of Computer Science, Durham University, South Road, Durham, DH1 3LE, UK).

Imprint | Data Protection