MedFocusCLIP : improving few shot classification in medical datasets using pixel wise attention

Show simple item record

dc.contributor.author Arora, Aadya
dc.contributor.author Namboodir, Vinay
dc.coverage.spatial United States of America
dc.date.accessioned 2025-01-16T14:18:51Z
dc.date.available 2025-01-16T14:18:51Z
dc.date.issued 2025-01
dc.identifier.citation Arora, Aadya and Namboodir, Vinay, "MedFocusCLIP : improving few shot classification in medical datasets using pixel wise attention", arXiv, Cornell University Library, DOI: arXiv:2501.03839, Jan. 2025.
dc.identifier.uri http://arxiv.org/abs/2501.03839
dc.identifier.uri https://repository.iitgn.ac.in/handle/123456789/10949
dc.description.abstract With the popularity of foundational models, parameter efficient fine tuning has become the defacto approach to leverage pretrained models to perform downstream tasks. Taking inspiration from recent advances in large language models, Visual Prompt Tuning, and similar techniques, learn an additional prompt to efficiently finetune a pretrained vision foundational model. However, we observe that such prompting is insufficient for fine-grained visual classification tasks such as medical image classification, where there is large inter-class variance, and small intra-class variance. Hence, in this paper we propose to leverage advanced segmentation capabilities of Segment Anything Model 2 (SAM2) as a visual prompting cue to help visual encoder in the CLIP (Contrastive Language-Image Pretraining) by guiding the attention in CLIP visual encoder to relevant regions in the image. This helps the model to focus on highly discriminative regions, without getting distracted from visually similar background features, an essential requirement in a fewshot, finegrained classification setting. We evaluate our method on diverse medical datasets including X-rays, CT scans, and MRI images, and report an accuracy of (71%, 81%, 86%, 58%) from the proposed approach on (COVID, lung-disease, brain-tumor, breast-cancer) datasets against (66%, 70%, 68%, 29%) from a pretrained CLIP model after fewshot training. The proposed approach also allows to obtain interpretable explanation for the classification performance through the localization obtained using segmentation.
dc.description.statementofresponsibility by Aadya Arora and Vinay Namboodir
dc.language.iso en_US
dc.publisher Cornell University Library
dc.title MedFocusCLIP : improving few shot classification in medical datasets using pixel wise attention
dc.type Article
dc.relation.journal arXiv


Files in this item

Files Size Format View

There are no files associated with this item.

This item appears in the following Collection(s)

Show simple item record

Search Digital Repository


Browse

My Account