CLIP-DoRA: Weight-decomposed Low-rank Adaptation for Efficient Vision-Language Models

Jesús M. Rodríguez-de-Vera, Imanol G. Estepa, Bhalaji Nagarajan, Petia Radeva

August, 2025

Abstract

Foundational Vision-Language (V-L) models, such as CLIP, have progressed computer vision research by providing general “shared” latent spaces for image and text modalities. However, training these models, or even fine-tuning them, demands substantial computational resources and has a high environmental cost. In this work, we propose CLIP-DoRA, a method leveraging weight-decomposed low-rank adaptation for parameter-efficient V-L fine-tuning. We explore the potential of less explored low-rank-based methods for V-L fine-tuning, and provide empirical proof of the benefits of weight-decomposed fine-tuning. Our extensive experiments across 11 few-shot datasets and 4 domain generalization benchmarks demonstrate that CLIP-DoRA outperforms existing PEFT methods with average improvements of up to 0.28% over the previous state-of-the-art. Furthermore, CLIP-DoRA shows competitive results in complex tasks like medical image segmentation using only 1.5% of the trainable parameters, proving its potential as a more sustainable and accessible solution for V-L model adaptation. These findings highlight the robustness and versatility of CLIP-DoRA in developing efficient and environmentally friendly computer vision solutions.

Type

Conference paper

Publication

Deep Learning Applications and Innovations Workshop in the International Joint Conference on Neural Networks (IJCNN)

CLIP-DoRA: Weight-decomposed Low-rank Adaptation for Efficient Vision-Language Models

Abstract

Jesús M. Rodríguez-de-Vera

PhD Candidate in Computer Vision