Foundational Vision-Language (V-L) models, such as CLIP, have progressed computer vision research by providing general “shared” latent spaces for image and text modalities. However, training these models, or even fine-tuning them, demands substantial computational resources and has a high environmental cost. In this work, we propose CLIP-DoRA, a method leveraging weight-decomposed low-rank adaptation for parameter-efficient V-L fine-tuning. We explore the potential of less explored low-rank-based methods for V-L fine-tuning, and provide empirical proof of the benefits of weight-decomposed fine-tuning. Our extensive experiments across 11 few-shot datasets and 4 domain generalization benchmarks demonstrate that CLIP-DoRA outperforms existing PEFT methods with average improvements of up to 0.28% over the previous state-of-the-art. Furthermore, CLIP-DoRA shows competitive results in complex tasks like medical image segmentation using only 1.5% of the trainable parameters, proving its potential as a more sustainable and accessible solution for V-L model adaptation. These findings highlight the robustness and versatility of CLIP-DoRA in developing efficient and environmentally friendly computer vision solutions.