Low-Rank Few-Shot Adaptation of Vision-Language Models
On this page
Recent progress in the few-shot adaptation of Vision-Language Models (VLMs)has further pushed their generalization capabilities, at the expense of just afew labeled samples within the target downstream task. However, this promising,already quite abundant few-shot literature has focused principally on promptlearning and, to a lesser extent, on adapters, overlooking the recent advancesin Parameter-Efficient Fine-Tuning (PEFT). Furthermore, existing few-shotlearning methods for VLMs often rely on heavy training procedures and/orcarefully chosen, task-specific hyper-parameters, which might impede theirapplicability. In response, we introduce Low-Rank Adaptation (LoRA) in few-shotlearning for VLMs, and show its potential on 11 datasets, in comparison tocurrent state-of-the-art prompt- and adapter-based approaches. Surprisingly,our simple CLIP-LoRA method exhibits substantial improvements, while reducingthe training times and keeping the same hyper-parameters in all the targettasks, i.e., across all the datasets and numbers of shots. Certainly, oursurprising results do not dismiss the potential of prompt-learning andadapter-based research. However, we believe that our strong baseline could beused to evaluate progress in these emergent subjects in few-shot VLMs.
Further reading
- Access Paper in arXiv.org