Explore the Potential of CLIP for Training-Free Open Vocabulary Semantic Segmentation
On this page
CLIP, as a vision-language model, has significantly advanced Open-VocabularySemantic Segmentation (OVSS) with its zero-shot capabilities. Despite itssuccess, its application to OVSS faces challenges due to its initialimage-level alignment training, which affects its performance in tasksrequiring detailed local context. Our study delves into the impact of CLIP’s[CLS] token on patch feature correlations, revealing a dominance of “global"patches that hinders local feature discrimination. To overcome this, we proposeCLIPtrase, a novel training-free semantic segmentation strategy that enhanceslocal feature awareness through recalibrated self-correlation among patches.This approach demonstrates notable improvements in segmentation accuracy andthe ability to maintain semantic coherence across objects.Experiments show thatwe are 22.3
Further reading
- Access Paper in arXiv.org