What If We Recaption Billions of Web Images with LLaMA-3?

Web-crawled image-text pairs are inherently noisy. Prior studies demonstratethat semantically aligning and enriching textual descriptions of these pairscan significantly enhance model training across various vision-language tasks,particularly text-to-image generation. However, large-scale investigations inthis area remain predominantly closed-source. Our paper aims to bridge thiscommunity effort, leveraging the powerful and open-sourced LLaMA-3, aGPT-4 level LLM. Our recaptioning pipeline is simple: first, we fine-tune aLLaMA-3-8B powered LLaVA-1.5 and then employ it to recaption 1.3 billion imagesfrom the DataComp-1B dataset. Our empirical results confirm that this enhanceddataset, Recap-DataComp-1B, offers substantial benefits in training advancedvision-language models. For discriminative models like CLIP, we observeenhanced zero-shot performance in cross-modal retrieval tasks. For generativemodels like text-to-image Diffusion Transformers, the generated images exhibita significant improvement in alignment with users’ text instructions,especially in following complex queries. Our project page ishttps://www.haqtu.me/Recap-Datacomp-1B/

Further reading