COIG-CQIA: Quality is All You Need for Chinese Instruction Fine-tuning

Remarkable progress on English instruction tuning has facilitated theefficacy and reliability of large language models (LLMs). However, thereremains a noticeable gap in instruction tuning for Chinese, where the complexlinguistic features pose significant challenges. Existing datasets, generallydistilled from English-centric LLMs, are not well-aligned with Chinese users’interaction patterns. To bridge this gap, we introduce COIG-CQIA, a new Chineseinstruction tuning dataset derived from various real-world resources andundergoing rigorous human verification. We conduct extensive experiments onCOIG-CQIA, and compare them with strong baseline models and datasets. Theexperimental results show that models trained on COIG-CQIA achieve highlycompetitive performance in diverse benchmarks. Additionally, our findings offerseveral insights for designing effective Chinese instruction-tuning datasetsand data-mixing strategies. Our dataset are available athttps://huggingface.co/datasets/m-a-p/COIG-CQIA.

Further reading