A Novel Paradigm Boosting Translation Capabilities of Large Language Models

This paper presents a study on strategies to enhance the translationcapabilities of large language models (LLMs) in the context of machinetranslation (MT) tasks. The paper proposes a novel paradigm consisting of threestages: Secondary Pre-training using Extensive Monolingual Data, ContinualPre-training with Interlinear Text Format Documents, and LeveragingSource-Language Consistent Instruction for Supervised Fine-Tuning. Previousresearch on LLMs focused on various strategies for supervised fine-tuning(SFT), but their effectiveness has been limited. While traditional machinetranslation approaches rely on vast amounts of parallel bilingual data, ourparadigm highlights the importance of using smaller sets of high-qualitybilingual data. We argue that the focus should be on augmenting LLMs’cross-lingual alignment abilities during pre-training rather than solelyrelying on extensive bilingual data during SFT. Experimental results conductedusing the Llama2 model, particularly on Chinese-Llama2 after monolingualaugmentation, demonstrate the improved translation capabilities of LLMs. Asignificant contribution of our approach lies in Stage2: Continual Pre-trainingwith Interlinear Text Format Documents, which requires less than 1B trainingdata, making our method highly efficient. Additionally, in Stage3, we observedthat setting instructions consistent with the source language benefits thesupervised fine-tuning process. Experimental results demonstrate that ourapproach surpasses previous work and achieves superior performance compared tomodels such as NLLB-54B and GPT3.5-text-davinci-003, despite having asignificantly smaller parameter count of only 7B or 13B. This achievementestablishes our method as a pioneering strategy in the field of machinetranslation.

Further reading