JiuZhang3.0: Efficiently Improving Mathematical Reasoning by Training Small Data Synthesis Models

Mathematical reasoning is an important capability of large languagemodels (LLMs) for real-world applications. To enhance this capability, existingwork either collects large-scale math-related texts for pre-training, or relieson stronger LLMs (GPT-4) to synthesize massive math problems. Both types ofwork generally lead to large costs in training or synthesis. To reduce thecost, based on open-source available texts, we propose an efficient way thattrains a small LLM for math problem synthesis, to efficiently generatesufficient high-quality pre-training data. To achieve it, we create a datasetusing GPT-4 to distill its data synthesis capability into the small LLM.Concretely, we craft a set of prompts based on human education stages to guideGPT-4, to synthesize problems covering diverse math knowledge and difficultylevels. Besides, we adopt the gradient-based influence estimation method toselect the most valuable math-related texts. The both are fed into GPT-4 forcreating the knowledge distillation dataset to train the small LLM. We leverageit to synthesize 6 million math problems for pre-training our JiuZhang3.0model, which only needs to invoke GPT-4 API 9.3k times and pre-train on 4.6Bdata. Experimental results have shown that JiuZhang3.0 achievesstate-of-the-art performance on several mathematical reasoning datasets, underboth natural language reasoning and tool manipulation settings. Our code anddata will be publicly released inhttps://github.com/RUCAIBox/JiuZhang3.0.

Further reading