Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models

We introduce Generalized Instruction Tuning (called GLAN), a general andscalable method for instruction tuning of Large Language Models (LLMs). Unlikeprior work that relies on seed examples or existing datasets to constructinstruction tuning data, GLAN exclusively utilizes a pre-curated taxonomy ofhuman knowledge and capabilities as input and generates large-scale syntheticinstruction data across all disciplines. Specifically, inspired by thesystematic structure in human education system, we build the taxonomy bydecomposing human knowledge and capabilities to various fields, sub-fields andultimately, distinct disciplines semi-automatically, facilitated by LLMs.Subsequently, we generate a comprehensive list of subjects for every disciplineand proceed to design a syllabus tailored to each subject, again utilizingLLMs. With the fine-grained key concepts detailed in every class session of thesyllabus, we are able to generate diverse instructions with a broad coverageacross the entire spectrum of human knowledge and skills. Extensive experimentson large language models (e.g., Mistral) demonstrate that GLAN excels inmultiple dimensions from mathematical reasoning, coding, academic exams,logical reasoning to general instruction following without using task-specifictraining data of these tasks. In addition, GLAN allows for easy customizationand new fields or skills can be added by simply incorporating a new node intoour taxonomy.

Further reading