BackdoorLLM: A Comprehensive Benchmark for Backdoor Attacks on Large Language Models

Generative Large Language Models (LLMs) have made significant strides acrossvarious tasks, but they remain vulnerable to backdoor attacks, where specifictriggers in the prompt cause the LLM to generate adversary-desired responses.While most backdoor research has focused on vision or text classificationtasks, backdoor attacks in text generation have been largely overlooked. Inthis work, we introduce BackdoorLLM, the first comprehensive benchmarkfor studying backdoor attacks on LLMs. BackdoorLLM features: 1) arepository of backdoor benchmarks with a standardized training pipeline, 2)diverse attack strategies, including data poisoning, weight poisoning, hiddenstate attacks, and chain-of-thought attacks, 3) extensive evaluations with over200 experiments on 8 attacks across 7 scenarios and 6 model architectures, and4) key insights into the effectiveness and limitations of backdoors in LLMs. Wehope BackdoorLLM will raise awareness of backdoor threats andcontribute to advancing AI safety. The code is available athttps://github.com/bboylyg/BackdoorLLM.

Further reading