Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing
On this page
Large language models (LLMs) are increasingly being adopted in a wide rangeof real-world applications. Despite their impressive performance, recentstudies have shown that LLMs are vulnerable to deliberately crafted adversarialprompts even when aligned via Reinforcement Learning from Human Feedback orsupervised fine-tuning. While existing defense methods focus on eitherdetecting harmful prompts or reducing the likelihood of harmful responsesthrough various means, defending LLMs against jailbreak attacks based on theinner mechanisms of LLMs remains largely unexplored. In this work, weinvestigate how LLMs response to harmful prompts and propose a novel defensemethod termed Layer-specific Editing (LED) to enhance theresilience of LLMs against jailbreak attacks. Through LED, we reveal thatseveral critical safety layers exist among the early layers of LLMs.We then show that realigning these safety layers (and some selected additionallayers) with the decoded safe response from selected target layers cansignificantly improve the alignment of LLMs against jailbreak attacks.Extensive experiments across various LLMs (e.g., Llama2, Mistral) show theeffectiveness of LED, which effectively defends against jailbreak attacks whilemaintaining performance on benign prompts. Our code is available athttps://github.com/ledllm/ledllm.
Further reading
- Access Paper in arXiv.org