Can LLMs Separate Instructions From Data? And What Do We Even Mean By That?

Instruction-tuned Large Language Models (LLMs) show impressive results innumerous practical applications, but they lack essential safety features thatare common in other areas of computer science, particularly an explicitseparation of instructions and data. This makes them vulnerable tomanipulations such as indirect prompt injections and generally unsuitable forsafety-critical tasks. Surprisingly, there is currently no establisheddefinition or benchmark to quantify this phenomenon. In this work, we closethis gap by introducing a formal measure for instruction-data separation and anempirical variant that is calculable from a model’s outputs. We also present anew dataset, SEP, that allows estimating the measure for real-world models. Ourresults on various LLMs show that the problem of instruction-data separation isreal: all models fail to achieve high separation, and canonical mitigationtechniques, such as prompt engineering and fine-tuning, either fail tosubstantially improve separation or reduce model utility. The source code andSEP dataset are openly accessible athttps://github.com/egozverev/Shold-It-Be-Executed-Or-Processed.

Further reading