Do Membership Inference Attacks Work on Large Language Models?

Membership inference attacks (MIAs) attempt to predict whether a particulardatapoint is a member of a target model’s training data. Despite extensiveresearch on traditional machine learning models, there has been limited workstudying MIA on the pre-training data of large language models (LLMs). Weperform a large-scale evaluation of MIAs over a suite of language models (LMs)trained on the Pile, ranging from 160M to 12B parameters. We find that MIAsbarely outperform random guessing for most settings across varying LLM sizesand domains. Our further analyses reveal that this poor performance can beattributed to (1) the combination of a large dataset and few trainingiterations, and (2) an inherently fuzzy boundary between members andnon-members. We identify specific settings where LLMs have been shown to bevulnerable to membership inference and show that the apparent success in suchsettings can be attributed to a distribution shift, such as when members andnon-members are drawn from the seemingly identical domain but with differenttemporal ranges. We release our code and data as a unified benchmark packagethat includes all existing MIAs, supporting future work.

Further reading