Elephants Never Forget: Memorization and Learning of Tabular Data in Large Language Models
On this page
While many have shown how Large Language Models (LLMs) can be applied to adiverse set of tasks, the critical issues of data contamination andmemorization are often glossed over. In this work, we address this concern fortabular data. Specifically, we introduce a variety of different techniques toassess whether a language model has seen a tabular dataset during training.This investigation reveals that LLMs have memorized many popular tabulardatasets verbatim. We then compare the few-shot learning performance of LLMs ondatasets that were seen during training to the performance on datasets releasedafter training. We find that LLMs perform better on datasets seen duringtraining, indicating that memorization leads to overfitting. At the same time,LLMs show non-trivial performance on novel datasets and are surprisingly robustto data transformations. We then investigate the in-context statisticallearning abilities of LLMs. While LLMs are significantly better than random atsolving statistical classification problems, the sample efficiency of few-shotlearning lags behind traditional statistical learning algorithms, especially asthe dimension of the problem increases. This suggests that much of the observedfew-shot performance on novel real-world datasets is due to the LLM’s worldknowledge. Overall, our results highlight the importance of testing whether anLLM has seen an evaluation dataset during pre-training. We release thehttps://github.com/interpretml/LLM-Tabular-Memorization-Checker Python packageto test LLMs for memorization of tabular datasets.
Further reading
- Access Paper in arXiv.org