Datasheet for the Pile

This datasheet describes the Pile, a 825 GiB dataset of human-authored textcompiled by EleutherAI for use in large-scale language modeling. The Pile iscomprised of 22 different text sources, ranging from original scrapes done forthis project, to text data made available by the data owners, to third-partyscrapes available online.

Further reading