Croissant: A Metadata Format for ML-Ready Datasets
On this page
Data is a critical resource for machine learning (ML), yet working with dataremains a key friction point. This paper introduces Croissant, a metadataformat for datasets that creates a shared representation across ML tools,frameworks, and platforms. Croissant makes datasets more discoverable,portable, and interoperable, thereby addressing significant challenges in MLdata management. Croissant is already supported by several popular datasetrepositories, spanning hundreds of thousands of datasets, enabling easy loadinginto the most commonly-used ML frameworks, regardless of where the data isstored. Our initial evaluation by human raters shows that Croissant metadata isreadable, understandable, complete, yet concise.
Further reading
- Access Paper in arXiv.org