The OLMo Cookbook: Open Recipes for Language Model Data Curation

Title: The OLMo Cookbook: Open Recipes for Language Model Data Curation

Abstract: Information about pretraining corpora used to train the current best-performing language models is seldom discussed: commercial models rarely detail their data, and even open models are often released without accompanying training data or recipes to reproduce them. As a result, it can be challenging to conduct and advance scientific research on language modeling, such as understanding how training data impacts model capabilities, risks and limitations. In this talk, I'll present how we approach data curation research for OLMo, our project to develop and share fully open language models. Reflecting on our journey from OLMo 1 to our latest release of OLMo 2, I'll explore how data curation practices have matured across our work and the broader open data research ecosystem. Finally, I'll examine key challenges and opportunities for open data amid a rapidly changing language model landscape.

Bio: Kyle Lo is a research scientist at the Allen Institute for AI (Ai2), where he co-leads the OLMo project on open language modeling research. His current work focuses on data-driven approaches to model behavior and efficient language model experimentation. His research on language model development and adaptation, evaluation methods, and human-AI interaction has won awards at ACL, EMNLP and CHI. Kyle’s work on language models for science—including fact checking, summarization, and augmented reading—have been featured in Nature, Science, TechCrunch and other publications. Kyle holds a degree in Statistics from the University of Washington. Outside of work, he enjoys board games, boba tea, D&D, and spending time with his cat Belphegor.