Upcoming Events

SCP Events

IC Distinguished Lecture Series with Guest Speaker Noah Smith

Abstract

Neural language models with billions of parameters and trained on trillions of words are powering the fastest-growing computing applications in history and generating discussion and debate around the world. Yet most scientists cannot study or improve those state-of-the-art models because the organizations deploying them keep their data and machine learning processes secret. I believe that the path to models that are usable by all, at low cost, customizable for areas of critical need like the sciences, and whose capabilities and limitations are made transparent and understandable, is radically open development, with academic and not-for-profit researchers empowered to do reproducible science. Projects like Falcon, Llama, MPT, and Pythia provide glimmers of hope. In this talk, I’ll share the story of the work our team is doing to radically open up the science of language modeling. As of April 2024, we’ve released Dolma, a three-trillion-token open dataset curated for training language models, and used it to pretrain OLMo v1, also publicly released. We’ve also built and released Tülu, a series of open instruction-tuned models. All of these come with open-source code and extensive documentation, including new tools for evaluation. Together these artifacts make it possible to explore new scientific questions and democratize control of the future of this fascinating and important technology.

The work I’ll present was carried out primarily by a large team at the Allen Institute for Artificial Intelligence in Seattle, with collaboration from the Paul G. Allen School at the University of Washington and various kinds of support and coordination from many organizations, including the Kempner Institute for the Study of Natural and Artificial Intelligence at Harvard University, AMD, CSC - IT Center for Science (Finland), Databricks, and Together.ai.