Language Generation in the Limit | Department of Computer Science

Title: Language Generation in the Limit

Abstract: Although current large language models are complex, the most basic specifications of the underlying language generation problem itself are simple to state: given a finite set of training samples from an unknown language, produce valid new strings from the language that don't already appear in the training data. Here we ask what we can conclude about language generation using only this specification, without any further properties or distributional assumptions. In particular, we consider models in which an adversary enumerates the strings of an unknown target language that is known only to come from a possibly infinite list of candidate languages, and we show that it is possible to give certain non-trivial guarantees for language generation in this setting. The resulting guarantees contrast dramatically with negative results due to Gold and Angluin in a well-studied model of language learning where the goal is to identify an unknown language from samples; the difference between these results suggests that identifying a language is a fundamentally different problem than generating from it. (This is joint work with Sendhil Mullainathan.)

Bio: Jon Kleinberg is a professor in both Computer Science and Information Science. His research focuses on issues at the interface of networks and information, with an emphasis on the social and information networks that underpin the Web and other on-line media. His work has been supported by an NSF Career Award, an ONR Young Investigator Award, a MacArthur Foundation Fellowship, a Packard Foundation Fellowship, a Sloan Foundation Fellowship, and grants from Google, Yahoo!, and the NSF. He is a member of the National Academy of Sciences, the National Academy of Engineering, and the American Academy of Arts and Sciences.