Date Posted: 9/23/2024

The Association for Computational Linguistics (ACL) has chosen a paper by Lillian Lee '93, the Charles Roy Davis Professor in the Ann S. Bowers College of Computing and Information Science, for its 25-year Test of Time Paper Award.

Each year, ACL selects up to two papers published 25 years earlier, and up to two papers published 10 years earlier, for contributions that have had a "long-lasting impact on the field of natural language processing (NLP) and computational linguistics." 

Lee's 1999 paper, "Measures of Distributional Similarity," was published in the Proceedings of the 37th Annual Meeting of the ACL. It was an early contribution in the field of language modeling.

Today, large language models (LLMs), like ChatGPT, generate text by predicting the statistically most likely next word based on the data it was previously trained on – buckets and buckets of text scraped from across the internet. 

In the late 90s, researchers faced a similar language prediction challenge but with more modest immediate goals. At the time, they were trying to differentiate between ambiguous sentences for so-called “decoding” tasks in speech recognition, machine translation, and other language applications. Lee gives this example: Did someone say, “It's hard to recognize speech"? Or did they actually say, "It’s hard to wreck a nice beach”? Those sentences sound similar but mean very different things. A language model from the time would make a choice based on which words were statistically most likely to be uttered, based on other words nearby.

But how could a model from the 90s make this call when it had been trained on only a few million words, instead of the entire internet? To address this "sparse data" problem, Lee and other NLP researchers used an approach called distributional similarity – the idea that words that appear in the same contexts tend to have similar meanings. By identifying words that have a similar context, the language model could use one as a guide for the other, to estimate the odds of when it might be used.

As an example, "cat" and "dog" tend to appear in similar sentences (e.g., ones that involve "vet," "play," or "pet,") whereas "cat" and "tree" do not, Lee said. If a language model doesn't know how to handle "cat," it can make a guess by treating it like "dog" instead of like "tree." 

In her 1999 paper, Lee evaluated several measures of distributional similarity that researchers had proposed previously and came up with her own measure based on her analysis. 

“I was looking at various different kinds of mathematical ways you could try to decide that two sentences or two words or two pieces of language are similar, so that you can use them to help estimate the probabilities of one versus the other,” Lee said. 

Her novel measure, called the skew divergence, could estimate the amount of difference in the usage of two words. Since that time, the skew divergence has found applications not just in language modeling, but also in the areas of quantum computing, image recognition, and graph analysis.

Lee considers this Test of Time Paper Award to be a joint award, in recognition of a collection of language modeling papers published by her and colleagues around that time. The summers before and after receiving her undergraduate degree in math and computer science from Cornell, Lee had internships at AT&T, working with Fernando Pereira, Naftali Tishby, and Ido Dagan. This experience sparked her interest in the field. Pereira, Dagan, and Tishby – all pioneers in the field of natural language processing – helped launch her career.

Following her early work in NLP, Lee's research focus shifted to applying computational techniques to understand social aspects of language – but she said it has been fascinating to see how LLMs have evolved.

So, does NLP still have a sparse data problem? Yes and no, Lee said. 

“We certainly have lots and lots more data than we used to have – that's one half of why large language models work so well," she said. "Back then, we only had 20 million words!” 

But for endangered languages without a lot of resources, there often isn't enough data to train an LLM in that language. Additionally, new slang and novel uses for old words are constantly emerging, meaning that there will always be unfamiliar words. 

“The sparse data problem is still with us," Lee said. "It's in a very different format.”

By Patricia Waldron, a writer for the Cornell Ann S. Bowers College of Computing and Information Science.