Date Posted: 1/23/2025

When provided with vast amounts of information, contemporary artificial intelligence (AI )models can pass the bar exam, beat the world's best Go player and generate photorealistic images. Unlike humans, however, the technology still struggles to work through novel problems when given little starting data. 

In 2019, French software engineer François Chollet highlighted this weakness by challenging the computer science community to create models that could solve a series of simple visual puzzles. His Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI) problem set includes hundreds of concepts and serves as a benchmark to measure how well AI systems work "on the fly" to generalize and solve novel problems using few examples. After five years of hacking, the $1 million prize for a model that can score at least 85% is still unclaimed.

Now, a team from Cornell led by Kevin Ellis, assistant professor of computer science in the Cornell Ann S. Bowers College of Computing and Information Science, has developed a set of AI models that together, solve about 56% of the problems – scoring within 4 percentage points of the average human.

They submitted their paper, “Combining Induction and Transduction for Abstract Reasoning,” to the 2024 ARC Prize competition, where their solution received a first-place-paper award. Lead authors Wen-Ding Li, a doctoral student in the field of computer science, and Keya Hu, a visiting undergraduate student in Ellis’ group, will share a $50,000 prize.

“It felt like a milestone for me, because this ARC competition really withstood five years with so many other good AI models being proposed,” Li said.

Chollet’s ARC-AGI set has hundreds of problems in the form of colored grids. Each puzzle provides two or more before-and-after examples that demonstrate the solution; the solver must detect the pattern and apply it to a final “before” grid. Solutions can be anything from filling in open shapes with a specific color to repeating a design to highlighting a path through a maze.

“It occurred to us that there were two ways of doing this,” said Ellis, whose lab members began training two neural networks – a type of AI model inspired by the brain that learns across millions or billions of connections between nodes. Hu's model just spit out an answer – directly solving problems without explaining the solution in code. Meanwhile, Li's model was more methodical and searched for an explanation of how to solve each puzzle using a computer program. 

These approaches are analogous to the two ways that humans are believed to form thoughts: System 1, or "fast" thinking, which is based on intuition; and System 2, or "slow" thinking, which involves logic and deliberation. 

The researchers trained their models to solve the problem set using hundreds of thousands of variations of the example solutions.

Individually, each model solved roughly 40% of the problems – then Ellis proposed combining the models, starting with the slow one and switching to the fast one if the first model timed out. Together, their performance surpassed all previously published models.

“They have different, unique advantages, so when we combine these two approaches together, it further boosts our performance,” Hu said. “We get 56.75% accuracy, and it's very, very close to the average human performance of 60%.”

A closer look at the results showed the models were solving different types of problems. The researchers were surprised, because they trained both models on the same problems and expected them to give similar results.

“That was not what we expected,” Ellis said. “They actually complement each other really nicely.” 

Problems solved by the "slow" System 2 model, but not the System 1 model.

A problem solved by the "fast" System 1 model, but not by the System 2 model.

Ellis sees connections in this work to classic cognitive science studies finding that certain problems are actually harder to solve when people are encouraged to deliberate, such when they must learn rules with exceptions. 

"We discovered within ARC-AGI that certain problems are similarly made more challenging when systematically searching for symbolic explanations," Ellis said.

Currently, the two models function independently, but the researchers are already investigating ways of alternating the two modes of problem-solving. This approach could have applications in several other domains, the researchers said, including enabling robots to learn novel skills; organize new scenes and objects into visual categories; and understand cause and effect from very few starting examples.

The research team included Cornell undergraduates Carter Larsen ’25 and Yuqing Wu ’26 (both Bowers Undergraduate Research Experience summer students), Caleb Woo ’26 and Spencer Dunn ’24, as well as Simon Alford and Hao Tang, both doctoral students in computer science. Additional authors are Wei-Long Zheng of Shanghai Jiao Tong University; Michelangelo Naim, Dat Nguyen and Zenna Tavares of Basis; and Evan Pu, a graduate school classmate of Ellis who works at Autodesk.

Funding from a National Science Foundation CAREER Award to Ellis helped support this work.

By Patricia Waldron, a writer for the Cornell Ann S. Bowers College of Computing and Information Science