In the ongoing evolution of human-robot interaction, one of the central goals of research remains enabling robots to understand instructions specified in natural language. Most current real-world robot systems are either built to solve one particular task in a specific way (e.g. robot vacuum cleaners), or use specialized controllers that require expertise and training to use (e.g., manufacturing robots). For robots to be useful companions in our everyday lives, a layperson—without prior training—should be able to tell the robot in natural human speech what he or she wants the robot to do. And the robot should then do it.
The technology to achieve the vision requires that robots “reason” about sentence structures, and moreover about how words and phrases correspond to objects and places in the world. It requires robots “inferring” what changes in the environment need to be made to satisfy the user’s goal, and determining what sequence of actions will achieve it. Recently, a team of Cornell researchers competed to create new and better conditions for improving the outcomes of human-robot interaction via natural language—and they proved victorious.
The team including Yoav Artzi, Associate Professor of Computer Science at Cornell Tech and Valts Blukis, a fifth-year Ph.D. candidate in Computer Science, along with Chris Paxton, Dieter Fox, and Animesh Garg at NVIDIA, won first place in the ALFRED Challenge (Action Learning From Realistic Environments and Directives) at the 2021 EAI@CVPR (Embodied Artificial Intelligence workshop at the Conference on Computer Vision and Pattern Recognition). Teams were invited to compete on “embodied visual tasks that require the grounding of language to actions in real-world settings.” “ALFRED,” as explained in this video, offers “a new benchmark for learning a mapping from natural language instructions and egocentric vision to sequences of actions for household tasks.”
With their prize-winning machine learning solution, Artzi, Blukis, and their teammates have made new strides in helping to bridge disparate computer science fields, and have improved the capacity of robots to interact with humans and be helpful companions in our everyday life. Their work will be presented at the Conference on Robot Learning 2021, a conference that brings together the world’s leading researchers at the intersection of machine learning and robotics.
Embodied artificial intelligence (AI), which involves vision-and-language and vision-and-robotics, sits at the interface of three different fields of inquiry: vision, robotics, and natural language processing. At the intersection of these fields, some of the problems researchers and designers face include partial observability, continuous state spaces, and irrevocable actions for language-guided agents in visual environments. At present, current datasets do not capture these kinds of instructions and operations. Thus, this workshop-as-competition aims to encourage the onward development of embodied vision and language.