A Vision-Language-Action Flow Model for General Robot Control

Abstract: Robot learning has the potential to unlock flexible, general, and dexterous systems while addressing key AI challenges. However, achieving the generality needed for real-world applications faces obstacles like data, generalization, and robustness. This talk will describe the journey in building our flagship model, Pi_0 [1]. We propose a novel flow-matching architecture built on a pre-trained vision-language model to leverage Internet-scale semantic knowledge. The model is trained on diverse datasets from various dexterous robots, including single-arm, dual-arm, and mobile manipulators. We evaluate its zero-shot performance, ability to follow language instructions, and capacity to learn new skills through fine-tuning across tasks like laundry folding, table cleaning, and box assembly.

Bio: Quan Vuong is a co-founder at Physical Intelligence. His research focuses on generalist robotics and algorithms that enable intelligent behaviors through large scale learning. His works have been featured in popular news outlets, such as the New York Times and TechCrunch. He received his Ph.D. in Computer Science from the University of California San Diego.