Hello friend,
In this series we have talked about individual topics from balance and movement to learning and simulating the real world. Now we are looking at how all these things come to build robots that can do many things.
This is the frontier of task learning and foundation models in robotics.
What is a Foundation Model for Humanoids?
A robotic foundation model is a computer program that is trained on a lot of data including simulated and real-world examples and can be used for many different tasks.
The common type of model is the Vision-Language-Action model, which uses:
- Pictures from cameras
- Instructions in natural language
- The robots own senses like how its joints are moving and its balance
- And outputs actions, like what the robot should do with its joints
These models aim to be general-purpose so one model can walk, manipulate objects, use tools, interact with people and adapt to new environments.
Multi-Task Learning: The Core Training Strategy
of training separate models for each task multi-task learning trains one big model on many tasks at the same time.
The benefits of this include:
- Skills can help each other like learning to walk can help with carrying objects
- The model learns the basics of how things work, instead of just memorizing what to do
- The model can even do things it was not specifically trained to do
Leading Approaches
There are many teams working on this:
- RT-X and Open X-Embodiment: Big projects that combine data from many robots and the models trained on this data can work on many different robots.
- π0 (Physical Intelligence): A leading model that is specifically designed for humanoid robots.
- Google DeepMind, Covariant and Tesla Optimus: Teams that are working on big models that combine language and actions.
- Figure 01: A company that uses a combination of foundation models and classical controllers.
Most of these models start by learning from examples. Then they are fine-tuned with feedback from humans or by trying things on their own.
Architecture Trends
Modern models usually use:
- A computer program that looks at pictures and understands what is in them
- A language program that understands what people are saying
- A model that combines all the information and outputs actions
- A program that predicts what the robot should do
Because it is hard to control the robot in real-time most systems use a combination of:
- Foundation models for high-level thinking and planning
- Classical controllers for low-level actions and safety
Challenges Remaining
- It is hard to control the robots many joints and make sure it is safe
- The robot needs to be able to think and plan for a time
- The robot needs to be safe and not cause damage or hurt people
- The model needs a lot of good data to learn from
My Personal Take
We are at the beginning of something big. For the time we have a way to build robots that can learn and adapt and do many things without needing special training.
The combination of:
- Understanding physics and how things move
- Having good hardware
- Learning from examples and trying things
- Using foundation models for thinking and planning
is very powerful. I think the teams that can put all these things together will be the winners, in the few years.
The age of robots that can only do one thing is ending. The age of robots that can do things and are really useful is beginning.