Multi-Task. Foundation Models for Humanoid Robots: Towards Generalist Agents

Hello friend,

In this series we have talked about individual topics from balance and movement to learning and simulating the real world. Now we are looking at how all these things come to build robots that can do many things.

This is the frontier of task learning and foundation models in robotics.

What is a Foundation Model for Humanoids?

A robotic foundation model is a computer program that is trained on a lot of data including simulated and real-world examples and can be used for many different tasks.

The common type of model is the Vision-Language-Action model, which uses:

Pictures from cameras
Instructions in natural language
The robots own senses like how its joints are moving and its balance
And outputs actions, like what the robot should do with its joints

These models aim to be general-purpose so one model can walk, manipulate objects, use tools, interact with people and adapt to new environments.

Multi-Task Learning: The Core Training Strategy

of training separate models for each task multi-task learning trains one big model on many tasks at the same time.

The benefits of this include:

Skills can help each other like learning to walk can help with carrying objects
The model learns the basics of how things work, instead of just memorizing what to do
The model can even do things it was not specifically trained to do

Leading Approaches

There are many teams working on this:

RT-X and Open X-Embodiment: Big projects that combine data from many robots and the models trained on this data can work on many different robots.
π0 (Physical Intelligence): A leading model that is specifically designed for humanoid robots.
Google DeepMind, Covariant and Tesla Optimus: Teams that are working on big models that combine language and actions.
Figure 01: A company that uses a combination of foundation models and classical controllers.

Most of these models start by learning from examples. Then they are fine-tuned with feedback from humans or by trying things on their own.

Architecture Trends

Modern models usually use:

A computer program that looks at pictures and understands what is in them
A language program that understands what people are saying
A model that combines all the information and outputs actions
A program that predicts what the robot should do

Because it is hard to control the robot in real-time most systems use a combination of:

Foundation models for high-level thinking and planning
Classical controllers for low-level actions and safety

Challenges Remaining

It is hard to control the robots many joints and make sure it is safe
The robot needs to be able to think and plan for a time
The robot needs to be safe and not cause damage or hurt people
The model needs a lot of good data to learn from

My Personal Take

We are at the beginning of something big. For the time we have a way to build robots that can learn and adapt and do many things without needing special training.

The combination of:

Understanding physics and how things move
Having good hardware
Learning from examples and trying things
Using foundation models for thinking and planning

is very powerful. I think the teams that can put all these things together will be the winners, in the few years.

The age of robots that can only do one thing is ending. The age of robots that can do things and are really useful is beginning.

Multi-Task. Foundation Models for Humanoid Robots: Towards Generalist Agents

What is a Foundation Model for Humanoids?

Multi-Task Learning: The Core Training Strategy

Leading Approaches

Architecture Trends

Challenges Remaining

My Personal Take

Like this:

Leave a ReplyCancel reply

What is a Foundation Model for Humanoids?

Multi-Task Learning: The Core Training Strategy

Leading Approaches

Architecture Trends

Challenges Remaining

My Personal Take

Share this:

Like this:

Leave a ReplyCancel reply