Reinforcement Learning 101 for Humanoids: Reward Design for Walking

Hello, I want to tell you a story. I still remember the time I tried to teach a robot to walk using reinforcement learning. After hours of training my robot had learned one thing perfectly. How to fall over in creative ways.

This experience taught me something the reward function is everything. When we use reinforcement learning for humanoids a designed reward can make even the most powerful algorithm fail. Today I will explain the basics of reinforcement learning for walking in a simple way.

What Is Reinforcement Learning in This Context?

Reinforcement learning is like teaching a dog a trick. You do not give the dog step-by-step instructions. Instead you let the robot try actions and you give it rewards when it does something good like staying upright and moving forward and penalties when it does something bad like falling over.

The robot, called the agent learns by trial and error to get the rewards over time. For humanoid walking the agent controls the robots joints and the environment, a physics simulator tells it what happens next.

Popular Simulation Environments

There are two tools for humanoid reinforcement learning:

  • OpenAI Gym, which is great for beginners
  • MuJoCo which is widely used in research and very good at simulating contact physics

Many modern humanoid projects, including research behind Optimus and Figure use MuJoCo-based environments because they handle foot-ground contact and dynamics very realistically.

The Simplest Reinforcement Learning Problem for Bipedal Walking

Lets define a walking task:

Goal: Make the robot walk forward as far as possible without falling.

The robot can see:

  • Its angles and velocities
  • The position and velocity of its center of mass
  • Whether its feet are touching the ground
  • The orientation of its torso

The robot can do:

  • Control its joint angles or torques for its hips, knees and ankles

The task ends when the robot falls or reaches a time limit.

Reward Design. The Important Part

This is where most people go wrong. Here is a simple reward function I recommend:

def compute_reward(self, obs, action):
    # Forward velocity reward, the main goal
    forward_vel = obs["base_vel"][0] # x-velocity
    reward = 1.5 * forward_vel # encourage moving forward
    
    # Alive bonus, stay upright
    reward += 2.0 # small constant reward for not falling
    
    # Energy penalty, don't waste power
    energy_cost = np.sum(np.square(action)) * 0.005
    reward -= energy_cost
    
    # Penalty for falling
    if self.is_fallen():
        reward -= 10.0
    
    # Torso orientation penalty, keep upright
    torso_tilt = abs(obs["torso_angle"])
    reward -= 0.5 * torso_tilt
    
    return reward

Why This Reward Works Well

  • Forward velocity is the goal.
  • Alive bonus prevents the robot from learning to fall immediately.
  • Energy penalty encourages gaits.
  • Torso tilt penalty helps keep the robot balanced.
  • Big fall penalty discourages falling.

You can start with a reward then add more terms as the robot improves.

Real-World Examples

Many successful humanoid walking policies started with reward functions like this. Researchers often begin with a task, then increase difficulty by adding terrain or obstacles.

My Personal Take

Reward design is both an art and a science. When I started I made the mistake of making the reward too complicated. The robot got confused and learned nothing.

My advice is to start simple. Get the robot to take one step forward. Once that works add terms for smoothness and energy efficiency.

Modern approaches often combine reinforcement learning with imitation learning and human feedback. But the core idea remains the same. You need a designed reward to guide the learning.

Understanding reinforcement learning for walking helps you appreciate how impressive even simple walking demos are. Behind that gait are thousands of training episodes where the robot fell over until it finally figured it out.

Leave a Reply

Scroll to Top