Useful humanoid robots will require a long list of competencies. They will need the ability to manipulate a diverse range of objects (e.g. hard/soft, heavy/delicate, rigid/articulated, large/small), as well as to coordinate their entire bodies to reconfigure themselves, their environments, avoid obstacles, and maintain balance while responding to surprises. We believe that building AI generalist robots is the most viable path to creating these competencies and achieving automation at scale with humanoids.

We are excited to share some of our progress on developing Large Behavior Models (LBMs) for Atlas®. This work is part of a collaboration between AI research teams at Toyota Research Institute (TRI) and Boston Dynamics. We have been building end-to-end language-conditioned policies that enable Atlas to accomplish long-horizon manipulation tasks.

These policies are able to take full advantage of the capabilities of the humanoid form factor, including taking steps, precisely positioning its feet, crouching, shifting its center of mass, and avoiding self-collisions, all of which we have found to be critical to solving realistic mobile manipulation tasks.


The process for building policies includes four basic steps:

  1. Collect embodied behavior data using teleoperation on both the real-robot hardware and in simulation.
  2. Process, annotate, and curate data so that we can easily incorporate it into a machine learning pipeline.
  3. Train a neural-network policy using all of the data across all tasks.
  4. Evaluate the policy using a test suite of tasks.
Our approach involves 4 main stages (1) data collection via teleoperation, (2) data processing, quality assurance and annotation, (3) model training and (4) evaluation. These stages form a continuous and iterative process.

The results of step 4 guide decision making about what additional data to collect and what network architecture or inference strategies lead to improved performance.

Our policy maps inputs consisting of images, proprioception and language prompts to actions that control the full Atlas robot at 30Hz. We leverage a diffusion transformer together with a flow matching loss to train our model.
Our policy maps inputs consisting of images, proprioception, and language prompts to actions that control the full Atlas robot at 30Hz. We leverage a diffusion transformer together with a flow matching loss to train our model.

In implementing this process, we’ve followed three core principles:

  • Maximizing task coverage: Humanoids can, in principle, tackle a tremendous breadth of manipulation tasks, but collecting data beyond stationary manipulation tasks while preserving high-quality, responsive motion is challenging. We built a state-of-the-art teleoperation system that combines Atlas’ model predictive controller (MPC) with a custom VR-based interface to cover tasks requiring anything from finger-level dexterity to whole-body reaching and locomotion.
  • Training generalist policies: The field is steadily accumulating evidence that policies trained on a large corpus of diverse task data can generalize and recover better than specialist policies that are trained to solve one or a small number of tasks. We use multi-task language-conditioned policies to accomplish a diverse set of tasks on multiple embodiments, incorporating pretraining data from Atlas, the upper-body-only Atlas Manipulation Test Stand (MTS), and TRI Ramen data. Building general policies also enables us to simplify deployment, share policy improvements across tasks and embodiments, and move closer towards unlocking emergent behaviors.
  • Building infrastructure to support fast iteration and rigorous science: Being able to quickly iterate on design choices is critical, but actually measuring with confidence when one policy is better or worse than another is the key ingredient to making steady progress. By leveraging the combination of simulation, hardware tests, and ML infrastructure built for production scale, we have been able to efficiently explore the data and policy design space while continuously improving on-robot performance.

Long-Horizon, End-to-End Manipulation


The “Spot Workshop” task demonstrates coordinated locomotion—stepping, setting a wide stance, and squatting—and dexterous manipulation including part picking, regrasping, articulating, placing, and sliding. It consists of three subtasks:

  1. Grasping Spot legs from the cart, folding them, and placing them on a shelf.
  2. Grasping face plates from the cart then pulling out a bin on the bottom shelf and putting the face plates in the bin.
  3. Once the cart is fully cleared, turning to the blue bin behind and clearing it of all other Spot parts, placing handfuls of them in the blue tilt truck.

In this uncut end-to-end video, we show a single language-conditioned policy performing the full sequence of tasks, where each of the three subtasks are triggered by passing a high-level language prompt to the policy.


A key feature was for our policies to react intelligently when things go wrong, such as a part falling on the ground or the bin lid closing. The initial versions of our policies didn’t have these capabilities. By demonstrating examples of the robot recovering from such disturbances and retraining our network, we were able to quickly deploy new reactive policies with no algorithmic or engineering changes needed. This is because the policies can effectively estimate the state of the world from the robot’s sensors and react accordingly purely through the experiences observed in training. As a result, programming new manipulation behaviors no longer requires an advanced degree and years of experience, which creates a compelling opportunity to scale up behavior development for Atlas. 

Additional Manipulation Capabilities

We have studied dozens of tasks that we used both for benchmarking and pushing the boundaries for manipulation. Using a single language-conditioned policy on Atlas MTS, we can perform tasks ranging from simple pick-and-place to more complex tasks such as rope tying, flipping a barstool, unfurling and spreading a tablecloth, and manipulating a 22lb car tire. Rope, cloth, and tire manipulation are examples of tasks that would be extremely difficult to perform with traditional robot programming techniques due to their deformable geometry and the complex manipulation sequences. But with LBMs, the training process is the same whether it’s stacking rigid blocks or folding a t-shirt: if you can demonstrate it, the robot can learn it.

Adapting Performance of Policies After Learning

One notable feature of our policies is that we can speed up the execution at inference time without requiring any training time changes. Specifically, since our policies predict a trajectory of future actions along with the time at which those actions should be taken, we can adjust this timing to control execution speed. In the video shown below we compare the policy being rolled out at 1x (i.e. the speed at which this task was done during data collection) and also at 2x and 3x speed. In general, we found that we were able to speed up policies by 1.5x-2x without significantly affecting policy performance on both the MTS and full Atlas platforms. While the task dynamics can sometimes preclude this kind of inference-time speedup, it does suggest that, in some cases, we can exceed the speed limits of human teleoperation.

Approach

Platform

Atlas contains 78 Degrees of Freedom (DoF) that provide a wide range of motion and a high degree of dexterity; the Atlas MTS has 29 DoF to explore pure manipulation tasks. The grippers each have 7 DoF that enable us to use a wide range of grasp strategies (power grasps, pinch grasps, etc). We rely on a pair of HDR stereo cameras mounted in the head to provide both situational awareness for teleoperation along with visual input for our policies.

Teleoperation: High-Quality Data Collection for Model Training

Controlling the robot in a fluid, dynamic, and dexterous manner is crucial, and we have invested heavily in our teleoperation system to address these needs. The system is built on Boston Dynamics’s MPC system, which has previously been deployed for use-cases ranging from parkour, dance, and both practical and impractical manipulation. This control system allows us to perform precise manipulation while maintaining balance and avoiding self-collisions, enabling us to push the boundaries of what we can do with the Atlas hardware.

The teleoperation setup leverages a VR headset for the operator to fully immerse themselves in the robot’s workspace and have access to the same information as the policy, with spatial awareness bolstered by a stereoscopic view rendered using Atlas’ head mounted cameras reprojected to the user’s viewpoint. Custom VR software provides the teleoperator with a rich interface to command the robot, providing them real-time feeds of robot state, control targets, sensor readings, tactile feedback, and system state via augmented reality, controller haptics, and heads-up display elements. This enables teleoperators to make full use of the robot hardware and capabilities, synchronizing their body and senses with those of the robot.


The initial version of the VR teleoperation application used the headset, base stations, controllers, and one tracker for the chest to control Atlas while standing still. This system employed a one-to-one mapping between the user and the robot (i.e. moving your hand 1cm would cause the robot to also move by 1cm) which yields an intuitive control experience, especially for bi-manual tasks. With this version, the operator was already able to perform a wide range of tasks, such as crouching down low to reach an object on the ground and also standing tall to reach a high shelf. However, one limitation of this system is that it didn’t allow the operator to dynamically reposition the feet and take steps, which significantly limited the tasks we could perform.

To support mobile manipulation, we incorporated two additional trackers for 1-to-1 tracking on the feet and extended the teleoperation control such that Atlas’s stance mode, support polygon, and stepping intent matched that of the operator’s. In addition to supporting locomotion, this setup allowed us to take full advantage of Atlas’s workspace. For instance, when opening the blue tote on the ground and picking items from inside, the human must be able to configure the robot with a wide stance and bent knees to reach the objects in the bin without colliding with the bin.

Our neural network policies use the same control interface to the robot as the teleoperation system, which made it easy to reuse model architectures we had developed previously (for policies that didn’t involve locomotion), simply by augmenting the action representation.

Policy

Building upon Toyota Research Institute’s Large Behavior Models, which scale Diffusion Policy-like architectures, our policy uses a 450M parameter Diffusion Transformer-based architecture together with a flow-matching objective. The policy is conditioned on proprioception, images, and also accepts a language prompt that specifies the objective to the robot. Image data comes in at 30 Hz and our network uses a history of observations to predict an action-chunk of length 48 (corresponding to 1.6 seconds) where generally 24 actions (0.8 seconds when running at 1x speed) are executed each time policy inference is run.

The policy’s observation space for Atlas consists of the images from the robot’s head-mounted cameras along with proprioception. The action space includes the joint positions for the left and right grippers, neck yaw, torso pose, left and right hand pose, and the left and right foot poses.

Atlas MTS is identical to the upper-body on Atlas, both from a mechanical and a software perspective. The observation and action spaces are the same as for Atlas simply with the torso and lower body components omitted. This shared hardware and software across Atlas and Atlas MTS aids in training multi-embodiment policies that can function across both platforms, allowing us to pool data from both embodiments.

These policies were trained on data that the team continuously collected and iterated upon, where high quality demonstrations were a critical part of getting successful policies. We heavily relied upon our quality assurance tooling that allowed us to review, filter, and provide feedback on the data collected.

Simulation

Simulation is a critical tool that allows us to quickly iterate on the teleoperation system, write unit and integration tests to ensure we can move forward without breakages, and perform informative training and evaluations that would otherwise be slower, more expensive and difficult to perform repeatably on hardware. Because our simulation stack is a faithful representation of the hardware and on-robot software stack, we are able to share our data pipeline, visualization tools, training code, VR software and interfaces across both simulation and hardware platforms. 

In addition to using simulation to benchmark our policy and architecture choices, we also incorporate our simulation as a significant co-training data source for our multi-task and multi-embodiment policies that we deploy on the hardware.

Conclusion and Next Steps

We have shown that we can train multi-task language-conditioned policies that can control Atlas to accomplish long-horizon tasks that involve both locomotion and dexterous whole-body manipulation. Our data driven approach is general and can be used for practically any downstream task that can be demonstrated via teleoperation.

While we are encouraged by our results so far, there is still much work to be done. With our established baseline of tasks and performance, we will focus on scaling our data flywheel to increase throughput, quality, task diversity, and difficulty while also exploring new algorithmic ideas. 

We are pursuing several research directions, including performance-related robotics topics (e.g. gripper force control with tactile feedback, fast dynamic manipulation), incorporating diverse data sources (cross-embodiment, ego-centric human data, etc), RL improvement of VLAs, and deploying VLM/VLA architectures to enable more complex long-horizon tasks and open-ended reasoning.

If these topics excite you and you want to work with world class researchers, engineers, and robots, please reach out to us at Boston Dynamics and Toyota Research Institute.

Authorship

This article was written with the support of teams from Boston Dynamics and TRI. Contributors are listed below in alphabetical order; organization affiliation is indicated in superscript (B for Boston Dynamics and T for TRI), project leads are indicated with one asterisks (*), and organization leaders with two (**).

LBM Team: Jun AhnB, Alex AlspachT**, Kevin BergaminB, Benjamin BurchfielT**, Eric CousineauT*, Aidan CurtisB, Siyuan FengT**, Kerri Fetzer-BorelliT**, Dion GonanoB, Rachel HanB, Scott KuindersmaB** – BD Lead, Lucas ManuelliB*, Pat MarionB*, Daniel MartinB, Aykut OnolT, Russ TedrakeT** – TRI Lead, Russell WongB, Mengchao ZhangT, Mark ZolotasT

Data Operations: Keelan BoyleB, Matthew FerreiraT, Cole GlynnB, Brendan HathawayT**, Allison HenryT**, Phoebe HorganT**, Connor KeaneB**, Ben StrongB, ThienTran LeB, Dominick LeiteB, David TagoT, Matthew TranT

Blog Authors: Eric Cousineau, Scott Kuindersma, Lucas Manuelli, Pat Marion