Explore our R&D with the world’s most dynamic humanoid robotRead More
Discover the past innovations that informed our current productsRead More
Meet the team behind the innovationsRead More
Learn how we develop and deploy robots to tackle industry’s toughest challengesRead More
Start your journey at Boston DynamicsRead More
Stay up to date with what we’ve been working onRead More
Discover the principles that guide our work and policiesRead More
Innovation
Blogs •
Boston Dynamics and TRI Research Team
Useful humanoid robots will require a long list of competencies. They will need the ability to manipulate a diverse range of objects (e.g. hard/soft, heavy/delicate, rigid/articulated, large/small), as well as to coordinate their entire bodies to reconfigure themselves, their environments, avoid obstacles, and maintain balance while responding to surprises. We believe that building AI generalist robots is the most viable path to creating these competencies and achieving automation at scale with humanoids.
We are excited to share some of our progress on developing Large Behavior Models (LBMs) for Atlas®. This work is part of a collaboration between AI research teams at Toyota Research Institute (TRI) and Boston Dynamics. We have been building end-to-end language-conditioned policies that enable Atlas to accomplish long-horizon manipulation tasks.
These policies are able to take full advantage of the capabilities of the humanoid form factor, including taking steps, precisely positioning its feet, crouching, shifting its center of mass, and avoiding self-collisions, all of which we have found to be critical to solving realistic mobile manipulation tasks.
The process for building policies includes four basic steps:
The results of step 4 guide decision making about what additional data to collect and what network architecture or inference strategies lead to improved performance.
In implementing this process, we’ve followed three core principles:
The “Spot Workshop” task demonstrates coordinated locomotion—stepping, setting a wide stance, and squatting—and dexterous manipulation including part picking, regrasping, articulating, placing, and sliding. It consists of three subtasks:
In this uncut end-to-end video, we show a single language-conditioned policy performing the full sequence of tasks, where each of the three subtasks are triggered by passing a high-level language prompt to the policy.
A key feature was for our policies to react intelligently when things go wrong, such as a part falling on the ground or the bin lid closing. The initial versions of our policies didn’t have these capabilities. By demonstrating examples of the robot recovering from such disturbances and retraining our network, we were able to quickly deploy new reactive policies with no algorithmic or engineering changes needed. This is because the policies can effectively estimate the state of the world from the robot’s sensors and react accordingly purely through the experiences observed in training. As a result, programming new manipulation behaviors no longer requires an advanced degree and years of experience, which creates a compelling opportunity to scale up behavior development for Atlas.
We have studied dozens of tasks that we used both for benchmarking and pushing the boundaries for manipulation. Using a single language-conditioned policy on Atlas MTS, we can perform tasks ranging from simple pick-and-place to more complex tasks such as rope tying, flipping a barstool, unfurling and spreading a tablecloth, and manipulating a 22lb car tire. Rope, cloth, and tire manipulation are examples of tasks that would be extremely difficult to perform with traditional robot programming techniques due to their deformable geometry and the complex manipulation sequences. But with LBMs, the training process is the same whether it’s stacking rigid blocks or folding a t-shirt: if you can demonstrate it, the robot can learn it.
One notable feature of our policies is that we can speed up the execution at inference time without requiring any training time changes. Specifically, since our policies predict a trajectory of future actions along with the time at which those actions should be taken, we can adjust this timing to control execution speed. In the video shown below we compare the policy being rolled out at 1x (i.e. the speed at which this task was done during data collection) and also at 2x and 3x speed. In general, we found that we were able to speed up policies by 1.5x-2x without significantly affecting policy performance on both the MTS and full Atlas platforms. While the task dynamics can sometimes preclude this kind of inference-time speedup, it does suggest that, in some cases, we can exceed the speed limits of human teleoperation.
Atlas contains 50 Degrees of Freedom (DoF) that provide a wide range of motion and a high degree of dexterity; the Atlas MTS has 29 DoF to explore pure manipulation tasks. The grippers each have 7 DoF that enable us to use a wide range of grasp strategies (power grasps, pinch grasps, etc). We rely on a pair of HDR stereo cameras mounted in the head to provide both situational awareness for teleoperation along with visual input for our policies.
Controlling the robot in a fluid, dynamic, and dexterous manner is crucial, and we have invested heavily in our teleoperation system to address these needs. The system is built on Boston Dynamics’s MPC system, which has previously been deployed for use-cases ranging from parkour, dance, and both practical and impractical manipulation. This control system allows us to perform precise manipulation while maintaining balance and avoiding self-collisions, enabling us to push the boundaries of what we can do with the Atlas hardware.
The teleoperation setup leverages a VR headset for the operator to fully immerse themselves in the robot’s workspace and have access to the same information as the policy, with spatial awareness bolstered by a stereoscopic view rendered using Atlas’ head mounted cameras reprojected to the user’s viewpoint. Custom VR software provides the teleoperator with a rich interface to command the robot, providing them real-time feeds of robot state, control targets, sensor readings, tactile feedback, and system state via augmented reality, controller haptics, and heads-up display elements. This enables teleoperators to make full use of the robot hardware and capabilities, synchronizing their body and senses with those of the robot.
The initial version of the VR teleoperation application used the headset, base stations, controllers, and one tracker for the chest to control Atlas while standing still. This system employed a one-to-one mapping between the user and the robot (i.e. moving your hand 1cm would cause the robot to also move by 1cm) which yields an intuitive control experience, especially for bi-manual tasks. With this version, the operator was already able to perform a wide range of tasks, such as crouching down low to reach an object on the ground and also standing tall to reach a high shelf. However, one limitation of this system is that it didn’t allow the operator to dynamically reposition the feet and take steps, which significantly limited the tasks we could perform.
To support mobile manipulation, we incorporated two additional trackers for 1-to-1 tracking on the feet and extended the teleoperation control such that Atlas’s stance mode, support polygon, and stepping intent matched that of the operator’s. In addition to supporting locomotion, this setup allowed us to take full advantage of Atlas’s workspace. For instance, when opening the blue tote on the ground and picking items from inside, the human must be able to configure the robot with a wide stance and bent knees to reach the objects in the bin without colliding with the bin.
Our neural network policies use the same control interface to the robot as the teleoperation system, which made it easy to reuse model architectures we had developed previously (for policies that didn’t involve locomotion), simply by augmenting the action representation.
Building upon Toyota Research Institute’s Large Behavior Models, which scale Diffusion Policy-like architectures, our policy uses a 450M parameter Diffusion Transformer-based architecture together with a flow-matching objective. The policy is conditioned on proprioception, images, and also accepts a language prompt that specifies the objective to the robot. Image data comes in at 30 Hz and our network uses a history of observations to predict an action-chunk of length 48 (corresponding to 1.6 seconds) where generally 24 actions (0.8 seconds when running at 1x speed) are executed each time policy inference is run.
The policy’s observation space for Atlas consists of the images from the robot’s head-mounted cameras along with proprioception. The action space includes the joint positions for the left and right grippers, neck yaw, torso pose, left and right hand pose, and the left and right foot poses.
Atlas MTS is identical to the upper-body on Atlas, both from a mechanical and a software perspective. The observation and action spaces are the same as for Atlas simply with the torso and lower body components omitted. This shared hardware and software across Atlas and Atlas MTS aids in training multi-embodiment policies that can function across both platforms, allowing us to pool data from both embodiments.
These policies were trained on data that the team continuously collected and iterated upon, where high quality demonstrations were a critical part of getting successful policies. We heavily relied upon our quality assurance tooling that allowed us to review, filter, and provide feedback on the data collected.
Simulation is a critical tool that allows us to quickly iterate on the teleoperation system, write unit and integration tests to ensure we can move forward without breakages, and perform informative training and evaluations that would otherwise be slower, more expensive and difficult to perform repeatably on hardware. Because our simulation stack is a faithful representation of the hardware and on-robot software stack, we are able to share our data pipeline, visualization tools, training code, VR software and interfaces across both simulation and hardware platforms.
In addition to using simulation to benchmark our policy and architecture choices, we also incorporate our simulation as a significant co-training data source for our multi-task and multi-embodiment policies that we deploy on the hardware.
We have shown that we can train multi-task language-conditioned policies that can control Atlas to accomplish long-horizon tasks that involve both locomotion and dexterous whole-body manipulation. Our data driven approach is general and can be used for practically any downstream task that can be demonstrated via teleoperation.
While we are encouraged by our results so far, there is still much work to be done. With our established baseline of tasks and performance, we will focus on scaling our data flywheel to increase throughput, quality, task diversity, and difficulty while also exploring new algorithmic ideas.
We are pursuing several research directions, including performance-related robotics topics (e.g. gripper force control with tactile feedback, fast dynamic manipulation), incorporating diverse data sources (cross-embodiment, ego-centric human data, etc), RL improvement of VLAs, and deploying VLM/VLA architectures to enable more complex long-horizon tasks and open-ended reasoning.
If these topics excite you and you want to work with world class researchers, engineers, and robots, please reach out to us at Boston Dynamics and Toyota Research Institute.
Authorship
This article was written with the support of teams from Boston Dynamics and TRI. Contributors are listed below in alphabetical order; organization affiliation is indicated in superscript (B for Boston Dynamics and T for TRI), project leads are indicated with one asterisks (*), and organization leaders with two (**).
LBM Team: Jun AhnB, Alex AlspachT**, Kevin BergaminB, Benjamin BurchfielT**, Eric CousineauT*, Aidan CurtisB, Siyuan FengT**, Kerri Fetzer-BorelliT**, Dion GonanoB, Rachel HanB, Scott KuindersmaB** – BD Lead, Lucas ManuelliB*, Pat MarionB*, Daniel MartinB, Aykut OnolT, Russ TedrakeT** – TRI Lead, Russell WongB, Mengchao ZhangT, Mark ZolotasT
Data Operations: Keelan BoyleB, Matthew FerreiraT, Cole GlynnB, Brendan HathawayT**, Allison HenryT**, Phoebe HorganT**, Connor KeaneB**, Ben StrongB, ThienTran LeB, Dominick LeiteB, David TagoT, Matthew TranT
Blog Authors: Eric Cousineau, Scott Kuindersma, Lucas Manuelli, Pat Marion
Recent Blogs
•45 min watch
Why Humanoids Are the Future of Manufacturing
•1 min watch
Walk, Run, Crawl, RL Fun
•8 min read
Put It in Context with Visual Foundation Models
Have a question about our robots and applications? We're here to help. Reach out to our sales team or talk to Bobbi, our virtual agent, for fast answers to all your questions.