The vast array of tasks that a humanoid robot could potentially do in a factory, warehouse, or even at home requires an understanding of the geometric and semantic properties of the world—that is both the shape and the context of the objects it is interacting with. To do those tasks with agility and adaptability, Atlas needs an equally agile and adaptable perception system. 

Even a seemingly simple task—pick up a car part and put it in the correct slot—breaks down into multiple steps, each requiring extensive knowledge about the environment. First, Atlas detects and identifies the object. Many parts in a factory are either shiny and metallic or low-contrast and dark, making them hard for the robot’s camera to distinguish clearly. Then Atlas needs to reason about the location of the object to grasp it. Is it out in the open on a table? Or inside a container with limited line of sight? After picking the object, Atlas decides where to place it and how to get it there.

Finally, Atlas needs to place the object accurately—just a couple centimeters off in any direction and the object might get stuck or simply fall. Atlas therefore also needs to be able to take corrective actions when things go wrong. If an insertion fails, for example, the robot can search for and pick up the dropped part from the ground—leveraging the generality of a foundation vision model conditioned on factory parts and Atlas’s large range-of-motion. 

These challenges require new methods and directly impact the design of Atlas’s perception system, built from strong components: well-calibrated sensing and kinematics, state-of-the-art machine learning perception models, and robust state estimation. In this blog, we explore the main components of the Atlas vision system that empower autonomous sequencing behavior.

2D awareness – What objects are in the environment?

Perception starts with determining what is around the robot—are there obstacles? Relevant objects? Hazards on the floor? Our 2D object detection system provides this information in the form of object identities, bounding boxes, and points of interest (or keypoints).

In this particular application, we detect fixtures—the large shelving units storing various automotive parts. These fixtures come in a variety of shapes and sizes, and Atlas needs to know both their type and the volume they occupy to avoid colliding with them. Along with detecting and identifying all fixtures, Atlas perceives their corners as keypoints, which makes it possible to align the perceived world to its internal model of what fixtures look like.

Fixture’s keypoints are 2D pixel points that come in two flavors: outer (green) and inner (red). Outer keypoints capture the envelope of the fixture. In this case, they are the four corners that roughly inscribe the fixture’s front face. The inner keypoints are more plentiful and varied. They capture the internal distribution of shelves and cubbies within a particular fixture. These provide the ability to localize individual slots with precision.

In order to perform fixture classification and keypoint prediction, Atlas uses a lightweight network architecture that hits a tradeoff of performance and real-time perception, essential for Atlas’s agility.

3D awareness – Where are the objects relative to Atlas?

To manipulate objects inside a fixture, Atlas first estimates its own position relative to that particular fixture. Atlas does this with a keypoint-based fixture localization module that estimates its relative position and orientation with respect to all fixtures in its vicinity.

The fixture localization system ingests both the inner and outer keypoints from the object detection pipeline and aligns them with a prior model of their expected spatial distribution, by minimizing their reprojection error. The system also ingests kinematic odometry—a measure of how much and in what direction Atlas is moving—to fuse fixture pose estimates in a consistent frame and achieve higher reliability to keypoint noise prediction.

A key challenge to achieve reliable fixture pose estimates is dealing with the frequent occlusions and out-of-view keypoints. For example, when Atlas is close to a fixture, some of the outer keypoints might not be in view. Angled views are also challenging, because the more distant keypoints are usually unreliable. The localization system addresses this by perceiving a much larger number of keypoints on the inside of the fixture, the corners between the slots dividers, which are directly relevant to how objects are inserted or extracted. This creates an association challenge between 2D keypoints and 3D corners—what corner corresponds to each keypoint in the image? Atlas makes a first approximation from the outer keypoints, which allows it to make a first guess of inner keypoint association. Then the combination of inner and outer keypoints produces a more reliable estimate of the pose of the fixture, and all of its slots.

Some fixtures are visually identical to each other. We call these a fixture class. This is very common in factories, and poses yet an additional challenge in realistic settings. Atlas addresses this through a combination of temporal coherence, and an initial prior of the relative position between different fixtures—e.g. expect fixture A half a meter to the right of fixture B. 

All these features combine into a reliable and agile fixture perception system. In the video, when someone moves the fixture behind Atlas, the robot quickly realizes the disparity between expectation and reality, snaps back the location of the fixture, and replans the behavior accordingly. 

An engineer moves a target container to test Atlas's ability to adapt on to changes in the environment

Object pose estimation – How should Atlas interact with an object?

Atlas’s robust object manipulation skills rely on accurate, real-time object-centric perception. Atlas’ object pose tracking system, SuperTracker, fuses different streams of information: robot kinematics, vision, and when necessary forces. Kinematics information from Atlas’s joint encoders allows us to determine where Atlas’s grippers are in space. When Atlas recognizes it has grasped an object, this information provides a strong prior about where the object should be as Atlas moves its body. By fusing kinematic data, Atlas can handle situations where objects are visually occluded or out of view of its cameras, as well as informing Atlas of when an object slips from its grasp. 

When the object is in view of the cameras, Atlas uses an object pose estimation model that uses a render-and-compare approach to estimate pose from monocular images. The model is trained with large-scale synthetic data, and generalizes zero-shot to novel objects given a CAD model. When initialized with a 3D pose prior, the model iteratively refines it to minimize the discrepancy between the rendered CAD model and the captured camera image. Alternatively, the pose estimator can be initialized from a 2D region-of-interest prior (such as an object mask). Atlas then generates a batch of pose hypotheses that are fed to a scoring model, and the best fit hypothesis is subsequently refined. Atlas’s pose estimator works reliably on hundreds of factory assets which we have previously modeled and textured in-house.

SuperTracker receives the visual pose estimates as a 3D prior. In the type of manipulation scenarios that Atlas confronts, visual pose estimates can be ambiguous due to occlusion, partial visibility, and lighting variations. We use a series of filters to validate these pose estimates: 

  • self consistency: rather than a single pose prior, we use a batch of perturbed initializations, and validate the outputs using a maximum-clique based consensus algorithm to ensure they converge to the same predicted pose; 
  • kinematic consistency: as a proxy for enforcing contact, we reject any predicted pose that leads to an unusually large finger-to-object distance. 

The kinematics and camera inputs are processed asynchronously using a fixed-lag smoother. The smoother takes a history of high-rate kinematics inputs from Atlas’s joint encoders, along with the lower-rate visual pose estimates from the machine learning model, and determines the best-fit 6-DoF object trajectory.

Calibration – Is Atlas actually where it thinks it is?

When performing precise manipulation tasks like sequencing, we should not underestimate the importance of well-calibrated hand-eye coordination—a precise and reliable mapping between what Atlas sees and how Atlas acts.

The above graphic shows Atlas’s internal model of its body overlaid on the live camera feed. The alignment is nearly perfect, with its arms, legs, and torso lining up exactly where the robot believes they are. Behind this is a set of carefully designed camera and kinematic calibration procedures that compensate for imprecision in manufacturing and assembly of the robot body as well as physical changes that happen over time due to external factors such as temperature changes or repeated physical impact.

In our experience, accurate hand-eye calibration is a key enabler of high-performance manipulation and perception-driven autonomy.

What’s next?

The work doesn’t stop here. Agility and adaptability continue to be the goals—which increasingly require accounting for some fundamental truths about the geometry, semantics and physics of how the world moves. Our team is focused on moving toward a unified foundation model for Atlas. The future points to going past perception, a shift where perception and action are not separate processes, a shift in perception from spatial AI to Athletic Intelligence.