Put It in Context with Visual Foundation Models

This blog was written by Michael McDonald and Jeffrey Yu, robotics engineers on the Spot team.

On a typical day, our customers are operating the Spot® robot in factories, foundries, substations, sub basements, and beyond. Spot encounters all kinds of obstacles and environmental changes, but it still needs to safely complete its mission without getting stuck, falling, or breaking anything. Over the past few years, we’ve added capabilities to help Spot more effectively cross through busy areas, avoid other moving objects, keep its balance on slippery floors, and more.

However, while there are challenges and obstacles that we can anticipate and plan for—like stairs or forklifts—there are many more that are difficult to predict. To help tackle these edge cases, we used AI foundation models to give Spot a better semantic understanding of the world, so it can plan its path based not only on the geometry of its environment, but with the benefit of additional context.

Perception and Gaps

Spot’s perception starts with the five stereo cameras built into its body; we use the depth data from those cameras to generate a 3D map of the surroundings. We are able to translate that depth data into a map, using the geometry of the space to detect walls, empty spaces, and other objects. From there, we refine the map to determine where Spot is able to step—avoiding steep slopes or high steps for example.

This works well for basic autonomous navigation and path planning, but the limitation of seeing the world primarily using geometry is that there are certain obstacles and hazards that don’t show up well in 3D data.

There are four types of challenges that commonly occur relying on geometry alone.

Too little data to see: Certain obstacles are difficult to identify using Spot’s camera. Small objects like wires may be smoothed out in the map data, while clear objects like windows or glass doors can be functionally invisible. Identifying these objects using visual data rather than depth data can help prevent entanglements or collisions.
Hazardous to Spot: Other objects may be visible in the 3D data, but still pose a hazard to Spot. For example a wheeled cart or dolly can look like a stable platform using 3D data, but if Spot tries to step on it, the cart moves and causes a fall. Adding semantic understanding helps Spot identify carts and movable objects, hoses, puddles, and other potential risks.
Handle with care: Similarly, there are things that may look innocuous in the 3D data, but are actually fragile or expensive. No one wants a robot to accidentally walk across something breakable, so Spot needs additional context to recognize these items as things it should avoid.
Hazardous to others: There are also objects that Spot can detect and navigate around, but that require additional contextual behavior to reduce risk to people. For example, walking under a ladder is a bad idea. Spot needs to be able to recognize that a ladder is a ladder to observe appropriate safety practices around ladders.

Using foundation models, we were able to train Spot to recognize these types of hazards and apply a contextual understanding to its behavior—modifying its path planning to account for more nuanced types of obstacles.

Training Context with Foundation Models

In recent years, AI—deep learning, neural networks, foundation models—have accelerated quickly. These technologies offer new ways to tackle old challenges in robotics and various teams at Boston Dynamics have been testing ways to push the boundaries of robot intelligence with machine learning. In particular, we saw the potential to give Spot a more semantic understanding of its environment—adding context to geometries and enabling safer, more predictable performance—using visual foundation models.

What are foundation models?

At a broad level, a foundation model is exactly what it sounds like: a foundation for building other applications on. To get a general level understanding for an AI model, you need tons of data; gathering that scale of data for an individual application can be prohibitive.

A foundation model speeds up the process by enabling you to start with something that already has learned that broad dataset. Typically these models are trained to make associations: this text relates to that text, this image relates to that text, this image relates to that image. You can use that foundation model to train a model on top of it or fine tune it for a downstream task using far less data.

In our case, we tested a few visual foundation models, settling on an open set object detection model; this means we could give it any text or image, and prompt it to find all matching instances in the input, in this case Spot’s cameras. This allowed us to identify the specific kinds of hazard we wanted Spot to more intelligently recognize and avoid. Additionally, the broader knowledge learned by the model makes it possible for us to quickly adapt that recognition to new environments and bespoke hazards, simply by providing a few images or short text descriptions.

Training, testing, and fine-tuning

Once we had a model, we needed to test and refine it. The main challenges in testing were speed and flexibility. The robot has a job to do and we don’t want to add so many restrictions that Spot can’t complete its missions. This meant it was important to run these models quickly on the robot, avoid false positives, and ensure Spot can avoid hazards without getting stuck.

ML models need a lot of resources and time to be able to process the data you gave them. If you want them to be valuable running on a robot, they have to be working in real time. So most of the testing was focused on how to efficiently create a pipeline where it will be able to recognize hazards, but still be able to feed that back into the navigation in real time. A lot of our testing focused on finding and fine-tuning a model that was robust enough to reliably detect hazards, but lightweight enough to run efficiently.

We also tested different ways to incorporate detected hazards back into Spot’s navigation. If you are too restrictive, it blocks Spot’s path and gets the robot stuck, but if you are too lenient Spot may still interact with the hazard in an undesirable way.

We wanted to deliver a more fine grained understanding of how Spot should change how it interacts, given what it sees in the space. In the example of a ladder or glass door, Spot should just steer clear of the obstacle altogether. But a lot of times, there’s a more nuanced option. Spot can step over a wire if needed, as long as it doesn’t step on the wire. For a puddle, Spot may want to go around, but can go through if there’s no other path clear.

In addition to mapping the obstacles, we also want to map the outputs of these models to navigational affordances for the robot—essentially giving the robot an understanding of given what it saw, what can it do?

Hazard Avoidance in 4.1

It’s not enough to create a proof of concept in a lab. We need Spot to work in the real world. After rigorous testing, this workflow is now available to all customers with a Spot Core I/O. Using multiple models and tools to detect objects, refine segmentation masks around the identified objects around the identified objects, and plan behaviors based on that information, we have made it easier for Spot to navigate more safely and efficiently in messy, busy, realistic workplaces.

As of our 4.1 release, in addition to moving objects, Spot can now detect and avoid common hazards in industrial environments including carts, wires, and ladders. This latest release immediately delivers these improvements in how Spot “sees” the world by integrating visual semantic context to its navigation system. Anyone with the extension installed on their robot will be able to see these changes in action.

Get the latest release

Continuing to Learn

Of course, this work doesn’t end with semantic hazard avoidance being deployed in the real world. Instead, that is the first step to train even more capable models. We are able to use performance data from customers to inform what is working well and what needs to be fine tuned. Real world operational data also enables us to train even more reliable and accurate models to detect other types of hazards or important objects in the environments where Spot works.

These kinds of semantic understandings not only make Spot more reliable, but they also help Spot act more like a person would in a given situation. Using foundation models to train contextual behaviors helps our robots be more predictable to people, more intuitive, and easier to both use and to work around.