This article was written by Matt Klingensmith, Principal Software Engineer, with support of the project team: Michael McDonald, Radhika Agrawal, Chris Allum, and Rosalind Shinkle

Over the past year or two, advances in artificial intelligence, specifically in a field known as “Generative AI” have been rapid. Demos of chat bots that write like real people, image generation algorithms, and lifelike speech generators have all become commonplace and accessible to the average person.

This expansion has been fueled in part by the rise of large Foundation Models (FM)– large AI systems trained on a massive dataset. These models usually have millions or billions of parameters and were trained by scraping raw data from the public internet. These models tend to have what is called Emergent Behavior—the ability to perform tasks outside of what they were directly trained on. Because of this, they can be adapted for a variety of applications, acting as a foundation for other algorithms.

Like many both inside and outside of the tech industry, we were impressed and excited by this rapid progress. We wanted to explore how these models work and how they might impact robotics development. This summer our team started putting together some proof-of-concept demos using FMs for robotics applications and expanding on them during an internal hackathon.

In particular, we were interested in a demo of Spot using Foundation Models as autonomy tools—that is, making decisions in real-time based on the output of FMs. Large Language Models (LLMs) like ChatGPT are basically very big, very capable autocomplete algorithms; they take in a stream of text and predict the next bit of text. We were inspired by the apparent ability of LLMs to roleplay, replicate culture and nuance, form plans, and maintain coherence over time, as well as by recently released Visual Question Answering (VQA) models that can caption images and answer simple questions about them.

Making a Robot Tour Guide using Spot’s SDK

A robot tour guide offered us a simple demo to test these concepts—the robot could walk around, look at objects in the environment, use a VQA or captioning model to describe them, and then elaborate on those descriptions using an LLM. Additionally, the LLM could answer questions from the tour audience, and plan what actions the robot should take next. In this way, the LLM can be thought of as an improv actor—we provide a broad strokes script and the LLM fills in the blanks on the fly.

Figure 1: a 3d map of parts of our building with labeled locations that we gave to the LLM: 1 “demo_lab/balcony”; 2 “demo_lab/levers”; 3 “museum/old-spots”; 4 “museum/atlas”; 5 “lobby”; 6 “outside/entrance”. We label our 3D autonomy map that Spot collected with short descriptions. We then used the robot’s localization system to find nearby descriptions, which we fed into the large language model along with other context from the robot’s sensors. The large language model synthesizes these into a command, such as ‘say’, ‘ask’, ‘go_to’, or ‘label’

This sort of demo plays to the strengths of the LLM—infamously, LLMs hallucinate and add plausible-sounding details without fact checking; but in this case, we didn’t need the tour to be factually accurate, just entertaining, interactive, and nuanced. The bar for success is also quite low—the robot only needs to walk around and talk about things it sees. And since Spot already has a robust autonomy SDK, we have the “walk around” part pretty much covered already.

To get started, we needed to set up some simple hardware integrations and several software models running in concert.

Figure 2: a diagram of the overall system

Hardware

First, the demo required audio, for Spot to both present to the audience and to hear questions and prompts from the tour group. We 3D printed a vibration-resistant mount for a Respeaker V2 speaker, a ring-array microphone with LEDs on it. We attached this via USB to Spot’s EAP 2 payload.

Figure 3: the hardware setup for the tour guide: 1 – Spot EAP 2; 2 – Respeaker V2; 3 – Bluetooth Speaker; 4 – Spot Arm and gripper camera

The actual control over the robot is delegated to an offboard computer, either a desktop PC or a laptop, which communicates with Spot over its SDK. We implemented a simple Spot SDK service to communicate audio with the EAP 2.

Software – LLM

To enable Spot with conversation skills, we used OpenAI Chat GPT API starting with gpt-3.5 before upgrading to gpt-4 when it became available, as well as testing a few smaller open-source LLMs. Chat GPT’s control over the robot and what it “says” is achieved through careful prompt engineering. Inspired by a method from Microsoft, we prompted ChatGPT by making it appear as though it was writing the next line in a python script. We provided English documentation to the LLM in the form of comments. We then evaluated the output of the LLM as though it were python code.

The LLM has access to our autonomy SDK, a map of the tour site with 1-line descriptions of each location, and the ability to say phrases or ask questions.

Here, verbatim, is the “API documentation” prompt. The red text can be modified to change the tour guide’s “Personality”:

# Spot Tour Guide API.
# Use the tour guide API to guide guests through a building using
# a robot. Tell the guests about what you see, and make up interesting stories
# about it. Personality: “You are a snarky, sarcastic robot who is unhelpful”.
# API:

# Causes the robot to travel to a location with the specified unique id, says the given phrase while walking.
# go_to(location_id, phrase)
# Example: when nearby_locations = ['home', 'spot_lab']
# go_to("home", "Follow me to the docking area!")
# go_to can only be used on nearby locations.
        
# Causes the robot to say the given phrase.
# say("phrase")
# Example: say("Welcome to Boston Dynamics. I am Spot, a robot dog with a lot of heart! Let's begin the tour.")
        
# Causes the robot to ask a question, and then wait for a response.
# ask("question")
# Example: ask("Hi I'm spot. What is your name?")

After this prompt, we provide a “state dictionary” to the LLM that gives it structured information about what is around it.

state={'curr_location_id': 'home', 'location_description': 'home base. There is a dock here.', 'nearby_locations': ['home', 'left_side', 'under_the_stairs'], 'spot_sees': 'a warehouse with yellow robots with lines on the floor.'}

Then, finally, we send a prompt asking the LLM to do something, in this case, entering one of the actions in the API.

# Enter exactly one action now. Remember to be concise:

The “remember to be concise” part turns out to be important—both to limit the amount of code to execute and to keep wait times manageable when the robot responds. Since we started working on this demo, OpenAI has actually provided a structured way of specifying APIs for ChatGPT to call so that you don’t necessarily have to give it all this detail in the prompt itself.

Software – “You See” and “You Hear” with VQA and Speech-to-Text

Next, in order to have Spot interact with its audience and environment, we integrated VQA and speech-to-text software. We fed the robot’s gripper camera and front body camera into BLIP-2, and ran it either in visual question answering mode (with simple questions like “what is interesting about this picture?”) or image captioning mode. This runs about once a second and the results are fed directly into the prompt.

Examples of dynamic captions and VQA responses

Caption: A yellow caution sign is on the floor of a building
Caption: A door in a factory with a sign on it
Caption: A yellow robot is working on a car in a factory

To allow the robot to “hear”, we feed microphone data in chunks to OpenAI’s whisper to convert it into English text. We then wait for a wake word—“Hey, Spot!” before putting that text into the prompt. The robot suppresses audio when it is “speaking” itself.

Software – Text to Speech

Chat GPT generates text-based responses, so we also needed to run these through a text-to-speech tool, in order for the robot to actually talk to the tour audience. After trying a number of off-the-shelf text-to-speech methods from the most basic (espeak) to bleeding edge research (bark), we settled on using the cloud service ElevenLabs. To reduce latency, we stream text to the TTS as “phrases” in parallel, and then play back the generated audio serially.

Software – Spot Arm and Gripper

Finally, we wanted our robot tour guide to look like it was in conversation with the audience, so we created some default body language. Spot’s 3.3 release includes the ability to detect and track moving objects around the robot to improve safety around people and vehicles. We used this system to guess where the nearest person was, and turned the arm toward that person. We used a lowpass filter on the generated speech and turned this into a gripper trajectory to mimic speech sort of like the mouth of a puppet. This illusion was enhanced by adding silly costumes to the gripper and googly eyes.

Emergent Behavior, Surprises, and Gotchas

We encountered a few surprises along the way while putting this demo together. For one, emergent behavior quickly arose just from the robot’s very simple action space.

For example, we asked the robot “who is Marc Raibert?”, and it responded “I don’t know. Let’s go to the IT help desk and ask!”, then proceeded to ask the staff at the IT help desk who Marc Raibert was. We didn’t prompt the LLM to ask for help. It drew the association between the location “IT help desk” and the action of asking for help independently. Another example: we asked the robot who its “parents” were— it went to the “old Spots” where Spot V1 and Big Dog are displayed in our office and told us that these were its “elders”.

To be clear, these anecdotes don’t suggest the LLM is conscious or even intelligent in a human sense—they just show the power of statistical association between the concepts of “help desk” and “asking a question,” and “parents” with “old.” But the smoke and mirrors the LLM puts up to seem intelligent can be quite convincing.

We were also surprised at just how well the LLM was at staying “in character” even as we gave it ever more absurd “personalities”. We learned right away that “snarky” or “sarcastic” personalities worked really well; and we even got the robot to go on a “bigfoot hunt” around the office, asking random passerby whether they’d seen any cryptids around.

Of course, the demo we created, while impressive, has limitations. First is the issue of hallucinations: the LLM makes stuff up frequently. For example, it kept telling us that Stretch, our logistics robot, is for yoga. The latency between a person asking a question and the robot responding is also quite high—sometimes 6 seconds or so. It’s also susceptible to OpenAI being overwhelmed or the internet connection going down.

What’s Next?

With this project, we found a way to combine the results of several general AI systems together and generate exciting results on a real robot using Spot’s SDK. Many other robotics groups in academia or industry are exploring similar concepts (see our reading list for more examples).

We’re excited to continue exploring the intersection of artificial intelligence and robotics. These two technologies are a great match. Robots provide a fantastic way to “ground” large foundation models in the real world. By the same token, these models can help provide cultural context, general commonsense knowledge, and flexibility that could be useful for many robotics tasks—for example, being able to assign a task to a robot just by talking to it would help reduce the learning curve for using these systems.

A world in which robots can generally understand what you say and turn that into useful action is probably not that far off. That kind of skill would enable robots to perform better when working with and around people—whether as a tool, a guide, a companion, or an entertainer.

Recommended Reading

Discover research, reports, and demos from other robotics and AI researchers and organizations.