close
close

MIT develops multimodal technology for robot training

MIT develops multimodal technology for robot training

Listen to this article

Voiced by Amazon Polly
MIT develops multimodal technology for robot training

Researchers filmed several instances of a robotic arm feeding a dog. The videos were included in the robot’s training datasets. | 1 credit

General purpose robot training remains a major challenge. Typically, engineers collect data specific to a particular robot and task, which they use to train the robot in a controlled environment. However, collecting this data is expensive and time-consuming, and the robot will likely have difficulty adapting to environments or tasks it has not encountered before.

To better train general-purpose robots, MIT researchers have developed a general-purpose technique that combines vast amounts of disparate data from many sources into one system that can teach any robot a wide range of tasks.

Their method involves combining data from different domains, such as simulations and real robots, as well as various modalities, including video sensors and robotic arm position encoders, into a common “language” that a generative AI model can process.

By combining such a huge amount of data, this approach can be used to train a robot to perform a wide variety of tasks without having to start training it from scratch each time.

This method can be faster and cheaper than traditional methods because it requires much less data for a specific task. Additionally, it outperformed training from scratch by more than 20% in simulations and real experiments.

“In robotics, people often argue that we don’t have enough training data. But I think the other big challenge is that the data comes from so many different fields, modalities, and robotic equipment. Our work shows how a robot can be trained using all these methods combined,” said Lirui Wang, an electrical engineering and computer science (EECS) graduate student and lead author of a paper on the technique.

Wang’s co-authors include fellow EECS graduate student Jialiang Zhao; Xinlei Chen, research fellow at Meta; and senior author Kaiming He, assistant professor at EECS and member of the Computer Science and Artificial Intelligence Laboratory (CSAIL).

MIT researchers have developed a multimodal technique to help robots learn new skills.

This figure shows how the new technology combines data from different domains, such as simulation and real robots, as well as from multiple modalities, including machine vision sensors and robotic arm position encoders, into a common “language” that a generative AI model can process. | 1 credit

Inspired by the LL.M.

The robotic “policy” takes into account sensor observations, such as camera images or proprioceptive measurements that track the speed and position of the robotic arm, and then tells the robot how and where to move.

Policies are typically trained using imitation learning, which is a human demonstration of actions or telecontrol of a robot to generate data that is fed to an artificial intelligence model that learns the policy. Because this method uses a small amount of data for a specific task, robots often fail when their environment or task changes.

To develop a better approach, Wang and his colleagues took inspiration from large language models such as GPT-4.

These models are pre-trained using a huge amount of diverse language data and then tuned by feeding them small amounts of data for specific tasks. Pre-training on such a large amount of data helps models adapt to perform efficiently on different tasks.

“In the realm of language, all data is simply sentences. In robotics, given all the heterogeneity of data, if you want to do pretraining in this way, we need a different architecture,” he said.

Robotic data takes many forms, from camera images to language instructions to depth maps. At the same time, each robot is mechanically unique: it has a different number and orientation of arms, grippers and sensors. Additionally, the environments in which data are collected vary greatly.


SITE AD to host presentations at the 2025 Robotics Summit.Apply for a speaking engagement.


MIT researchers have developed a new architecture called Heterogeneous Pretrained Transformers (HPT) that integrates data from these different modalities and domains.

They placed a machine learning model, known as a transformer, in the middle of their architecture that processes vision and proprioception inputs. A transformer is the same type of model that forms the basis of large language models.

Researchers combine vision and proprioception data into the same type of input, called a token, which can be processed by a transducer. Each input is represented by the same fixed number of tokens.

The transformer then maps all the input data into one common space, growing into a huge pre-trained model as it processes and trains on more data. The larger the transformer becomes, the better it will perform.

The user only needs to provide HPT with a small amount of data about the design, setup, and task he wants to perform. The HPT then transfers the knowledge acquired by the transformer during pre-training to learn a new task.

Enabling deft movements

One of the biggest challenges in developing HPT was creating a huge transformer pre-training dataset, which included 52 datasets with more than 200,000 robot trajectories in four categories, including human demo videos and simulations.

The researchers also needed to develop an efficient way to convert the raw proprioceptive signals from the sensor array into data that the transducer could process.

“Proprioception is key to many dexterous movements. Since the number of tokens in our architecture is always the same, we give equal weight to proprioception and vision,” Wang explained.

When they tested HPT, the robot’s performance on simulated and real-world tasks improved by more than 20% compared to training it from scratch each time. Even when the task was very different from the pre-training data, HPT still improved performance.

“This paper presents a new approach to learning a uniform policy for multiple variants of a robot. This allows for training on a variety of data sets, allowing robotic learning methods to significantly increase the size of the data sets on which they can learn. It also allows the model to quickly adapt to new robot designs, which is important because new robot designs are being created all the time,” said David Held, an assistant professor at Carnegie Mellon University’s Robotics Institute who was not involved in this work.

In the future, researchers want to explore how data diversity can improve HPT performance. They also want to improve HPT so it can handle unlabeled data like GPT-4 and other large language models.

“Our dream is to create a universal robot brain that you can download and use for your robot without any training. Although we are only in the early stages, we are going to continue to work hard and hope that scaling will lead to a breakthrough in robotic policy, just like what happened with large language models,” he said.

Editor’s Note: This article has been republished from MIT News.