Conference Paper

MuTT: A Multimodal Trajectory Transformer for Robot Skills

We are happy to announce that our paper “MuTT: A Multimodal Trajectory Transformer for Robot Skills” has been accepted at the IEEE International Conference on Robots and Systems (IROS).

The full text of the paper is accessible on ArXiv (Link).

The MuTT architecture: modality specific embedding of the simulated trajectory (blue), skill parameters (green) and environment image (red) into tokens, which are concatenated to one token sequence. All tokens are coded with modality specific positional and token-type encoding and passed through an encoder transformer. The decoder predicts the real-world trajectory (purple) in an autoregressive manner.

In the paper, we introduce MuTT (Multimodal Trajectory Transformer), a novel encoder-decoder transformer architecture designed to predict robot skill executions by integrating vision, trajectory, and robot skill parameters. MuTT is the first transformer to successfully fuse vision and trajectory modalities, featuring a new trajectory projection method that maintains precise pose and force information. The system can work as a predictor within robot skill optimizers, enabling parameter optimization for current environments without requiring real-world executions during the optimization process.We demonstrate MuTT’s effectiveness through three comprehensive experiments (including two real-world industrial tasks and one simulation benchmark) using two different skill representations. The results show that MuTT can accurately predict robot trajectories and forces, leading to significant improvements in task success rates and efficiency when used within model-based optimization frameworks.