Diffusion Forcing for Multi-Agent Interaction Sequence Modeling
Vongani H. Maluleke*§, Kie Horiuchi*†§, Lea Wilken§, Evonne Ng‡, Jitendra Malik§, Angjoo Kanazawa§
Sony Group Corporation†, Meta‡, UC Berkeley§
*Equal contribution
Abstract
Understanding and generating multi-person interactions is a fundamental challenge with broad implications for robotics and social computing. While humans naturally coordinate in groups, capturing this complexity in generative models remains an open problem. Current motion generation methods are specialized, handling only isolated tasks like dyadic reactions or partner prediction, and lack unified capabilities for comprehensive multi-agent motion generation. We introduce MAGNeT, Multi-Agent Diffusion Forcing Transformer, a unified autoregressive diffusion framework that bridges this gap. The model seamlessly handles interactions among multiple people and supports flexible sampling strategies for various tasks, including dyadic prediction, partner inpainting, and multi-agent motion generation. It can autoregressively generate hundreds of realistic motions for ultra-long sequences, capturing both synchronized activities (e.g., dancing, boxing) and diverse social interactions. Building on Diffusion Forcing, we introduce key modifications to enable effective modeling of inter-agent interactions. Our model performs on-par with specialized methods on dyadic interactions while naturally extending to polyadic scenarios with three or more people—a capability enabled by our framework's scalable architecture.
Method
Technical overview of the MAGNeT architecture and training process
MAGNeT Training Pipeline
Temporal Denoising Schedule
Visualizing how tokens are denoised across time steps for different sampling strategies
All agents' future motion is jointly generated from a single distribution, ensuring coordinated predictions.
P(At:t+L, Bt:t+L | A0:t−1, B0:t−1)
Joint noise sampling preserves spatial and temporal correlations and naturally extends to n ≥ 2 agents. Each token denoises progressively with causal dependencies.