Diffusion Forcing for Multi-Agent Interaction Sequence Modeling
Vongani H. Maluleke*§, Kie Horiuchi*†§, Lea Wilken§, Evonne Ng‡, Jitendra Malik§, Angjoo Kanazawa§
Sony Group Corporation†, Meta‡, UC Berkeley§
*Equal contribution
Abstract
Understanding and generating multi-person interactions is a fundamental challenge with broad implications for robotics and social computing. While humans naturally coordinate in groups, capturing this complexity in generative models remains an open problem. Current motion generation methods are specialized, handling only isolated tasks like dyadic reactions or partner prediction, and lack unified capabilities for comprehensive multi-agent motion generation. We introduce MAGNet, Multi-Agent Diffusion Forcing Transformer Network, a unified autoregressive diffusion framework that bridges this gap. The model seamlessly handles interactions among multiple people and supports flexible sampling strategies for various tasks, including dyadic prediction, partner inpainting, and multi-agent motion generation. It can autoregressively generate hundreds of realistic motions for ultra-long sequences, capturing both synchronized activities (e.g., dancing, boxing) and diverse social interactions. Building on Diffusion Forcing, we introduce key modifications to enable effective modeling of inter-agent interactions. Our model performs on-par with specialized methods on dyadic interactions while naturally extending to polyadic scenarios with three or more people—a capability enabled by our framework's scalable architecture.
Model Capabilities
Explore diverse multi-agent motion generation tasks
Dyadic Prediction
Dyadic Prediction: Jointly predict future motion of both agents from past observations, ensuring coordinated and realistic two-person interactions.
Polyadic Prediction(N=3)
Polyadic Prediction(N=3): Predict the next action of all agents given the current and historical motion of all agents.
Polyadic Prediction(N=4)
Polyadic Prediction(N=4): Predict the next action of all agents given the current and historical motion of all agents.
Synchronous Agentic Sampling
Synchronous Agentic Sampling: All agents generate motion simultaneously in parallel, each responding to the observed actions of others at the same time step.
Asynchronous Agentic Sampling
Asynchronous Agentic Sampling: Agents generate motion sequentially in turn-taking fashion, enabling highly reactive behaviors and distributed inference.
Partner Prediction
Partner Prediction: Predict one agent's future motion given both agents' past motion, generating reactive and contextually appropriate responses.
Partner Inpainting
Partner Inpainting: Generate one agent's complete motion sequence given the full motion of their partner, reconstructing coordinated interactions.
Motion In-Betweening
Motion In-Betweening: Generate smooth transitions between specified keyframes for all agents, supporting flexible motion control and editing.
Method
Technical overview of the MAGNet architecture and training process
MAGNet Training Pipeline
Temporal Denoising Schedule
Visualizing how tokens are denoised across time steps for different sampling strategies
All agents' future motion is jointly generated from a single distribution, ensuring coordinated predictions.
P(At:t+L, Bt:t+L | A0:t−1, B0:t−1)
Joint noise sampling preserves spatial and temporal correlations and naturally extends to n ≥ 2 agents. Each token denoises progressively with causal dependencies.
Baseline Comparisons
Comparison with state-of-the-art methods on key tasks
Dyadic Prediction: Ready-to-React Vs MAGNet (Ours)
Ready-to-React vs MAGNet (Ours): Side-by-side comparison of dyadic future prediction results between the Ready-to-React baseline and our MAGNet model.
Partner Prediction: Ready-to-React Vs MAGNet (Ours)
Ready-to-React vs MAGNet (Ours): Side-by-side comparison of partner prediction results between the Ready-to-React baseline and our MAGNet model.
Partner Inpainting: Duolando Vs MAGNet (Ours)
Duolando vs MAGNet (Ours): Side-by-side comparison of partner inpainting results between the Duolando baseline and our MAGNet model.
Partner Inpainting: ReMoS Vs MAGNet (Ours)
ReMoS vs MAGNet (Ours): Side-by-side comparison of partner inpainting results between the ReMoS baseline and our MAGNet model.
Ultra Long Motion Generation
MAGNet can autoregressively generate hundreds of realistic motions for ultra-long sequences