Diffusion Forcing for Multi-Agent Interaction Sequence Modeling
Vongani H. Maluleke*§, Kie Horiuchi*†§, Lea Wilken§, Evonne Ng‡, Jitendra Malik§, Angjoo Kanazawa§
Sony Group Corporation†, Meta‡, UC Berkeley§
*Equal contribution
Abstract
Understanding and generating multi-person interactions is a fundamental challenge with broad implications for robotics and social computing. While humans naturally coordinate in groups, modeling such interactions remains difficult due to long temporal horizons, strong inter-agent dependencies, and variable group sizes. Existing motion generation methods are largely task-specific and do not generalize to flexible multi-agent generation. We introduce MAGNet (Multi-Agent Generative Network), a unified autoregressive diffusion framework for multi-agent motion generation that supports a wide range of interaction tasks through flexible conditioning and sampling. MAGNet performs dyadic and polyadic prediction, partner inpainting, partner prediction, and agentic generation all within a single model, and can autoregressively generate ultra-long sequences spanning hundreds of motion steps. We explicitly model inter-agent coupling during autoregressive denoising, enabling coherent coordination across agents. As a result, MAGNet captures both tightly synchronized activities (e.g., dancing, boxing) and loosely structured social interactions. Our approach performs on par with specialized methods on dyadic benchmarks while naturally extending to polyadic scenarios involving three or more interacting people.
Model Capabilities
Explore diverse multi-agent motion generation tasks
Dyadic Prediction
Dyadic Prediction: Jointly predict future motion of both agents from past observations, ensuring coordinated and realistic two-person interactions.
Polyadic Prediction(N=3)
Polyadic Prediction(N=3): Predict the next action of all agents given the current and historical motion of all agents.
Polyadic Prediction(N=4)
Polyadic Prediction(N=4): Predict the next action of all agents given the current and historical motion of all agents.
Synchronous Agentic Sampling
Synchronous Agentic Sampling: All agents generate motion simultaneously in parallel, each responding to the observed actions of others at the same time step.
Asynchronous Agentic Sampling
Asynchronous Agentic Sampling: Agents generate motion sequentially in turn-taking fashion, enabling highly reactive behaviors and distributed inference.
Partner Prediction
Partner Prediction: Predict one agent's future motion given both agents' past motion, generating reactive and contextually appropriate responses.
Partner Inpainting
Partner Inpainting: Generate one agent's complete motion sequence given the full motion of their partner, reconstructing coordinated interactions.
Motion In-Betweening
Motion In-Betweening: Generate smooth transitions between specified keyframes for all agents, supporting flexible motion control and editing.
Method
Technical overview of the MAGNet architecture and training process
MAGNet Training Pipeline
Temporal Denoising Schedule
Visualizing how tokens are denoised across time steps for different sampling strategies
All agents' future motion is jointly generated from a single distribution, ensuring coordinated predictions.
P(At:t+L, Bt:t+L | A0:t−1, B0:t−1)
Joint noise sampling preserves spatial and temporal correlations and naturally extends to n ≥ 2 agents. Each token denoises progressively with causal dependencies.
Baseline Comparisons
Comparison with state-of-the-art methods on key tasks
Dyadic Prediction: Ready-to-React Vs MAGNet (Ours)
Ready-to-React vs MAGNet (Ours): Side-by-side comparison of dyadic future prediction results between the Ready-to-React baseline and our MAGNet model.
Partner Prediction: Ready-to-React Vs MAGNet (Ours)
Ready-to-React vs MAGNet (Ours): Side-by-side comparison of partner prediction results between the Ready-to-React baseline and our MAGNet model.
Partner Inpainting: Duolando Vs MAGNet (Ours)
Duolando vs MAGNet (Ours): Side-by-side comparison of partner inpainting results between the Duolando baseline and our MAGNet model.
Partner Inpainting: ReMoS Vs MAGNet (Ours)
ReMoS vs MAGNet (Ours): Side-by-side comparison of partner inpainting results between the ReMoS baseline and our MAGNet model.
Ultra Long Motion Generation
MAGNet can autoregressively generate hundreds of realistic motions for ultra-long sequences