MAGNet - Multi-Agent Diffusion Forcing Transformer

Diffusion Forcing for Multi-Agent Interaction Sequence Modeling

Vongani H. Maluleke^*§, Kie Horiuchi^*†§, Lea Wilken^§, Evonne Ng^‡, Jitendra Malik^§, Angjoo Kanazawa^§

Sony Group Corporation^†, Meta^‡, UC Berkeley^§

^*Equal contribution

Code Coming Soon

MAGNet: Multi-Agent Diffusion Forcing Transformer - A unified autoregressive diffusion framework for modeling and generating realistic motion of multiple interacting humans.

Abstract

Understanding and generating multi-person interactions is a fundamental challenge with broad implications for robotics and social computing. While humans naturally coordinate in groups, capturing this complexity in generative models remains an open problem. Current motion generation methods are specialized, handling only isolated tasks like dyadic reactions or partner prediction, and lack unified capabilities for comprehensive multi-agent motion generation. We introduce MAGNet, Multi-Agent Diffusion Forcing Transformer Network, a unified autoregressive diffusion framework that bridges this gap. The model seamlessly handles interactions among multiple people and supports flexible sampling strategies for various tasks, including dyadic prediction, partner inpainting, and multi-agent motion generation. It can autoregressively generate hundreds of realistic motions for ultra-long sequences, capturing both synchronized activities (e.g., dancing, boxing) and diverse social interactions. Building on Diffusion Forcing, we introduce key modifications to enable effective modeling of inter-agent interactions. Our model performs on-par with specialized methods on dyadic interactions while naturally extending to polyadic scenarios with three or more people—a capability enabled by our framework's scalable architecture.

Model Capabilities

Explore diverse multi-agent motion generation tasks

Dyadic Prediction

Dyadic Prediction: Jointly predict future motion of both agents from past observations, ensuring coordinated and realistic two-person interactions.

Polyadic Prediction(N=3)

Polyadic Prediction(N=3): Predict the next action of all agents given the current and historical motion of all agents.

Polyadic Prediction(N=4)

Polyadic Prediction(N=4): Predict the next action of all agents given the current and historical motion of all agents.

Synchronous Agentic Sampling

Synchronous Agentic Sampling: All agents generate motion simultaneously in parallel, each responding to the observed actions of others at the same time step.

Asynchronous Agentic Sampling

Asynchronous Agentic Sampling: Agents generate motion sequentially in turn-taking fashion, enabling highly reactive behaviors and distributed inference.

Partner Prediction

Partner Prediction: Predict one agent's future motion given both agents' past motion, generating reactive and contextually appropriate responses.

Partner Inpainting

Partner Inpainting: Generate one agent's complete motion sequence given the full motion of their partner, reconstructing coordinated interactions.

Motion In-Betweening

Motion In-Betweening: Generate smooth transitions between specified keyframes for all agents, supporting flexible motion control and editing.

Method

Technical overview of the MAGNet architecture and training process

MAGNet Training Pipeline

Input Motion

Motion Repr.

T^c→r

ΔT^can

T^A→B

T^c→r

ΔT^can

T^B→A

Encoding

x_t

Θ_t

T^can→root

VQ-VAE

Encoder

c_t

ΔT^can

Noisy Tokens

M₀

⋮

M̃(τ)

⋮

Denoiser F_φ

DFoT

Diffusion Forcing Transformer

↻ × N steps

M̂₀

⋮

Decoding

VQ-VAE

Decoder

X̃

Θ̂_t

T̂^can→root

c_t

ΔT̂^can

Reconstructed X̃

Motion Components

β — body shape

Θ — joint angles

T^can→root — canonical→root

ΔT^can — canonical delta

z — latent code

T^s→p — self→partner

Agents

Agent A

Agent B

Motion Token m_i^p

m^A

[

z^A

ΔT^can

T^A→B

]

m^B

[

z^B

ΔT^can

T^B→A

]

Temporal Denoising Schedule

Visualizing how tokens are denoised across time steps for different sampling strategies

Joint Future Prediction

All agents' future motion is jointly generated from a single distribution, ensuring coordinated predictions.

P(A_t:t+L, B_t:t+L | A_0:t−1, B_0:t−1)

Joint noise sampling preserves spatial and temporal correlations and naturally extends to n ≥ 2 agents. Each token denoises progressively with causal dependencies.

Denoising Steps

Time

Agent A (clean)

Agent B (clean)

Noisy

Speed: 1x

Frame: 0/200

Timeline:

Baseline Comparisons

Comparison with state-of-the-art methods on key tasks

Dyadic Prediction: Ready-to-React Vs MAGNet (Ours)

Ready-to-React vs MAGNet (Ours): Side-by-side comparison of dyadic future prediction results between the Ready-to-React baseline and our MAGNet model.

Partner Prediction: Ready-to-React Vs MAGNet (Ours)

Ready-to-React vs MAGNet (Ours): Side-by-side comparison of partner prediction results between the Ready-to-React baseline and our MAGNet model.

Partner Inpainting: Duolando Vs MAGNet (Ours)

Duolando vs MAGNet (Ours): Side-by-side comparison of partner inpainting results between the Duolando baseline and our MAGNet model.

Partner Inpainting: ReMoS Vs MAGNet (Ours)

ReMoS vs MAGNet (Ours): Side-by-side comparison of partner inpainting results between the ReMoS baseline and our MAGNet model.

Ultra Long Motion Generation

MAGNet can autoregressively generate hundreds of realistic motions for ultra-long sequences

Ultra Long Motion Generation: MAGNet generates extended motion sequences maintaining coordination and realism. This demonstrates the model's ability to generate ultra-long, coherent motion sequences with consistent dynamics and timing. The pink mesh is the context (4 frames, 1 second), the grey mesh is the ground truth, and the red and blue meshes are the generated motions samples. The ground truth motion ends at 44 seconds, represented by the frozen grey meshes, and we generate motion beyond the available ground truth.