Diffusion Forcing for Multi-Agent Interaction Sequence Modeling


Vongani H. Maluleke, Kie Horiuchi*†§, Lea Wilken§, Evonne Ng, Jitendra Malik§, Angjoo Kanazawa§

Sony Group Corporation, Meta, UC Berkeley§

*Equal contribution

MAGNeT: Multi-Agent Diffusion Forcing Transformer - A unified autoregressive diffusion framework for modeling and generating realistic motion of multiple interacting humans.

Abstract

Understanding and generating multi-person interactions is a fundamental challenge with broad implications for robotics and social computing. While humans naturally coordinate in groups, capturing this complexity in generative models remains an open problem. Current motion generation methods are specialized, handling only isolated tasks like dyadic reactions or partner prediction, and lack unified capabilities for comprehensive multi-agent motion generation. We introduce MAGNeT, Multi-Agent Diffusion Forcing Transformer, a unified autoregressive diffusion framework that bridges this gap. The model seamlessly handles interactions among multiple people and supports flexible sampling strategies for various tasks, including dyadic prediction, partner inpainting, and multi-agent motion generation. It can autoregressively generate hundreds of realistic motions for ultra-long sequences, capturing both synchronized activities (e.g., dancing, boxing) and diverse social interactions. Building on Diffusion Forcing, we introduce key modifications to enable effective modeling of inter-agent interactions. Our model performs on-par with specialized methods on dyadic interactions while naturally extending to polyadic scenarios with three or more people—a capability enabled by our framework's scalable architecture.

Method

Technical overview of the MAGNeT architecture and training process

MAGNeT Training Pipeline

Input Motion
Motion Repr.
A
β
Θ
Tc→r
ΔTcan
TA→B
B
β
Θ
Tc→r
ΔTcan
TB→A
Encoding
xt
Θt
Tcan→root
VQ-VAE
Encoder
ct
β
ΔTcan
Noisy Tokens
M0
+
ε
=
M̃(τ)
Denoiser Fφ
DFoT
Diffusion Forcing Transformer
× N steps
0
Decoding
VQ-VAE
Decoder
Θ̂t
can→root
ct
β
ΔT̂can
Reconstructed X̃
Motion Components
β — body shape
Θ — joint angles
Tcan→root — canonical→root
ΔTcan — canonical delta
z — latent code
Ts→p — self→partner
Agents
Agent A
Agent B
Motion Token mip
mA
[
zA
ΔTcan
TA→B
]
mB
[
zB
ΔTcan
TB→A
]

Temporal Denoising Schedule

Visualizing how tokens are denoised across time steps for different sampling strategies

Joint Future Prediction

All agents' future motion is jointly generated from a single distribution, ensuring coordinated predictions.

P(At:t+L, Bt:t+L | A0:t−1, B0:t−1)

Joint noise sampling preserves spatial and temporal correlations and naturally extends to n ≥ 2 agents. Each token denoises progressively with causal dependencies.

1x
Frame: 0/200
Denoising Steps
Time
Agent A (clean)
Agent B (clean)
Noisy