Spatial reasoning is a foundational capability for multimodal large language models (MLLMs) to perceive
and operate within the physical world. However, existing benchmarks predominantly rely on passive
evaluation (e.g., static VQA) or simulator-specific pipelines, failing to assess general interactive
spatial understanding. We introduce SpatialWorld, a unified benchmark designed
specifically for evaluating the interactive spatial understanding of multimodal agents in complex
real-world tasks.
Integrating eight heterogeneous simulation backends under a shared, simulator-agnostic protocol,
SpatialWorld features 760 human-annotated tasks across diverse domains (e.g.,
household routines, travel, social collaboration). Agents must solve tasks under vision-only partial
observability, actively gathering egocentric visual evidence and expressing decisions via a unified,
text-based action interface native to MLLMs. For reliable evaluation, each task includes a
human-validated initial state, a reference trajectory, and a terminal-state verifier.
Evaluating 15 advanced agents reveals that robust spatial task solving remains challenging: the strongest
model, GPT-5, achieves an average task success rate (TSR) of only
17.4%, while the leading open-source model, Qwen-3.5, reaches
14.1%. Further analysis exposes a clear mismatch between task success and execution
efficiency, alongside substantial domain-specific performance variations. These bottlenecks in active
exploration and long-horizon planning position SpatialWorld as a rigorous testbed for future spatial
agents.