SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks

1Tsinghua University, 2Chongqing University, 3Peking University, 4ZenoMind AI,
5Xi'an Jiaotong University, 6Beijing Institute of Technology, 7Southeast University,
8Shanghai Jiao Tong University, 9Joy Future Academy, 10The University of Hong Kong

*Equal contribution  ·  Corresponding author

Abstract

Spatial reasoning is a foundational capability for multimodal large language models (MLLMs) to perceive and operate within the physical world. However, existing benchmarks predominantly rely on passive evaluation (e.g., static VQA) or simulator-specific pipelines, failing to assess general interactive spatial understanding. We introduce SpatialWorld, a unified benchmark designed specifically for evaluating the interactive spatial understanding of multimodal agents in complex real-world tasks.

Integrating eight heterogeneous simulation backends under a shared, simulator-agnostic protocol, SpatialWorld features 760 human-annotated tasks across diverse domains (e.g., household routines, travel, social collaboration). Agents must solve tasks under vision-only partial observability, actively gathering egocentric visual evidence and expressing decisions via a unified, text-based action interface native to MLLMs. For reliable evaluation, each task includes a human-validated initial state, a reference trajectory, and a terminal-state verifier.

Evaluating 15 advanced agents reveals that robust spatial task solving remains challenging: the strongest model, GPT-5, achieves an average task success rate (TSR) of only 17.4%, while the leading open-source model, Qwen-3.5, reaches 14.1%. Further analysis exposes a clear mismatch between task success and execution efficiency, alongside substantial domain-specific performance variations. These bottlenecks in active exploration and long-horizon planning position SpatialWorld as a rigorous testbed for future spatial agents.

760
Human-Annotated Tasks
8
Simulation Backends
6
Scenario Categories
15
Evaluated MLLM Agents

Benchmark Overview

SpatialWorld is a scalable, general-purpose evaluation framework for multimodal agents, supporting end-to-end task solving and structured plan generation. It unifies diverse 3D backends under a standardized observation-action interface, enabling rigorous assessment of interactive spatial reasoning via reproducible benchmarks and automated efficiency metrics.

The benchmark wraps eight heterogeneous simulation backends behind a shared closed-loop protocol: agents receive only a natural-language instruction and egocentric RGB observations, express decisions through a unified text-based action interface, and are evaluated with human-validated terminal-state verifiers.

Task construction follows a unified pipeline across all environments: collect environments, write instructions, define success conditions, and validate trajectories through automated execution checks and human review.

Eight Simulation Backends

🏠 AI2-THOR

Near-photorealistic indoor scenes with rich object affordances for household manipulation. 311 tasks

🏗️ ProcTHOR

Procedurally generated indoor layouts testing generalization across diverse room configurations. 127 tasks

🛋️ VirtualHome

Daily activity scripts in home environments with multi-step routines. 38 tasks

🚗 CARLA

Urban traffic simulation for outdoor navigation and travel-oriented tasks. 80 tasks

🌆 EmbodiedCity

Large-scale city navigation with realistic pedestrian and vehicle dynamics. 53 tasks

👥 Multi-AI2THOR

Multi-agent social collaboration in shared indoor environments. 29 tasks

👥 Multi-ProcTHOR

Coordinated multi-agent tasks in procedurally generated scenes. 17 tasks

🎮 3D Games

Lightweight digital environments (Block 3D, Maze 3D, Snake, Rubik's Cube) for abstract spatial reasoning. 105 tasks

Complexity Levels

🧭 Navigation

Explore the 3D environment and reach a target location or object without manipulating environment state.

🤲 Interaction

Perform object-level state changes (pick, place, open, toggle) without extensive spatial exploration.

🔀 Hybrid

Combine long-horizon navigation with multi-step manipulation, demanding both exploration and fine-grained interaction.

Task Examples

Representative tasks from SpatialWorld. Agents receive only a natural-language instruction and egocentric RGB observations.

Experimental Results

Main Results — Task Success Rate (TSR %)

Performance across physical scenario categories and digital 3D games. Bold = best, underline = second-best per column. Physical Overall is the weighted average of Daily, Work, Entertain., Travel, and Social categories.

Model Physical Digital
Daily Work Entertain. Travel Social Overall Entertain.
(A) Open-Source Models
Qwen2.5-VL-72B3.78.52.90.82.23.47.6
Qwen3-VL-30B-A3B6.35.14.41.54.34.97.9
Qwen3-VL-235B-Instruct6.98.57.44.510.96.95.0
Qwen3-VL-235B-Thinking5.78.57.43.810.96.128.3
Qwen-3.5-397B-A17B13.116.913.24.519.612.226.0
GLM-4.5V3.73.44.41.513.04.014.5
GLM-4.6V2.95.14.41.50.02.78.1
Kimi-VL-A3B1.13.40.00.00.00.93.3
Kimi-K2.511.18.54.43.817.49.231.0
(B) Closed-Source Models
Gemini-2.5-Pro7.411.91.53.810.96.732.6
Gemini-3-Flash8.010.24.46.14.37.238.1
Gemini-3.1-Pro11.410.25.94.58.79.239.0
GPT-514.916.910.36.834.814.436.4
GPT-5.48.05.15.93.86.56.611.9
Doubao-2.0-Lite5.76.85.93.013.05.824.8
Scenario Distribution (click to expand)
Environment Daily Work Entertain. Travel Social Total
AI2-THOR 2194140110311
ProcTHOR 92102320127
VirtualHome 27830038
CARLA 00080080
EmbodiedCity 120239053
Multi-AI2THOR 00002929
Multi-ProcTHOR 00001717
3D Games 0010500105
Total 3505917313246760
Step Efficiency (SE) Table (click to expand)

SE = reference step count / actual step count on successful trajectories. Higher is more efficient. Physical Overall is the weighted mean over successful valid physical trajectories. "—" indicates no successful trajectory in that category.

Model Physical Digital
Daily Work Entertain. Travel Social Overall Entertain.
(A) Open-Source Models
Qwen2.5-VL-72B0.5450.5100.4580.8890.1430.5260.688
Qwen3-VL-30B-A3B0.7020.6670.5000.8750.1740.6580.765
Qwen3-VL-235B-Instruct0.7080.5740.5290.4490.2430.5870.397
Qwen3-VL-235B-Thinking0.5360.4530.4240.5240.2180.4710.747
Qwen-3.5-397B-A17B0.5520.4770.4530.6330.2900.5080.737
GLM-4.5V0.6630.5830.4820.4500.2700.5290.809
GLM-4.6V0.7050.3810.4440.4170.5760.920
Kimi-VL-A3B0.6360.3330.5350.948
Kimi-K2.50.5190.5560.5170.5530.2260.4860.626
(B) Closed-Source Models
Gemini-2.5-Pro0.6150.5670.6670.4830.3990.5690.518
Gemini-3-Flash0.5750.3900.5040.6120.1830.5360.657
Gemini-3.1-Pro0.7080.5440.4660.7320.2810.6490.717
GPT-50.5970.5400.3870.5440.2480.5110.583
GPT-5.40.6170.6670.4270.5130.3050.5690.720
Doubao-2.0-Lite0.7760.7080.6040.7080.5220.7040.599

Indoor vs. Outdoor Performance — Top-5 Models

Indoor vs outdoor radar chart
Multi-Agent Social Collaboration Profile (click to expand)
Multi-agent social profile

Key Findings

Far from reliable 3D task solving. GPT-5 achieves only 14.4% Physical Overall TSR; Qwen-3.5-397B-A17B reaches 12.2%. Even the strongest models struggle across all scenario categories.
Success ≠ efficiency. Models with comparable TSR can differ substantially in step efficiency. Kimi-K2.5 and GPT-5.4 have similar Physical Overall TSR (9.2% vs. 6.6%), yet GPT-5.4 achieves higher SE (0.569 vs. 0.486), indicating heavier trial-and-error for Kimi-K2.5.
Domain-specific strengths. GPT-5 and Qwen-3.5-397B-A17B tie in Work & Study (16.9%); GPT-5 leads Travel (6.8%); Gemini-3.1-Pro achieves the highest scores on digital 3D games (39.0% TSR).
Vision-only closed-loop evaluation. Unlike static VQA or simulator-specific pipelines, SpatialWorld requires agents to actively explore under partial observability using only egocentric RGB and a text-based action interface.

Citation

@misc{gao2026spatialworldbenchmarkinginteractivespatial,
  title={SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks},
  author={Hongcheng Gao and Hailong Qu and Jingyi Tang and Jiahao Wang and Zihao Huang and Hengkang Qiao and Shihong Huang and Junming Yang and Yi Li and Hongyixuan Yuan and Wenjie Li and Bohan Zeng and Wenbo Li and Bo Wang and Jianhui Liu and Olive Huang and Haoyang Huang and Wentao Zhang and Guoqing Huang and Nan Duan and Yinpeng Dong},
  year={2026},
  eprint={2606.09669},
  archivePrefix={arXiv},
  primaryClass={cs.AI},
  url={https://arxiv.org/abs/2606.09669}
}