SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks

Hongcheng Gao^*†1, Hailong Qu^*2, Jingyi Tang³, Jiahao Wang⁴, Zihao Huang⁵, Hengkang Qiao²,
Shihong Huang³, Junming Yang⁶, Yi Li¹, Hongyixuan Yuan²,
Wenjie Li⁷, Bohan Zeng³, Wenbo Li⁸, Bo Wang⁵, Jianhui Liu⁹, Olive Huang³,
Haoyang Huang⁸, Wentao Zhang³, Guoqing Huang², Nan Duan⁸, Yinpeng Dong^†1

¹Tsinghua University, ²Chongqing University, ³Peking University,
⁴Xi'an Jiaotong University, ⁵Beijing Institute of Technology, ⁶Southeast University,
⁷Shanghai Jiao Tong University, ⁸Joy Future Academy, ⁹The University of Hong Kong

^*Equal contribution · ^†Corresponding author

Paper Code Data

Abstract

Spatial reasoning is a foundational capability for multimodal large language models (MLLMs) to perceive and operate within the physical world. However, existing benchmarks predominantly rely on passive evaluation (e.g., static VQA) or simulator-specific pipelines, failing to assess general interactive spatial understanding. We introduce SpatialWorld, a unified benchmark designed specifically for evaluating the interactive spatial understanding of multimodal agents in complex real-world tasks.

Integrating eight heterogeneous simulation backends under a shared, simulator-agnostic protocol, SpatialWorld features 760 human-annotated tasks across diverse domains (e.g., household routines, travel, social collaboration). Agents must solve tasks under vision-only partial observability, actively gathering egocentric visual evidence and expressing decisions via a unified, text-based action interface native to MLLMs. For reliable evaluation, each task includes a human-validated initial state, a reference trajectory, and a terminal-state verifier.

Evaluating 15 advanced agents reveals that robust spatial task solving remains challenging: the strongest model, GPT-5, achieves an average task success rate (TSR) of only 17.4%, while the leading open-source model, Qwen-3.5, reaches 14.1%. Further analysis exposes a clear mismatch between task success and execution efficiency, alongside substantial domain-specific performance variations. These bottlenecks in active exploration and long-horizon planning position SpatialWorld as a rigorous testbed for future spatial agents.

760

Human-Annotated Tasks

Simulation Backends

Scenario Categories

Evaluated MLLM Agents

Benchmark Overview

SpatialWorld is a scalable, general-purpose evaluation framework for multimodal agents, supporting end-to-end task solving and structured plan generation. It unifies diverse 3D backends under a standardized observation-action interface, enabling rigorous assessment of interactive spatial reasoning via reproducible benchmarks and automated efficiency metrics.

The benchmark wraps eight heterogeneous simulation backends behind a shared closed-loop protocol: agents receive only a natural-language instruction and egocentric RGB observations, express decisions through a unified text-based action interface, and are evaluated with human-validated terminal-state verifiers.

Task construction follows a unified pipeline across all environments: collect environments, write instructions, define success conditions, and validate trajectories through automated execution checks and human review.

Scroll vertically to view figures

SpatialWorld framework overview — SpatialWorld unifies diverse 3D backends under a standardized observation-action interface for interactive spatial reasoning evaluation.

SpatialWorld data construction pipeline — Data construction pipeline: environment collection, instruction writing, success-condition definition, automated execution validation, and human cross-validation.

Unified text-based action space — Unified observation and action interfaces with a structured action space and cross-simulator action-to-code mapping.

Eight Simulation Backends

🏠 AI2-THOR

Near-photorealistic indoor scenes with rich object affordances for household manipulation. 311 tasks

🏗️ ProcTHOR

Procedurally generated indoor layouts testing generalization across diverse room configurations. 127 tasks

🛋️ VirtualHome

Daily activity scripts in home environments with multi-step routines. 38 tasks

🚗 CARLA

Urban traffic simulation for outdoor navigation and travel-oriented tasks. 80 tasks

🌆 EmbodiedCity

Large-scale city navigation with realistic pedestrian and vehicle dynamics. 53 tasks

👥 Multi-AI2THOR

Multi-agent social collaboration in shared indoor environments. 29 tasks

👥 Multi-ProcTHOR

Coordinated multi-agent tasks in procedurally generated scenes. 17 tasks

🎮 3D Games

Lightweight digital environments (Block 3D, Maze 3D, Snake, Rubik's Cube) for abstract spatial reasoning. 105 tasks

Complexity Levels

🧭 Navigation

Explore the 3D environment and reach a target location or object without manipulating environment state.

🤲 Interaction

Perform object-level state changes (pick, place, open, toggle) without extensive spatial exploration.

🔀 Hybrid

Combine long-horizon navigation with multi-step manipulation, demanding both exploration and fine-grained interaction.

Task Examples

Representative tasks from SpatialWorld. Agents receive only a natural-language instruction and egocentric RGB observations.

Daily
Travel
Work & Study
Social
Digital Games

AI2-THOR

AI2-THOR · Daily

"I found the lettuce was rotten; please help me throw it in the trash."

Interaction object_in_receptacle

ProcTHOR

ProcTHOR · Daily

"Please put the dirty plate into the dishwasher and close it."

Hybrid multi-step manipulation

VirtualHome

VirtualHome · Daily

"I need to tidy up the kitchen. Please open the refrigerator door and put the salmon inside, but do not close the refrigerator door."

Hybrid object_state

CARLA

CARLA · Travel

"Walk to the position marked by the red line in the screenshot. You can turn and move in any direction."

Navigation distance_to_waypoint

EmbodiedCity

EmbodiedCity · Travel

"Navigate to the bus stop at the end of the street and wait there."

Navigation urban navigation

AI2-THOR

AI2-THOR · Work & Study

"Please organize the documents on the desk and place the laptop in the drawer."

Hybrid office scene

ProcTHOR

ProcTHOR · Work & Study

"Find the whiteboard marker and place it on the conference table."

Hybrid search & place

Multi-AI2THOR

Multi-AI2THOR · Social Collaboration

"Work with your partner to set the dining table: you handle the plates while they handle the utensils."

Hybrid multi-agent

Multi-ProcTHOR

Multi-ProcTHOR · Social Collaboration

"Coordinate with the other agent to move the heavy box to the storage room together."

Hybrid coordination

Maze 3D

3D Games · Digital

"Navigate through the 3D maze from the green start point to the red exit."

Navigation abstract spatial

Rubik's Cube

3D Games · Digital

"Rotate the faces to solve the Rubik's cube and match all sides."

Interaction geometric reasoning

Snake 3D

3D Games · Digital

"Control the snake to eat the food without hitting the walls or itself."

Hybrid planning

Experimental Results

Main Results — Task Success Rate (TSR %)

Performance across physical scenario categories and digital 3D games. Bold = best, underline = second-best per column. Physical Overall is the weighted average of Daily, Work, Entertain., Travel, and Social categories.

Model	Physical						Digital
Model	Daily	Work	Entertain.	Travel	Social	Overall	Entertain.
(A) Open-Source Models
Qwen2.5-VL-72B	3.7	8.5	2.9	0.8	2.2	3.4	7.6
Qwen3-VL-30B-A3B	6.3	5.1	4.4	1.5	4.3	4.9	7.9
Qwen3-VL-235B-Instruct	6.9	8.5	7.4	4.5	10.9	6.9	5.0
Qwen3-VL-235B-Thinking	5.7	8.5	7.4	3.8	10.9	6.1	28.3
Qwen-3.5-397B-A17B	13.1	16.9	13.2	4.5	19.6	12.2	26.0
GLM-4.5V	3.7	3.4	4.4	1.5	13.0	4.0	14.5
GLM-4.6V	2.9	5.1	4.4	1.5	0.0	2.7	8.1
Kimi-VL-A3B	1.1	3.4	0.0	0.0	0.0	0.9	3.3
Kimi-K2.5	11.1	8.5	4.4	3.8	17.4	9.2	31.0
(B) Closed-Source Models
Gemini-2.5-Pro	7.4	11.9	1.5	3.8	10.9	6.7	32.6
Gemini-3-Flash	8.0	10.2	4.4	6.1	4.3	7.2	38.1
Gemini-3.1-Pro	11.4	10.2	5.9	4.5	8.7	9.2	39.0
GPT-5	14.9	16.9	10.3	6.8	34.8	14.4	36.4
GPT-5.4	8.0	5.1	5.9	3.8	6.5	6.6	11.9
Doubao-2.0-Lite	5.7	6.8	5.9	3.0	13.0	5.8	24.8

Scenario Distribution (click to expand)

Environment	Daily	Work	Entertain.	Travel	Social	Total
AI2-THOR	219	41	40	11	0	311
ProcTHOR	92	10	23	2	0	127
VirtualHome	27	8	3	0	0	38
CARLA	0	0	0	80	0	80
EmbodiedCity	12	0	2	39	0	53
Multi-AI2THOR	0	0	0	0	29	29
Multi-ProcTHOR	0	0	0	0	17	17
3D Games	0	0	105	0	0	105
Total	350	59	173	132	46	760

Step Efficiency (SE) Table (click to expand)

SE = reference step count / actual step count on successful trajectories. Higher is more efficient. Physical Overall is the weighted mean over successful valid physical trajectories; "—" indicates no successful trajectory in that category.

Model	Physical						Digital
Model	Daily	Work	Entertain.	Travel	Social	Overall	Entertain.
(A) Open-Source Models
Qwen2.5-VL-72B	0.692	0.675	0.576	0.889	0.089	0.659	0.757
Qwen3-VL-30B-A3B	0.912	1.000	0.889	0.875	0.107	0.866	0.665
Qwen3-VL-235B-Instruct	0.871	0.864	0.863	0.482	0.145	0.737	0.889
Qwen3-VL-235B-Thinking	0.732	0.683	0.681	0.574	0.137	0.625	0.666
Qwen-3.5-397B-A17B	0.689	0.605	0.594	0.690	0.172	0.609	0.694
GLM-4.5V	0.841	1.000	0.707	0.550	0.164	0.659	0.634
GLM-4.6V	1.000	0.484	0.833	0.583	—	0.840	0.711
Kimi-VL-A3B	0.886	0.500	—	—	—	0.758	0.633
Kimi-K2.5	0.624	0.846	0.686	0.653	0.141	0.584	0.531
(B) Closed-Source Models
Gemini-2.5-Pro	0.761	0.796	1.000	0.550	0.236	0.688	0.647
Gemini-3-Flash	0.719	0.573	0.526	0.674	0.108	0.654	0.640
Gemini-3.1-Pro	0.814	0.665	0.541	0.788	0.175	0.736	0.717
GPT-5	0.707	0.664	0.547	0.615	0.146	0.587	0.536
GPT-5.4	0.796	1.000	0.726	0.647	0.200	0.745	0.702
Doubao-2.0-Lite	0.954	1.000	0.875	0.854	0.326	0.841	0.598

Indoor vs. Outdoor Performance — Top-5 Models

Multi-Agent Social Collaboration Profile (click to expand)

Key Findings

Far from reliable 3D task solving. GPT-5 achieves only 14.4% Physical Overall TSR; Qwen-3.5-397B-A17B reaches 12.2%. Even the strongest models struggle across all scenario categories.

Success ≠ efficiency. Models with comparable TSR can differ substantially in step efficiency. Kimi-K2.5 and GPT-5.4 have similar Physical Overall TSR (9.2% vs. 6.6%), yet GPT-5.4 achieves higher SE (0.745 vs. 0.584), indicating heavier trial-and-error for Kimi-K2.5.

Domain-specific strengths. GPT-5 and Qwen-3.5-397B-A17B tie in Work & Study (16.9%); GPT-5 leads Travel (6.8%); Gemini-3.1-Pro achieves the highest scores on digital 3D games (39.0% TSR).

Vision-only closed-loop evaluation. Unlike static VQA or simulator-specific pipelines, SpatialWorld requires agents to actively explore under partial observability using only egocentric RGB and a text-based action interface.

Citation

@misc{gao2026spatialworldbenchmarkinginteractivespatial,
  title={SpatialWorld: Benchmarking Interactive Spatial Reasoning of Multimodal Agents in Real-World Tasks},
  author={Hongcheng Gao and Hailong Qu and Jingyi Tang and Jiahao Wang and Zihao Huang and Hengkang Qiao and Shihong Huang and Junming Yang and Yi Li and Hongyixuan Yuan and Wenjie Li and Bohan Zeng and Wenbo Li and Bo Wang and Jianhui Liu and Olive Huang and Haoyang Huang and Wentao Zhang and Guoqing Huang and Nan Duan and Yinpeng Dong},
  year={2026},
  eprint={2606.09669},
  archivePrefix={arXiv},
  primaryClass={cs.AI},
  url={https://arxiv.org/abs/2606.09669}
}