Please enable JavaScript in your browser.

Enhancing AI Agent Spatial Reasoning for Real-World Applications - fltech - Technology Blog of Fujitsu Research

fltech - Technology Blog of Fujitsu Research

A technology blog where Fujitsu researchers talk about a variety of topics

Enhancing AI Agent Spatial Reasoning for Real-World Applications

Introduction

Hello, we are Fan Yang from Fujitsu Research and Quanting Xie from Carnegie Mellon University.

Our institutions have a longstanding partnership focused on pioneering AI technologies to tackle real-world challenges. We are excited to share how our partnership pioneers AI technologies for real-world challenges. CMU’s Embodied-RAG - a framework that revolutionizes spatial reasoning for embodied AI agents - combines seamlessly with our YOWO (You Only Walk Once) system for collaborative indoor mapping. Together, they elevate spatial intelligence accommodated to be evaluated in our FieldWorkArena (FWA), which is designed for benchmarking AI agents in improving efficiency, safety, and decision-making in complex indoor environments like factories, warehouses and retail stores. In this blog, we will introduce how their integration solves critical challenges in factories and warehouses through enhanced spatial-semantic reasoning.

Embodied-RAG: Hierarchical Spatial Memory

Embodied-RAG*1 bridges the gap between traditional Retrieval-Augmented Generation (RAG) and robotics by introducing a non-parametric memory system that autonomously constructs hierarchical knowledge for navigation and language tasks . Unlike conventional RAG, which struggles with multimodal, spatially correlated data, Embodied-RAG organizes experiences into a semantic forest—a layered structure storing language descriptions at varying resolutions. Embodied-RAG provides a faster and robust spatial retrieval for agents that assist field workers.

Hierarchical structure of Embodied-RAG’s semantic forest, enabling multi-resolution query handling

Compared with existing works, Embodied-RAG builds bottom-up memory with multimodal representation and efficient integration of structure into embodied experiences. The system first represents embodied experiences with a multimodal topological graph, where each node contains robot poses, robot observations (images), and timestamps. Based on these topological nodes, a semantic forest is hierarchically clustered based on spatial proximity. This graph-building process is 9.76X faster than Light-RAG on the same dataset size, and can be extended in real-time. This two-stage memory system creates an efficient, large-scale, globally aware, interpretable, and multimodal memory representation for embodied agents to retrieve from.

Comparison of memory building

YOWO: Efficient Indoor Mapping and Sensor Registration

YOWO ("You Only Walk Once") https://ieeexplore.ieee.org/document/10663468 provides an elegant solution for efficiently mapping indoor environments and registering multiple sensors within a unified coordinate system. The process requires just a single walkthrough with an RGB-D camera-equipped mobile (or embodied) agent while ceiling-mounted cameras observe the agent's movement (see Fig. 3). Other indoor IoT sensors can be registered using the registered CCTV cameras to infer their spatial positions.

YOWO’s joint mapping and multi-camera registration process, enabling unified spatial coordination(©2025 IEEE)

Compared to related works, YOWO is robust at handling spatial ambiguities during the indoor scene mapping and CCTV camera registration, through collaborative mapping the indoor scene and registering cameras to the scene layout.

Comparison of indoor scene mapping and camera registration(©2025 IEEE).

Synergistic Integration for FieldWorkArena

Our FieldWorkArena *2 is a benchmark suite designed to evaluate AI agents in real-world scenarios, such as factories and logistics centers. The combination of YOWO’s precision in indoor spatial registration with Embodied-RAG’s hierarchical memory system unlocks transformative applications for FieldWorkArena.

Case Studies

We introduce two use cases combining the strong features of Embodied-RAG and YOWO on warehouse field work monitoring.

Warehouse Safety Compliance

Challenge: Ensuring worker adherence to safety protocols in a dynamic environment Implementation:

  1. YOWO mapped the facility and registered 24 cameras in hours
  2. Embodied-RAG organized spatial data with safety-related semantic tags
  3. The integrated system responded to both explicit queries ("Find areas where helmets are missing") and implicit queries ("Identify a quiet area for equipment maintenance")

Smart Warehouse Optimization

Challenge: A 10,000m² warehouse requiring automated safety checks and equipment tracking Solution:

  1. YOWO implementation completed mapping and sensor registration in hours instead of days
  2. Embodied-RAG integration enabled complex spatial reasoning:
    • Real-time tracking of assets and personnel
    • Identification of workflow bottlenecks through spatial pattern analysis
    • Predictive maintenance scheduling based on equipment location and usage patterns

Future Directions: CMU x Fujitsu Synergy

The Fujitsu-CMU partnership is developing an integrated spatial intelligence platform that combines the strengths of both technologies:

  1. Temporal-Spatial Reasoning: Extending Embodied-RAG's semantic forest to incorporate YOWO's temporal data, enabling predictive modeling of dynamic environments and anticipatory safety responses
  2. Multi-Agent Coordination: Using YOWO's unified spatial registration to enhance Embodied-RAG's reasoning across multiple autonomous agents, creating collaborative agent teams with shared spatial awareness
  3. Cross-Modal Semantic Grounding: Developing a joint embedding space where Embodied-RAG's language representations align with YOWO's spatial coordinates, enabling natural-language control of physical systems
  4. Federated Spatial Learning: Creating distributed Embodied-RAG instances that share hierarchical knowledge across multiple YOWO-mapped facilities, establishing enterprise-wide spatial intelligence networks.
  5. Enhanced Autonomous Reporting and Compliance: Combining the strengths of Embodied-RAG’s retrieval capabilities with YOWO’s real-time spatial data, the partnership aims to automate incident logs and compliance checks. This will facilitate proactive safety management and operational efficiency, allowing for real-time monitoring and reporting of safety and compliance issues in industrial settings.

Conclusion

The fusion of Embodied-RAG’s hierarchical reasoning and YOWO’s spatial precision positions Fujitsu at the forefront of large-scale indoor spatial reasoning. By embedding these technologies into FieldWorkArena, we can deliver scalable, safety-critical solutions for smart factories, warehouses, and beyond.