APPLIED SCIENCES-BASEL, cilt.16, sa.3, 2026 (SCI-Expanded, Scopus)
Autonomous navigation in unknown environments demands policies that can jointly perceive semantic context and geometric safety. Existing Transformer-enabled deep reinforcement learning (DRL) frameworks, such as the Goal-guided Transformer Soft Actor-Critic (GoT-SAC), rely on temporal stacking of multiple RGB frames, which encodes short-term motion cues but lacks explicit spatial understanding. This study introduces a geometry-aware RGB-D early fusion modality that replaces temporal redundancy with cross-modal alignment between appearance and depth. Within the GoT-SAC framework, we integrate a pixel-aligned RGB-D input into the Transformer encoder, enabling the attention mechanism to simultaneously capture semantic textures and obstacle geometry. A comprehensive systematic ablation study was conducted across five modality variants (4RGB, RGB-D, G-D, 4G-D, and 4RGB-D) and three fusion strategies (early, parallel, and late) under identical hyperparameter settings in a controlled simulation environment. The proposed RGB-D early fusion achieved a 40.0% success rate and +94.1 average reward, surpassing the canonical 4RGB baseline (28.0% success, +35.2 reward), while a tuned configuration further improved performance to 54.0% success and +146.8 reward. These results establish early pixel-level multimodal fusion (RGB-D) as a principled and efficient successor to temporal stacking, yielding higher stability, sample efficiency, and geometry-aware decision-making. This work provides the first controlled evidence that spatially aligned multimodal fusion within Transformer-based DRL significantly enhances mapless navigation performance and offers a reproducible foundation for sim-to-real transfer in autonomous mobile robots.