Aerial Vision-and-Language Navigation (VLN) is a novel task enabling Unmanned Aerial Vehicles (UAVs) to navigate in outdoor environments through natural language instructions and visual cues. It remains challenging due to the complex spatial relationships in outdoor aerial scenes. In this paper, we propose an end-to-end zero-shot framework for aerial VLN tasks, where the large language model (LLM) is introduced as our agent for action prediction. Specifically, we develop a novel Semantic-Topo-Metric Representation (STMR) to enhance the spatial reasoning ability of LLMs. This is achieved by extracting and projecting instruction-related semantic masks of landmarks into a top-down map that contains the location information of surrounding landmarks. Further, this map is transformed into a matrix representation with distance metrics as the text prompt to the LLM, for action prediction according to the instruction. Experiments conducted in real and simulation environments have successfully proved the effectiveness and robustness of our method, achieving 15.9% and 12.5% improvements (absolute) in Oracle Success Rate (OSR) on AerialVLN-S dataset.
In this paper, we introduce a novel zero-shot framework that leverages large language models (LLMs) for action prediction in aerial VLN tasks.
Our method consists of three modules, i.e. sub-goal extracting, Semantic-Topo-Metric Representating, and LLM planner. First, we prompt an LLM to break down long instructions and extract specific landmarks, which will be classified by perception models, generating corresponding semantic masks. Then we project these masks into a top-down map using the UAV-view pose and depth information. The map is further compressed into a matrix representation. Finally, we prompt LLM with well-designed instructions to enable effective reasoning.
Our method can align visual and textual landmarks as well as understand commands. Finally, the UAV reaches the destination successfully.
@misc{gao2024stmraerialvln,
title={Aerial Vision-and-Language Navigation via Semantic-Topo-Metric Representation Guided LLM Reasoning},
author={Yunpeng Gao and Zhigang Wang and Linglin Jing and Dong Wang and Xuelong Li and Bin Zhao},
year={2024},
eprint={2410.08500},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2410.08500},
}