STMR VLN

Aerial Vision-and-Language Navigation via Semantic-Topo-Metric Representation Guided LLM Reasoning

Northwestern Polytechnical University, Shanghai AI Laboratory, Institute of Artificial Intelligence, China Telecom Corp Ltd
^*Indicates Equal Contribution
^†Indicates Corresponding Author

Abstract

Aerial Vision-and-Language Navigation (VLN) is a novel task enabling Unmanned Aerial Vehicles (UAVs) to navigate in outdoor environments through natural language instructions and visual cues. It remains challenging due to the complex spatial relationships in outdoor aerial scenes. In this paper, we propose an end-to-end zero-shot framework for aerial VLN tasks, where the large language model (LLM) is introduced as our agent for action prediction. Specifically, we develop a novel Semantic-Topo-Metric Representation (STMR) to enhance the spatial reasoning ability of LLMs. This is achieved by extracting and projecting instruction-related semantic masks of landmarks into a top-down map that contains the location information of surrounding landmarks. Further, this map is transformed into a matrix representation with distance metrics as the text prompt to the LLM, for action prediction according to the instruction. Experiments conducted in real and simulation environments have successfully proved the effectiveness and robustness of our method, achieving 15.9% and 12.5% improvements (absolute) in Oracle Success Rate (OSR) on AerialVLN-S dataset.

Method

In this paper, we introduce a novel zero-shot framework that leverages large language models (LLMs) for action prediction in aerial VLN tasks.

Our method consists of three modules, i.e. sub-goal extracting, Semantic-Topo-Metric Representating, and LLM planner. First, we prompt an LLM to break down long instructions and extract specific landmarks, which will be classified by perception models, generating corresponding semantic masks. Then we project these masks into a top-down map using the UAV-view pose and depth information. The map is further compressed into a matrix representation. Finally, we prompt LLM with well-designed instructions to enable effective reasoning.

Firstly, we extract instruction-related landmarks and obtain corresponding semantic masks by perception models. After that, semantic masks obtained from each step are gradually projected into a top-down map using depth and pose transformation. To integrate rich visual information and topology into the text prompt while maintaining its simplicity, we separate the top-down map centered at the UAV's current position into grids and substitute each grid with a semantic number. This process transforms the top-down map into a matrix representation to serve as a spatial prompt for the LLM.

BibTeX

@misc{gao2024stmraerialvln, title={Aerial Vision-and-Language Navigation via Semantic-Topo-Metric Representation Guided LLM Reasoning}, author={Yunpeng Gao and Zhigang Wang and Linglin Jing and Dong Wang and Xuelong Li and Bin Zhao}, year={2024}, eprint={2410.08500}, archivePrefix={arXiv}, primaryClass={cs.RO}, url={https://arxiv.org/abs/2410.08500}, }

Aerial Vision-and-Language Navigation via Semantic-Topo-Metric Representation Guided LLM Reasoning

Abstract

Method

Demo Video

Qualitative result

Qualitative result of our method in the real-world scenarios.
Our method can align visual and textual landmarks as well as understand commands. Finally, the UAV reaches the destination successfully.

BibTeX

Aerial Vision-and-Language Navigation via Semantic-Topo-Metric Representation Guided LLM Reasoning

Abstract

Method

Demo Video

Qualitative result

Qualitative result of our method in the real-world scenarios. Our method can align visual and textual landmarks as well as understand commands. Finally, the UAV reaches the destination successfully.

BibTeX

Qualitative result of our method in the real-world scenarios.
Our method can align visual and textual landmarks as well as understand commands. Finally, the UAV reaches the destination successfully.