Background Image

MMSP 2025 Embodied AI Challenge:
Long-Horizon Vision-Language Navigation Challenge

September 21 to September 23, 2025
Beijing, China

The 1st Long-Horizon Vision-Language Navigation Challenge based on the insights presented in the LH-VLN—hosted as the “Embodied AI Challenge” track of the IEEE 27th International Workshop on Multimedia Signal Processing (MMSP 2025)—focuses on complicated long-horizon VLN tasks. Our LHPR-VLN benchmark defines a complex task that includes multiple single-stage subtasks. For an LHPR-VLN task, the basic format is “Find something somewhere, and take it to something somewhere, then. . . ”. Each complex task involves locating an object at a specified initial location and transporting it to a designated target location, potentially encompassing two to four sequential navigation subtasks. The embodied agent needs to sequentially complete these single-stage navigation tasks to ultimately fulfill the instruction. These tasks emphasizes long-term planning and decision consistency across consecutive subtasks. The goal is to push agents beyond simple, short-term navigation by requiring them to deeply comprehend complex task instructions, maintain continuous navigation, and handle sequential sub-tasks seamlessly across a dynamic environment.
Background Image

Figure 1: Environment where agents execute navigation tasks.

Challenge

Background Image

Video 1: Agent executing the LH-VLN task.

Benchmark: LHPR-VLN.
The tasks within this benchmark all consist of multiple single-stage subtasks. Throughout navigation, the agent acquires observational data from three perspectives (+60° , 0° , −60° ) and is permitted to execute fundamental actions: turn left, move forward, turn right, and stop. When the agent selects the “stop” action, the sub-task is deemed complete, and task success is evaluated based on the agent’s final positional state relative to the target.
For each single-stage navigation task, the agent must approach within a 1-meter geodesic distance of the target object, ensuring the object is positioned within a 60-degree horizontal field of view to maintain task fidelity.

Schedule details

Task Overview

The 1st Long-Horizon Vision-Language Navigation Challenge focuses on complicated long‑horizon VLN tasks. LHPR-VLN benchmark defines a complex task that includes multiple single-stage subtasks. For an LHPR-VLN task, the basic format is “Find something somewhere, and take it to something somewhere, then. . . ”. Each complex task involves locating an object at a specified initial location and transporting it to a designated target location, potentially encompassing two to four sequential navigation subtasks. The embodied agent needs to sequentially complete these single-stage navigation tasks to ultimately fulfill the instruction. These tasks emphasizes long-term planning and decision consistency across consecutive subtasks. The goal is to push agents beyond simple, short-term navigation by requiring them to deeply comprehend complex task instructions, maintain continuous navigation, and handle sequential sub-tasks seamlessly across a dynamic environment.

Submission evaluation

The competition evaluation consists of two stages. In the first stage, we will release the training data along with some test data. Participants will train their models on the training data and self-assess the results. These results will serve as the basis for the ranking in the first stage. We require the participants advancing to the second stage to publicly release all model code and weights, and submit a corresponding technical report as part of the entry. The technical report will account for 30% of the final evaluation for awarding.

In the second stage, participants who are selected in the first stage will be required to submit a container that meets the requirements, which will then be tested by the competition organizers on an undisclosed test set, and the final ranking will be based on these results.

Detailed requirements are as follows:

Timeline

Registration

We have opened the registration link website, where you can register and submit Docker information for us to evaluate your proposal. The website will also update the ranking scores of participants in real time. We look forward to your participation.

Challenge Guide

Experiment Result Score

The final score is calculated as: Score = 0.4 * TAR + 0.2 * ISR + 0.2 * CSR + 0.2 * CGT

For the detailed meaning and calculation methods of each metric, please refer to this link.

First Stage Submission

1. Dataset

The dataset is divided based on scene IDs: For downloading and using the dataset, please refer to the LH-VLN repository.

2. Submission

Please submit the log file generated by running your model on the dataset. You can submit it here.

Second Stage Submission

coming soon...

Other

More information about the challenge can be found in the LH-VLN repository. We have opened an Issue platform where you can submit any problems or questions you encounter during the competition.

Organizers

Yang Liu
Associate professor at SYSU
Liang Lin
Professor at SYSU
Guanbin Li
Professor at SYSU
Xinshuai Song
MSc Student at SYSU
Yexin Zhang
MSc Student at SYSU
Weixing Chen
PhD Student at SYSU
Kaixuan Jiang
MSc Student at SYSU