MMSP 2025 Embodied AI Challenge:
Long-Horizon Vision-Language Navigation Challenge
September 21 to September 23, 2025
Beijing, China
The 1st Long-Horizon Vision-Language Navigation Challenge based on the insights presented in the
LH-VLN—hosted as the “Embodied AI Challenge” track of the IEEE 27th International Workshop on Multimedia Signal Processing (MMSP 2025)—focuses on complicated long-horizon VLN tasks. Our LHPR-VLN benchmark defines a complex task that includes multiple single-stage subtasks. For an LHPR-VLN task, the basic format is “Find something somewhere, and take it to something somewhere, then. . . ”. Each complex task involves locating an object at a specified initial location and transporting it to a designated target location, potentially encompassing two to four sequential navigation subtasks. The embodied agent needs to sequentially complete these single-stage navigation tasks to ultimately fulfill the instruction. These tasks emphasizes long-term planning and decision consistency across consecutive subtasks. The goal is to push agents beyond simple, short-term navigation by requiring them to deeply comprehend complex task instructions, maintain continuous navigation, and handle sequential sub-tasks seamlessly across a dynamic environment.
Figure 1: Environment where agents execute navigation tasks.
Video 1: Agent executing the LH-VLN task.
Benchmark: LHPR-VLN.
The tasks within this benchmark all consist of multiple single-stage subtasks. Throughout navigation, the agent acquires observational data from three perspectives (+60° , 0° , −60° ) and is permitted to execute fundamental actions: turn left, move forward, turn right, and stop. When the agent selects the “stop” action, the sub-task is deemed complete, and task success is evaluated based on the agent’s final positional state relative to the target.
For each single-stage navigation task, the agent must approach within a 1-meter geodesic distance of the target object, ensuring the object is positioned within a 60-degree horizontal field of view to maintain task fidelity.
Task Overview
The 1st Long-Horizon Vision-Language Navigation Challenge focuses on complicated long‑horizon VLN tasks. LHPR-VLN benchmark defines a complex task that includes multiple single-stage subtasks. For an LHPR-VLN task, the basic format is “Find something somewhere, and take it to something somewhere, then. . . ”. Each complex task involves locating an object at a specified initial location and transporting it to a designated target location, potentially encompassing two to four sequential navigation subtasks. The embodied agent needs to sequentially complete these single-stage navigation tasks to ultimately fulfill the instruction. These tasks emphasizes long-term planning and decision consistency across consecutive subtasks. The goal is to push agents beyond simple, short-term navigation by requiring them to deeply comprehend complex task instructions, maintain continuous navigation, and handle sequential sub-tasks seamlessly across a dynamic environment.
Submission evaluation
The competition evaluation consists of two stages.
In the first stage, we will release the training data along with some test data. Participants will train their models on the training data and self-assess the results. These results will serve as the basis for the ranking in the first stage. We require the participants advancing to the second stage to publicly release all model code and weights, and submit a corresponding technical report as part of the entry. The technical report will account for 30% of the final evaluation for awarding.
In the second stage, participants who are selected in the first stage will be required to submit a container that meets the requirements, which will then be tested by the competition organizers on an undisclosed test set, and the final ranking will be based on these results.
Detailed requirements are as follows:
- The model parameter size must not exceed 8B, and it must be capable of inference on a 3090 or equivalent machine. Based on this, the organizers will conduct testing and provide results within 3 days of submission.
- The Docker container (including model weights) must not exceed 20GB and must pass validation using the script provided by the organizers to be considered valid.
- The use of closed-source large model APIs is not allowed.
Timeline
- * Registration start time: 23:59:59 UTC+8 on June 12, 2025
- * Registration close time: 23:59:59 UTC+8 on July 15, 2025
- * Stage 1 submission deadline: 23:59:59 UTC+8 on July 31, 2025
- * Stage 1 result announcement: 23:59:59 UTC+8 on August 3, 2025
- * Stage 2 submission deadline: 23:59:59 UTC+8 on August 10, 2025
- * Final result announcement: 23:59:59 UTC+8 on August 15, 2025
Registration
We have opened the registration link website, where you can register and submit Docker information for us to evaluate your proposal. The website will also update the ranking scores of participants in real time. We look forward to your participation.
Experiment Result Score
The final score is calculated as: Score = 0.4 * TAR + 0.2 * ISR + 0.2 * CSR + 0.2 * CGT
For the detailed meaning and calculation methods of each metric, please refer to this link.
First Stage Submission
1. Dataset
The dataset is divided based on scene IDs:
- Scene IDs below 700 are for the training set.
- Scene IDs from 700 to 800 are for the validation set.
- Scene IDs above 800 are for the test set.
For downloading and using the dataset, please refer to the LH-VLN repository.
2. Submission
Please submit the log file generated by running your model on the dataset. You can submit it here.
Second Stage Submission
coming soon...
Other
More information about the challenge can be found in the LH-VLN repository.
We have opened an Issue platform where you can submit any problems or questions you encounter during the competition.