The
robotics is advancing rapidly, but most robots still face a fundamental limitation: the difficulty in
making precise decisions about what action to take and where to carry it out.
Microsoft, along with a consortium of academic researchers, has presented a new standard,
GroundedPlanBench, which seeks to solve this challenge and bring the artificial intelligence of robots closer to efficient and contextualized decision-making.
In conventional robotic systems, the decision-making process is divided into
two stages. First, a vision and language model generates a plan in natural language. Then, another system translates that plan into physical actions. This fragmented approach causes frequent errors, as the disconnection between the plan and the execution allows mistakes in one stage to be carried over to the next.
Typical errors include
confusion about which object to manipulate or the invention of unnecessary steps. For example, if a robot is asked to discard paper cups, it may not correctly identify which cup to pick up or even perform unsolicited actions. These failures are aggravated in cluttered environments, where objects are similar or numerous.
GroundedPlanBench: A New Standard for Improving Decision-Making
To address this challenge,
Microsoft and its partners have developed
GroundedPlanBench, a system that evaluates whether AI models can plan tasks while accurately identifying where each action should be performed.
Unlike traditional systems that only use text, this standard links each action to a specific location in an image. Actions such as grabbing, placing, opening, or closing are associated with specific objects or positions, forcing the AI to connect the decision with the real physical environment.
The benchmark includes
more than a thousand tasks based on real robot interactions. Some instructions are direct, such as placing a spoon on a plate, while others are open-ended, such as tidying a table. This variety is crucial, as robots often fail when instructions are not clear enough.
In one of the experiments, a robot had to place
four napkins on a sofa. The lack of specificity in the instruction caused the system to repeat the action on the same napkin, even with seemingly more precise descriptions such as “upper left napkin”. This shows that ambiguous language continues to be an obstacle to the reliable execution of complex tasks.
Learning based on real tasks
To improve decision-making capabilities, the team developed a training method called
Video-to-Spatially Grounded Planning (V2GP). This system analyzes videos of robots performing tasks, detects interactions with objects, identifies those objects, and tracks their locations, thus generating structured plans that link each action to a specific point.
Using this approach, researchers generated more than 40,000 "grounded" plans, ranging from simple actions to complex sequences of up to 26 steps. The models trained with this method demonstrated a better ability to choose appropriate actions and associate them with the correct objects, as well as reduce repetitive errors such as acting multiple times on the same element.
A Paradigm Shift for Robotics
Despite the advances, challenges persist, especially in long tasks and with indirect instructions. Researchers warn that models must be able to reason about
extensive sequences and maintain coherence throughout multiple steps. When comparing the new approach with traditional systems, it was observed that the latter tend to assign
multiple actions to the same object or place, especially when the orders are ambiguous.
The
integration of planning and localization into a single process reduces these mismatches and allows for more precise decisions. The Microsoft team suggests that future research could combine this method with predictive models capable of anticipating the consequences of each action, which would help robots avoid errors in real time.
You can also read:
The study's conclusions point to a clear direction for the future of robotics: systems that jointly consider
action and location are more likely to operate successfully in real environments. This innovation represents a
key step for robots to be able to decide and act reliably in everyday tasks, bringing them closer to a true applied artificial intelligence.