Here we show the examplar sequneces of GPT-4o agent on real time, comparing with RL agent, to demonstrate the limitations of the current agents.

Embodied Visual Tracking

The agent is required to dynamically track the person in the image.

RL Agent

                                     SuburbNeighborhood

                                 SuburbNeighborhood

GPT-4o Agent with Real-time Mode

The environment and target will keep running while the GPT is responding to the instructions. The agent easily loses the target from the view due to the latency of the response.

GPT_tracking_5.mp4

GPT-4o Agent with Stepwise Pause Mode

The environment and target are paused while waiting for the GPT response. The agent could keep tracking the target for a while, however, such setting is unrealistic in real-world.

GPT_tracking_stepwise.mp4

Visual Navigation

The agent is required to navigate to a set of wooden boxes in the environment, as illustrated in the following picture.

               Target object

           Target object     

RL Agent

We have trained an end-to-end visual navigation model, enabling the agent to adapt to environmental structures through jumping, crouching, walking, and running, and to quickly reach the target location.

A success case by RL-based agent (1)

A success case by RL-based agent (1)

A success case 2 by RL-based agnet (2)

A success case 2 by RL-based agnet (2)

GPT-4o Agent

The GPT-4o receives the first-person observation and the target’s relative coordinates as observation. The agent can perform turn, jump, and moving actions to navigate to the destination over obstacles.

A success case by GPT-4o, the left image is the image for GPT inference, the right is the captured image by the agent. But the policy is inefficient and usually takes some additional or wrong actions.

A success case by GPT-4o, the left image is the image for GPT inference, the right is the captured image by the agent. But the policy is inefficient and usually takes some additional or wrong actions.