How generative foundation models are driving autonomous embodied AI. Wayve steers the right route
- Summary:
-
Wayve has launched GAIA-3, a generative foundation model for stress testing autonomous driving models. Aniruddha Kembhavi, Director of Science Strategy at Wayve, explains how this could advance adjacent research into warehouse robotics, household humanoids, manufacturing and more.
DARPA held the Urban Challenge in 2007, during which six teams managed to drive autonomous vehicles through a mock city without crashing. Yet 20 years later, autonomous vehicles struggle with novel failure modes. Wayve has launched GAIA-3, a new generative foundation model to simulate and mitigate these problems in mapless driving.
These models are used to simulate cars with different sensor configurations, environmental conditions, and various what-if scenarios. This can help safely validate and verify the resilience of separate embodied AI models in the lab before they are deployed into real cars.
This kind of physical AI stress testing leverages latent diffusion models rather than the Large Language Models (LLMs) that are all the rage these days. Rather than training on large amounts of text like LLMs, they consume vast quantities of validated, aligned video, LIDAR, and physics data to improve realism. Aside from the different scenarios, a thorny issue is troubleshooting problems arising from subtle visual effects caused by shadows, object reflectance, and transparency.
The new model uses 15 billion parameters to support new levels of realism and control. This is twice the size of the previous version and was trained with ten times as much data across continents, vehicle types, environments and driving conditions to improve the video tokenizer used to represent objects and lighting effects under these diverse conditions. This enables more consistent and repeatable evaluation across diverse, safety-critical conditions by improving the representation of visual details, lighting, textures, and road signs.
Aniruddha Kembhavi, Director of Science Strategy at Wayve, says these kinds of advances could have value beyond better cars:
Evaluating Embodied AI models is very expensive since testing in the real world is slow, expensive and error-prone. GAIA-3 shows that we can build world models to evaluate, validate, and stress-test embodied AI systems. This has implications far beyond autonomous driving and into domains such as warehouse robotics, household humanoids, manufacturing and more.
Solving for embodied intelligence
The Wayve team came from Microsoft, DeepMind, the Allen Institute for AI and other leading academic labs, where they spent years building large-scale perception and learning systems. Their fundamental insight was that more resilient autonomous systems required solving an embodied intelligence problem, not by hand-engineering ever more complex rules, as was the approach used in 2017 when they started. Kembhavi explains:
End-to-end learning, supported by strong world models, offered a clearer path to scalability than traditional stacks. It is this belief that has ultimately led us to GAIA-3.
Complementary world models
At a high level, Wayve has been focusing on three complementary areas: world foundation models for autonomous driving, linguistic representation, and simulation. Each of these uses different techniques to advance different aspects of the process. The autonomous driver guides the car and adapts to unexpected changes. LINGO, the linguistic engine, explains why the autonomous driver behaved as it did and helps translate a written scenario or edge case into the appropriate representation for the simulator or driver models. GAIA, the simulation engine, helps train a more resilient autonomous driver.
All three of these world models use neural networks under the hood. In the case of GAIA, the neural network can produce a simulation by training on a large corpus of data and modeling the appearance, semantics, and motion of objects and agents in the world.
This approach overcomes the limitations of procedural simulators and 3D reconstruction simulations. Procedural simulators have been the traditional standard for autonomy testing, offering precise control but lacking realism. 3D reconstruction simulators can achieve greater realism but struggle with occlusions and the vagaries of other cars and pedestrians. The result is the ability to replace more real-world test coverage with virtual test coverage safely. Kembhavi says:
Most generative simulators today aim for visual realism but aren’t trustworthy for measuring safety-critical behavior. Traditional evaluation still depends on limited real-world logs or hand-crafted scenarios. GAIA-3 moves beyond this by generating dynamic scenarios with controllable agents, giving us a way to measure how an autonomy system actually interacts with the world. This enables more scalable and reliable validation than has been possible before.
Kebhavi says that the improvements to GAIA provide a much wider and more controllable testing space than real-world driving alone:
It lets us explore rare and risky situations early, understand failure modes more deeply, and iterate faster because evaluation becomes repeatable. This has directly improved the robustness and generalization of our driving models. Building GAIA-3 pushed us to unify perception, prediction, and scene understanding around a single world representation. That alignment clarified how our driving policy should reason about complex environments and created a feedback loop where improvements in one system directly informed the other.
Scaling challenges
Building a bigger and better model required solving three complementary scaling challenges: data volume and diversity, model parameters, and compute. Wayve worked closely with Microsoft to fine-tune the compute infrastructure to support this process. Also, a lot of effort went into maintaining data quality at scale. Kebhavi explains:
Every time one scales up data, it is imperative to maintain high quality of data, and this always consumes a large fraction of time. Scaling up model parameters and compute isn’t trivial either, since it requires world-class GPU infrastructure to train models in a stable way, training recipes that scale, and experienced teams that can closely monitor training progress.
An intriguing aspect of Wayve's approach is the ability to transfer learning across different vehicles and sensor configurations. This makes it possible to accurately recreate consistent scenarios across diverse cars, improving transferability across a fleet. For now, this means different cars, but down the road, it will enable these models to be adapted across completely different embodiments, such as warehouse robots, humanoids, or industrial systems. Across these different use cases, Kebhavi recommends:
Start with real sensor data, focus on evaluation as much as synthesis, and design for controllability and diversity rather than just visual detail. Treat the world model and the embodied system as a single ecosystem, and invest early in tools for scenario generation and diagnostics.
My take
I think some of the larger AI vendors have done the industry a bit of a dis-service by arguing that a single large foundation model might lead to advanced general intelligence. Wayve seems to have found that it might be easier to break the problem into multiple complementary world foundation models that work together.
Each one of these models requires different kinds of data, processes, algorithms, and compute infrastructure. It also means that down the road, it will be easier to take advantage of new techniques. For example, a better world simulator will make it easier to provide more accurate simulated experiences for models based on Bayesian approaches like active inference, as well as for newer reinforcement learning approaches as they mature.
Also, cars have relatively simple actuators for accelerating, braking, and turning. Humanoids, warehouse robots, autonomous labs, and construction equipment will require extending the autonomous brains to support more complex interactions with materials and objects. These will likely benefit from better ways to refine world foundation models for communicating with humans, controlling embodied systems, and representing different facets of the world.