Agent-Based AI

NVIDIA Unveils Cosmos 3, an AI Designed to Understand the Real World

Artificial intelligence is now capable of generating text, creating images, producing videos, and even writing code. However, one limitation remains: understanding the physical world. A model can describe a car, recognize a pedestrian, or identify an obstacle, but understanding how objects interact in space and anticipating their movements remains a major challenge. This is precisely the problem NVIDIA is seeking to solve with Cosmos 3.

Unveiled at GTC Taipei 2026 alongside the Isaac GROOT humanoid robot, Cosmos 3 marks a new milestone in the development of what NVIDIA calls “physical AI.” Unlike traditional generative models, this technology isn’t just aimed at understanding digital content. Its goal is to help robots, autonomous vehicles, and intelligent systems better interpret, anticipate, and interact with the real world.

For NVIDIA, this capability could significantly accelerate the development of robotics, autonomous vehicles, and future physical agents powered by artificial intelligence.

For several years now, advances in AI have been based primarily on the understanding of language, images, and digital data. However, machines continue to face challenges when it comes to interacting with real-world environments.

A robot that needs to grasp an object, avoid an obstacle, or navigate a complex environment must understand much more than just the appearance of a scene. It must be able to anticipate the consequences of its actions, evaluate possible movements, and reason about physical interactions.

This issue is becoming particularly important as global investment in robotics is projected to exceed $260 billion by2030.¹ Manufacturers are now seeking models capable of bridging the gap between digital perception and physical understanding.

It was against this backdrop that Cosmos 3 was designed.

NVIDIA is introducing Cosmos 3 as the first fully open “omnimodel” dedicated to physical AI. The system was developed to serve as the foundation for a new generation of intelligent machines capable of interacting with their environment.

The company already offers two versions of the model. The Super version is designed for applications requiring high physical accuracy, particularly in industrial robotics and autonomous driving. A Nano version is also available for applications requiring faster response times and lower computing costs.

NVIDIA also announced the upcoming release of an Edge version designed to run directly on local devices. This approach addresses a major challenge facing the industry: enabling autonomous systems to make decisions without always relying on a cloud connection.

This strategy shows that NVIDIA is not only seeking to develop a high-performance model, but also to build a true physical AI ecosystem capable of adapting to different levels of infrastructure.

One of the most impressive aspects of Cosmos 3 is the data used to train it.

According to NVIDIA, the model was trained on nearly 20,000 billion tokens2. This dataset includes:

  • nearly one billion images;
  • approximately 400 million real and synthetic videos;
  • ambient audio data;
  • text content;
  • traces of actions carried out by humans and robots.

This diversity allows the model to learn not only to recognize objects or situations, but also to understand the actions associated with these environments.

Unlike a traditional video generator, which focuses primarily on the visual appearance of a scene, Cosmos 3 seeks to model what is actually happening in the physical world.

According to Ming-Yu Liu, vice president of Cosmos Lab at NVIDIA, the goal is to learn the movements, interactions, and behaviors that characterize real-world environments2.

The true innovation of Cosmos 3 lies in its ability to incorporate the concept of action.

To a human, watching someone open a door, move an object, or climb a staircase seems natural. To a machine, these actions represent a complex combination of movements, physical constraints, and sequential decisions.

Cosmos 3 specifically seeks to capture this aspect.

The model can generate extremely detailed action data, including:

  • movement paths;
  • the positions of robotic effectors;
  • joint angles;
  • mechanical arm movements;
  • the steps required to complete a task.

This information is essential for training robots to interact effectively with their environment.

This approach gradually brings artificial intelligence systems closer to human physical reasoning, a capability considered essential for the emergence of truly autonomous agents.

One of the most promising use cases involves generating rare or dangerous scenarios.

In the real world, it is often difficult, costly, or risky to replicate certain situations needed to train autonomous systems. Vehicle collisions, industrial accidents, and mechanical failures are rare events, but they are essential for developing robust systems.

Cosmos 3 allows users to virtually generate these types of scenarios to enrich training data.

This approach offers several advantages:

  • reduction in physical testing costs;
  • improving the safety of experiments;
  • shorter development cycles;
  • an increase in the variety of simulated scenarios.

NVIDIA even claims that certain training phases that might previously have taken several months can now be completed in just a few days2.

● RS6787 Certification

Executive Training

AI & Data Science
s for Managers

Integrate AI into your business strategy. A 360° approach—Technology, Business, and Ethics—designed for decision-makers. Prerequisites: 5 years of managerial experience.

3 days Eligible for CPF funding — €1,800 (excluding tax) Paris-Villejuif & Nice
Learn more about the program →

Like the Nemotron family, Cosmos 3 adopts an open strategy. NVIDIA wants to enable developers, researchers, and industry professionals to adapt the model to their own needs.

This openness contrasts with the trend observed among several major players in the sector, who favor more closed models.

The goal is to foster the emergence of an ecosystem capable of accelerating innovation in robotics, autonomous mobility, and smart systems.

Among the first partners announced are Agile Robots, Black Forest Labs, and Runway, demonstrating that NVIDIA is seeking to build a broad network around this new platform2.

Cosmos 3 illustrates a profound evolution in artificial intelligence. After learning to understand language, images, and digital data, models are now seeking to develop a more nuanced understanding of the physical laws that govern the real world.

This development could have major implications for robotics, autonomous mobility, industry, and future agent-based AI systems.

The challenge is no longer simply to create models capable of answering questions or generating content. It is now about building systems capable of interacting with their environment in a reliable, predictable, and autonomous manner.

With Cosmos 3, NVIDIA is not merely seeking to improve artificial intelligence. The company is attempting to bring machines closer to understanding the physical world—a challenge that remains one of the greatest frontiers in AI today.

Technology Framework

How does Cosmos 3 work?

Cosmos 3 is based on a multimodal artificial intelligence architecture designed to understand physical environments and model interactions between objects, humans, and machines. Developed by NVIDIA, this model belongs to a new category of AI called “physical AI,” whose goal is no longer limited to processing text, images, or videos, but also includes understanding actions that take place in the real world.

Unlike traditional generative models, which focus primarily on digital content, Cosmos 3 seeks to represent the physical laws, movements, and behaviors observed in real-world environments. The system analyzes various types of multimodal data, including images, videos, text, sounds, and records of human or robotic actions.

Based on this information, the model learns to identify not only what is present in a scene, but also what is happening there, what movements are being made, what interactions are taking place, and what consequences may result from certain actions. This capability allows it to generate realistic physical simulations and produce data that can be used to train robots, autonomous vehicles, or other intelligent systems.

Key Features of Cosmos 3
  • Advanced Physical Understanding: Analysis of Interactions Between Objects, Humans, and Machines
  • Multimodal model: simultaneous processing of text, images, videos, audio, and actions
  • Simulation Generation: Creating Realistic Physical Environments for AI Training
  • Motion Modeling: Understanding Trajectories, Displacements, and Dynamic Behaviors
  • Actionable Data Generation: Producing Actionable Information for Robotics and Automation
  • Open architecture: flexibility and customization for specific industrial applications
  • Optimization for Physical AI: Accelerated Development of Autonomous Robots and Smart Vehicles
Technical constraints and limitations
  • Significant computing power requirements for training and inference
  • Dependence on the quality and diversity of the physical data used
  • Difficulty in perfectly replicating certain complex real-world situations
  • The Need for Validation in Real-World Physical Environments Following Simulation
  • Risks Associated with Bias in Training Data
  • Current limitations in understanding highly unpredictable or unprecedented situations

The development of models capable of understanding the physical world is a key step in the evolution of artificial intelligence, particularly for robotics, autonomous vehicles, and simulated environments. On a related topic, check out our article “DINOv3 by Meta: Self-Supervision for Precise Visual Analysis”, which examines how advances in computer vision enable AI systems to better interpret their environment and interact with complex real-world situations.

1. MarketsandMarkets. (2025). Global Robotics Market Forecast.
https://www.marketsandmarkets.com

2. NVIDIA. (2026). Cosmos 3 Technical Presentation, GTC Taipei 2026.
https://www.nvidia.com

Don't miss our upcoming articles!

Get the latest articles written by aivancity experts and professors delivered straight to your inbox.

We don't send spam! Please see our privacy policy for more information.

Don't miss our upcoming articles!

Get the latest articles written by aivancity experts and professors delivered straight to your inbox.

We don't send spam! Please see our privacy policy for more information.

Related posts
Agent-Based AI

Microsoft Launches Agent 365: The Platform That Monitors AI Agents for You

Artificial intelligence is reaching a new milestone in the business world. Following chatbots, co-pilots, and conversational assistants, a new generation of tools is emerging: autonomous AI agents. Capable of performing tasks, interacting…
Agent-Based AI

Alibaba Unveils Qwen3.7-Max, an AI Capable of Orchestrating Autonomous Agents

Alibaba continues to gain momentum in the global race for artificial intelligence. Long considered a minor player compared to OpenAI, Google, and Anthropic, the Chinese giant is now making rapid strides with its Qwen family of models….
Agent-Based AI

Google is stepping up its game with Gemini 3.5 Flash, an AI capable of reasoning and acting on its own

Google continues to step up its game in the race for artificial intelligence. At its Google I/O 2026 conference, the Mountain View-based company unveiled Gemini 3.5 Flash, a new model touted as faster and more autonomous…

Leave a comment

Your email address will not be published. Required fields are marked with *