NVIDIA launches new open-source multimodal large model "Nemotron 3 Nano Omni"! Capable of processing video, audio, images, and text, specialized for Agent applications.

📄Full Article· Automatically extracted by trafilaturaGemini 翻譯1639 words

NVIDIA is making another bold move! Today (28th), it announced the launch of its brand-new open-source multimodal large model, "Nemotron 3 Nano Omni." This model breaks the pain point of traditional AI relying on a chain of fragmented models, enabling efficient and unified processing of video, audio, images, and text within a "single model." NVIDIA has also aggressively announced that it is "fully open-source," not only releasing weights on Hugging Face but also making the training datasets and recipes fully public, launching a comprehensive assault on the underlying infrastructure market for Agentic AI. (Previous coverage: Flash: NVIDIA hits a new all-time high during trading, "surpassing $212.6," with a market cap of $5.17 trillion, reclaiming the top spot globally.) (Background supplement: Jensen Huang sends an all-hands letter embracing OpenAI Codex: Over 10,000 NVIDIA employees are already using it, with GPT-5.5 running on GB200.) The development of AI Agents is undergoing a major architectural overhaul, and the driving force behind this revolution is the computing powerhouse, NVIDIA. On the 28th, NVIDIA officially unveiled the latest member of the Nemotron 3 family—"Nemotron 3 Nano Omni." True to its name, "Omni" (all-encompassing/multimodal), this is an extremely efficient, open, and powerful weapon capable of unifying the processing of video, audio, images, and text within a single model, specifically built for the next generation of Agentic AI. In the past, when companies wanted to develop an AI agent capable of understanding documents, listening to voice, and watching videos, they often had to rely on a "fragmented model chain"—essentially stitching together independent vision models, audio models, and text models. This approach not only leads to extreme coordination complexity and expensive inference costs, but more critically, cross-modal "context" is easily lost or prone to hallucinations during transmission. The birth of Nemotron 3 Nano Omni is intended to converge these complex processes into a "single, efficient, open model." As a multimodal perception sub-agent within a system, it allows AI to seamlessly process multimodal inputs in a single "perception-action loop," significantly improving convergence and reducing enterprise costs. In terms of hardware and underlying architecture optimization, NVIDIA has demonstrated its dominant strength: - Hybrid MoE Architecture: The model features 30 billion (30B) total parameters and adopts a Mixture of Experts (MoE) architecture, meaning the "activated parameters" during actual inference are only about 3 billion (3B), balancing top-tier performance with extreme computational efficiency. The underlying layer cleverly combines the dual advantages of Mamba (specializing in sequence and memory efficiency) and Transformer (specializing in precise reasoning). - Performance Dominance: In various benchmarks (such as MMLongbench-Doc, WorldSense, etc.), Nano Omni has demonstrated industry-leading strength. Compared to other open multimodal models, at the same interactivity threshold, its "video inference" system capacity has increased by up to 9.2 times, and its multi-document reasoning capability has improved by 7.4 times. - Born for Blackwell: The model perfectly supports NVIDIA's latest Blackwell GPU and NVFP4 quantization technology, and supports an ultra-long context window of up to 262K, tailored for enterprise-grade long-sequence video processing and complex document reasoning. What excites the developer community most is NVIDIA's "Open by Design" philosophy this time. Unlike many "pseudo-open-source" projects that only release weights, NVIDIA has directly made the model weights of Nemotron 3 Nano Omni, the massive training datasets (including synthetic data generated via NeMo Data Designer), and high-value "fine-tuning recipes (such as SFT, Reinforcement Learning RL, LoRA, GRPO, etc.)" fully public. The model is currently available for download on the Hugging Face platform and has been simultaneously launched on NVIDIA NIM microservices. NVIDIA emphasized in its announcement that this breakthrough is not just about chasing benchmark scores, but a substantial upgrade aimed at "real agent workloads." In the future, whether in finance, medical analysis, or media and entertainment, developers can utilize Nemotron 3 Nano Omni, combined with larger super-models (such as Nemotron 3 Ultra), to build truly powerful, modular, and highly perceptive AI enterprise agent systems.

Data Status✓ Full text extractedRead Original (動區 BlockTempo)

🔍Historical Similar Events· Keyword + Asset Matching6 items

2026-04-20

Nvidia Releases Nemotron 3 Super, a 120B Open AI Model Built for Agentic Workloads

Similarity 130%關鍵字 nemotron/nvidia

2026-04-28

Meituan has quietly launched its new-generation large model "LongCat-2.0-Preview"! It sets a record for the largest training computing power in China and specializes in AI Agent.

Similarity 120%關鍵字 agent同分類 zh

2026-04-25

Google launches Deep Research Agent: Quickly master automated 160-search cycles to generate charts, touted as research-grade collaborative AI

Similarity 120%關鍵字 agent同分類 zh

2026-04-24