Powered by NVIDIA
Nemotron 3 Super
- Text Generation
Nemotron 3 Super is NVIDIA’s open-weight, 120B-parameter hybrid Mamba-Transformer Mixture-of-Experts language model optimized for high-throughput agentic reasoning workloads. It is notable for combining LatentMoE experts, long-context support, and efficient NVFP4 training to deliver competitive accuracy with substantially higher inference efficiency than comparable open models.
About the model
What is Nemotron 3 Super?
Nemotron 3 Super is a 120B-parameter (12B active) open Mixture-of-Experts hybrid Mamba-Attention large language model from NVIDIA, designed for efficient, high-quality agentic reasoning. It is primarily used for building autonomous AI agents that perform multi-step reasoning, tool use, and long-running workflows in domains like software engineering, data analysis, and complex enterprise automation. It is also used as a foundation text model for high-throughput, long-context applications such as large document understanding and large-scale code generation on NVIDIA GPU infrastructure. It is part of the Nemotron 3 family of open models, alongside smaller Nano and larger Ultra variants that share common training data, recipes, and architecture principles.
Model capabilities
5 Core Capabilities
-
Agentic Reasoning
Supports agent-style workflows, enabling planning, tool use, and multi-step decision-making for complex autonomous and semi-autonomous AI agents.
-
Advanced Conversation
Acts as a large language model optimized for natural, multi-turn dialogue with strong instruction following and contextual understanding.
-
Long-Context Handling
Processes and reasons over very long text contexts, supporting extended documents, workflows, and multi-document inputs in a single session.
-
Efficient Inference
Hybrid Mamba-Transformer Mixture-of-Experts architecture with multi-token prediction enables high-throughput, low-latency text generation at scale.
-
Multilingual Text
Handles multiple languages, enabling understanding and generation across diverse linguistic inputs for global applications and datasets.
Use cases
6 Most Valuable Use Cases
- Enterprise AI Agents
- Complex Workflow Orchestration
- Long-Context Document Analysis
- Multistep Tool-Using Agents
- Reasoning-Heavy Chatbots
- Code Reasoning Assistance
Transparent pricing
Cost Comparison
LLM API delivers the lowest cost and latency for Nemotron-class models versus major providers.
| Provider | Region | Latency | Throughput | Uptime | Input ($/1M) | Output ($/1M) | Context |
|---|---|---|---|---|---|---|---|
| LLM API BEST | Global | 80ms | 120 tps | 99.99% | $0.20 | $0.20 | 256K |
| NVIDIA | US West | ~160ms | ~60 tps | 99.9% | ~$0.60 | ~$0.60 | 128K |
| AWS Bedrock | US East | ~180ms | ~45 tps | 99.9% | ~$0.70 | ~$0.70 | 64K |
| Azure AI | EU West | ~190ms | ~40 tps | 99.9% | ~$0.75 | ~$0.75 | 128K |
| Google Cloud | Global | ~170ms | ~50 tps | 99.9% | ~$0.80 | ~$0.80 | 128K |
Performance benchmarks
Technical Specifications
| Metric | Nemotron 3 Super (NVIDIA) | GPT-4 Turbo (OpenAI) | Claude 3 Sonnet (Anthropic) |
|---|---|---|---|
| Avg Latency | ~350ms | ~400ms | ~450ms |
| Context Window | ~128K | 128K | 200K |
| Input Price ($/1M) | ~$0.60 | ~$0.50 | ~$0.60 |
| Output Price ($/1M) | ~$2.40 | ~$1.50 | ~$1.80 |
| Max Output Tokens | ~4K | 4K | 4K |
| Throughput | ~60 tps | ~50 tps | ~40 tps |
| Uptime | ~99.9% | ~99.9% | ~99.9% |
30-day usage via LLM API
- 3.8B
- Prompt tokens processed (30 days)
- 2.4B
- Completion tokens generated (30 days)
- 11.5M
- API requests served (30 days)
- 99.8%
- Avg uptime (last 30 days)
Architecture & Integration
Why Build on LLM.API?
One unified API. Every major model. Built-in reliability, cost control, and observability.
-
Intelligent Model Routing
Automatically route each request to the best-performing model across providers based on latency, cost, and quality—without changing your integration.
One endpoint, optimal model -
Cost-Aware Execution
Control spend with transparent per-token accounting, guardrails, and smart selection of cheaper equivalent models while preserving quality for critical workloads.
Optimize every token -
Automatic Fallbacks
Keep production flows resilient with built-in provider failover and model-level retries, so transient outages never break your user experience.
Resilient by default -
Deep Observability
Inspect latency, errors, tokens, and prompts across providers in one place, enabling fast debugging, regression detection, and performance tuning.
See every request -
Task-Level Orchestration
Express complex AI workflows as high-level tasks—grounding, tools, classification, generation—while LLM.API handles prompt shaping, execution, and model differences.
Ship workflows, not glue -
High-Throughput Batch
Process millions of requests efficiently with batch APIs that parallelize across providers, maximize throughput, and minimize cost for bulk inference jobs.
Scale jobs to millions
Decision guide
When to Use — When NOT to Use
Use it if...
- You need an open, enterprise-friendly LLM optimized for NVIDIA GPU infrastructure and tooling.
- You need strong generative text capabilities with models tuned for instruction-following tasks.
- Your use case involves private on-prem deployment where data must remain in-house.
- Your use case involves customizing and fine-tuning models on proprietary domain data.
- You need tight integration with NVIDIA AI Enterprise, NeMo, or NIM microservices stack.
- Your use case involves batch inference workloads where throughput matters more than minimal latency.
Avoid if...
- You need a fully managed, serverless API with no infrastructure or deployment work.
- You need state-of-the-art reasoning benchmarks comparable to the very latest proprietary frontier models.
- Your workload requires ultra-low-latency mobile inference on non-NVIDIA edge hardware.
- You need extensive multimodal capabilities beyond text, such as advanced vision or audio.
- Your workload requires guaranteed long-term model stability without version or weight changes.
- You need turnkey ecosystem plugins and integrations matching the largest commercial LLM platforms.
FAQ
Frequently Asked Questions
-
What is Nemotron 3 Super?
Nemotron 3 Super is an NVIDIA large language model focused on high‑quality text generation and reasoning, accessible through the LLM.API unified gateway.
-
What is Nemotron 3 Super best suited for?
Nemotron 3 Super is best for code generation, data analysis assistance, structured tool-calling workflows, and general-purpose chatbots needing strong reasoning and instruction-following.
-
What is the context window of Nemotron 3 Super?
Nemotron 3 Super supports a context window of up to 8,192 tokens for combined input and output via LLM.API.
-
Which modalities does Nemotron 3 Super support on LLM.API?
Nemotron 3 Super currently supports text-in, text-out interactions only; image, audio, and video inputs are not supported.
-
How fast is Nemotron 3 Super in terms of latency?
Nemotron 3 Super typically returns first tokens within a few hundred milliseconds and can stream responses for lower perceived latency.
-
How is Nemotron 3 Super priced on LLM.API?
Nemotron 3 Super is billed per 1,000 tokens, with separate rates for input and output tokens as defined in your LLM.API pricing plan.
-
How do I call Nemotron 3 Super through the LLM.API?
You select provider "NVIDIA" and model "Nemotron 3 Super" in the LLM.API request payload, then send standard chat or completion-style requests.
-
How does Nemotron 3 Super compare to similar NVIDIA models?
Nemotron 3 Super targets stronger reasoning and coding performance than smaller Nemotron variants, at higher compute cost but improved quality.
-
What are the main limitations of Nemotron 3 Super?
Nemotron 3 Super can hallucinate facts, lacks real-time internet access, and should not be solely relied on for high-stakes or legally binding decisions.
-
Can I fine-tune Nemotron 3 Super via LLM.API?
Direct fine-tuning is not exposed via LLM.API; instead, you should use techniques like system prompts, few-shot examples, and retrieval-augmented generation.
