Nemotron 3 Super

Text Generation

Nemotron 3 Super is NVIDIA’s open-weight, 120B-parameter hybrid Mamba-Transformer Mixture-of-Experts language model optimized for high-throughput agentic reasoning workloads. It is notable for combining LatentMoE experts, long-context support, and efficient NVFP4 training to deliver competitive accuracy with substantially higher inference efficiency than comparable open models.

Start Using API

API Performance

Latency: ~0.8s time to first token
Context: ~16K token context
Input: Free per 1M tokens
Output: Free per 1M tokens
Uptime: 99% 99%

About the model

What is Nemotron 3 Super?

Nemotron 3 Super is a 120B-parameter (12B active) open Mixture-of-Experts hybrid Mamba-Attention large language model from NVIDIA, designed for efficient, high-quality agentic reasoning. It is primarily used for building autonomous AI agents that perform multi-step reasoning, tool use, and long-running workflows in domains like software engineering, data analysis, and complex enterprise automation. It is also used as a foundation text model for high-throughput, long-context applications such as large document understanding and large-scale code generation on NVIDIA GPU infrastructure. It is part of the Nemotron 3 family of open models, alongside smaller Nano and larger Ultra variants that share common training data, recipes, and architecture principles.

Input / Output

Input

Text prompts (natural language or structured text)

Output

Generated text completions and chat-style responses
Generated source code and code completions

Model capabilities

5 Core Capabilities

Agentic Reasoning

Supports agent-style workflows, enabling planning, tool use, and multi-step decision-making for complex autonomous and semi-autonomous AI agents.
Advanced Conversation

Acts as a large language model optimized for natural, multi-turn dialogue with strong instruction following and contextual understanding.
Long-Context Handling

Processes and reasons over very long text contexts, supporting extended documents, workflows, and multi-document inputs in a single session.
Efficient Inference

Hybrid Mamba-Transformer Mixture-of-Experts architecture with multi-token prediction enables high-throughput, low-latency text generation at scale.
Multilingual Text

Handles multiple languages, enabling understanding and generation across diverse linguistic inputs for global applications and datasets.

Use cases

6 Most Valuable Use Cases

Enterprise AI Agents
Complex Workflow Orchestration
Long-Context Document Analysis
Multistep Tool-Using Agents
Reasoning-Heavy Chatbots
Code Reasoning Assistance

Transparent pricing

Cost Comparison

LLM API delivers the lowest cost and latency for Nemotron-class models versus major providers.

Provider	Region	Latency	Throughput	Uptime	Input ($/1M)	Output ($/1M)	Context
LLM API BEST	Global	80ms	120 tps	99.99%	$0.20	$0.20	256K
NVIDIA	US West	~160ms	~60 tps	99.9%	~$0.60	~$0.60	128K
AWS Bedrock	US East	~180ms	~45 tps	99.9%	~$0.70	~$0.70	64K
Azure AI	EU West	~190ms	~40 tps	99.9%	~$0.75	~$0.75	128K
Google Cloud	Global	~170ms	~50 tps	99.9%	~$0.80	~$0.80	128K

Performance benchmarks

Technical Specifications

Metric	Nemotron 3 Super (NVIDIA)	GPT-4 Turbo (OpenAI)	Claude 3 Sonnet (Anthropic)
Avg Latency	~350ms	~400ms	~450ms
Context Window	~128K	128K	200K
Input Price ($/1M)	~$0.60	~$0.50	~$0.60
Output Price ($/1M)	~$2.40	~$1.50	~$1.80
Max Output Tokens	~4K	4K	4K
Throughput	~60 tps	~50 tps	~40 tps
Uptime	~99.9%	~99.9%	~99.9%

30-day usage via LLM API

3.8B: Prompt tokens processed (30 days)
2.4B: Completion tokens generated (30 days)
11.5M: API requests served (30 days)
99.8%: Avg uptime (last 30 days)

Start Using API

Architecture & Integration

Why Build on LLM.API?

One unified API. Every major model. Built-in reliability, cost control, and observability.

Intelligent Model Routing

Automatically route each request to the best-performing model across providers based on latency, cost, and quality—without changing your integration.
One endpoint, optimal model
Cost-Aware Execution

Control spend with transparent per-token accounting, guardrails, and smart selection of cheaper equivalent models while preserving quality for critical workloads.
Optimize every token
Automatic Fallbacks

Keep production flows resilient with built-in provider failover and model-level retries, so transient outages never break your user experience.
Resilient by default
Deep Observability

Inspect latency, errors, tokens, and prompts across providers in one place, enabling fast debugging, regression detection, and performance tuning.
See every request
Task-Level Orchestration

Express complex AI workflows as high-level tasks—grounding, tools, classification, generation—while LLM.API handles prompt shaping, execution, and model differences.
Ship workflows, not glue
High-Throughput Batch

Process millions of requests efficiently with batch APIs that parallelize across providers, maximize throughput, and minimize cost for bulk inference jobs.
Scale jobs to millions

Decision guide

When to Use — When NOT to Use

Use it if...

You need an open, enterprise-friendly LLM optimized for NVIDIA GPU infrastructure and tooling.
You need strong generative text capabilities with models tuned for instruction-following tasks.
Your use case involves private on-prem deployment where data must remain in-house.
Your use case involves customizing and fine-tuning models on proprietary domain data.
You need tight integration with NVIDIA AI Enterprise, NeMo, or NIM microservices stack.
Your use case involves batch inference workloads where throughput matters more than minimal latency.

Avoid if...

You need a fully managed, serverless API with no infrastructure or deployment work.
You need state-of-the-art reasoning benchmarks comparable to the very latest proprietary frontier models.
Your workload requires ultra-low-latency mobile inference on non-NVIDIA edge hardware.
You need extensive multimodal capabilities beyond text, such as advanced vision or audio.
Your workload requires guaranteed long-term model stability without version or weight changes.
You need turnkey ecosystem plugins and integrations matching the largest commercial LLM platforms.

FAQ

Frequently Asked Questions

What is Nemotron 3 Super?

Nemotron 3 Super is an NVIDIA large language model focused on high‑quality text generation and reasoning, accessible through the LLM.API unified gateway.
What is Nemotron 3 Super best suited for?

Nemotron 3 Super is best for code generation, data analysis assistance, structured tool-calling workflows, and general-purpose chatbots needing strong reasoning and instruction-following.
What is the context window of Nemotron 3 Super?

Nemotron 3 Super supports a context window of up to 8,192 tokens for combined input and output via LLM.API.
Which modalities does Nemotron 3 Super support on LLM.API?

Nemotron 3 Super currently supports text-in, text-out interactions only; image, audio, and video inputs are not supported.
How fast is Nemotron 3 Super in terms of latency?

Nemotron 3 Super typically returns first tokens within a few hundred milliseconds and can stream responses for lower perceived latency.
How is Nemotron 3 Super priced on LLM.API?

Nemotron 3 Super is billed per 1,000 tokens, with separate rates for input and output tokens as defined in your LLM.API pricing plan.
How do I call Nemotron 3 Super through the LLM.API?

You select provider "NVIDIA" and model "Nemotron 3 Super" in the LLM.API request payload, then send standard chat or completion-style requests.
How does Nemotron 3 Super compare to similar NVIDIA models?

Nemotron 3 Super targets stronger reasoning and coding performance than smaller Nemotron variants, at higher compute cost but improved quality.
What are the main limitations of Nemotron 3 Super?

Nemotron 3 Super can hallucinate facts, lacks real-time internet access, and should not be solely relied on for high-stakes or legally binding decisions.
Can I fine-tune Nemotron 3 Super via LLM.API?

Direct fine-tuning is not exposed via LLM.API; instead, you should use techniques like system prompts, few-shot examples, and retrieval-augmented generation.

Start in 2 lines of code

Get My API Key

Nemotron 3 Super

What is Nemotron 3 Super?

5 Core Capabilities

Agentic Reasoning

Advanced Conversation

Long-Context Handling

Efficient Inference

Multilingual Text

6 Most Valuable Use Cases

Cost Comparison

Technical Specifications

Why Build on LLM.API?

Intelligent Model Routing

Cost-Aware Execution

Automatic Fallbacks

Deep Observability

Task-Level Orchestration

High-Throughput Batch

When to Use — When NOT to Use

Use it if...

Avoid if...

Start in 2 lines of code