Powered by NVIDIA

Nemotron 3 Super

  • Text Generation

Nemotron 3 Super is NVIDIA’s open-weight, 120B-parameter hybrid Mamba-Transformer Mixture-of-Experts language model optimized for high-throughput agentic reasoning workloads. It is notable for combining LatentMoE experts, long-context support, and efficient NVFP4 training to deliver competitive accuracy with substantially higher inference efficiency than comparable open models.

Start Using API

What is Nemotron 3 Super?

Nemotron 3 Super is a 120B-parameter (12B active) open Mixture-of-Experts hybrid Mamba-Attention large language model from NVIDIA, designed for efficient, high-quality agentic reasoning. It is primarily used for building autonomous AI agents that perform multi-step reasoning, tool use, and long-running workflows in domains like software engineering, data analysis, and complex enterprise automation. It is also used as a foundation text model for high-throughput, long-context applications such as large document understanding and large-scale code generation on NVIDIA GPU infrastructure. It is part of the Nemotron 3 family of open models, alongside smaller Nano and larger Ultra variants that share common training data, recipes, and architecture principles.

5 Core Capabilities

  • Agentic Reasoning

    Supports agent-style workflows, enabling planning, tool use, and multi-step decision-making for complex autonomous and semi-autonomous AI agents.

  • Advanced Conversation

    Acts as a large language model optimized for natural, multi-turn dialogue with strong instruction following and contextual understanding.

  • Long-Context Handling

    Processes and reasons over very long text contexts, supporting extended documents, workflows, and multi-document inputs in a single session.

  • Efficient Inference

    Hybrid Mamba-Transformer Mixture-of-Experts architecture with multi-token prediction enables high-throughput, low-latency text generation at scale.

  • Multilingual Text

    Handles multiple languages, enabling understanding and generation across diverse linguistic inputs for global applications and datasets.

6 Most Valuable Use Cases

  • Enterprise AI Agents
  • Complex Workflow Orchestration
  • Long-Context Document Analysis
  • Multistep Tool-Using Agents
  • Reasoning-Heavy Chatbots
  • Code Reasoning Assistance

Cost Comparison

LLM API delivers the lowest cost and latency for Nemotron-class models versus major providers.

Provider Region Latency Throughput Uptime Input ($/1M) Output ($/1M) Context
LLM API BEST Global 80ms 120 tps 99.99% $0.20 $0.20 256K
NVIDIA US West ~160ms ~60 tps 99.9% ~$0.60 ~$0.60 128K
AWS Bedrock US East ~180ms ~45 tps 99.9% ~$0.70 ~$0.70 64K
Azure AI EU West ~190ms ~40 tps 99.9% ~$0.75 ~$0.75 128K
Google Cloud Global ~170ms ~50 tps 99.9% ~$0.80 ~$0.80 128K

Technical Specifications

Metric Nemotron 3 Super (NVIDIA) GPT-4 Turbo (OpenAI) Claude 3 Sonnet (Anthropic)
Avg Latency ~350ms ~400ms ~450ms
Context Window ~128K 128K 200K
Input Price ($/1M) ~$0.60 ~$0.50 ~$0.60
Output Price ($/1M) ~$2.40 ~$1.50 ~$1.80
Max Output Tokens ~4K 4K 4K
Throughput ~60 tps ~50 tps ~40 tps
Uptime ~99.9% ~99.9% ~99.9%

30-day usage via LLM API

3.8B
Prompt tokens processed (30 days)
2.4B
Completion tokens generated (30 days)
11.5M
API requests served (30 days)
99.8%
Avg uptime (last 30 days)
Start Using API

Why Build on LLM.API?

One unified API. Every major model. Built-in reliability, cost control, and observability.

  • Intelligent Model Routing

    Automatically route each request to the best-performing model across providers based on latency, cost, and quality—without changing your integration.

    One endpoint, optimal model
  • Cost-Aware Execution

    Control spend with transparent per-token accounting, guardrails, and smart selection of cheaper equivalent models while preserving quality for critical workloads.

    Optimize every token
  • Automatic Fallbacks

    Keep production flows resilient with built-in provider failover and model-level retries, so transient outages never break your user experience.

    Resilient by default
  • Deep Observability

    Inspect latency, errors, tokens, and prompts across providers in one place, enabling fast debugging, regression detection, and performance tuning.

    See every request
  • Task-Level Orchestration

    Express complex AI workflows as high-level tasks—grounding, tools, classification, generation—while LLM.API handles prompt shaping, execution, and model differences.

    Ship workflows, not glue
  • High-Throughput Batch

    Process millions of requests efficiently with batch APIs that parallelize across providers, maximize throughput, and minimize cost for bulk inference jobs.

    Scale jobs to millions

When to Use — When NOT to Use

Use it if...

  • You need an open, enterprise-friendly LLM optimized for NVIDIA GPU infrastructure and tooling.
  • You need strong generative text capabilities with models tuned for instruction-following tasks.
  • Your use case involves private on-prem deployment where data must remain in-house.
  • Your use case involves customizing and fine-tuning models on proprietary domain data.
  • You need tight integration with NVIDIA AI Enterprise, NeMo, or NIM microservices stack.
  • Your use case involves batch inference workloads where throughput matters more than minimal latency.

Avoid if...

  • You need a fully managed, serverless API with no infrastructure or deployment work.
  • You need state-of-the-art reasoning benchmarks comparable to the very latest proprietary frontier models.
  • Your workload requires ultra-low-latency mobile inference on non-NVIDIA edge hardware.
  • You need extensive multimodal capabilities beyond text, such as advanced vision or audio.
  • Your workload requires guaranteed long-term model stability without version or weight changes.
  • You need turnkey ecosystem plugins and integrations matching the largest commercial LLM platforms.

Frequently Asked Questions

  • What is Nemotron 3 Super?

    Nemotron 3 Super is an NVIDIA large language model focused on high‑quality text generation and reasoning, accessible through the LLM.API unified gateway.

  • What is Nemotron 3 Super best suited for?

    Nemotron 3 Super is best for code generation, data analysis assistance, structured tool-calling workflows, and general-purpose chatbots needing strong reasoning and instruction-following.

  • What is the context window of Nemotron 3 Super?

    Nemotron 3 Super supports a context window of up to 8,192 tokens for combined input and output via LLM.API.

  • Which modalities does Nemotron 3 Super support on LLM.API?

    Nemotron 3 Super currently supports text-in, text-out interactions only; image, audio, and video inputs are not supported.

  • How fast is Nemotron 3 Super in terms of latency?

    Nemotron 3 Super typically returns first tokens within a few hundred milliseconds and can stream responses for lower perceived latency.

  • How is Nemotron 3 Super priced on LLM.API?

    Nemotron 3 Super is billed per 1,000 tokens, with separate rates for input and output tokens as defined in your LLM.API pricing plan.

  • How do I call Nemotron 3 Super through the LLM.API?

    You select provider "NVIDIA" and model "Nemotron 3 Super" in the LLM.API request payload, then send standard chat or completion-style requests.

  • How does Nemotron 3 Super compare to similar NVIDIA models?

    Nemotron 3 Super targets stronger reasoning and coding performance than smaller Nemotron variants, at higher compute cost but improved quality.

  • What are the main limitations of Nemotron 3 Super?

    Nemotron 3 Super can hallucinate facts, lacks real-time internet access, and should not be solely relied on for high-stakes or legally binding decisions.

  • Can I fine-tune Nemotron 3 Super via LLM.API?

    Direct fine-tuning is not exposed via LLM.API; instead, you should use techniques like system prompts, few-shot examples, and retrieval-augmented generation.

Start in 2 lines of code

Get My API Key