Mercury 2

Text Generation

Mercury 2 is a proprietary, diffusion-based large language model (dLLM) from Inception designed for extremely fast reasoning and text generation with a long 128K-token context window.

Start Using API

API Performance

Latency: ~0.1s time to first token at >1,000 tok/s
Context: 128K token context
Input: $0.25 per 1M tokens
Output: $0.75 per 1M tokens
Uptime: 99% 99%

About the model

What is Mercury 2?

Mercury 2 is a commercial-scale diffusion-based language model by Inception optimized for high-speed reasoning and generation. It is primarily used for code generation, analytical reasoning, and complex automation workflows where low latency is critical. It is also applied in AI agents, search, and business applications that benefit from rapid, large-context processing. Mercury 2 belongs to Inception’s Mercury family of diffusion-based LLMs, succeeding earlier Mercury models and specialized variants such as Mercury Coder.

Input / Output

Input

Text prompts

Output

Text responses (natural language, reasoning, JSON or structured text)
Source code generation and editing

Model capabilities

5 Core Capabilities

Conversational AI

Engages in multi-turn dialogue, answering questions and following instructions while maintaining context across user interactions.
Visual Analysis

Processes images to identify objects and scenes, enabling descriptions and basic reasoning about visual content.
Text Translation

Translates written content between multiple languages while attempting to preserve meaning and tone.
Document OCR

Extracts machine-readable text from images or scanned documents, supporting downstream search or analysis.
Content Monitoring

Assists in monitoring streams of textual data for specific topics or issues using pattern matching and basic analysis.

Use cases

6 Most Valuable Use Cases

Contract Clause Extraction
Regulatory Change Monitoring
Financial Invoice Processing
Customer Support Tagging
IT Operations Automation
Procurement Risk Analysis

Transparent pricing

Cost Comparison

LLM API offers the lowest cost and highest performance for Mercury 2–class models.

Provider	Region	Latency	Throughput	Uptime	Input ($/1M)	Output ($/1M)	Context
LLM API BEST	Global	80ms	120 tps	99.99%	$0.40	$0.80	128K tokens
Inception	Global	~150ms	~60 tps	~99.9%	~$0.80	~$1.60	~64K tokens
OpenAI	Global	~160ms	~70 tps	~99.9%	~$1.00	~$2.00	~128K tokens
Anthropic	US East	~170ms	~50 tps	~99.9%	~$1.20	~$2.40	~200K tokens

Performance benchmarks

Technical Specifications

Metric	Mercury 2 (Inception)	GPT-4.1 (OpenAI)	Claude 3.5 Sonnet (Anthropic)
Avg Latency	~180ms	~220ms	~250ms
Context Window	128K	128K	200K
Input Price ($/1M)	$0.80	$5.00	$3.00
Output Price ($/1M)	$2.40	$15.00	$15.00
Max Output Tokens	8K	4K	8K
Throughput	80 tps	40 tps	50 tps
Uptime	99.9%	99.9%	99.9%

30-day usage via LLM API

7.5B: Prompt tokens processed (last 30 days)
5.1B: Completion tokens generated (last 30 days)
22.4M: API requests served (last 30 days)
98.9%: Avg uptime (last 30 days)

Start Using API

Architecture & Integration

Why Build on LLM.API?

One unified API. Every major model. Built-in reliability, cost control, and observability.

Intelligent Model Routing

Automatically route each request to the best model across providers based on latency, cost, and quality—without changing your integration or redeploying code.
One endpoint, every model
Cost-Aware Orchestration

Control and predict spend with per-route pricing policies, budget guards, and automatic downshifts to cheaper models when quality thresholds are still met.
Optimize every token
Resilient Fallback Logic

Survive provider outages and rate limits with built-in multi-region, multi-model failover so your app keeps responding even when an upstream service doesn’t.
Always-on reliability
Full-Stack Observability

Trace every request across models and providers with logs, latency breakdowns, and error analytics to debug faster and continuously tune your routing rules.
See every token hop
Task-Level Abstractions

Use high-level task APIs for chat, RAG, tools, and more so you can swap underlying models without rewriting prompts or business logic.
Tasks, not raw calls
High-Throughput Batch

Process massive workloads efficiently with parallelized, rate-limit-aware batch execution, automatic retries, and deduplicated inputs for lower cost and higher throughput.
Ship at batch scale

Decision guide

When to Use — When NOT to Use

Use it if...

You need a general-purpose model from Inception already integrated into your infrastructure.
You need consistent behavior across many small automation tasks with moderate reasoning complexity.
Your use case involves standard customer-support chatbots that follow clear, pre-defined workflows.
Your use case involves drafting routine business content like emails, summaries, and reports.
Your use case involves running batch inference jobs where predictable costs matter more than peak capability.
You need a model suited for prototyping generic AI features before optimizing with specialized systems.

Avoid if...

You need state-of-the-art reasoning on complex scientific, mathematical, or legal problems.
You need guaranteed compliance with strict, audited industry regulations such as HIPAA or PCI-DSS.
Your workload requires ultra-low-latency real-time interactions for high-frequency trading or control systems.
Your workload requires on-device or fully offline inference without any external API dependency.
You need a highly specialized vision, speech, or code model rather than a generalist.
Your workload requires verifiable tool-calling support aligned exactly with another provider’s proprietary schema.

FAQ

Frequently Asked Questions

What is Mercury 2?

Mercury 2 is an Inception large language model accessible via LLM.API, designed for fast, cost-efficient general-purpose text generation and reasoning.
What types of tasks is Mercury 2 best suited for?

Mercury 2 is best for code generation, step-by-step reasoning, chatbot-style conversations, and structured text transformations like summarization or extraction.
What is the context window of Mercury 2?

Mercury 2 supports a 32K token context window, allowing it to handle long documents, multi-step tools, and extended conversations reliably.
How fast is Mercury 2 in terms of latency and throughput?

Mercury 2 is optimized for low p95 latency and high token throughput, making it suitable for interactive applications and high-traffic backends.
Which input and output modalities does Mercury 2 support?

Mercury 2 currently supports text input and text output only, with no native image, audio, or video processing.
How is Mercury 2 priced when accessed through LLM.API?

Mercury 2 uses LLM.API’s unified token-based pricing, with separate rates for input and output tokens configurable per project in your LLM.API dashboard.
How do I call Mercury 2 through the LLM.API?

Use the chat or completions endpoint with `model` set to `inception/mercury-2`, passing your prompt, optional system instructions, and any tool definitions.
How does Mercury 2 compare to similar mid-sized general-purpose models?

Mercury 2 targets a balance of quality and speed, typically trading slightly lower peak capability for materially lower cost and latency.
Does Mercury 2 support tools, function calling, or structured outputs?

Mercury 2 supports JSON-structured outputs and standard tool or function-calling semantics via LLM.API’s unified tool-calling interface.
What are the main limitations of Mercury 2?

Mercury 2 can hallucinate facts, lacks real-time knowledge or browsing, and is not suitable for safety-critical or compliance-required decision-making without human review.

Start in 2 lines of code

Get My API Key

Mercury 2

What is Mercury 2?

5 Core Capabilities

Conversational AI

Visual Analysis

Text Translation

Document OCR

Content Monitoring

6 Most Valuable Use Cases

Cost Comparison

Technical Specifications

Why Build on LLM.API?

Intelligent Model Routing

Cost-Aware Orchestration

Resilient Fallback Logic

Full-Stack Observability

Task-Level Abstractions

High-Throughput Batch

When to Use — When NOT to Use

Use it if...

Avoid if...

Start in 2 lines of code