Powered by Inception
Mercury 2
- Text Generation
Mercury 2 is a proprietary, diffusion-based large language model (dLLM) from Inception designed for extremely fast reasoning and text generation with a long 128K-token context window.
About the model
What is Mercury 2?
Mercury 2 is a commercial-scale diffusion-based language model by Inception optimized for high-speed reasoning and generation. It is primarily used for code generation, analytical reasoning, and complex automation workflows where low latency is critical. It is also applied in AI agents, search, and business applications that benefit from rapid, large-context processing. Mercury 2 belongs to Inception’s Mercury family of diffusion-based LLMs, succeeding earlier Mercury models and specialized variants such as Mercury Coder.
Model capabilities
5 Core Capabilities
-
Conversational AI
Engages in multi-turn dialogue, answering questions and following instructions while maintaining context across user interactions.
-
Visual Analysis
Processes images to identify objects and scenes, enabling descriptions and basic reasoning about visual content.
-
Text Translation
Translates written content between multiple languages while attempting to preserve meaning and tone.
-
Document OCR
Extracts machine-readable text from images or scanned documents, supporting downstream search or analysis.
-
Content Monitoring
Assists in monitoring streams of textual data for specific topics or issues using pattern matching and basic analysis.
Use cases
6 Most Valuable Use Cases
- Contract Clause Extraction
- Regulatory Change Monitoring
- Financial Invoice Processing
- Customer Support Tagging
- IT Operations Automation
- Procurement Risk Analysis
Transparent pricing
Cost Comparison
LLM API offers the lowest cost and highest performance for Mercury 2–class models.
| Provider | Region | Latency | Throughput | Uptime | Input ($/1M) | Output ($/1M) | Context |
|---|---|---|---|---|---|---|---|
| LLM API BEST | Global | 80ms | 120 tps | 99.99% | $0.40 | $0.80 | 128K tokens |
| Inception | Global | ~150ms | ~60 tps | ~99.9% | ~$0.80 | ~$1.60 | ~64K tokens |
| OpenAI | Global | ~160ms | ~70 tps | ~99.9% | ~$1.00 | ~$2.00 | ~128K tokens |
| Anthropic | US East | ~170ms | ~50 tps | ~99.9% | ~$1.20 | ~$2.40 | ~200K tokens |
Performance benchmarks
Technical Specifications
| Metric | Mercury 2 (Inception) | GPT-4.1 (OpenAI) | Claude 3.5 Sonnet (Anthropic) |
|---|---|---|---|
| Avg Latency | ~180ms | ~220ms | ~250ms |
| Context Window | 128K | 128K | 200K |
| Input Price ($/1M) | $0.80 | $5.00 | $3.00 |
| Output Price ($/1M) | $2.40 | $15.00 | $15.00 |
| Max Output Tokens | 8K | 4K | 8K |
| Throughput | 80 tps | 40 tps | 50 tps |
| Uptime | 99.9% | 99.9% | 99.9% |
30-day usage via LLM API
- 7.5B
- Prompt tokens processed (last 30 days)
- 5.1B
- Completion tokens generated (last 30 days)
- 22.4M
- API requests served (last 30 days)
- 98.9%
- Avg uptime (last 30 days)
Architecture & Integration
Why Build on LLM.API?
One unified API. Every major model. Built-in reliability, cost control, and observability.
-
Intelligent Model Routing
Automatically route each request to the best model across providers based on latency, cost, and quality—without changing your integration or redeploying code.
One endpoint, every model -
Cost-Aware Orchestration
Control and predict spend with per-route pricing policies, budget guards, and automatic downshifts to cheaper models when quality thresholds are still met.
Optimize every token -
Resilient Fallback Logic
Survive provider outages and rate limits with built-in multi-region, multi-model failover so your app keeps responding even when an upstream service doesn’t.
Always-on reliability -
Full-Stack Observability
Trace every request across models and providers with logs, latency breakdowns, and error analytics to debug faster and continuously tune your routing rules.
See every token hop -
Task-Level Abstractions
Use high-level task APIs for chat, RAG, tools, and more so you can swap underlying models without rewriting prompts or business logic.
Tasks, not raw calls -
High-Throughput Batch
Process massive workloads efficiently with parallelized, rate-limit-aware batch execution, automatic retries, and deduplicated inputs for lower cost and higher throughput.
Ship at batch scale
Decision guide
When to Use — When NOT to Use
Use it if...
- You need a general-purpose model from Inception already integrated into your infrastructure.
- You need consistent behavior across many small automation tasks with moderate reasoning complexity.
- Your use case involves standard customer-support chatbots that follow clear, pre-defined workflows.
- Your use case involves drafting routine business content like emails, summaries, and reports.
- Your use case involves running batch inference jobs where predictable costs matter more than peak capability.
- You need a model suited for prototyping generic AI features before optimizing with specialized systems.
Avoid if...
- You need state-of-the-art reasoning on complex scientific, mathematical, or legal problems.
- You need guaranteed compliance with strict, audited industry regulations such as HIPAA or PCI-DSS.
- Your workload requires ultra-low-latency real-time interactions for high-frequency trading or control systems.
- Your workload requires on-device or fully offline inference without any external API dependency.
- You need a highly specialized vision, speech, or code model rather than a generalist.
- Your workload requires verifiable tool-calling support aligned exactly with another provider’s proprietary schema.
FAQ
Frequently Asked Questions
-
What is Mercury 2?
Mercury 2 is an Inception large language model accessible via LLM.API, designed for fast, cost-efficient general-purpose text generation and reasoning.
-
What types of tasks is Mercury 2 best suited for?
Mercury 2 is best for code generation, step-by-step reasoning, chatbot-style conversations, and structured text transformations like summarization or extraction.
-
What is the context window of Mercury 2?
Mercury 2 supports a 32K token context window, allowing it to handle long documents, multi-step tools, and extended conversations reliably.
-
How fast is Mercury 2 in terms of latency and throughput?
Mercury 2 is optimized for low p95 latency and high token throughput, making it suitable for interactive applications and high-traffic backends.
-
Which input and output modalities does Mercury 2 support?
Mercury 2 currently supports text input and text output only, with no native image, audio, or video processing.
-
How is Mercury 2 priced when accessed through LLM.API?
Mercury 2 uses LLM.API’s unified token-based pricing, with separate rates for input and output tokens configurable per project in your LLM.API dashboard.
-
How do I call Mercury 2 through the LLM.API?
Use the chat or completions endpoint with `model` set to `inception/mercury-2`, passing your prompt, optional system instructions, and any tool definitions.
-
How does Mercury 2 compare to similar mid-sized general-purpose models?
Mercury 2 targets a balance of quality and speed, typically trading slightly lower peak capability for materially lower cost and latency.
-
Does Mercury 2 support tools, function calling, or structured outputs?
Mercury 2 supports JSON-structured outputs and standard tool or function-calling semantics via LLM.API’s unified tool-calling interface.
-
What are the main limitations of Mercury 2?
Mercury 2 can hallucinate facts, lacks real-time knowledge or browsing, and is not suitable for safety-critical or compliance-required decision-making without human review.
