Powered by Arcee AI
Trinity Mini
Trinity Mini is a 26B-parameter (3B active) sparse mixture-of-experts language model from Arcee AI, optimized for efficient long-context reasoning with low per-token cost. It is an open-weight model designed for enterprise and enthusiast use across tools, agents, and high-throughput applications.
About the model
What is Trinity Mini?
Trinity Mini is a 26B-parameter sparse MoE language model from Arcee AI with about 3B parameters active per token for efficient inference. It is primarily used for reasoning-intensive text generation, such as analytical chat, planning, and complex problem solving, while maintaining strong performance on long-context workloads up to around 131k tokens. It is also applied in function calling and multi-step agent workflows where structured tool use and low latency are important. Trinity Mini is the medium-sized model in Arcee AI’s Trinity open-weight family, sitting between Trinity Nano and larger Trinity variants.
Model capabilities
5 Core Capabilities
-
Conversational Chat
Handles general dialogue and instruction-following tasks as a text-only large language model for interactive chat-based applications.
-
Long-Context Reasoning
Performs efficient reasoning and generation over long contexts around 128k–131k tokens using a sparse mixture-of-experts architecture.
-
Function Calling
Supports structured tool and function calling, enabling multi-step agent workflows and schema-based integrations with external systems.
-
Structured Output
Generates well-structured, machine-readable text such as JSON or classified labels suitable for automation, evaluation, and downstream processing.
-
Multilingual Text
Processes and generates text in multiple languages, enabling cross-lingual chat, drafting, and localization workflows from a single model.
Use cases
6 Most Valuable Use Cases
- Enterprise Chatbots
- Invoice / Document Parsing
- Legal Case Research
- Regulation Change Monitoring
- Customer Support Triage
- Agentic Tool Orchestration
Transparent pricing
Cost Comparison
LLM API offers the lowest cost and fastest access for Trinity Mini-class models.
| Provider | Region | Latency | Throughput | Uptime | Input ($/1M) | Output ($/1M) | Context |
|---|---|---|---|---|---|---|---|
| LLM API BEST | Global | 80ms | ~220 tps | 99.99% | $0.12 | $0.12 | ~128K tokens |
| Arcee AI | US East | ~160ms | ~120 tps | 99.9% | ~$0.25 | ~$0.25 | ~64K tokens |
| AWS Bedrock (Trinity Mini-equivalent) | US West | ~190ms | ~150 tps | 99.9% | ~$0.30 | ~$0.30 | ~128K tokens |
| Azure OpenAI (Trinity Mini-equivalent) | EU West | ~220ms | ~100 tps | 99.95% | ~$0.35 | ~$0.35 | ~128K tokens |
| Vertex AI (Trinity Mini-equivalent) | Global | ~210ms | ~130 tps | 99.9% | ~$0.32 | ~$0.32 | ~64K tokens |
Performance benchmarks
Technical Specifications
| Metric | Trinity Mini (Arcee AI) | GPT-4o Mini (OpenAI) | Gemini 1.5 Flash (Google) |
|---|---|---|---|
| Avg Latency | ~180ms | ~200ms | ~220ms |
| Context Window | 128K | 128K | 1M |
| Input Price ($/1M tokens) | ~$0.10 | ~$0.15 | ~$0.15 |
| Output Price ($/1M tokens) | ~$0.15 | ~$0.60 | ~$0.60 |
| Max Output Tokens | 4K | 16K | 8K |
| Throughput | ~80 tps | ~60 tps | ~70 tps |
| Uptime | ~99.9% | ~99.9% | ~99.9% |
30-day usage via LLM API
- 320M
- Prompt tokens processed (last 30 days)
- 3.8M
- Completion tokens generated
- 410K
- API requests served
- 99.7%
- Avg uptime
Architecture & Integration
Why Build on LLM.API?
One unified API. Every major model. Built-in reliability, cost control, and observability.
-
Unified AI Routing
Automatically send each request to the optimal model across providers based on latency, quality, and cost. One endpoint, dynamic policies, no code rewrites.
One endpoint, any model -
Cost-Aware Orchestration
Control spend with fine-grained rate limits, model tiering, and smart downgrades. Keep performance high while staying within strict budget and quota constraints.
Predictable, optimized spend -
Resilient Fallback Flows
Design multi-step failover chains across providers so requests keep succeeding through outages, rate limits, or timeouts—without changing your application code.
Never fail on one model -
End-to-End Observability
Inspect tokens, latencies, errors, and provider usage in one place. Quickly debug incidents, tune routing rules, and prove reliability to stakeholders.
One pane of glass -
Task-Level Abstractions
Call high-level tasks like chat, tools, or rerank without vendor-specific boilerplate. Swap models freely while keeping a single, stable application contract.
Code to tasks, not vendors -
High-Throughput Batch Jobs
Process millions of inferences via optimized batching with concurrency control, retries, and partial failure handling built in—no custom job infrastructure required.
Scale inference, not ops
Decision guide
When to Use — When NOT to Use
Use it if...
- You need a compact language model suitable for on-device or edge deployments.
- You need cost-efficient inference for high-volume simple chatbots or assistants.
- Your use case involves lightweight text classification, tagging, or intent detection pipelines.
- Your use case involves fine-tuning a small model with domain-specific datasets.
- You need fast inference for autocomplete, query rewriting, or basic summarization tasks.
Avoid if...
- You need state-of-the-art reasoning performance on complex, multi-step analytical tasks.
- Your workload requires handling very long context windows with high factual reliability.
- You need advanced multimodal capabilities like image understanding or video reasoning.
- You need best-in-class coding assistance across many languages and large codebases.
- Your workload requires strong safety guardrails and enterprise-grade compliance guarantees out-of-the-box.
FAQ
Frequently Asked Questions
-
What is Trinity Mini?
Trinity Mini is a 26B-parameter sparse mixture-of-experts language model by Arcee AI with about 3B active parameters optimized for efficient reasoning over long contexts.
-
What is the context window of Trinity Mini?
Trinity Mini supports a context window of approximately 131K tokens, enabling long documents, multi-step workflows, and extended multi-turn conversations.
-
What does Trinity Mini cost to use on LLM.API?
On LLM.API, Trinity Mini typically follows Arcee AI’s pricing of about $0.04–$0.045 per million input tokens and $0.15 per million output tokens, plus gateway overhead.
-
What is Trinity Mini best suited for?
Trinity Mini is best for long-context reasoning, structured outputs, tool or function calling, and cost-efficient general-purpose chat and automation agents.
-
Which modalities does Trinity Mini support?
Trinity Mini is a text-only model that accepts text prompts and returns text completions; it does not natively process images, audio, or video.
-
How fast is Trinity Mini in terms of latency and throughput?
Thanks to its sparse MoE design, Trinity Mini usually delivers fast token throughput comparable to small dense models while handling significantly longer contexts.
-
How do I call Trinity Mini through LLM.API?
Set the model identifier to the Trinity Mini slug provided by LLM.API in your completion or chat endpoint call, passing prompts and parameters as usual.
-
How does Trinity Mini compare to larger Trinity models?
Compared with Trinity Large variants, Trinity Mini is cheaper and lighter with slightly lower peak reasoning quality but similar long-context capabilities for many workloads.
-
What are the main limitations of Trinity Mini?
Trinity Mini can still hallucinate, lacks up-to-the-minute world knowledge, is not fine-tuned for code to the level of specialist coder models, and is text-only.
-
Does Trinity Mini support function calling and tool use via LLM.API?
Yes, when used through LLM.API, Trinity Mini can be driven with JSON schemas or tool definitions to perform function calling and multi-step tool-using workflows.
