Trinity Mini

Text Embeddings

Trinity Mini is a 26B-parameter (3B active) sparse mixture-of-experts language model from Arcee AI, optimized for efficient long-context reasoning with low per-token cost. It is an open-weight model designed for enterprise and enthusiast use across tools, agents, and high-throughput applications.

Start Using API

API Performance

Latency: ~0.9s avg response
Context: 131K token context
Input: $0.045 per 1M tokens
Output: $0.150 per 1M tokens
Uptime: 99% 99%

About the model

What is Trinity Mini?

Trinity Mini is a 26B-parameter sparse MoE language model from Arcee AI with about 3B parameters active per token for efficient inference. It is primarily used for reasoning-intensive text generation, such as analytical chat, planning, and complex problem solving, while maintaining strong performance on long-context workloads up to around 131k tokens. It is also applied in function calling and multi-step agent workflows where structured tool use and low latency are important. Trinity Mini is the medium-sized model in Arcee AI’s Trinity open-weight family, sitting between Trinity Nano and larger Trinity variants.

Input / Output

Input

Text prompts

Output

Text responses

Model capabilities

5 Core Capabilities

Conversational Chat

Handles general dialogue and instruction-following tasks as a text-only large language model for interactive chat-based applications.
Long-Context Reasoning

Performs efficient reasoning and generation over long contexts around 128k–131k tokens using a sparse mixture-of-experts architecture.
Function Calling

Supports structured tool and function calling, enabling multi-step agent workflows and schema-based integrations with external systems.
Structured Output

Generates well-structured, machine-readable text such as JSON or classified labels suitable for automation, evaluation, and downstream processing.
Multilingual Text

Processes and generates text in multiple languages, enabling cross-lingual chat, drafting, and localization workflows from a single model.

Use cases

6 Most Valuable Use Cases

Enterprise Chatbots
Invoice / Document Parsing
Legal Case Research
Regulation Change Monitoring
Customer Support Triage
Agentic Tool Orchestration

Transparent pricing

Cost Comparison

LLM API offers the lowest cost and fastest access for Trinity Mini-class models.

Provider	Region	Latency	Throughput	Uptime	Input ($/1M)	Output ($/1M)	Context
LLM API BEST	Global	80ms	~220 tps	99.99%	$0.12	$0.12	~128K tokens
Arcee AI	US East	~160ms	~120 tps	99.9%	~$0.25	~$0.25	~64K tokens
AWS Bedrock (Trinity Mini-equivalent)	US West	~190ms	~150 tps	99.9%	~$0.30	~$0.30	~128K tokens
Azure OpenAI (Trinity Mini-equivalent)	EU West	~220ms	~100 tps	99.95%	~$0.35	~$0.35	~128K tokens
Vertex AI (Trinity Mini-equivalent)	Global	~210ms	~130 tps	99.9%	~$0.32	~$0.32	~64K tokens

Performance benchmarks

Technical Specifications

Metric	Trinity Mini (Arcee AI)	GPT-4o Mini (OpenAI)	Gemini 1.5 Flash (Google)
Avg Latency	~180ms	~200ms	~220ms
Context Window	128K	128K	1M
Input Price ($/1M tokens)	~$0.10	~$0.15	~$0.15
Output Price ($/1M tokens)	~$0.15	~$0.60	~$0.60
Max Output Tokens	4K	16K	8K
Throughput	~80 tps	~60 tps	~70 tps
Uptime	~99.9%	~99.9%	~99.9%

30-day usage via LLM API

320M: Prompt tokens processed (last 30 days)
3.8M: Completion tokens generated
410K: API requests served
99.7%: Avg uptime

Start Using API

Architecture & Integration

Why Build on LLM.API?

One unified API. Every major model. Built-in reliability, cost control, and observability.

Unified AI Routing

Automatically send each request to the optimal model across providers based on latency, quality, and cost. One endpoint, dynamic policies, no code rewrites.
One endpoint, any model
Cost-Aware Orchestration

Control spend with fine-grained rate limits, model tiering, and smart downgrades. Keep performance high while staying within strict budget and quota constraints.
Predictable, optimized spend
Resilient Fallback Flows

Design multi-step failover chains across providers so requests keep succeeding through outages, rate limits, or timeouts—without changing your application code.
Never fail on one model
End-to-End Observability

Inspect tokens, latencies, errors, and provider usage in one place. Quickly debug incidents, tune routing rules, and prove reliability to stakeholders.
One pane of glass
Task-Level Abstractions

Call high-level tasks like chat, tools, or rerank without vendor-specific boilerplate. Swap models freely while keeping a single, stable application contract.
Code to tasks, not vendors
High-Throughput Batch Jobs

Process millions of inferences via optimized batching with concurrency control, retries, and partial failure handling built in—no custom job infrastructure required.
Scale inference, not ops

Decision guide

When to Use — When NOT to Use

Use it if...

You need a compact language model suitable for on-device or edge deployments.
You need cost-efficient inference for high-volume simple chatbots or assistants.
Your use case involves lightweight text classification, tagging, or intent detection pipelines.
Your use case involves fine-tuning a small model with domain-specific datasets.
You need fast inference for autocomplete, query rewriting, or basic summarization tasks.

Avoid if...

You need state-of-the-art reasoning performance on complex, multi-step analytical tasks.
Your workload requires handling very long context windows with high factual reliability.
You need advanced multimodal capabilities like image understanding or video reasoning.
You need best-in-class coding assistance across many languages and large codebases.
Your workload requires strong safety guardrails and enterprise-grade compliance guarantees out-of-the-box.

FAQ

Frequently Asked Questions

What is Trinity Mini?

Trinity Mini is a 26B-parameter sparse mixture-of-experts language model by Arcee AI with about 3B active parameters optimized for efficient reasoning over long contexts.
What is the context window of Trinity Mini?

Trinity Mini supports a context window of approximately 131K tokens, enabling long documents, multi-step workflows, and extended multi-turn conversations.
What does Trinity Mini cost to use on LLM.API?

On LLM.API, Trinity Mini typically follows Arcee AI’s pricing of about $0.04–$0.045 per million input tokens and $0.15 per million output tokens, plus gateway overhead.
What is Trinity Mini best suited for?

Trinity Mini is best for long-context reasoning, structured outputs, tool or function calling, and cost-efficient general-purpose chat and automation agents.
Which modalities does Trinity Mini support?

Trinity Mini is a text-only model that accepts text prompts and returns text completions; it does not natively process images, audio, or video.
How fast is Trinity Mini in terms of latency and throughput?

Thanks to its sparse MoE design, Trinity Mini usually delivers fast token throughput comparable to small dense models while handling significantly longer contexts.
How do I call Trinity Mini through LLM.API?

Set the model identifier to the Trinity Mini slug provided by LLM.API in your completion or chat endpoint call, passing prompts and parameters as usual.
How does Trinity Mini compare to larger Trinity models?

Compared with Trinity Large variants, Trinity Mini is cheaper and lighter with slightly lower peak reasoning quality but similar long-context capabilities for many workloads.
What are the main limitations of Trinity Mini?

Trinity Mini can still hallucinate, lacks up-to-the-minute world knowledge, is not fine-tuned for code to the level of specialist coder models, and is text-only.
Does Trinity Mini support function calling and tool use via LLM.API?

Yes, when used through LLM.API, Trinity Mini can be driven with JSON schemas or tool definitions to perform function calling and multi-step tool-using workflows.

Start in 2 lines of code

Get My API Key

Trinity Mini

What is Trinity Mini?

5 Core Capabilities

Conversational Chat

Long-Context Reasoning

Function Calling

Structured Output

Multilingual Text

6 Most Valuable Use Cases

Cost Comparison

Technical Specifications

Why Build on LLM.API?

Unified AI Routing

Cost-Aware Orchestration

Resilient Fallback Flows

End-to-End Observability

Task-Level Abstractions

High-Throughput Batch Jobs

When to Use — When NOT to Use

Use it if...

Avoid if...

Start in 2 lines of code