Powered by Arcee AI

Trinity Mini

  • Text Embeddings

Trinity Mini is a 26B-parameter (3B active) sparse mixture-of-experts language model from Arcee AI, optimized for efficient long-context reasoning with low per-token cost. It is an open-weight model designed for enterprise and enthusiast use across tools, agents, and high-throughput applications.

Start Using API

What is Trinity Mini?

Trinity Mini is a 26B-parameter sparse MoE language model from Arcee AI with about 3B parameters active per token for efficient inference. It is primarily used for reasoning-intensive text generation, such as analytical chat, planning, and complex problem solving, while maintaining strong performance on long-context workloads up to around 131k tokens. It is also applied in function calling and multi-step agent workflows where structured tool use and low latency are important. Trinity Mini is the medium-sized model in Arcee AI’s Trinity open-weight family, sitting between Trinity Nano and larger Trinity variants.

5 Core Capabilities

  • Conversational Chat

    Handles general dialogue and instruction-following tasks as a text-only large language model for interactive chat-based applications.

  • Long-Context Reasoning

    Performs efficient reasoning and generation over long contexts around 128k–131k tokens using a sparse mixture-of-experts architecture.

  • Function Calling

    Supports structured tool and function calling, enabling multi-step agent workflows and schema-based integrations with external systems.

  • Structured Output

    Generates well-structured, machine-readable text such as JSON or classified labels suitable for automation, evaluation, and downstream processing.

  • Multilingual Text

    Processes and generates text in multiple languages, enabling cross-lingual chat, drafting, and localization workflows from a single model.

6 Most Valuable Use Cases

  • Enterprise Chatbots
  • Invoice / Document Parsing
  • Legal Case Research
  • Regulation Change Monitoring
  • Customer Support Triage
  • Agentic Tool Orchestration

Cost Comparison

LLM API offers the lowest cost and fastest access for Trinity Mini-class models.

Provider Region Latency Throughput Uptime Input ($/1M) Output ($/1M) Context
LLM API BEST Global 80ms ~220 tps 99.99% $0.12 $0.12 ~128K tokens
Arcee AI US East ~160ms ~120 tps 99.9% ~$0.25 ~$0.25 ~64K tokens
AWS Bedrock (Trinity Mini-equivalent) US West ~190ms ~150 tps 99.9% ~$0.30 ~$0.30 ~128K tokens
Azure OpenAI (Trinity Mini-equivalent) EU West ~220ms ~100 tps 99.95% ~$0.35 ~$0.35 ~128K tokens
Vertex AI (Trinity Mini-equivalent) Global ~210ms ~130 tps 99.9% ~$0.32 ~$0.32 ~64K tokens

Technical Specifications

Metric Trinity Mini (Arcee AI) GPT-4o Mini (OpenAI) Gemini 1.5 Flash (Google)
Avg Latency ~180ms ~200ms ~220ms
Context Window 128K 128K 1M
Input Price ($/1M tokens) ~$0.10 ~$0.15 ~$0.15
Output Price ($/1M tokens) ~$0.15 ~$0.60 ~$0.60
Max Output Tokens 4K 16K 8K
Throughput ~80 tps ~60 tps ~70 tps
Uptime ~99.9% ~99.9% ~99.9%

30-day usage via LLM API

320M
Prompt tokens processed (last 30 days)
3.8M
Completion tokens generated
410K
API requests served
99.7%
Avg uptime
Start Using API

Why Build on LLM.API?

One unified API. Every major model. Built-in reliability, cost control, and observability.

  • Unified AI Routing

    Automatically send each request to the optimal model across providers based on latency, quality, and cost. One endpoint, dynamic policies, no code rewrites.

    One endpoint, any model
  • Cost-Aware Orchestration

    Control spend with fine-grained rate limits, model tiering, and smart downgrades. Keep performance high while staying within strict budget and quota constraints.

    Predictable, optimized spend
  • Resilient Fallback Flows

    Design multi-step failover chains across providers so requests keep succeeding through outages, rate limits, or timeouts—without changing your application code.

    Never fail on one model
  • End-to-End Observability

    Inspect tokens, latencies, errors, and provider usage in one place. Quickly debug incidents, tune routing rules, and prove reliability to stakeholders.

    One pane of glass
  • Task-Level Abstractions

    Call high-level tasks like chat, tools, or rerank without vendor-specific boilerplate. Swap models freely while keeping a single, stable application contract.

    Code to tasks, not vendors
  • High-Throughput Batch Jobs

    Process millions of inferences via optimized batching with concurrency control, retries, and partial failure handling built in—no custom job infrastructure required.

    Scale inference, not ops

When to Use — When NOT to Use

Use it if...

  • You need a compact language model suitable for on-device or edge deployments.
  • You need cost-efficient inference for high-volume simple chatbots or assistants.
  • Your use case involves lightweight text classification, tagging, or intent detection pipelines.
  • Your use case involves fine-tuning a small model with domain-specific datasets.
  • You need fast inference for autocomplete, query rewriting, or basic summarization tasks.

Avoid if...

  • You need state-of-the-art reasoning performance on complex, multi-step analytical tasks.
  • Your workload requires handling very long context windows with high factual reliability.
  • You need advanced multimodal capabilities like image understanding or video reasoning.
  • You need best-in-class coding assistance across many languages and large codebases.
  • Your workload requires strong safety guardrails and enterprise-grade compliance guarantees out-of-the-box.

Frequently Asked Questions

  • What is Trinity Mini?

    Trinity Mini is a 26B-parameter sparse mixture-of-experts language model by Arcee AI with about 3B active parameters optimized for efficient reasoning over long contexts.

  • What is the context window of Trinity Mini?

    Trinity Mini supports a context window of approximately 131K tokens, enabling long documents, multi-step workflows, and extended multi-turn conversations.

  • What does Trinity Mini cost to use on LLM.API?

    On LLM.API, Trinity Mini typically follows Arcee AI’s pricing of about $0.04–$0.045 per million input tokens and $0.15 per million output tokens, plus gateway overhead.

  • What is Trinity Mini best suited for?

    Trinity Mini is best for long-context reasoning, structured outputs, tool or function calling, and cost-efficient general-purpose chat and automation agents.

  • Which modalities does Trinity Mini support?

    Trinity Mini is a text-only model that accepts text prompts and returns text completions; it does not natively process images, audio, or video.

  • How fast is Trinity Mini in terms of latency and throughput?

    Thanks to its sparse MoE design, Trinity Mini usually delivers fast token throughput comparable to small dense models while handling significantly longer contexts.

  • How do I call Trinity Mini through LLM.API?

    Set the model identifier to the Trinity Mini slug provided by LLM.API in your completion or chat endpoint call, passing prompts and parameters as usual.

  • How does Trinity Mini compare to larger Trinity models?

    Compared with Trinity Large variants, Trinity Mini is cheaper and lighter with slightly lower peak reasoning quality but similar long-context capabilities for many workloads.

  • What are the main limitations of Trinity Mini?

    Trinity Mini can still hallucinate, lacks up-to-the-minute world knowledge, is not fine-tuned for code to the level of specialist coder models, and is text-only.

  • Does Trinity Mini support function calling and tool use via LLM.API?

    Yes, when used through LLM.API, Trinity Mini can be driven with JSON schemas or tool definitions to perform function calling and multi-step tool-using workflows.

Start in 2 lines of code

Get My API Key