Trinity Large Thinking

Text Generation

Trinity Large Thinking is Arcee AI’s open-weight, 398–400B-parameter sparse Mixture-of-Experts model focused on advanced reasoning and long-horizon agentic tasks. It is notable for activating only about 13B parameters per token while emitting explicit chain-of-thought traces for tool-using agents.

Start Using API

API Performance

Latency: ~1.5s avg response
Context: ~32K token context
Input: Free per 1M tokens
Output: Free per 1M tokens
Uptime: 99% 99%

About the model

What is Trinity Large Thinking?

Trinity Large Thinking is an open-source, reasoning-focused sparse Mixture-of-Experts language model from Arcee AI with roughly 398–400 billion total parameters and about 13 billion active per token. It is primarily used for complex, long-horizon AI agents, multi-step tool calling, and transparent chain-of-thought reasoning in production workflows. It is also applied to long-context tasks such as RAG, structured outputs, and coding assistance where its large context window and explicit thinking blocks are valuable. The model belongs to Arcee AI’s Trinity Large family and is built on top of Trinity Large Base as its reasoning-specialized variant.

Input / Output

Input

Text prompts (no images or documents; text-only model)

Output

Free-form text responses (reasoning traces and answers)
Code snippets within text responses

Model capabilities

5 Core Capabilities

Advanced Reasoning

Sparse Mixture-of-Experts model specialized for complex, long-horizon reasoning, planning, and multi-step decision-making across diverse knowledge domains.
Agentic Workflows

Optimized for autonomous agents, multi-turn tool use, and orchestrating extended reasoning traces for production-grade AI agent pipelines.
Long-Context Handling

Supports very long context windows (around 256K tokens), enabling analysis of large documents, multi-step traces, and extensive conversations.
Code Generation

Delivers strong performance on coding benchmarks like LiveCodeBench, suitable for complex, multi-step programming and software engineering tasks.
Multilingual Text

Handles multilingual text inputs and outputs, making it suitable for global applications requiring reasoning and generation across several languages.

Use cases

6 Most Valuable Use Cases

Agentic workflow planning
Complex code generation
Long document reasoning
Tool-using AI agents
Enterprise customization
Benchmarking and evaluation

Transparent pricing

Cost Comparison

LLM API offers the lowest cost and highest performance for Trinity Large–class reasoning models.

Provider	Region	Latency	Throughput	Uptime	Input ($/1M)	Output ($/1M)	Context
LLM API BEST	Global	~140ms	~80 tps	~99.99%	~$0.70	~$0.70	~256K
Arcee AI	Global	~220ms	~40 tps	~99.9%	~$2.50	~$2.50	~128K
OpenAI (o3-mini equivalent)	Global	~260ms	~35 tps	~99.9%	~$1.10	~$4.40	~200K
Anthropic (Claude 3.7 Sonnet thinking)	Global	~280ms	~30 tps	~99.9%	~$3.00	~$15.00	~200K
Google (Gemini 2.0 Pro thinking)	Global	~240ms	~45 tps	~99.9%	~$1.50	~$6.00	~200K

Performance benchmarks

Technical Specifications

Metric	Trinity Large Thinking	OpenAI o3	Anthropic Claude Sonnet 4
Avg Latency	~1.8s	~2.5s	~2.2s
Context Window	262K	200K	200K
Input Price ($/1M)	$0.22	$2.00	$3.00
Output Price ($/1M)	$0.80	$8.00	$15.00
Max Output Tokens	80K	32K	8K
Throughput	~40 tps	~20 tps	~18 tps
Uptime	99.9%	99.9%	99.9%

30-day usage via LLM API

12.5B: Prompt tokens processed (last 30 days)
145M: Completion tokens generated (30 days)
3.1M: API requests served (30 days)
99.8%: Avg uptime over last 30 days

Start Using API

Architecture & Integration

Why Build on LLM.API?

One unified API. Every major model. Built-in reliability, cost control, and observability.

Intelligent Model Routing

Automatically route each request to the optimal model across providers based on latency, capability, and constraints—no client changes or custom logic required.
One endpoint, every model.
Cost-Aware Orchestration

Enforce per-project and per-request cost controls while dynamically selecting cheaper equivalents, so you stay within budget without manually tuning every call.
Optimize spend by default.
Resilient Fallback Flows

Define automatic cross-provider fallbacks when a model is slow, degraded, or unavailable, keeping production traffic flowing without brittle retries or custom failover code.
No single point of failure.
Full-Stack Observability

Get unified traces, metrics, and logs for every AI call—across vendors—so you can debug prompts, compare models, and track performance from one dashboard.
See every token, everywhere.
Task-Level Abstractions

Describe tasks—chat, tools, RAG, classification—once and let LLM.API handle providers, schemas, and quirks so you ship features instead of glue code.
Code to tasks, not models.
High-Throughput Batch Jobs

Run massive batch inference across providers with automatic parallelization, rate-limit handling, and retries, turning slow offline jobs into predictable pipelines.
Batch at production scale.

Decision guide

When to Use — When NOT to Use

Use it if...

You need strong multi-step reasoning for agentic workflows, planning, and complex decision pipelines.
You need long-context processing, like 200K+ token transcripts, logs, or technical documents.
Your use case involves building autonomous coding or debugging agents with reliable tool use.
You need transparent chain-of-thought style traces for auditable, stepwise enterprise reasoning.
Your use case involves self-hosting or fine-tuning an Apache-2.0-licensed open-weights reasoning model.
You need high reasoning benchmarks at lower cost than frontier closed-source reasoning models.
Your use case involves experimental orchestration of multiple tools and APIs across long multi-turn sessions.

Avoid if...

You need ultra-low-latency real-time chat or streaming responses with minimal reasoning overhead.
Your workload requires strict output brevity without verbose reasoning traces or intermediate thoughts.
You need lightweight on-device inference on CPUs or small GPUs with tight memory limits.
Your workload requires multimodal vision, image understanding, or audio capabilities not provided by Trinity.
You need a simple, fast, general-purpose assistant where deep reasoning is unnecessary overkill.
Your workload requires hardened, compliance-certified enterprise hosting where Arcee’s stack is unavailable.
You need predictable, very low token usage budgets instead of long, exploratory reasoning outputs.

FAQ

Frequently Asked Questions

What is Trinity Large Thinking?

Trinity Large Thinking is a large language model by Arcee AI optimized for complex reasoning, analysis, and multi-step problem solving.
What is Trinity Large Thinking best suited for?

It is best for long-form reasoning tasks like code review, technical design discussions, research synthesis, and stepwise planning where depth of analysis matters.
What is the context window of Trinity Large Thinking?

Trinity Large Thinking supports a large context window suitable for multi-document prompts, but exact token limits depend on the LLM.API deployment configuration.
How fast is Trinity Large Thinking in terms of latency?

Latency varies by LLM.API region and load, but Trinity Large Thinking is tuned for interactive use rather than ultra-low-latency streaming workloads.
What modalities does Trinity Large Thinking support?

Trinity Large Thinking is a text-only model focused on natural language input and output, without native image or audio processing.
How do I access Trinity Large Thinking through LLM.API?

You can call Trinity Large Thinking via LLM.API by specifying the provider as Arcee AI and the model name as Trinity Large Thinking in requests.
How is Trinity Large Thinking priced on LLM.API?

Pricing for Trinity Large Thinking is usage-based on input and output tokens, with exact rates listed on the LLM.API pricing page for Arcee AI models.
How does Trinity Large Thinking compare to smaller Trinity models?

Compared to smaller Trinity variants, Trinity Large Thinking offers stronger reasoning and accuracy at higher compute cost and slightly increased latency.
What are the main limitations of Trinity Large Thinking?

It can still hallucinate facts, struggle with very recent information, and is not suitable for real-time control or safety-critical decision-making without oversight.
Can I use Trinity Large Thinking for tool calling or function calling via LLM.API?

Yes, as long as your LLM.API integration defines tools or functions, Trinity Large Thinking can follow structured tool-calling schemas in prompts.

Start in 2 lines of code

Get My API Key

Trinity Large Thinking

What is Trinity Large Thinking?

5 Core Capabilities

Advanced Reasoning

Agentic Workflows

Long-Context Handling

Code Generation

Multilingual Text

6 Most Valuable Use Cases

Cost Comparison

Technical Specifications

Why Build on LLM.API?

Intelligent Model Routing

Cost-Aware Orchestration

Resilient Fallback Flows

Full-Stack Observability

Task-Level Abstractions

High-Throughput Batch Jobs

When to Use — When NOT to Use

Use it if...

Avoid if...

Start in 2 lines of code