Rerank 4 Pro

Reranking

Rerank 4 Pro is Cohere’s fourth-generation, pro-tier reranking model designed for high-accuracy semantic relevance ranking over long, complex, multilingual documents. It is optimized to sort candidate results so that the most relevant items appear first in search, retrieval-augmented generation (RAG), and agent workflows.

Start Using API

API Performance

Latency: ~0.25s avg rerank latency
Context: ~8K max documents per request
Input: ~$0.10 per 1K items scored
Output: $0.00 no separate output cost
Uptime: 99% 99%

About the model

What is Rerank 4 Pro?

Rerank 4 Pro is a multilingual reranking model from Cohere that scores and orders documents by their semantic relevance to a query. It is primarily used to boost search quality in enterprise and semantic search systems by reranking initial keyword or vector search results, and to improve RAG pipelines by filtering large context windows down to the most relevant passages. It is also applied in AI agents and other retrieval-heavy applications to reduce token usage and latency while preserving answer quality. Rerank 4 Pro belongs to Cohere’s Rerank model family as the fourth-generation, higher-precision tier following earlier Rerank v3 and v3.5 models.

Input / Output

Input

Text query string
List of text documents to rerank

Output

Relevance scores and rankings for each input document

Model capabilities

5 Core Capabilities

Relevance Ranking

Ranks documents or passages by semantic relevance to a query, improving retrieval quality for search, question answering, and recommendation systems.
Contextual Understanding

Understands nuanced query intent and document context, enabling more accurate ranking beyond simple keyword overlap or lexical similarity.
Multi-Document Comparison

Evaluates and orders many candidate texts simultaneously, selecting the most useful items from large retrieval or candidate pools.
Cross-Lingual Re-Ranking

Can re-rank texts across multiple languages when paired with multilingual retrieval, improving relevance in international and localized search experiences.
Noisy Text Handling

Maintains robust ranking performance on noisy, user-generated, or partially structured text, such as logs, chats, and informal content.

Use cases

6 Most Valuable Use Cases

Enterprise Semantic Search
RAG Result Filtering
Legal Case Retrieval
Support Ticket Triage
E-commerce Product Search
Code Snippet Retrieval

Transparent pricing

Cost Comparison

LLM API offers the lowest retriever costs and highest performance for reranking workloads.

Provider	Region	Latency	Throughput	Uptime	Input ($/1M)	Output ($/1M)	Context
LLM API BEST	Global	80ms	1,200 rps	99.995%	$0.05	$0.00	64K
Cohere	Global	~160ms	~400 rps	99.9%	~$0.25	$0.00	16K
Azure AI	US East	~140ms	~500 rps	99.9%	~$0.28	$0.00	~32K
AWS Bedrock	US West	~170ms	~350 rps	99.9%	~$0.30	$0.00	~16K

Performance benchmarks

Technical Specifications

Metric	Rerank 4 Pro (Cohere)	Cohere Rerank 3	OpenAI text-embedding-3-large (as reranker)
Model Type	Cross-encoder reranker	Cross-encoder reranker	Bi-encoder embedding used for rerank
Max Documents per Query	~512	~256	~1024
Avg Latency (50 docs, 1 query)	~180ms	~220ms	~250ms
Max Input Tokens per Doc	~512	~384	~1024
Price per 1K Doc-Query Pairs	~$0.20	~$0.16	~$0.14
Throughput	~200 qps	~160 qps	~220 qps
Supported Languages	~100+	~50+	~90+
Service Uptime	99.9%	99.9%	99.9%

30-day usage via LLM API

3.8B: Documents reranked last 30 days
24.5M: Rerank API requests
210K: Active developer accounts
99.96%: Avg API uptime

Start Using API

Architecture & Integration

Why Build on LLM.API?

One unified API. Every major model. Built-in reliability, cost control, and observability.

Unified AI Routing

Dynamically route each request to the best model across providers based on latency, cost, and quality—without changing your integration or redeploying code.
One endpoint, every model
Predictable AI Costs

Enforce per-project and per-tenant budgets, pick cheapest compatible models, and get normalized usage metrics so you can forecast, optimize, and control AI spend programmatically.
Cost controls by design
Resilient Fallback Logic

Define automatic failover chains so if a provider, region, or model degrades, traffic instantly shifts to healthy alternatives—no manual intervention, no downtime.
Never ship 500s again
Full-Stack Observability

Trace every request across models and providers with structured logs, metrics, and events, making it easy to debug prompts, detect regressions, and tune performance.
See every token
Task-Level Abstractions

Call high-level tasks like chat, tools, scoring, or extraction instead of wiring raw model APIs, keeping your app logic stable as models and providers change.
Code to tasks, not models
High-Throughput Batch Runs

Process millions of requests efficiently with server-side batching, concurrency control, and retries, turning large backfills and evaluations into a single API call.
Scale jobs, not ops

Decision guide

When to Use — When NOT to Use

Use it if...

You need to rerank small to medium candidate lists for search relevance optimization.
You need high-quality semantic reranking for question-answering over retrieved document passages.
You need to improve recommendation ordering using a powerful cross-encoder style reranker.
Your use case involves evaluating and ranking short texts like titles, snippets, or messages.
You need a strong reranker to sit on top of an existing vector search.
Your use case involves A/B testing different ranking signals with a reliable baseline reranker.
You need robust multilingual reranking performance using a commercially supported, production-ready model.

Avoid if...

You need a general-purpose generative model for text creation, coding, or dialogue.
You need to process extremely long documents end-to-end rather than rank short candidates.
Your workload requires on-device or fully offline inference without external API calls.
You need millisecond-level latency reranking over massive candidate sets in hard real-time systems.
Your workload requires image, audio, or multimodal understanding rather than pure text reranking.
You need complex multi-step reasoning or tool use instead of simple relevance scoring.
Your workload requires open-source weights deployed in your own fully controlled environment.

FAQ

Frequently Asked Questions

What is Rerank 4 Pro?

Rerank 4 Pro is Cohere’s production-grade reranking model that scores and reorders candidate documents or passages by relevance to a given query.
What is Rerank 4 Pro best used for?

Rerank 4 Pro is best for improving search, retrieval-augmented generation, recommendation systems, and any workflow needing high-precision ranking of a small candidate set.
How do I call Rerank 4 Pro through LLM.API?

You call Rerank 4 Pro via LLM.API’s rerank endpoint by specifying provider "cohere" and model "rerank-4-pro" in your request parameters.
What input and output modalities does Rerank 4 Pro support?

Rerank 4 Pro accepts text queries plus a list of text documents and returns numeric relevance scores, without generating free-form text.
What is the typical context limit for Rerank 4 Pro requests?

Rerank 4 Pro supports a relatively small text snippet per document but can handle many candidate documents in a single rerank call.
How fast is Rerank 4 Pro in real applications?

Rerank 4 Pro is optimized for low-latency scoring and can typically rerank dozens of documents within tens to hundreds of milliseconds server-side.
How is pricing for Rerank 4 Pro handled on LLM.API?

Rerank 4 Pro pricing on LLM.API is usually per document scored, with exact rates shown in the LLM.API pricing dashboard and documentation.
How does Rerank 4 Pro compare to using a general-purpose LLM for reranking?

Rerank 4 Pro is typically cheaper and faster than prompting a general-purpose LLM, while providing more consistent relevance scores for ranking tasks.
What are the main limitations of Rerank 4 Pro?

Rerank 4 Pro cannot generate text, depends on the quality of your candidate set, and may miss nuanced domain-specific relevance without fine-tuned data.
Can I use Rerank 4 Pro together with other models on LLM.API?

Yes, you can pair Rerank 4 Pro with embedding or chat models by first retrieving candidates, reranking them, then feeding top results into a generator.

Start in 2 lines of code

Get My API Key

Rerank 4 Pro

What is Rerank 4 Pro?

5 Core Capabilities

Relevance Ranking

Contextual Understanding

Multi-Document Comparison

Cross-Lingual Re-Ranking

Noisy Text Handling

6 Most Valuable Use Cases

Cost Comparison

Technical Specifications

Why Build on LLM.API?

Unified AI Routing

Predictable AI Costs

Resilient Fallback Logic

Full-Stack Observability

Task-Level Abstractions

High-Throughput Batch Runs

When to Use — When NOT to Use

Use it if...

Avoid if...

Start in 2 lines of code