Powered by Cohere
Rerank 4 Pro
- Reranking
Rerank 4 Pro is Cohere’s fourth-generation, pro-tier reranking model designed for high-accuracy semantic relevance ranking over long, complex, multilingual documents. It is optimized to sort candidate results so that the most relevant items appear first in search, retrieval-augmented generation (RAG), and agent workflows.
About the model
What is Rerank 4 Pro?
Rerank 4 Pro is a multilingual reranking model from Cohere that scores and orders documents by their semantic relevance to a query. It is primarily used to boost search quality in enterprise and semantic search systems by reranking initial keyword or vector search results, and to improve RAG pipelines by filtering large context windows down to the most relevant passages. It is also applied in AI agents and other retrieval-heavy applications to reduce token usage and latency while preserving answer quality. Rerank 4 Pro belongs to Cohere’s Rerank model family as the fourth-generation, higher-precision tier following earlier Rerank v3 and v3.5 models.
Model capabilities
5 Core Capabilities
-
Relevance Ranking
Ranks documents or passages by semantic relevance to a query, improving retrieval quality for search, question answering, and recommendation systems.
-
Contextual Understanding
Understands nuanced query intent and document context, enabling more accurate ranking beyond simple keyword overlap or lexical similarity.
-
Multi-Document Comparison
Evaluates and orders many candidate texts simultaneously, selecting the most useful items from large retrieval or candidate pools.
-
Cross-Lingual Re-Ranking
Can re-rank texts across multiple languages when paired with multilingual retrieval, improving relevance in international and localized search experiences.
-
Noisy Text Handling
Maintains robust ranking performance on noisy, user-generated, or partially structured text, such as logs, chats, and informal content.
Use cases
6 Most Valuable Use Cases
- Enterprise Semantic Search
- RAG Result Filtering
- Legal Case Retrieval
- Support Ticket Triage
- E-commerce Product Search
- Code Snippet Retrieval
Transparent pricing
Cost Comparison
LLM API offers the lowest retriever costs and highest performance for reranking workloads.
| Provider | Region | Latency | Throughput | Uptime | Input ($/1M) | Output ($/1M) | Context |
|---|---|---|---|---|---|---|---|
| LLM API BEST | Global | 80ms | 1,200 rps | 99.995% | $0.05 | $0.00 | 64K |
| Cohere | Global | ~160ms | ~400 rps | 99.9% | ~$0.25 | $0.00 | 16K |
| Azure AI | US East | ~140ms | ~500 rps | 99.9% | ~$0.28 | $0.00 | ~32K |
| AWS Bedrock | US West | ~170ms | ~350 rps | 99.9% | ~$0.30 | $0.00 | ~16K |
Performance benchmarks
Technical Specifications
| Metric | Rerank 4 Pro (Cohere) | Cohere Rerank 3 | OpenAI text-embedding-3-large (as reranker) |
|---|---|---|---|
| Model Type | Cross-encoder reranker | Cross-encoder reranker | Bi-encoder embedding used for rerank |
| Max Documents per Query | ~512 | ~256 | ~1024 |
| Avg Latency (50 docs, 1 query) | ~180ms | ~220ms | ~250ms |
| Max Input Tokens per Doc | ~512 | ~384 | ~1024 |
| Price per 1K Doc-Query Pairs | ~$0.20 | ~$0.16 | ~$0.14 |
| Throughput | ~200 qps | ~160 qps | ~220 qps |
| Supported Languages | ~100+ | ~50+ | ~90+ |
| Service Uptime | 99.9% | 99.9% | 99.9% |
30-day usage via LLM API
- 3.8B
- Documents reranked last 30 days
- 24.5M
- Rerank API requests
- 210K
- Active developer accounts
- 99.96%
- Avg API uptime
Architecture & Integration
Why Build on LLM.API?
One unified API. Every major model. Built-in reliability, cost control, and observability.
-
Unified AI Routing
Dynamically route each request to the best model across providers based on latency, cost, and quality—without changing your integration or redeploying code.
One endpoint, every model -
Predictable AI Costs
Enforce per-project and per-tenant budgets, pick cheapest compatible models, and get normalized usage metrics so you can forecast, optimize, and control AI spend programmatically.
Cost controls by design -
Resilient Fallback Logic
Define automatic failover chains so if a provider, region, or model degrades, traffic instantly shifts to healthy alternatives—no manual intervention, no downtime.
Never ship 500s again -
Full-Stack Observability
Trace every request across models and providers with structured logs, metrics, and events, making it easy to debug prompts, detect regressions, and tune performance.
See every token -
Task-Level Abstractions
Call high-level tasks like chat, tools, scoring, or extraction instead of wiring raw model APIs, keeping your app logic stable as models and providers change.
Code to tasks, not models -
High-Throughput Batch Runs
Process millions of requests efficiently with server-side batching, concurrency control, and retries, turning large backfills and evaluations into a single API call.
Scale jobs, not ops
Decision guide
When to Use — When NOT to Use
Use it if...
- You need to rerank small to medium candidate lists for search relevance optimization.
- You need high-quality semantic reranking for question-answering over retrieved document passages.
- You need to improve recommendation ordering using a powerful cross-encoder style reranker.
- Your use case involves evaluating and ranking short texts like titles, snippets, or messages.
- You need a strong reranker to sit on top of an existing vector search.
- Your use case involves A/B testing different ranking signals with a reliable baseline reranker.
- You need robust multilingual reranking performance using a commercially supported, production-ready model.
Avoid if...
- You need a general-purpose generative model for text creation, coding, or dialogue.
- You need to process extremely long documents end-to-end rather than rank short candidates.
- Your workload requires on-device or fully offline inference without external API calls.
- You need millisecond-level latency reranking over massive candidate sets in hard real-time systems.
- Your workload requires image, audio, or multimodal understanding rather than pure text reranking.
- You need complex multi-step reasoning or tool use instead of simple relevance scoring.
- Your workload requires open-source weights deployed in your own fully controlled environment.
FAQ
Frequently Asked Questions
-
What is Rerank 4 Pro?
Rerank 4 Pro is Cohere’s production-grade reranking model that scores and reorders candidate documents or passages by relevance to a given query.
-
What is Rerank 4 Pro best used for?
Rerank 4 Pro is best for improving search, retrieval-augmented generation, recommendation systems, and any workflow needing high-precision ranking of a small candidate set.
-
How do I call Rerank 4 Pro through LLM.API?
You call Rerank 4 Pro via LLM.API’s rerank endpoint by specifying provider "cohere" and model "rerank-4-pro" in your request parameters.
-
What input and output modalities does Rerank 4 Pro support?
Rerank 4 Pro accepts text queries plus a list of text documents and returns numeric relevance scores, without generating free-form text.
-
What is the typical context limit for Rerank 4 Pro requests?
Rerank 4 Pro supports a relatively small text snippet per document but can handle many candidate documents in a single rerank call.
-
How fast is Rerank 4 Pro in real applications?
Rerank 4 Pro is optimized for low-latency scoring and can typically rerank dozens of documents within tens to hundreds of milliseconds server-side.
-
How is pricing for Rerank 4 Pro handled on LLM.API?
Rerank 4 Pro pricing on LLM.API is usually per document scored, with exact rates shown in the LLM.API pricing dashboard and documentation.
-
How does Rerank 4 Pro compare to using a general-purpose LLM for reranking?
Rerank 4 Pro is typically cheaper and faster than prompting a general-purpose LLM, while providing more consistent relevance scores for ranking tasks.
-
What are the main limitations of Rerank 4 Pro?
Rerank 4 Pro cannot generate text, depends on the quality of your candidate set, and may miss nuanced domain-specific relevance without fine-tuned data.
-
Can I use Rerank 4 Pro together with other models on LLM.API?
Yes, you can pair Rerank 4 Pro with embedding or chat models by first retrieving candidates, reranking them, then feeding top results into a generator.
