Powered by Cohere
Rerank 4 Fast
- Text Generation
Rerank 4 Fast is Cohere’s fourth-generation multilingual reranking model optimized for low-latency, high-throughput retrieval with a context window of around 32K–33K tokens. It is designed to quickly reorder candidate documents by semantic relevance to a query in production search and RAG pipelines.
About the model
What is Rerank 4 Fast?
Rerank 4 Fast is a Cohere reranking model that scores and reorders candidate documents by their relevance to a user query, with support for long contexts and multilingual data. It is mainly used to improve result ordering in retrieval-augmented generation (RAG), enterprise search, and agent workflows where many documents must be ranked quickly. It also serves in high-traffic, latency-sensitive applications where it trades a bit of precision for speed relative to heavier rerankers. It belongs to Cohere’s Rerank 4 family, alongside Rerank 4 Pro and successors to earlier Rerank 3.x models.
Model capabilities
5 Core Capabilities
-
Document Reranking
Reorders candidate documents or passages by relevance to a query, improving information retrieval quality over initial search results.
-
Semantic Matching
Assesses semantic similarity between queries and texts to surface contextually related results beyond simple keyword overlap.
-
Search Optimization
Enhances search pipelines by providing relevance scores that can be integrated into ranking, filtering, or hybrid retrieval systems.
-
Multilingual Queries
Handles queries and documents across multiple languages for ranking tasks, supporting diverse international search and retrieval scenarios.
-
Result Personalization
Enables customized ranking strategies by incorporating additional metadata or signals alongside textual relevance scores when reranking results.
Use cases
6 Most Valuable Use Cases
- Search Results Reranking
- E-commerce Product Ranking
- Legal Case Retrieval
- News Feed Prioritization
- Support Ticket Triage
- Code Snippet Ranking
Transparent pricing
Cost Comparison
Up to ~60% cheaper and faster than comparable rerank APIs
| Provider | Region | Latency | Throughput | Uptime | Input ($/1M) | Output ($/1M) | Context |
|---|---|---|---|---|---|---|---|
| LLM API BEST | Global | ~120ms | ~120 qps | 99.99% | ~$0.30 per 1M input tokens | $0.00 | 200K tokens |
| Cohere | Global | ~220ms | ~60 qps | 99.9% | ~$0.75 per 1M input tokens | $0.00 | 128K tokens |
| Azure AI | US East | ~260ms | ~80 qps | 99.9% | ~$0.90 per 1M input tokens | $0.00 | 128K tokens |
| AWS Bedrock | US West | ~280ms | ~70 qps | 99.9% | ~$0.95 per 1M input tokens | $0.00 | 128K tokens |
Performance benchmarks
Technical Specifications
| Metric | Rerank 4 Fast (Cohere) | text-embedding-3-large (OpenAI) | nomic-embed-text v1 (Nomic) |
|---|---|---|---|
| Task Type | Reranking | Embedding | Embedding/Reranking |
| Dimensions | ~1024 | 3072 | 768 |
| Max Input Tokens | ~8K | 8K | 8K |
| Avg Latency | ~120ms | ~180ms | ~200ms |
| Price per 1M Tokens | ~$0.15 | $0.13 | ~$0.10 |
| Throughput | ~150 QPS | ~200 QPS | ~120 QPS |
| Uptime | 99.9% | 99.9% | 99.5% |
30-day usage via LLM API
- 1.1B
- Documents reranked in last 30 days
- 27M
- API requests served in last 30 days
- 3.6K
- Active teams using Rerank 4 Fast
- 99.95%
- Avg uptime over last 30 days
Architecture & Integration
Why Build on LLM.API?
One unified API. Every major model. Built-in reliability, cost control, and observability.
-
Unified AI Routing
Dynamically route each request across providers and models based on latency, cost, or quality. One endpoint lets you A/B test, roll out, and swap models safely.
One endpoint, any model -
Cost-Aware Orchestration
Automatically pick the most cost-effective model per task while enforcing budgets and quotas. Reduce spend without rewriting app logic or touching provider settings.
Optimize spend by default -
Resilient Fallback Flows
Define provider-agnostic fallback chains that retry, downgrade, or reroute on failures and timeouts. Keep production workloads up even when individual APIs break.
Designed for failure -
Full-Stack Observability
Get unified traces, logs, metrics, and payload insights across every provider. Debug latency spikes, failures, and regressions from a single, queryable view.
See every token -
Task-Level Abstractions
Describe intent as tasks—chat, generate, extract, score—instead of raw prompts. LLM.API normalizes parameters so you can swap models without rewriting integrations.
Code to tasks, not models -
High-Throughput Batch
Submit massive batch jobs with automatic chunking, concurrency control, retries, and result aggregation. Maximize throughput while staying within rate limits and SLAs.
Millions of calls, one job
Decision guide
When to Use — When NOT to Use
Use it if...
- You need to rerank search results from a traditional or vector search backend.
- You need fast, low-cost relevance scoring for many query-document pairs at scale.
- Your use case involves improving recommendation ordering based on textual similarity and relevance.
- Your use case involves ranking retrieved passages before passing a few into a generator.
- You need to prioritize customer support articles matching a user query or ticket.
- Your use case involves ranking product listings by semantic match to user queries.
- You need a lightweight reranker to boost quality of existing keyword search.
Avoid if...
- You need a generative model to write, summarize, or translate text content.
- Your workload requires understanding images, audio, or other non-text modalities.
- You need complex multi-step reasoning, planning, or tool-use instead of simple relevance ranking.
- Your workload requires processing extremely long documents beyond typical reranker context limits.
- You need low-latency, on-device inference rather than calling a hosted reranking API.
- Your workload requires training or fine-tuning custom models, not using fixed rerank endpoints.
- You need personalized ranking heavily dependent on user profiles rather than text similarity.
FAQ
Frequently Asked Questions
-
What is Rerank 4 Fast?
Rerank 4 Fast is a Cohere model that scores and reorders candidate documents or passages for a query to improve retrieval relevance.
-
What is Rerank 4 Fast best used for?
It is best for fast, low-cost reranking in search, RAG pipelines, recommendation systems, and retrieval-based question answering.
-
How is Rerank 4 Fast priced when called through LLM.API?
LLM.API usage is typically billed per input token or per item scored; check the LLM.API pricing page for exact current rates.
-
What context window does Rerank 4 Fast support?
Rerank 4 Fast can handle relatively long queries and documents but is generally limited to a few thousand tokens per input item.
-
How fast is Rerank 4 Fast in terms of latency?
It is optimized for low-latency inference, so it typically returns relevance scores quickly even when ranking many candidates.
-
What modalities does Rerank 4 Fast support?
Rerank 4 Fast operates on text-only inputs, taking a text query and a list of text documents to score.
-
How do I call Rerank 4 Fast via the LLM.API gateway?
You select the Cohere provider and specify the Rerank 4 Fast model name in your LLM.API rerank request payload.
-
How does Rerank 4 Fast compare to other Cohere reranking models?
Compared to larger or more accurate variants, Rerank 4 Fast generally trades a bit of quality for better speed and lower cost.
-
Can I use Rerank 4 Fast for general text generation?
No, it is a ranking model that scores documents against a query and does not generate free-form text.
-
What limitations should I be aware of when using Rerank 4 Fast?
Its relevance depends on the quality of candidate documents and it may underperform on highly domain-specific or out-of-distribution content.
