Rerank 4 Fast

Text Generation

Rerank 4 Fast is Cohere’s fourth-generation multilingual reranking model optimized for low-latency, high-throughput retrieval with a context window of around 32K–33K tokens. It is designed to quickly reorder candidate documents by semantic relevance to a query in production search and RAG pipelines.

Start Using API

API Performance

Latency: ~0.15s avg rerank time for 100–200 documents
Context: ~10K max documents per request
Input: ~$0.10 per 1M tokens (documents)
Output: $0.00 no generated tokens; scores only
Uptime: 99% 99%

About the model

What is Rerank 4 Fast?

Rerank 4 Fast is a Cohere reranking model that scores and reorders candidate documents by their relevance to a user query, with support for long contexts and multilingual data. It is mainly used to improve result ordering in retrieval-augmented generation (RAG), enterprise search, and agent workflows where many documents must be ranked quickly. It also serves in high-traffic, latency-sensitive applications where it trades a bit of precision for speed relative to heavier rerankers. It belongs to Cohere’s Rerank 4 family, alongside Rerank 4 Pro and successors to earlier Rerank 3.x models.

Input / Output

Input

Query text
Documents as text or semi-structured JSON objects

Output

Ranked documents with relevance scores

Model capabilities

5 Core Capabilities

Document Reranking

Reorders candidate documents or passages by relevance to a query, improving information retrieval quality over initial search results.
Semantic Matching

Assesses semantic similarity between queries and texts to surface contextually related results beyond simple keyword overlap.
Search Optimization

Enhances search pipelines by providing relevance scores that can be integrated into ranking, filtering, or hybrid retrieval systems.
Multilingual Queries

Handles queries and documents across multiple languages for ranking tasks, supporting diverse international search and retrieval scenarios.
Result Personalization

Enables customized ranking strategies by incorporating additional metadata or signals alongside textual relevance scores when reranking results.

Use cases

6 Most Valuable Use Cases

Search Results Reranking
E-commerce Product Ranking
Legal Case Retrieval
News Feed Prioritization
Support Ticket Triage
Code Snippet Ranking

Transparent pricing

Cost Comparison

Up to ~60% cheaper and faster than comparable rerank APIs

Provider	Region	Latency	Throughput	Uptime	Input ($/1M)	Output ($/1M)	Context
LLM API BEST	Global	~120ms	~120 qps	99.99%	~$0.30 per 1M input tokens	$0.00	200K tokens
Cohere	Global	~220ms	~60 qps	99.9%	~$0.75 per 1M input tokens	$0.00	128K tokens
Azure AI	US East	~260ms	~80 qps	99.9%	~$0.90 per 1M input tokens	$0.00	128K tokens
AWS Bedrock	US West	~280ms	~70 qps	99.9%	~$0.95 per 1M input tokens	$0.00	128K tokens

Performance benchmarks

Technical Specifications

Metric	Rerank 4 Fast (Cohere)	text-embedding-3-large (OpenAI)	nomic-embed-text v1 (Nomic)
Task Type	Reranking	Embedding	Embedding/Reranking
Dimensions	~1024	3072	768
Max Input Tokens	~8K	8K	8K
Avg Latency	~120ms	~180ms	~200ms
Price per 1M Tokens	~$0.15	$0.13	~$0.10
Throughput	~150 QPS	~200 QPS	~120 QPS
Uptime	99.9%	99.9%	99.5%

30-day usage via LLM API

1.1B: Documents reranked in last 30 days
27M: API requests served in last 30 days
3.6K: Active teams using Rerank 4 Fast
99.95%: Avg uptime over last 30 days

Start Using API

Architecture & Integration

Why Build on LLM.API?

One unified API. Every major model. Built-in reliability, cost control, and observability.

Unified AI Routing

Dynamically route each request across providers and models based on latency, cost, or quality. One endpoint lets you A/B test, roll out, and swap models safely.
One endpoint, any model
Cost-Aware Orchestration

Automatically pick the most cost-effective model per task while enforcing budgets and quotas. Reduce spend without rewriting app logic or touching provider settings.
Optimize spend by default
Resilient Fallback Flows

Define provider-agnostic fallback chains that retry, downgrade, or reroute on failures and timeouts. Keep production workloads up even when individual APIs break.
Designed for failure
Full-Stack Observability

Get unified traces, logs, metrics, and payload insights across every provider. Debug latency spikes, failures, and regressions from a single, queryable view.
See every token
Task-Level Abstractions

Describe intent as tasks—chat, generate, extract, score—instead of raw prompts. LLM.API normalizes parameters so you can swap models without rewriting integrations.
Code to tasks, not models
High-Throughput Batch

Submit massive batch jobs with automatic chunking, concurrency control, retries, and result aggregation. Maximize throughput while staying within rate limits and SLAs.
Millions of calls, one job

Decision guide

When to Use — When NOT to Use

Use it if...

You need to rerank search results from a traditional or vector search backend.
You need fast, low-cost relevance scoring for many query-document pairs at scale.
Your use case involves improving recommendation ordering based on textual similarity and relevance.
Your use case involves ranking retrieved passages before passing a few into a generator.
You need to prioritize customer support articles matching a user query or ticket.
Your use case involves ranking product listings by semantic match to user queries.
You need a lightweight reranker to boost quality of existing keyword search.

Avoid if...

You need a generative model to write, summarize, or translate text content.
Your workload requires understanding images, audio, or other non-text modalities.
You need complex multi-step reasoning, planning, or tool-use instead of simple relevance ranking.
Your workload requires processing extremely long documents beyond typical reranker context limits.
You need low-latency, on-device inference rather than calling a hosted reranking API.
Your workload requires training or fine-tuning custom models, not using fixed rerank endpoints.
You need personalized ranking heavily dependent on user profiles rather than text similarity.

FAQ

Frequently Asked Questions

What is Rerank 4 Fast?

Rerank 4 Fast is a Cohere model that scores and reorders candidate documents or passages for a query to improve retrieval relevance.
What is Rerank 4 Fast best used for?

It is best for fast, low-cost reranking in search, RAG pipelines, recommendation systems, and retrieval-based question answering.
How is Rerank 4 Fast priced when called through LLM.API?

LLM.API usage is typically billed per input token or per item scored; check the LLM.API pricing page for exact current rates.
What context window does Rerank 4 Fast support?

Rerank 4 Fast can handle relatively long queries and documents but is generally limited to a few thousand tokens per input item.
How fast is Rerank 4 Fast in terms of latency?

It is optimized for low-latency inference, so it typically returns relevance scores quickly even when ranking many candidates.
What modalities does Rerank 4 Fast support?

Rerank 4 Fast operates on text-only inputs, taking a text query and a list of text documents to score.
How do I call Rerank 4 Fast via the LLM.API gateway?

You select the Cohere provider and specify the Rerank 4 Fast model name in your LLM.API rerank request payload.
How does Rerank 4 Fast compare to other Cohere reranking models?

Compared to larger or more accurate variants, Rerank 4 Fast generally trades a bit of quality for better speed and lower cost.
Can I use Rerank 4 Fast for general text generation?

No, it is a ranking model that scores documents against a query and does not generate free-form text.
What limitations should I be aware of when using Rerank 4 Fast?

Its relevance depends on the quality of candidate documents and it may underperform on highly domain-specific or out-of-distribution content.

Start in 2 lines of code

Get My API Key

Rerank 4 Fast

What is Rerank 4 Fast?

5 Core Capabilities

Document Reranking

Semantic Matching

Search Optimization

Multilingual Queries

Result Personalization

6 Most Valuable Use Cases

Cost Comparison

Technical Specifications

Why Build on LLM.API?

Unified AI Routing

Cost-Aware Orchestration

Resilient Fallback Flows

Full-Stack Observability

Task-Level Abstractions

High-Throughput Batch

When to Use — When NOT to Use

Use it if...

Avoid if...

Start in 2 lines of code