Powered by Cohere

Rerank 4 Fast

  • Text Generation

Rerank 4 Fast is Cohere’s fourth-generation multilingual reranking model optimized for low-latency, high-throughput retrieval with a context window of around 32K–33K tokens. It is designed to quickly reorder candidate documents by semantic relevance to a query in production search and RAG pipelines.

Start Using API

What is Rerank 4 Fast?

Rerank 4 Fast is a Cohere reranking model that scores and reorders candidate documents by their relevance to a user query, with support for long contexts and multilingual data. It is mainly used to improve result ordering in retrieval-augmented generation (RAG), enterprise search, and agent workflows where many documents must be ranked quickly. It also serves in high-traffic, latency-sensitive applications where it trades a bit of precision for speed relative to heavier rerankers. It belongs to Cohere’s Rerank 4 family, alongside Rerank 4 Pro and successors to earlier Rerank 3.x models.

5 Core Capabilities

  • Document Reranking

    Reorders candidate documents or passages by relevance to a query, improving information retrieval quality over initial search results.

  • Semantic Matching

    Assesses semantic similarity between queries and texts to surface contextually related results beyond simple keyword overlap.

  • Search Optimization

    Enhances search pipelines by providing relevance scores that can be integrated into ranking, filtering, or hybrid retrieval systems.

  • Multilingual Queries

    Handles queries and documents across multiple languages for ranking tasks, supporting diverse international search and retrieval scenarios.

  • Result Personalization

    Enables customized ranking strategies by incorporating additional metadata or signals alongside textual relevance scores when reranking results.

6 Most Valuable Use Cases

  • Search Results Reranking
  • E-commerce Product Ranking
  • Legal Case Retrieval
  • News Feed Prioritization
  • Support Ticket Triage
  • Code Snippet Ranking

Cost Comparison

Up to ~60% cheaper and faster than comparable rerank APIs

Provider Region Latency Throughput Uptime Input ($/1M) Output ($/1M) Context
LLM API BEST Global ~120ms ~120 qps 99.99% ~$0.30 per 1M input tokens $0.00 200K tokens
Cohere Global ~220ms ~60 qps 99.9% ~$0.75 per 1M input tokens $0.00 128K tokens
Azure AI US East ~260ms ~80 qps 99.9% ~$0.90 per 1M input tokens $0.00 128K tokens
AWS Bedrock US West ~280ms ~70 qps 99.9% ~$0.95 per 1M input tokens $0.00 128K tokens

Technical Specifications

Metric Rerank 4 Fast (Cohere) text-embedding-3-large (OpenAI) nomic-embed-text v1 (Nomic)
Task Type Reranking Embedding Embedding/Reranking
Dimensions ~1024 3072 768
Max Input Tokens ~8K 8K 8K
Avg Latency ~120ms ~180ms ~200ms
Price per 1M Tokens ~$0.15 $0.13 ~$0.10
Throughput ~150 QPS ~200 QPS ~120 QPS
Uptime 99.9% 99.9% 99.5%

30-day usage via LLM API

1.1B
Documents reranked in last 30 days
27M
API requests served in last 30 days
3.6K
Active teams using Rerank 4 Fast
99.95%
Avg uptime over last 30 days
Start Using API

Why Build on LLM.API?

One unified API. Every major model. Built-in reliability, cost control, and observability.

  • Unified AI Routing

    Dynamically route each request across providers and models based on latency, cost, or quality. One endpoint lets you A/B test, roll out, and swap models safely.

    One endpoint, any model
  • Cost-Aware Orchestration

    Automatically pick the most cost-effective model per task while enforcing budgets and quotas. Reduce spend without rewriting app logic or touching provider settings.

    Optimize spend by default
  • Resilient Fallback Flows

    Define provider-agnostic fallback chains that retry, downgrade, or reroute on failures and timeouts. Keep production workloads up even when individual APIs break.

    Designed for failure
  • Full-Stack Observability

    Get unified traces, logs, metrics, and payload insights across every provider. Debug latency spikes, failures, and regressions from a single, queryable view.

    See every token
  • Task-Level Abstractions

    Describe intent as tasks—chat, generate, extract, score—instead of raw prompts. LLM.API normalizes parameters so you can swap models without rewriting integrations.

    Code to tasks, not models
  • High-Throughput Batch

    Submit massive batch jobs with automatic chunking, concurrency control, retries, and result aggregation. Maximize throughput while staying within rate limits and SLAs.

    Millions of calls, one job

When to Use — When NOT to Use

Use it if...

  • You need to rerank search results from a traditional or vector search backend.
  • You need fast, low-cost relevance scoring for many query-document pairs at scale.
  • Your use case involves improving recommendation ordering based on textual similarity and relevance.
  • Your use case involves ranking retrieved passages before passing a few into a generator.
  • You need to prioritize customer support articles matching a user query or ticket.
  • Your use case involves ranking product listings by semantic match to user queries.
  • You need a lightweight reranker to boost quality of existing keyword search.

Avoid if...

  • You need a generative model to write, summarize, or translate text content.
  • Your workload requires understanding images, audio, or other non-text modalities.
  • You need complex multi-step reasoning, planning, or tool-use instead of simple relevance ranking.
  • Your workload requires processing extremely long documents beyond typical reranker context limits.
  • You need low-latency, on-device inference rather than calling a hosted reranking API.
  • Your workload requires training or fine-tuning custom models, not using fixed rerank endpoints.
  • You need personalized ranking heavily dependent on user profiles rather than text similarity.

Frequently Asked Questions

  • What is Rerank 4 Fast?

    Rerank 4 Fast is a Cohere model that scores and reorders candidate documents or passages for a query to improve retrieval relevance.

  • What is Rerank 4 Fast best used for?

    It is best for fast, low-cost reranking in search, RAG pipelines, recommendation systems, and retrieval-based question answering.

  • How is Rerank 4 Fast priced when called through LLM.API?

    LLM.API usage is typically billed per input token or per item scored; check the LLM.API pricing page for exact current rates.

  • What context window does Rerank 4 Fast support?

    Rerank 4 Fast can handle relatively long queries and documents but is generally limited to a few thousand tokens per input item.

  • How fast is Rerank 4 Fast in terms of latency?

    It is optimized for low-latency inference, so it typically returns relevance scores quickly even when ranking many candidates.

  • What modalities does Rerank 4 Fast support?

    Rerank 4 Fast operates on text-only inputs, taking a text query and a list of text documents to score.

  • How do I call Rerank 4 Fast via the LLM.API gateway?

    You select the Cohere provider and specify the Rerank 4 Fast model name in your LLM.API rerank request payload.

  • How does Rerank 4 Fast compare to other Cohere reranking models?

    Compared to larger or more accurate variants, Rerank 4 Fast generally trades a bit of quality for better speed and lower cost.

  • Can I use Rerank 4 Fast for general text generation?

    No, it is a ranking model that scores documents against a query and does not generate free-form text.

  • What limitations should I be aware of when using Rerank 4 Fast?

    Its relevance depends on the quality of candidate documents and it may underperform on highly domain-specific or out-of-distribution content.

Start in 2 lines of code

Get My API Key