bge-m3

Text Embeddings

bge-m3 is a multilingual text embedding model from BAAI that produces dense, sparse, and ColBERT-style multi-vector embeddings in a single pass, enabling powerful hybrid retrieval. It is optimized for long-context, multi-language semantic search and retrieval applications.

Start Using API

API Performance

Latency: ~0.35s avg embedding time per 1K tokens on A100
Context: 8K token context
Input: Free per 1M tokens
Output: Free per 1M tokens
Uptime: 99% 99%

About the model

What is bge-m3?

bge-m3 is a multi-functionality, multilingual, multi-granularity text embedding model developed by BAAI that outputs dense, sparse, and ColBERT-style embeddings simultaneously. It is primarily used for information retrieval, semantic search, and retrieval-augmented generation, where a single model can power dense, sparse (lexical), and hybrid search pipelines. It also supports over 100 languages and long documents (up to around 8k tokens) for use cases like cross-lingual search, question answering over large corpora, and document similarity. It belongs to BAAI’s BGE (Beijing General Embedding) family, extending earlier BGE embedding models with unified training for dense, sparse, and multi-vector retrieval.

Input / Output

Input

Text (list of strings, up to model context limit)

Output

Embedding vectors (floating-point arrays representing each input text)

Model capabilities

5 Core Capabilities

Dense Retrieval

Generates high-quality dense text embeddings for semantic similarity search, ranking, and retrieval across many tasks and domains.
Sparse Retrieval

Produces sparse lexical-token representations enabling BM25‑like keyword matching, hybrid search, and improved recall in information retrieval.
Multi-Vector Embeddings

Outputs ColBERT-style multi-vector embeddings for fine-grained late interaction retrieval, improving accuracy on complex search queries.
Multilingual Support

Supports over one hundred languages in a shared embedding space, enabling cross-lingual search, retrieval, and comparison of text.
Long-Context Encoding

Encodes long texts, from short sentences to multi-thousand-token documents, into unified embeddings suitable for RAG pipelines.

Use cases

6 Most Valuable Use Cases

Hybrid Text Retrieval
Multilingual Semantic Search
RAG Knowledge Bases
Cross-Lingual Document Matching
Domain Chatbots Retrieval
Long-Context Text Indexing

Transparent pricing

Cost Comparison

LLM API offers the lowest cost and highest performance for bge-m3-class embeddings.

Provider	Region	Latency	Throughput	Uptime	Input ($/1M)	Output ($/1M)	Context
LLM API BEST	Global	80ms	120 kTkn/s	99.99%	$0.02 per 1M tokens	$0.00 per 1M tokens	8K tokens
BAAI (Official API)	Global	~180ms	~40 kTkn/s	~99.5%	~$0.10 per 1M tokens	$0.00 per 1M tokens	8K tokens
Fireworks AI	US East	~150ms	~60 kTkn/s	~99.9%	~$0.06 per 1M tokens	$0.00 per 1M tokens	~16K tokens
Together AI	US West	~160ms	~55 kTkn/s	~99.9%	~$0.07 per 1M tokens	$0.00 per 1M tokens	~8K tokens
Replicate	Global	~220ms	~30 kTkn/s	~99.0%	~$0.12 per 1M tokens	$0.00 per 1M tokens	~4K tokens

Performance benchmarks

Technical Specifications

Metric	bge-m3 (BAAI)	text-embedding-3-large (OpenAI)	e5-mistral-7b-instruct (Mistral/ HuggingFace)
Dimensions	1024	3072	4096
Max Input Tokens	~8K	8K	~4K
Price per 1M Tokens	~$0.05	$0.13	~$0.20
Throughput	~1,200 tps	~1,000 tps	~600 tps
Avg Latency	~120ms	~150ms	~220ms
Uptime	~99.5%	~99.9%	~99.0%

30-day usage via LLM API

3.4B: Prompt tokens processed (30 days)
27M: Embedding vectors generated (30 days)
740K: API requests served (30 days)
99.9%: Avg uptime (last 30 days)

Start Using API

Architecture & Integration

Why Build on LLM.API?

One unified API. Every major model. Built-in reliability, cost control, and observability.

Intelligent AI Routing

Automatically route each request to the optimal model or provider based on latency, cost, and quality—without changing your integration or client code.
One endpoint, any model
Cost-Aware Controls

Enforce per-key, per-project, and per-model budgets while auto-selecting cheaper equivalents so you never blow past spend limits in production.
Predictable spend at scale
Resilient Fallback Logic

Define multi-provider failover policies so traffic transparently reroutes on timeouts, errors, or quota issues—no manual retries or brittle custom logic.
Stay online, even upstream
End-to-End Observability

Get centralized logs, traces, and metrics across every provider, model, and project to debug prompts, track latency, and optimize performance in one place.
See every token flow
Task-Level Abstractions

Call high-level tasks like chat, tools, RAG, and scoring instead of provider-specific APIs, so you can swap models without rewriting business logic.
Code to tasks, not vendors
High-Throughput Batch

Run large-scale batch inference jobs across providers with automatic chunking, retries, and concurrency control to maximize throughput and minimize unit cost.
Bulk inference made easy

Decision guide

When to Use — When NOT to Use

Use it if...

You need a strong general-purpose embedding model that supports multiple languages efficiently.
You need text, code, and retrieval tasks covered by a single unified embedding model.
Your use case involves semantic search or dense retrieval over large multilingual document corpora.
Your use case involves building rerankers or hybrid search pipelines using high-quality embeddings.
You need compact embeddings that balance retrieval quality and storage or memory constraints.
Your use case involves retrieval-augmented generation requiring robust cross-lingual semantic similarity.
You need an open-source embedding model that can run on-premise or offline.

Avoid if...

You need generative capabilities like text completion or chat, not just embeddings.
Your workload requires extremely long-context understanding beyond what typical embedding models handle.
You need embeddings optimized specifically for images, audio, or multimodal content.
Your workload requires fine-grained token-level reasoning instead of sentence or document embeddings.
You need a fully managed, production-ready API service with enterprise SLAs from the provider.
Your workload requires models trained on very domain-specific data like medical records.
You need embeddings explicitly aligned for safety-sensitive applications with strong guardrails.

FAQ

Frequently Asked Questions

What is bge-m3?

bge-m3 is a BAAI embedding model that supports multi-lingual, multi-function, and multi-granularity text and retrieval tasks.
What is bge-m3 best suited for?

bge-m3 is best suited for semantic search, dense retrieval, reranking, and building multilingual retrieval-augmented generation systems.
What context window does bge-m3 support?

bge-m3 typically processes sequences up to 512 tokens per input text when generating embeddings.
How fast is bge-m3 in terms of latency?

bge-m3 is relatively lightweight and can generate embeddings with low latency on modern GPUs for typical retrieval workloads.
What modalities does bge-m3 support?

bge-m3 is a text-only embedding model and does not support image, audio, or video inputs.
How is bge-m3 priced when accessed through LLM.API?

LLM.API usage-based pricing for bge-m3 is per-token for input text embeddings and is configured by the LLM.API platform, not BAAI.
How do I access bge-m3 via LLM.API?

You call the LLM.API embeddings endpoint specifying the provider as BAAI and the model name as bge-m3 in your request parameters.
How does bge-m3 compare to other embedding models?

bge-m3 offers strong multilingual retrieval quality and flexible embedding functions compared to many English-only or single-task embedding models.
What are the main limitations of bge-m3?

bge-m3 cannot generate or chat, only embeds text, and its performance may degrade on extremely long documents or unsupported languages.
Can bge-m3 handle both query and document embeddings?

Yes, bge-m3 supports using different instruction prompts to produce query, document, and other specialized embeddings for retrieval pipelines.

Start in 2 lines of code

Get My API Key

bge-m3

What is bge-m3?

5 Core Capabilities

Dense Retrieval

Sparse Retrieval

Multi-Vector Embeddings

Multilingual Support

Long-Context Encoding

6 Most Valuable Use Cases

Cost Comparison

Technical Specifications

Why Build on LLM.API?

Intelligent AI Routing

Cost-Aware Controls

Resilient Fallback Logic

End-to-End Observability

Task-Level Abstractions

High-Throughput Batch

When to Use — When NOT to Use

Use it if...

Avoid if...

Start in 2 lines of code