Powered by BAAI
bge-m3
bge-m3 is a multilingual text embedding model from BAAI that produces dense, sparse, and ColBERT-style multi-vector embeddings in a single pass, enabling powerful hybrid retrieval. It is optimized for long-context, multi-language semantic search and retrieval applications.
About the model
What is bge-m3?
bge-m3 is a multi-functionality, multilingual, multi-granularity text embedding model developed by BAAI that outputs dense, sparse, and ColBERT-style embeddings simultaneously. It is primarily used for information retrieval, semantic search, and retrieval-augmented generation, where a single model can power dense, sparse (lexical), and hybrid search pipelines. It also supports over 100 languages and long documents (up to around 8k tokens) for use cases like cross-lingual search, question answering over large corpora, and document similarity. It belongs to BAAI’s BGE (Beijing General Embedding) family, extending earlier BGE embedding models with unified training for dense, sparse, and multi-vector retrieval.
Model capabilities
5 Core Capabilities
-
Dense Retrieval
Generates high-quality dense text embeddings for semantic similarity search, ranking, and retrieval across many tasks and domains.
-
Sparse Retrieval
Produces sparse lexical-token representations enabling BM25‑like keyword matching, hybrid search, and improved recall in information retrieval.
-
Multi-Vector Embeddings
Outputs ColBERT-style multi-vector embeddings for fine-grained late interaction retrieval, improving accuracy on complex search queries.
-
Multilingual Support
Supports over one hundred languages in a shared embedding space, enabling cross-lingual search, retrieval, and comparison of text.
-
Long-Context Encoding
Encodes long texts, from short sentences to multi-thousand-token documents, into unified embeddings suitable for RAG pipelines.
Use cases
6 Most Valuable Use Cases
- Hybrid Text Retrieval
- Multilingual Semantic Search
- RAG Knowledge Bases
- Cross-Lingual Document Matching
- Domain Chatbots Retrieval
- Long-Context Text Indexing
Transparent pricing
Cost Comparison
LLM API offers the lowest cost and highest performance for bge-m3-class embeddings.
| Provider | Region | Latency | Throughput | Uptime | Input ($/1M) | Output ($/1M) | Context |
|---|---|---|---|---|---|---|---|
| LLM API BEST | Global | 80ms | 120 kTkn/s | 99.99% | $0.02 per 1M tokens | $0.00 per 1M tokens | 8K tokens |
| BAAI (Official API) | Global | ~180ms | ~40 kTkn/s | ~99.5% | ~$0.10 per 1M tokens | $0.00 per 1M tokens | 8K tokens |
| Fireworks AI | US East | ~150ms | ~60 kTkn/s | ~99.9% | ~$0.06 per 1M tokens | $0.00 per 1M tokens | ~16K tokens |
| Together AI | US West | ~160ms | ~55 kTkn/s | ~99.9% | ~$0.07 per 1M tokens | $0.00 per 1M tokens | ~8K tokens |
| Replicate | Global | ~220ms | ~30 kTkn/s | ~99.0% | ~$0.12 per 1M tokens | $0.00 per 1M tokens | ~4K tokens |
Performance benchmarks
Technical Specifications
| Metric | bge-m3 (BAAI) | text-embedding-3-large (OpenAI) | e5-mistral-7b-instruct (Mistral/ HuggingFace) |
|---|---|---|---|
| Dimensions | 1024 | 3072 | 4096 |
| Max Input Tokens | ~8K | 8K | ~4K |
| Price per 1M Tokens | ~$0.05 | $0.13 | ~$0.20 |
| Throughput | ~1,200 tps | ~1,000 tps | ~600 tps |
| Avg Latency | ~120ms | ~150ms | ~220ms |
| Uptime | ~99.5% | ~99.9% | ~99.0% |
30-day usage via LLM API
- 3.4B
- Prompt tokens processed (30 days)
- 27M
- Embedding vectors generated (30 days)
- 740K
- API requests served (30 days)
- 99.9%
- Avg uptime (last 30 days)
Architecture & Integration
Why Build on LLM.API?
One unified API. Every major model. Built-in reliability, cost control, and observability.
-
Intelligent AI Routing
Automatically route each request to the optimal model or provider based on latency, cost, and quality—without changing your integration or client code.
One endpoint, any model -
Cost-Aware Controls
Enforce per-key, per-project, and per-model budgets while auto-selecting cheaper equivalents so you never blow past spend limits in production.
Predictable spend at scale -
Resilient Fallback Logic
Define multi-provider failover policies so traffic transparently reroutes on timeouts, errors, or quota issues—no manual retries or brittle custom logic.
Stay online, even upstream -
End-to-End Observability
Get centralized logs, traces, and metrics across every provider, model, and project to debug prompts, track latency, and optimize performance in one place.
See every token flow -
Task-Level Abstractions
Call high-level tasks like chat, tools, RAG, and scoring instead of provider-specific APIs, so you can swap models without rewriting business logic.
Code to tasks, not vendors -
High-Throughput Batch
Run large-scale batch inference jobs across providers with automatic chunking, retries, and concurrency control to maximize throughput and minimize unit cost.
Bulk inference made easy
Decision guide
When to Use — When NOT to Use
Use it if...
- You need a strong general-purpose embedding model that supports multiple languages efficiently.
- You need text, code, and retrieval tasks covered by a single unified embedding model.
- Your use case involves semantic search or dense retrieval over large multilingual document corpora.
- Your use case involves building rerankers or hybrid search pipelines using high-quality embeddings.
- You need compact embeddings that balance retrieval quality and storage or memory constraints.
- Your use case involves retrieval-augmented generation requiring robust cross-lingual semantic similarity.
- You need an open-source embedding model that can run on-premise or offline.
Avoid if...
- You need generative capabilities like text completion or chat, not just embeddings.
- Your workload requires extremely long-context understanding beyond what typical embedding models handle.
- You need embeddings optimized specifically for images, audio, or multimodal content.
- Your workload requires fine-grained token-level reasoning instead of sentence or document embeddings.
- You need a fully managed, production-ready API service with enterprise SLAs from the provider.
- Your workload requires models trained on very domain-specific data like medical records.
- You need embeddings explicitly aligned for safety-sensitive applications with strong guardrails.
FAQ
Frequently Asked Questions
-
What is bge-m3?
bge-m3 is a BAAI embedding model that supports multi-lingual, multi-function, and multi-granularity text and retrieval tasks.
-
What is bge-m3 best suited for?
bge-m3 is best suited for semantic search, dense retrieval, reranking, and building multilingual retrieval-augmented generation systems.
-
What context window does bge-m3 support?
bge-m3 typically processes sequences up to 512 tokens per input text when generating embeddings.
-
How fast is bge-m3 in terms of latency?
bge-m3 is relatively lightweight and can generate embeddings with low latency on modern GPUs for typical retrieval workloads.
-
What modalities does bge-m3 support?
bge-m3 is a text-only embedding model and does not support image, audio, or video inputs.
-
How is bge-m3 priced when accessed through LLM.API?
LLM.API usage-based pricing for bge-m3 is per-token for input text embeddings and is configured by the LLM.API platform, not BAAI.
-
How do I access bge-m3 via LLM.API?
You call the LLM.API embeddings endpoint specifying the provider as BAAI and the model name as bge-m3 in your request parameters.
-
How does bge-m3 compare to other embedding models?
bge-m3 offers strong multilingual retrieval quality and flexible embedding functions compared to many English-only or single-task embedding models.
-
What are the main limitations of bge-m3?
bge-m3 cannot generate or chat, only embeds text, and its performance may degrade on extremely long documents or unsupported languages.
-
Can bge-m3 handle both query and document embeddings?
Yes, bge-m3 supports using different instruction prompts to produce query, document, and other specialized embeddings for retrieval pipelines.
