Powered by BAAI

bge-m3

  • Text Embeddings

bge-m3 is a multilingual text embedding model from BAAI that produces dense, sparse, and ColBERT-style multi-vector embeddings in a single pass, enabling powerful hybrid retrieval. It is optimized for long-context, multi-language semantic search and retrieval applications.

Start Using API

What is bge-m3?

bge-m3 is a multi-functionality, multilingual, multi-granularity text embedding model developed by BAAI that outputs dense, sparse, and ColBERT-style embeddings simultaneously. It is primarily used for information retrieval, semantic search, and retrieval-augmented generation, where a single model can power dense, sparse (lexical), and hybrid search pipelines. It also supports over 100 languages and long documents (up to around 8k tokens) for use cases like cross-lingual search, question answering over large corpora, and document similarity. It belongs to BAAI’s BGE (Beijing General Embedding) family, extending earlier BGE embedding models with unified training for dense, sparse, and multi-vector retrieval.

5 Core Capabilities

  • Dense Retrieval

    Generates high-quality dense text embeddings for semantic similarity search, ranking, and retrieval across many tasks and domains.

  • Sparse Retrieval

    Produces sparse lexical-token representations enabling BM25‑like keyword matching, hybrid search, and improved recall in information retrieval.

  • Multi-Vector Embeddings

    Outputs ColBERT-style multi-vector embeddings for fine-grained late interaction retrieval, improving accuracy on complex search queries.

  • Multilingual Support

    Supports over one hundred languages in a shared embedding space, enabling cross-lingual search, retrieval, and comparison of text.

  • Long-Context Encoding

    Encodes long texts, from short sentences to multi-thousand-token documents, into unified embeddings suitable for RAG pipelines.

6 Most Valuable Use Cases

  • Hybrid Text Retrieval
  • Multilingual Semantic Search
  • RAG Knowledge Bases
  • Cross-Lingual Document Matching
  • Domain Chatbots Retrieval
  • Long-Context Text Indexing

Cost Comparison

LLM API offers the lowest cost and highest performance for bge-m3-class embeddings.

Provider Region Latency Throughput Uptime Input ($/1M) Output ($/1M) Context
LLM API BEST Global 80ms 120 kTkn/s 99.99% $0.02 per 1M tokens $0.00 per 1M tokens 8K tokens
BAAI (Official API) Global ~180ms ~40 kTkn/s ~99.5% ~$0.10 per 1M tokens $0.00 per 1M tokens 8K tokens
Fireworks AI US East ~150ms ~60 kTkn/s ~99.9% ~$0.06 per 1M tokens $0.00 per 1M tokens ~16K tokens
Together AI US West ~160ms ~55 kTkn/s ~99.9% ~$0.07 per 1M tokens $0.00 per 1M tokens ~8K tokens
Replicate Global ~220ms ~30 kTkn/s ~99.0% ~$0.12 per 1M tokens $0.00 per 1M tokens ~4K tokens

Technical Specifications

Metric bge-m3 (BAAI) text-embedding-3-large (OpenAI) e5-mistral-7b-instruct (Mistral/ HuggingFace)
Dimensions 1024 3072 4096
Max Input Tokens ~8K 8K ~4K
Price per 1M Tokens ~$0.05 $0.13 ~$0.20
Throughput ~1,200 tps ~1,000 tps ~600 tps
Avg Latency ~120ms ~150ms ~220ms
Uptime ~99.5% ~99.9% ~99.0%

30-day usage via LLM API

3.4B
Prompt tokens processed (30 days)
27M
Embedding vectors generated (30 days)
740K
API requests served (30 days)
99.9%
Avg uptime (last 30 days)
Start Using API

Why Build on LLM.API?

One unified API. Every major model. Built-in reliability, cost control, and observability.

  • Intelligent AI Routing

    Automatically route each request to the optimal model or provider based on latency, cost, and quality—without changing your integration or client code.

    One endpoint, any model
  • Cost-Aware Controls

    Enforce per-key, per-project, and per-model budgets while auto-selecting cheaper equivalents so you never blow past spend limits in production.

    Predictable spend at scale
  • Resilient Fallback Logic

    Define multi-provider failover policies so traffic transparently reroutes on timeouts, errors, or quota issues—no manual retries or brittle custom logic.

    Stay online, even upstream
  • End-to-End Observability

    Get centralized logs, traces, and metrics across every provider, model, and project to debug prompts, track latency, and optimize performance in one place.

    See every token flow
  • Task-Level Abstractions

    Call high-level tasks like chat, tools, RAG, and scoring instead of provider-specific APIs, so you can swap models without rewriting business logic.

    Code to tasks, not vendors
  • High-Throughput Batch

    Run large-scale batch inference jobs across providers with automatic chunking, retries, and concurrency control to maximize throughput and minimize unit cost.

    Bulk inference made easy

When to Use — When NOT to Use

Use it if...

  • You need a strong general-purpose embedding model that supports multiple languages efficiently.
  • You need text, code, and retrieval tasks covered by a single unified embedding model.
  • Your use case involves semantic search or dense retrieval over large multilingual document corpora.
  • Your use case involves building rerankers or hybrid search pipelines using high-quality embeddings.
  • You need compact embeddings that balance retrieval quality and storage or memory constraints.
  • Your use case involves retrieval-augmented generation requiring robust cross-lingual semantic similarity.
  • You need an open-source embedding model that can run on-premise or offline.

Avoid if...

  • You need generative capabilities like text completion or chat, not just embeddings.
  • Your workload requires extremely long-context understanding beyond what typical embedding models handle.
  • You need embeddings optimized specifically for images, audio, or multimodal content.
  • Your workload requires fine-grained token-level reasoning instead of sentence or document embeddings.
  • You need a fully managed, production-ready API service with enterprise SLAs from the provider.
  • Your workload requires models trained on very domain-specific data like medical records.
  • You need embeddings explicitly aligned for safety-sensitive applications with strong guardrails.

Frequently Asked Questions

  • What is bge-m3?

    bge-m3 is a BAAI embedding model that supports multi-lingual, multi-function, and multi-granularity text and retrieval tasks.

  • What is bge-m3 best suited for?

    bge-m3 is best suited for semantic search, dense retrieval, reranking, and building multilingual retrieval-augmented generation systems.

  • What context window does bge-m3 support?

    bge-m3 typically processes sequences up to 512 tokens per input text when generating embeddings.

  • How fast is bge-m3 in terms of latency?

    bge-m3 is relatively lightweight and can generate embeddings with low latency on modern GPUs for typical retrieval workloads.

  • What modalities does bge-m3 support?

    bge-m3 is a text-only embedding model and does not support image, audio, or video inputs.

  • How is bge-m3 priced when accessed through LLM.API?

    LLM.API usage-based pricing for bge-m3 is per-token for input text embeddings and is configured by the LLM.API platform, not BAAI.

  • How do I access bge-m3 via LLM.API?

    You call the LLM.API embeddings endpoint specifying the provider as BAAI and the model name as bge-m3 in your request parameters.

  • How does bge-m3 compare to other embedding models?

    bge-m3 offers strong multilingual retrieval quality and flexible embedding functions compared to many English-only or single-task embedding models.

  • What are the main limitations of bge-m3?

    bge-m3 cannot generate or chat, only embeds text, and its performance may degrade on extremely long documents or unsupported languages.

  • Can bge-m3 handle both query and document embeddings?

    Yes, bge-m3 supports using different instruction prompts to produce query, document, and other specialized embeddings for retrieval pipelines.

Start in 2 lines of code

Get My API Key