Powered by Sentence Transformers

multi-qa-mpnet-base-dot-v1

  • Text Generation

multi-qa-mpnet-base-dot-v1 is a Sentence Transformers model that encodes sentences and paragraphs into 768-dimensional embeddings optimized for semantic search using dot-product similarity. It is trained on large-scale question–answer pairs to retrieve relevant passages for user queries.

Start Using API

What is multi-qa-mpnet-base-dot-v1?

multi-qa-mpnet-base-dot-v1 is a sentence-transformers text-embedding model based on mpnet-base that maps natural-language inputs to 768-dimensional dense vectors. It is mainly used for semantic search, where both queries and candidate passages are embedded and ranked via dot-product similarity. It is also applied to related tasks like text similarity, clustering, and information retrieval in downstream applications. The model is part of the Sentence Transformers family and is fine-tuned from the pretrained mpnet-base transformer encoder.

5 Core Capabilities

  • Semantic Text Search

    Enables high-quality semantic search by encoding queries and documents into a shared vector space for similarity-based retrieval.

  • Question Answer Retrieval

    Optimized for multi-domain question-answer retrieval, matching user questions to the most relevant passages or FAQ-style answers.

  • Sentence Embedding Generation

    Produces dense sentence and passage embeddings capturing semantic meaning, suitable for clustering, ranking, and downstream NLP tasks.

  • Cross-lingual Question Matching

    Supports multilingual question matching, allowing questions in different languages to be mapped into a common embedding space for retrieval.

  • Text Similarity Scoring

    Computes cosine similarity between query and document embeddings to score relevance for information retrieval and recommendation pipelines.

6 Most Valuable Use Cases

  • Semantic Document Search
  • Question Answer Retrieval
  • FAQ Matching System
  • Duplicate Issue Detection
  • Recommendation via Similarity
  • Embedding-Based Retrieval

Cost Comparison

LLM API embeddings for multi-qa-mpnet-base-dot-v1 are up to ~60% cheaper than major providers

Provider Region Latency Throughput Uptime Input ($/1M) Output ($/1M) Context
LLM API BEST Global ~120ms ~9000 tokens/s 99.99% $0.02 per 1M tokens $0.00 per 1M tokens ~8K tokens
Sentence Transformers (Hosted API) Global ~250ms ~3000 tokens/s ~99.9% ~$0.05 per 1M tokens $0.00 per 1M tokens ~8K tokens
Hugging Face Inference API EU West ~280ms ~2500 tokens/s ~99.9% ~$0.06 per 1M tokens $0.00 per 1M tokens ~8K tokens
Azure AI (Custom Container) US East ~220ms ~4000 tokens/s 99.9% ~$0.07 per 1M tokens $0.00 per 1M tokens ~8K tokens
Replicate Global ~300ms ~2000 tokens/s ~99.5% ~$0.08 per 1M tokens $0.00 per 1M tokens ~8K tokens

Technical Specifications

Metric multi-qa-mpnet-base-dot-v1 all-mpnet-base-v2 msmarco-distilbert-base-v3
Model Type Text embedding (bi-encoder) General-purpose text embedding MS MARCO passage ranking bi-encoder
Dimensions 768 768 768
Max Input Tokens ~512 ~512 ~512
Price per 1M Tokens ~$0.10 ~$0.10 ~$0.08
Avg Latency per 1K Tokens ~120ms ~130ms ~110ms
Throughput ~1,500 tps ~1,400 tps ~1,600 tps
Uptime ~99.9% ~99.9% ~99.9%

30-day usage via LLM API

620M
Embedding tokens processed (30 days)
7.8M
API requests (30 days)
41.5K
Active developer accounts (30 days)
99.95%
Avg API uptime (30 days)
Start Using API

Why Build on LLM.API?

One unified API. Every major model. Built-in reliability, cost control, and observability.

  • Unified AI Routing

    Dynamically route requests across providers based on latency, cost, or quality policies. One API abstracts model sprawl so you ship faster with less integration work.

    One endpoint, every model.
  • Cost-Aware Orchestration

    Automatically balance model choice and token usage to hit your budget targets. Set cost policies once and let LLM.API optimize every call for price-performance.

    Lower spend, same output.
  • Resilient Fallbacks

    Define per-route fallback chains so outages or rate limits never break production. LLM.API transparently retries on alternate models while preserving contracts and payloads.

    Stay online, automatically.
  • End-to-End Observability

    Trace every request across providers with unified logs, metrics, and evaluations. Debug latency, failures, and quality issues from a single, provider-agnostic dashboard.

    See every token hop.
  • Task-Level Abstractions

    Declare tasks like chat, embeddings, tools, or rerank once—LLM.API handles prompt shaping and provider quirks so your code stays clean and portable.

    Code to tasks, not vendors.
  • High-Throughput Batch

    Send thousands of requests in a single batch with built-in concurrency control and retries. Maximize throughput while keeping provider limits and costs in check.

    Scale workloads, not code.

When to Use — When NOT to Use

Use it if...

  • You need high-quality semantic search over short to medium-length English text passages.
  • You need to power FAQ-style question–answer retrieval using dense vector similarity search.
  • Your use case involves clustering or deduplicating similar sentences, queries, or support tickets.
  • You need multilingual-friendly question embeddings compatible with many existing vector databases.
  • Your use case involves reranking BM25 search results with lightweight semantic similarity scoring.
  • You need an open-source, locally deployable embedding model without external API dependencies.

Avoid if...

  • You need to process very long documents end-to-end, beyond a few paragraphs per embedding.
  • Your workload requires generating text, summarization, translation, or other generative capabilities.
  • You need domain-specific embeddings for code, biology, or finance with strong specialized performance.
  • Your workload requires state-of-the-art cross-lingual performance across many low-resource languages.
  • You need real-time personalization or training on-the-fly, beyond fixed pretrained embeddings.
  • Your workload requires detailed reasoning, logical inference, or multi-step planning over retrieved content.

Frequently Asked Questions

  • What is multi-qa-mpnet-base-dot-v1?

    multi-qa-mpnet-base-dot-v1 is a Sentence Transformers model for generating dense embeddings tailored to multi-domain question answering and semantic search.

  • What is multi-qa-mpnet-base-dot-v1 best used for?

    It is best for semantic search, duplicate question detection, information retrieval, and building QA systems over documents or knowledge bases.

  • How much does it cost to use multi-qa-mpnet-base-dot-v1 via LLM.API?

    LLM.API usage-based pricing applies per embedding request; check the multi-qa-mpnet-base-dot-v1 pricing section in your LLM.API dashboard.

  • What context window or maximum input length does multi-qa-mpnet-base-dot-v1 support?

    multi-qa-mpnet-base-dot-v1 is typically used with inputs up to a few hundred tokens; very long texts should be chunked before embedding.

  • How fast is multi-qa-mpnet-base-dot-v1 on LLM.API?

    On LLM.API, it usually returns embeddings in tens of milliseconds to low hundreds per request, depending on batch size and load.

  • What modalities does multi-qa-mpnet-base-dot-v1 support?

    multi-qa-mpnet-base-dot-v1 is a text-only model that takes natural language strings and outputs fixed-size vector embeddings.

  • How do I call multi-qa-mpnet-base-dot-v1 through LLM.API?

    Use the LLM.API embeddings endpoint, set the provider to Sentence Transformers, and specify multi-qa-mpnet-base-dot-v1 as the model name.

  • How does multi-qa-mpnet-base-dot-v1 compare to other Sentence Transformers QA models?

    It generally offers strong retrieval quality with moderate embedding size and speed, often outperforming older MiniLM-based QA embeddings.

  • Can I use multi-qa-mpnet-base-dot-v1 for multilingual queries?

    It works best for English; performance on other languages may be weaker and is not specifically optimized for broad multilingual coverage.

  • What are the main limitations of multi-qa-mpnet-base-dot-v1?

    It cannot generate text, handle images, or reason over very long documents without chunking, and embedding quality degrades on noisy or non-English data.

Start in 2 lines of code

Get My API Key