Powered by Sentence Transformers
all-mpnet-base-v2
- Text Generation
all-mpnet-base-v2 is a widely used English sentence-embedding model from Sentence Transformers that maps text to 768-dimensional vectors for semantic similarity tasks. It is built on Microsoft’s MPNet architecture and fine-tuned on over a billion sentence pairs for strong general-purpose performance.
About the model
What is all-mpnet-base-v2?
all-mpnet-base-v2 is an English sentence-transformer model that encodes sentences and short paragraphs into 768-dimensional dense vector embeddings. It is mainly used for semantic search and retrieval in applications like RAG pipelines, documentation search, and information retrieval systems. It is also commonly applied to clustering, deduplication, and semantic similarity scoring across large text collections. The model is part of the Sentence Transformers family and is fine-tuned from the microsoft/mpnet-base architecture using large-scale contrastive training data.
Model capabilities
5 Core Capabilities
-
Sentence Embeddings
Generates dense vector embeddings for sentences and short texts, capturing semantic meaning for downstream similarity and retrieval tasks.
-
Semantic Search
Enables semantic search by embedding queries and documents into a shared space, supporting meaning-based retrieval beyond exact keyword matching.
-
Text Clustering
Supports clustering of documents or sentences by embedding them into vectors, enabling grouping of semantically similar texts at scale.
-
Text Classification
Provides embeddings usable as features for training classifiers, improving performance on various downstream text classification tasks.
-
Duplicate Detection
Identifies near-duplicate or paraphrased sentences by comparing embedding similarity, useful for deduplication and plagiarism-like detection scenarios.
Use cases
6 Most Valuable Use Cases
- Semantic Text Search
- Duplicate Question Detection
- Legal Case Similarity Search
- Case Law Monitoring
- Product Recommendation Matching
- Embedding-Based NLP
Transparent pricing
Cost Comparison
LLM API embeddings are up to ~60% cheaper than comparable all-mpnet-base-v2 offerings.
| Provider | Region | Latency | Throughput | Uptime | Input ($/1M) | Output ($/1M) | Context |
|---|---|---|---|---|---|---|---|
| LLM API BEST | Global | 80ms | 120k tokens/s | 99.99% | $0.02 | $0.00 | 8192 tokens |
| Sentence Transformers (Hosted) | Global | ~220ms | ~300 tps | ~99.5% | ~$0.05 | $0.00 | ~4096 tokens |
| Hugging Face Inference API | EU West | ~250ms | ~250 tps | 99.9% | ~$0.06 | $0.00 | ~4096 tokens |
| Azure AI (MPNet-like Embeddings) | US East | ~200ms | ~400 tps | 99.9% | ~$0.04 | $0.00 | 4096 tokens |
| Replicate | US West | ~260ms | ~200 tps | ~99.0% | ~$0.07 | $0.00 | ~4096 tokens |
Performance benchmarks
Technical Specifications
| Metric | all-mpnet-base-v2 (Sentence Transformers) | bert-base-nli-mean-tokens (Sentence Transformers) | paraphrase-MiniLM-L6-v2 (Sentence Transformers) |
|---|---|---|---|
| Dimensions | 768 | 768 | 384 |
| Max Input Tokens | ~256 tokens | ~128 tokens | ~256 tokens |
| Price per 1M Tokens | ~$0.10 (self-hosted infra only) | ~$0.09 (self-hosted infra only) | ~$0.07 (self-hosted infra only) |
| Avg Latency (per 128‑token input on GPU) | ~6ms | ~8ms | ~4ms |
| Throughput (embeddings/s on single GPU) | ~4,000/s | ~3,000/s | ~6,000/s |
| Model Size | ~420MB | ~420MB | ~90MB |
| Training Domain | General English STS + NLI | General English NLI | General English paraphrase mining |
| Uptime (self-hosted, well-managed) | ~99.5% | ~99.5% | ~99.5% |
30-day usage via LLM API
- 3.8B
- Text pairs embedded in last 30 days
- 21M
- API requests served in last 30 days
- 410K
- Developers using this model monthly
- 99.9%
- Avg API uptime over last 30 days
Architecture & Integration
Why Build on LLM.API?
One unified API. Every major model. Built-in reliability, cost control, and observability.
-
Intelligent Model Routing
Automatically route each request to the best model across providers based on task, latency, and reliability—no client changes required as your stack evolves.
One endpoint, any model -
Cost-Aware Orchestration
Optimize for price and performance with per-request cost controls, dynamic model selection, and transparent usage insights that keep your AI bill predictable.
Cut cost, keep quality -
Automatic Provider Fallback
Survive provider outages and rate limits with built-in failover logic that retries on alternate models, preserving SLAs without custom recovery code.
Resiliency by default -
Full-Stack Observability
Track latency, errors, tokens, and provider performance across every request with unified logs, traces, and metrics wired for your existing monitoring stack.
See every token -
Task-Level Abstractions
Call high-level tasks—chat, tools, RAG, vision—instead of provider-specific APIs, so you can swap models without rewriting business logic or prompt glue.
Code to tasks, not vendors -
High-Throughput Batch Jobs
Run massive batch workloads through a single endpoint with concurrency controls, retries, and progress tracking designed for production-scale pipelines.
Ship bulk, stay fast
Decision guide
When to Use — When NOT to Use
Use it if...
- You need robust general-purpose sentence embeddings for semantic similarity and clustering tasks.
- You need to power semantic search over short to medium-length English texts efficiently.
- Your use case involves intent classification or FAQ matching using dense vector similarity.
- You need a well-known, widely-benchmarked baseline model for sentence-level embedding experiments.
- Your use case involves building recommendation systems based on textual description similarity.
- You need to deduplicate or cluster large corpora of short documents by semantic closeness.
- Your use case involves zero-shot text matching by comparing query and label descriptions directly.
Avoid if...
- You need to process very long documents end-to-end, far beyond typical sentence lengths.
- Your workload requires state-of-the-art multilingual performance across many non-English languages.
- You need embeddings specifically optimized for code, images, audio, or multimodal inputs.
- Your workload requires continuously updated embeddings reflecting very recent domain-specific knowledge.
- You need task-specific fine-tuning with integrated training pipelines rather than an off-the-shelf encoder.
- Your workload requires strict on-device inference with extremely constrained memory and compute resources.
- You need strong domain adaptation out-of-the-box for highly specialized technical or legal text.
FAQ
Frequently Asked Questions
-
What is all-mpnet-base-v2?
all-mpnet-base-v2 is a Sentence Transformers text-embedding model based on MPNet, optimized for high-quality general-purpose sentence and document similarity.
-
What is all-mpnet-base-v2 best used for?
It is best for semantic search, clustering, deduplication, recommendation, and textual similarity tasks where short-to-medium English sentences or paragraphs are compared.
-
What modalities does all-mpnet-base-v2 support?
all-mpnet-base-v2 is text-only and generates fixed-size vector embeddings from input text; it does not process images, audio, or other modalities.
-
What is the embedding dimensionality and context window of all-mpnet-base-v2?
The model outputs 768-dimensional embeddings and is typically used with short to moderate-length texts up to roughly a few hundred tokens.
-
How fast is all-mpnet-base-v2 when called through LLM.API?
Latency depends on input size and region, but LLM.API routes to optimized Sentence Transformers runtimes for low-latency, high-throughput embedding generation.
-
How is pricing for all-mpnet-base-v2 handled on LLM.API?
Usage is billed according to LLM.API’s standard embedding pricing for this provider, usually per-token or per-character, as shown in your LLM.API dashboard.
-
How do I access all-mpnet-base-v2 via the LLM.API?
Call the LLM.API embeddings endpoint with provider set to Sentence Transformers and model set to all-mpnet-base-v2, passing your texts in the request body.
-
How does all-mpnet-base-v2 compare to larger Sentence Transformers models?
It is generally smaller and faster than larger Sentence Transformers models, offering strong performance for many tasks with reduced compute and latency.
-
Does all-mpnet-base-v2 support multilingual text?
It mainly targets English and may work on related languages, but performance is not guaranteed or optimized for fully multilingual use cases.
-
What are the main limitations of all-mpnet-base-v2?
It cannot generate or edit text, struggles with very long documents, and performance may degrade on domain-specific or non-English data without adaptation.
