all-mpnet-base-v2

Text Generation

all-mpnet-base-v2 is a widely used English sentence-embedding model from Sentence Transformers that maps text to 768-dimensional vectors for semantic similarity tasks. It is built on Microsoft’s MPNet architecture and fine-tuned on over a billion sentence pairs for strong general-purpose performance.

Start Using API

API Performance

Latency: ~50ms avg embedding time for 1K tokens on GPU
Context: 512 max input tokens (recommended)
Input: Free per 1M tokens (open-source model)
Output: Free per 1M embedding vectors
Uptime: 99% 99%

About the model

What is all-mpnet-base-v2?

all-mpnet-base-v2 is an English sentence-transformer model that encodes sentences and short paragraphs into 768-dimensional dense vector embeddings. It is mainly used for semantic search and retrieval in applications like RAG pipelines, documentation search, and information retrieval systems. It is also commonly applied to clustering, deduplication, and semantic similarity scoring across large text collections. The model is part of the Sentence Transformers family and is fine-tuned from the microsoft/mpnet-base architecture using large-scale contrastive training data.

Input / Output

Input

Text sentences or short documents

Output

Sentence or document embeddings (numerical vectors)

Model capabilities

5 Core Capabilities

Sentence Embeddings

Generates dense vector embeddings for sentences and short texts, capturing semantic meaning for downstream similarity and retrieval tasks.
Semantic Search

Enables semantic search by embedding queries and documents into a shared space, supporting meaning-based retrieval beyond exact keyword matching.
Text Clustering

Supports clustering of documents or sentences by embedding them into vectors, enabling grouping of semantically similar texts at scale.
Text Classification

Provides embeddings usable as features for training classifiers, improving performance on various downstream text classification tasks.
Duplicate Detection

Identifies near-duplicate or paraphrased sentences by comparing embedding similarity, useful for deduplication and plagiarism-like detection scenarios.

Use cases

6 Most Valuable Use Cases

Semantic Text Search
Duplicate Question Detection
Legal Case Similarity Search
Case Law Monitoring
Product Recommendation Matching
Embedding-Based NLP

Transparent pricing

Cost Comparison

LLM API embeddings are up to ~60% cheaper than comparable all-mpnet-base-v2 offerings.

Provider	Region	Latency	Throughput	Uptime	Input ($/1M)	Output ($/1M)	Context
LLM API BEST	Global	80ms	120k tokens/s	99.99%	$0.02	$0.00	8192 tokens
Sentence Transformers (Hosted)	Global	~220ms	~300 tps	~99.5%	~$0.05	$0.00	~4096 tokens
Hugging Face Inference API	EU West	~250ms	~250 tps	99.9%	~$0.06	$0.00	~4096 tokens
Azure AI (MPNet-like Embeddings)	US East	~200ms	~400 tps	99.9%	~$0.04	$0.00	4096 tokens
Replicate	US West	~260ms	~200 tps	~99.0%	~$0.07	$0.00	~4096 tokens

Performance benchmarks

Technical Specifications

Metric	all-mpnet-base-v2 (Sentence Transformers)	bert-base-nli-mean-tokens (Sentence Transformers)	paraphrase-MiniLM-L6-v2 (Sentence Transformers)
Dimensions	768	768	384
Max Input Tokens	~256 tokens	~128 tokens	~256 tokens
Price per 1M Tokens	~$0.10 (self-hosted infra only)	~$0.09 (self-hosted infra only)	~$0.07 (self-hosted infra only)
Avg Latency (per 128‑token input on GPU)	~6ms	~8ms	~4ms
Throughput (embeddings/s on single GPU)	~4,000/s	~3,000/s	~6,000/s
Model Size	~420MB	~420MB	~90MB
Training Domain	General English STS + NLI	General English NLI	General English paraphrase mining
Uptime (self-hosted, well-managed)	~99.5%	~99.5%	~99.5%

30-day usage via LLM API

3.8B: Text pairs embedded in last 30 days
21M: API requests served in last 30 days
410K: Developers using this model monthly
99.9%: Avg API uptime over last 30 days

Start Using API

Architecture & Integration

Why Build on LLM.API?

One unified API. Every major model. Built-in reliability, cost control, and observability.

Intelligent Model Routing

Automatically route each request to the best model across providers based on task, latency, and reliability—no client changes required as your stack evolves.
One endpoint, any model
Cost-Aware Orchestration

Optimize for price and performance with per-request cost controls, dynamic model selection, and transparent usage insights that keep your AI bill predictable.
Cut cost, keep quality
Automatic Provider Fallback

Survive provider outages and rate limits with built-in failover logic that retries on alternate models, preserving SLAs without custom recovery code.
Resiliency by default
Full-Stack Observability

Track latency, errors, tokens, and provider performance across every request with unified logs, traces, and metrics wired for your existing monitoring stack.
See every token
Task-Level Abstractions

Call high-level tasks—chat, tools, RAG, vision—instead of provider-specific APIs, so you can swap models without rewriting business logic or prompt glue.
Code to tasks, not vendors
High-Throughput Batch Jobs

Run massive batch workloads through a single endpoint with concurrency controls, retries, and progress tracking designed for production-scale pipelines.
Ship bulk, stay fast

Decision guide

When to Use — When NOT to Use

Use it if...

You need robust general-purpose sentence embeddings for semantic similarity and clustering tasks.
You need to power semantic search over short to medium-length English texts efficiently.
Your use case involves intent classification or FAQ matching using dense vector similarity.
You need a well-known, widely-benchmarked baseline model for sentence-level embedding experiments.
Your use case involves building recommendation systems based on textual description similarity.
You need to deduplicate or cluster large corpora of short documents by semantic closeness.
Your use case involves zero-shot text matching by comparing query and label descriptions directly.

Avoid if...

You need to process very long documents end-to-end, far beyond typical sentence lengths.
Your workload requires state-of-the-art multilingual performance across many non-English languages.
You need embeddings specifically optimized for code, images, audio, or multimodal inputs.
Your workload requires continuously updated embeddings reflecting very recent domain-specific knowledge.
You need task-specific fine-tuning with integrated training pipelines rather than an off-the-shelf encoder.
Your workload requires strict on-device inference with extremely constrained memory and compute resources.
You need strong domain adaptation out-of-the-box for highly specialized technical or legal text.

FAQ

Frequently Asked Questions

What is all-mpnet-base-v2?

all-mpnet-base-v2 is a Sentence Transformers text-embedding model based on MPNet, optimized for high-quality general-purpose sentence and document similarity.
What is all-mpnet-base-v2 best used for?

It is best for semantic search, clustering, deduplication, recommendation, and textual similarity tasks where short-to-medium English sentences or paragraphs are compared.
What modalities does all-mpnet-base-v2 support?

all-mpnet-base-v2 is text-only and generates fixed-size vector embeddings from input text; it does not process images, audio, or other modalities.
What is the embedding dimensionality and context window of all-mpnet-base-v2?

The model outputs 768-dimensional embeddings and is typically used with short to moderate-length texts up to roughly a few hundred tokens.
How fast is all-mpnet-base-v2 when called through LLM.API?

Latency depends on input size and region, but LLM.API routes to optimized Sentence Transformers runtimes for low-latency, high-throughput embedding generation.
How is pricing for all-mpnet-base-v2 handled on LLM.API?

Usage is billed according to LLM.API’s standard embedding pricing for this provider, usually per-token or per-character, as shown in your LLM.API dashboard.
How do I access all-mpnet-base-v2 via the LLM.API?

Call the LLM.API embeddings endpoint with provider set to Sentence Transformers and model set to all-mpnet-base-v2, passing your texts in the request body.
How does all-mpnet-base-v2 compare to larger Sentence Transformers models?

It is generally smaller and faster than larger Sentence Transformers models, offering strong performance for many tasks with reduced compute and latency.
Does all-mpnet-base-v2 support multilingual text?

It mainly targets English and may work on related languages, but performance is not guaranteed or optimized for fully multilingual use cases.
What are the main limitations of all-mpnet-base-v2?

It cannot generate or edit text, struggles with very long documents, and performance may degrade on domain-specific or non-English data without adaptation.

Start in 2 lines of code

Get My API Key

all-mpnet-base-v2

What is all-mpnet-base-v2?

5 Core Capabilities

Sentence Embeddings

Semantic Search

Text Clustering

Text Classification

Duplicate Detection

6 Most Valuable Use Cases

Cost Comparison

Technical Specifications

Why Build on LLM.API?

Intelligent Model Routing

Cost-Aware Orchestration

Automatic Provider Fallback

Full-Stack Observability

Task-Level Abstractions

High-Throughput Batch Jobs

When to Use — When NOT to Use

Use it if...

Avoid if...

Start in 2 lines of code