Embed V1 4B

Text Embeddings

Embed V1 4B is Perplexity’s 4-billion-parameter text embedding model optimized for high-quality, web‑scale dense retrieval, supporting long 32K-token inputs and efficient INT8/binary representations.

Start Using API

API Performance

Latency: ~0.4s avg embedding time per 1K tokens
Context: 32K token context
Input: Free per 1M tokens (MIT-licensed open-source model)
Output: $0.10 per 1M tokens via Perplexity API or proxies
Uptime: 99% 99%

About the model

What is Embed V1 4B?

Embed V1 4B is a 4B-parameter Perplexity text embedding model (pplx-embed-v1-4B) designed for state-of-the-art, real-world web-scale retrieval tasks. It is primarily used for dense text retrieval and semantic search over large corpora, benefiting applications like RAG systems, question answering, and document ranking. The model also serves general-purpose feature extraction and sentence similarity use cases, aided by long-context (32K) support and compact INT8/binary embeddings that reduce storage and retrieval costs. It is part of the pplx-embed-v1 family of diffusion-pretrained dense embedding models, offered alongside a smaller 0.6B version and related contextual variant pplx-embed-context-v1.

Input / Output

Input

Text (for embedding)

Output

Vector embeddings

Model capabilities

5 Core Capabilities

Text Embedding

Generates dense vector representations of text inputs, enabling efficient similarity search, retrieval, and downstream semantic applications.
Semantic Search

Supports semantic retrieval by embedding queries and documents into a shared vector space for relevance ranking beyond keyword matching.
Multilingual Support

Embeds text from multiple languages into a unified vector space, enabling cross-lingual search and comparison tasks.
Document Clustering

Facilitates grouping related documents or passages using vector similarity, aiding topic discovery and organization of large text corpora.
Recommendation Engine

Enables content and item recommendations by comparing embedded user preferences with candidate items in high-dimensional vector space.

Use cases

6 Most Valuable Use Cases

Web-Scale Retrieval
Dense Text Search
RAG Document Indexing
Multilingual Similarity Search
Tool and API Retrieval
Monitoring Knowledge Bases

Transparent pricing

Cost Comparison

LLM API offers the lowest cost-per-token and fastest embedding throughput versus comparable Embed V1 4B-class models.

Provider	Region	Latency	Throughput	Uptime	Input ($/1M)	Output ($/1M)	Context
LLM API BEST	Global	80ms	120K tokens/s	99.99%	$0.03	$0.00	200K tokens
Perplexity	Global	~140ms	~60K tokens/s	~99.9%	~$0.05	$0.00	~100K tokens
OpenAI	Global	~120ms	~80K tokens/s	99.9%	~$0.10	$0.00	128K tokens
Azure AI	US East	~150ms	~70K tokens/s	99.9%	~$0.11	$0.00	~100K tokens

Performance benchmarks

Technical Specifications

Metric	Embed V1 4B (Perplexity)	text-embedding-3-large (OpenAI)	Voyage-large-2 (Voyage AI)
Dimensions	4096~estimate	3072	3072~estimate
Max Input Tokens	8K~estimate	8K~estimate	16K~estimate
Price per 1M Tokens	$0.10~estimate	$0.13~estimate	$0.12~estimate
Avg Latency	~120ms~estimate	~180ms~estimate	~220ms~estimate
Throughput	800 tps~estimate	600 tps~estimate	500 tps~estimate
Uptime	99.9%~estimate	99.9%~estimate	99.9%~estimate

30-day usage via LLM API

620M: Embedding tokens processed (30 days)
7.8M: API requests served (30 days)
210K: Unique developer accounts (30 days)
99.95%: Avg API uptime (30 days)

Start Using API

Architecture & Integration

Why Build on LLM.API?

One unified API. Every major model. Built-in reliability, cost control, and observability.

Unified AI Routing

Automatically route each request to the best model across providers based on latency, price, or quality—without changing your integration or redeploying code.
One endpoint, any model
Cost-Aware Orchestration

Control spend with per-route budgets, tiered model selection, and real-time cost tracking so you can ship advanced AI features without surprise bills.
Lower cost, same quality
Automatic Fallbacks

Define fallback chains so requests transparently fail over to alternative models or providers, preserving uptime and UX even during outages or rate-limit spikes.
No single point of failure
End-to-End Observability

Get full visibility into every call with traces, metrics, and structured logs across all providers, making debugging and performance tuning straightforward.
See every token, everywhere
Task-Level Abstractions

Describe intent—chat, extraction, classification, tools—while LLM.API picks and configures the right models, so your code stays clean and future-proof.
Code to tasks, not models
High-Throughput Batch

Submit large batches with built-in concurrency control, retries, and aggregation to process millions of tasks efficiently across providers with a single API.
Massive scale, simple API

Decision guide

When to Use — When NOT to Use

Use it if...

You need inexpensive, general-purpose text embeddings for semantic search across large corpora.
You need to build retrieval-augmented generation pipelines with a strong open-source embedding model.
Your use case involves clustering or deduplicating many short texts, titles, or snippets.
Your use case involves intent or topic matching between queries and knowledge base articles.
You need multilingual embeddings for cross-language search and similarity without heavy licensing constraints.
Your use case involves reranking search results using vector similarity from a compact model.

Avoid if...

You need a proprietary, fully managed embedding API with strict enterprise uptime SLAs.
Your workload requires state-of-the-art performance on niche domains like code or biology.
You need maximum-quality, very high-dimensional embeddings regardless of compute and memory cost.
Your workload requires on-device embeddings within extremely tight latency and memory budgets.
You need a model explicitly optimized and benchmarked for very long document embeddings.
You need unified vendor support, billing, and monitoring tightly integrated with a single cloud platform.

FAQ

Frequently Asked Questions

What is Embed V1 4B?

Embed V1 4B is a Perplexity embedding model accessible through LLM.API, designed to generate vector representations of text for search, retrieval, and similarity.
What is Embed V1 4B best suited for?

Embed V1 4B is best for semantic search, retrieval-augmented generation, clustering, deduplication, and recommendation systems where dense text embeddings are required.
How is Embed V1 4B priced when used via LLM.API?

Embed V1 4B pricing on LLM.API is usage-based per input token or character, with exact rates defined in your LLM.API pricing plan.
What context window does Embed V1 4B support?

Embed V1 4B accepts moderately long text inputs suitable for typical search and retrieval use cases, but does not support extremely long document contexts.
How fast is Embed V1 4B in terms of latency?

Embed V1 4B is optimized for low-latency embedding generation, typically suitable for real-time or near real-time search and retrieval workloads.
What modalities does Embed V1 4B support?

Embed V1 4B is a text embedding model and supports only text inputs, not images, audio, or video.
How do I call Embed V1 4B through LLM.API?

You call Embed V1 4B via LLM.API by selecting the Perplexity provider and specifying the Embed V1 4B model name in your embedding requests.
How does Embed V1 4B compare to larger Perplexity or other providers' embedding models?

Embed V1 4B typically offers a balance of quality and cost, trading some accuracy compared to larger models for better speed and lower pricing.
Does Embed V1 4B support multilingual embeddings?

Embed V1 4B can handle multiple languages to some extent, but its strongest performance is usually in English-centric or high-resource language datasets.
What limitations should I be aware of when using Embed V1 4B?

Embed V1 4B may underperform on highly specialized domains, extremely long documents, or tasks requiring fine-grained reasoning beyond semantic similarity.

Start in 2 lines of code

Get My API Key

Embed V1 4B

What is Embed V1 4B?

5 Core Capabilities

Text Embedding

Semantic Search

Multilingual Support

Document Clustering

Recommendation Engine

6 Most Valuable Use Cases

Cost Comparison

Technical Specifications

Why Build on LLM.API?

Unified AI Routing

Cost-Aware Orchestration

Automatic Fallbacks

End-to-End Observability

Task-Level Abstractions

High-Throughput Batch

When to Use — When NOT to Use

Use it if...

Avoid if...

Start in 2 lines of code