Powered by Perplexity

Embed V1 4B

  • Text Embeddings

Embed V1 4B is Perplexity’s 4-billion-parameter text embedding model optimized for high-quality, web‑scale dense retrieval, supporting long 32K-token inputs and efficient INT8/binary representations.

Start Using API

What is Embed V1 4B?

Embed V1 4B is a 4B-parameter Perplexity text embedding model (pplx-embed-v1-4B) designed for state-of-the-art, real-world web-scale retrieval tasks. It is primarily used for dense text retrieval and semantic search over large corpora, benefiting applications like RAG systems, question answering, and document ranking. The model also serves general-purpose feature extraction and sentence similarity use cases, aided by long-context (32K) support and compact INT8/binary embeddings that reduce storage and retrieval costs. It is part of the pplx-embed-v1 family of diffusion-pretrained dense embedding models, offered alongside a smaller 0.6B version and related contextual variant pplx-embed-context-v1.

5 Core Capabilities

  • Text Embedding

    Generates dense vector representations of text inputs, enabling efficient similarity search, retrieval, and downstream semantic applications.

  • Semantic Search

    Supports semantic retrieval by embedding queries and documents into a shared vector space for relevance ranking beyond keyword matching.

  • Multilingual Support

    Embeds text from multiple languages into a unified vector space, enabling cross-lingual search and comparison tasks.

  • Document Clustering

    Facilitates grouping related documents or passages using vector similarity, aiding topic discovery and organization of large text corpora.

  • Recommendation Engine

    Enables content and item recommendations by comparing embedded user preferences with candidate items in high-dimensional vector space.

6 Most Valuable Use Cases

  • Web-Scale Retrieval
  • Dense Text Search
  • RAG Document Indexing
  • Multilingual Similarity Search
  • Tool and API Retrieval
  • Monitoring Knowledge Bases

Cost Comparison

LLM API offers the lowest cost-per-token and fastest embedding throughput versus comparable Embed V1 4B-class models.

Provider Region Latency Throughput Uptime Input ($/1M) Output ($/1M) Context
LLM API BEST Global 80ms 120K tokens/s 99.99% $0.03 $0.00 200K tokens
Perplexity Global ~140ms ~60K tokens/s ~99.9% ~$0.05 $0.00 ~100K tokens
OpenAI Global ~120ms ~80K tokens/s 99.9% ~$0.10 $0.00 128K tokens
Azure AI US East ~150ms ~70K tokens/s 99.9% ~$0.11 $0.00 ~100K tokens

Technical Specifications

Metric Embed V1 4B (Perplexity) text-embedding-3-large (OpenAI) Voyage-large-2 (Voyage AI)
Dimensions 4096~estimate 3072 3072~estimate
Max Input Tokens 8K~estimate 8K~estimate 16K~estimate
Price per 1M Tokens $0.10~estimate $0.13~estimate $0.12~estimate
Avg Latency ~120ms~estimate ~180ms~estimate ~220ms~estimate
Throughput 800 tps~estimate 600 tps~estimate 500 tps~estimate
Uptime 99.9%~estimate 99.9%~estimate 99.9%~estimate

30-day usage via LLM API

620M
Embedding tokens processed (30 days)
7.8M
API requests served (30 days)
210K
Unique developer accounts (30 days)
99.95%
Avg API uptime (30 days)
Start Using API

Why Build on LLM.API?

One unified API. Every major model. Built-in reliability, cost control, and observability.

  • Unified AI Routing

    Automatically route each request to the best model across providers based on latency, price, or quality—without changing your integration or redeploying code.

    One endpoint, any model
  • Cost-Aware Orchestration

    Control spend with per-route budgets, tiered model selection, and real-time cost tracking so you can ship advanced AI features without surprise bills.

    Lower cost, same quality
  • Automatic Fallbacks

    Define fallback chains so requests transparently fail over to alternative models or providers, preserving uptime and UX even during outages or rate-limit spikes.

    No single point of failure
  • End-to-End Observability

    Get full visibility into every call with traces, metrics, and structured logs across all providers, making debugging and performance tuning straightforward.

    See every token, everywhere
  • Task-Level Abstractions

    Describe intent—chat, extraction, classification, tools—while LLM.API picks and configures the right models, so your code stays clean and future-proof.

    Code to tasks, not models
  • High-Throughput Batch

    Submit large batches with built-in concurrency control, retries, and aggregation to process millions of tasks efficiently across providers with a single API.

    Massive scale, simple API

When to Use — When NOT to Use

Use it if...

  • You need inexpensive, general-purpose text embeddings for semantic search across large corpora.
  • You need to build retrieval-augmented generation pipelines with a strong open-source embedding model.
  • Your use case involves clustering or deduplicating many short texts, titles, or snippets.
  • Your use case involves intent or topic matching between queries and knowledge base articles.
  • You need multilingual embeddings for cross-language search and similarity without heavy licensing constraints.
  • Your use case involves reranking search results using vector similarity from a compact model.

Avoid if...

  • You need a proprietary, fully managed embedding API with strict enterprise uptime SLAs.
  • Your workload requires state-of-the-art performance on niche domains like code or biology.
  • You need maximum-quality, very high-dimensional embeddings regardless of compute and memory cost.
  • Your workload requires on-device embeddings within extremely tight latency and memory budgets.
  • You need a model explicitly optimized and benchmarked for very long document embeddings.
  • You need unified vendor support, billing, and monitoring tightly integrated with a single cloud platform.

Frequently Asked Questions

  • What is Embed V1 4B?

    Embed V1 4B is a Perplexity embedding model accessible through LLM.API, designed to generate vector representations of text for search, retrieval, and similarity.

  • What is Embed V1 4B best suited for?

    Embed V1 4B is best for semantic search, retrieval-augmented generation, clustering, deduplication, and recommendation systems where dense text embeddings are required.

  • How is Embed V1 4B priced when used via LLM.API?

    Embed V1 4B pricing on LLM.API is usage-based per input token or character, with exact rates defined in your LLM.API pricing plan.

  • What context window does Embed V1 4B support?

    Embed V1 4B accepts moderately long text inputs suitable for typical search and retrieval use cases, but does not support extremely long document contexts.

  • How fast is Embed V1 4B in terms of latency?

    Embed V1 4B is optimized for low-latency embedding generation, typically suitable for real-time or near real-time search and retrieval workloads.

  • What modalities does Embed V1 4B support?

    Embed V1 4B is a text embedding model and supports only text inputs, not images, audio, or video.

  • How do I call Embed V1 4B through LLM.API?

    You call Embed V1 4B via LLM.API by selecting the Perplexity provider and specifying the Embed V1 4B model name in your embedding requests.

  • How does Embed V1 4B compare to larger Perplexity or other providers' embedding models?

    Embed V1 4B typically offers a balance of quality and cost, trading some accuracy compared to larger models for better speed and lower pricing.

  • Does Embed V1 4B support multilingual embeddings?

    Embed V1 4B can handle multiple languages to some extent, but its strongest performance is usually in English-centric or high-resource language datasets.

  • What limitations should I be aware of when using Embed V1 4B?

    Embed V1 4B may underperform on highly specialized domains, extremely long documents, or tasks requiring fine-grained reasoning beyond semantic similarity.

Start in 2 lines of code

Get My API Key