Powered by NVIDIA

Llama Nemotron Embed VL 1B V2 (free)

  • Text Generation

Llama Nemotron Embed VL 1B V2 (free) is NVIDIA’s 1B-parameter multimodal embedding model optimized for question-answering retrieval over text and visual document data. It produces dense vector embeddings from text, images, or combined image–text inputs for high-quality semantic search and RAG systems.

Start Using API

What is Llama Nemotron Embed VL 1B V2 (free)?

Llama Nemotron Embed VL 1B V2 (free) is a combined language–vision embedding model from NVIDIA designed for multimodal question-answering retrieval over text and document images. It is mainly used to embed large corpora of documents (including pages with text, tables, charts, and infographics) into dense vectors for semantic retrieval, enterprise search, and knowledge indexing. It is also used to power RAG pipelines that retrieve relevant visual or textual context given a text query, supporting text, image, and text+image to embedding modalities with a large context window. It belongs to NVIDIA’s Nemotron RAG collection and Llama Nemotron embedding family, and is offered as a free variant via providers like OpenRouter and Remova.

5 Core Capabilities

  • Multimodal Embeddings

    Generates dense vector embeddings from text, images, or combined image-text document pages for retrieval over multimodal corpora.

  • Text Document Retrieval

    Embeds textual queries and passages so semantically related documents can be efficiently retrieved using vector similarity search.

  • Visual Document Retrieval

    Encodes page images containing text, tables, charts, and infographics to enable semantic search over scanned or PDF documents.

  • Question Answer Retrieval

    Optimized to embed user questions and relevant pages so answer-containing documents are ranked highly in retrieval pipelines.

  • Multilingual Support

    Provides multilingual text embeddings, enabling cross-language retrieval where queries and documents may be written in different languages.

6 Most Valuable Use Cases

  • Multimodal QA Retrieval
  • Visual Document Search
  • Legal Case Retrieval
  • Regulation Change Monitoring
  • E-commerce Catalog Search
  • RAG System Embeddings

Cost Comparison

LLM API offers the lowest cost and highest performance for Llama Nemotron–class vision-language embeddings.

Provider Region Latency Throughput Uptime Input ($/1M) Output ($/1M) Context
LLM API BEST Global 50ms 120 img/s 99.99% $0.00 $0.00 4096 tokens
NVIDIA US West ~120ms ~40 img/s ~99.9% $0.00 $0.00 ~4096 tokens
AWS Bedrock US East ~160ms ~30 img/s 99.9% ~$0.60 / 1M tokens ~$0.60 / 1M tokens ~4096 tokens
Azure AI EU West ~170ms ~25 img/s 99.9% ~$0.70 / 1M tokens ~$0.70 / 1M tokens ~4096 tokens
Replicate Global ~200ms ~20 img/s ~99.5% ~$1.20 / 1M tokens ~$1.20 / 1M tokens ~4096 tokens

Technical Specifications

Metric Llama Nemotron Embed VL 1B V2 (free) OpenAI text-embedding-3-small Cohere Embed v3 English
Dimensions 1024 1536 1024
Max Input Tokens ~8K 8192 ~8K
Price per 1M Tokens $0.00 $0.02 $0.10
Throughput ~5K tok/s ~10K tok/s ~7K tok/s
Avg Latency ~120ms ~100ms ~140ms
Uptime ~99.5% ~99.9% ~99.9%

30-day usage via LLM API

3.4B
Prompt tokens processed (30 days)
9.1M
API requests served (30 days)
310K
Unique developers using this model (30 days)
99.8%
Average uptime over last 30 days
Start Using API

Why Build on LLM.API?

One unified API. Every major model. Built-in reliability, cost control, and observability.

  • Unified AI Routing

    Dynamically route each request to the best model across providers based on latency, cost, and quality—no client changes, just smarter infrastructure.

    One endpoint, every model
  • Cost-Aware Orchestration

    Optimize spend by mixing premium and budget models behind a single API, with pricing controls and per-route policies baked into your architecture.

    Cut costs, keep quality
  • Resilient Fallback Logic

    Automatic failover to backup models and regions when a provider degrades, keeping your AI features reliable without extra retry logic in your code.

    Stay online under failure
  • Full-Stack Observability

    Trace every call across providers with logs, metrics, and structured events so you can debug latency, failures, and quality from one place.

    See every token hop
  • Task-Level Abstractions

    Describe what you want—chat, tools, RAG, workflows—once, and let LLM.API map tasks to the right models and parameters automatically.

    Think tasks, not models
  • High-Throughput Batch

    Submit massive batches across providers with built-in queuing, parallelization, and retry semantics, instead of building and tuning your own job runner.

    Millions of calls, one API

When to Use — When NOT to Use

Use it if...

  • You need a free multimodal embedding model for both images and text.
  • You need to build image-text retrieval or visual search with minimal infrastructure cost.
  • Your use case involves clustering or deduplicating large mixed text–image datasets efficiently.
  • Your use case involves lightweight multimodal similarity search where 1B-parameter quality is sufficient.
  • You need compact vision-language embeddings to power recommendation or content discovery features.
  • Your use case involves encoding images and captions to train downstream retrieval models.
  • You need to prototype multimodal search quickly using an off-the-shelf NVIDIA embedding model.

Avoid if...

  • You need a generative model that produces text, code, or images from prompts.
  • Your workload requires state-of-the-art semantic understanding on very long multimodal documents.
  • You need highly precise domain-specialized embeddings for legal, medical, or scientific tasks.
  • Your workload requires complex reasoning or tool use rather than simple similarity embeddings.
  • You need to run entirely on CPU-constrained environments without access to NVIDIA GPUs.
  • Your workload requires strict, battle-tested production SLAs and enterprise hosting out of the box.
  • You need multilingual embeddings with strong performance across many low-resource languages.

Frequently Asked Questions

  • What is Llama Nemotron Embed VL 1B V2 (free)?

    Llama Nemotron Embed VL 1B V2 (free) is an NVIDIA vision-language embedding model that generates joint vector representations for text and images.

  • What is Llama Nemotron Embed VL 1B V2 (free) best suited for?

    It is best for semantic search, multimodal retrieval, clustering, and recommendation systems that require aligned embeddings of text and visual content.

  • How much does it cost to use Llama Nemotron Embed VL 1B V2 (free) on LLM.API?

    The Llama Nemotron Embed VL 1B V2 (free) tier is available at zero API usage cost on LLM.API, subject to platform-wide rate limits.

  • What modalities does Llama Nemotron Embed VL 1B V2 (free) support?

    It supports multimodal input, allowing you to encode text-only, image-only, or combined image-plus-text into a single embedding space.

  • What is the context window of Llama Nemotron Embed VL 1B V2 (free) for text inputs?

    Llama Nemotron Embed VL 1B V2 (free) supports text inputs up to 8,192 tokens per request via LLM.API.

  • How fast is Llama Nemotron Embed VL 1B V2 (free) in terms of latency?

    As a compact 1B-parameter model, it is optimized for low latency embedding generation, typically returning results in tens of milliseconds per request.

  • How do I call Llama Nemotron Embed VL 1B V2 (free) through the LLM.API gateway?

    Specify the model name "nvidia/llama-nemotron-embed-vl-1b-v2-free" in your LLM.API request along with your text and image payloads.

  • How does Llama Nemotron Embed VL 1B V2 (free) compare to larger multimodal embedding models?

    Compared to larger multimodal embedders, it generally offers lower latency and cost with slightly lower embedding quality on complex, fine-grained tasks.

  • Can I use Llama Nemotron Embed VL 1B V2 (free) for general text generation?

    No, it is an embedding model designed solely to produce vector representations, not to generate or continue natural language text.

  • What limitations should I be aware of when using Llama Nemotron Embed VL 1B V2 (free)?

    It may struggle with very long documents, highly specialized domains, or detailed image reasoning compared to larger, domain-tuned multimodal models.

Start in 2 lines of code

Get My API Key