Llama Nemotron Embed VL 1B V2 (free)

Text Generation

Llama Nemotron Embed VL 1B V2 (free) is NVIDIA’s 1B-parameter multimodal embedding model optimized for question-answering retrieval over text and visual document data. It produces dense vector embeddings from text, images, or combined image–text inputs for high-quality semantic search and RAG systems.

Start Using API

API Performance

Latency: ~0.6s avg embedding latency
Context: ~8K token context
Input: Free per 1M tokens
Output: Free per 1M embedding vectors
Uptime: 99% 99%

About the model

What is Llama Nemotron Embed VL 1B V2 (free)?

Llama Nemotron Embed VL 1B V2 (free) is a combined language–vision embedding model from NVIDIA designed for multimodal question-answering retrieval over text and document images. It is mainly used to embed large corpora of documents (including pages with text, tables, charts, and infographics) into dense vectors for semantic retrieval, enterprise search, and knowledge indexing. It is also used to power RAG pipelines that retrieve relevant visual or textual context given a text query, supporting text, image, and text+image to embedding modalities with a large context window. It belongs to NVIDIA’s Nemotron RAG collection and Llama Nemotron embedding family, and is offered as a free variant via providers like OpenRouter and Remova.

Input / Output

Input

Text prompts (queries and document text for embedding)
Images (document pages, screenshots, infographics, tables, charts)
Combined image–text document inputs

Output

Fixed-size embedding vectors for retrieval and similarity search

Model capabilities

5 Core Capabilities

Multimodal Embeddings

Generates dense vector embeddings from text, images, or combined image-text document pages for retrieval over multimodal corpora.
Text Document Retrieval

Embeds textual queries and passages so semantically related documents can be efficiently retrieved using vector similarity search.
Visual Document Retrieval

Encodes page images containing text, tables, charts, and infographics to enable semantic search over scanned or PDF documents.
Question Answer Retrieval

Optimized to embed user questions and relevant pages so answer-containing documents are ranked highly in retrieval pipelines.
Multilingual Support

Provides multilingual text embeddings, enabling cross-language retrieval where queries and documents may be written in different languages.

Use cases

6 Most Valuable Use Cases

Multimodal QA Retrieval
Visual Document Search
Legal Case Retrieval
Regulation Change Monitoring
E-commerce Catalog Search
RAG System Embeddings

Transparent pricing

Cost Comparison

LLM API offers the lowest cost and highest performance for Llama Nemotron–class vision-language embeddings.

Provider	Region	Latency	Throughput	Uptime	Input ($/1M)	Output ($/1M)	Context
LLM API BEST	Global	50ms	120 img/s	99.99%	$0.00	$0.00	4096 tokens
NVIDIA	US West	~120ms	~40 img/s	~99.9%	$0.00	$0.00	~4096 tokens
AWS Bedrock	US East	~160ms	~30 img/s	99.9%	~$0.60 / 1M tokens	~$0.60 / 1M tokens	~4096 tokens
Azure AI	EU West	~170ms	~25 img/s	99.9%	~$0.70 / 1M tokens	~$0.70 / 1M tokens	~4096 tokens
Replicate	Global	~200ms	~20 img/s	~99.5%	~$1.20 / 1M tokens	~$1.20 / 1M tokens	~4096 tokens

Performance benchmarks

Technical Specifications

Metric	Llama Nemotron Embed VL 1B V2 (free)	OpenAI text-embedding-3-small	Cohere Embed v3 English
Dimensions	1024	1536	1024
Max Input Tokens	~8K	8192	~8K
Price per 1M Tokens	$0.00	$0.02	$0.10
Throughput	~5K tok/s	~10K tok/s	~7K tok/s
Avg Latency	~120ms	~100ms	~140ms
Uptime	~99.5%	~99.9%	~99.9%

30-day usage via LLM API

3.4B: Prompt tokens processed (30 days)
9.1M: API requests served (30 days)
310K: Unique developers using this model (30 days)
99.8%: Average uptime over last 30 days

Start Using API

Architecture & Integration

Why Build on LLM.API?

One unified API. Every major model. Built-in reliability, cost control, and observability.

Unified AI Routing

Dynamically route each request to the best model across providers based on latency, cost, and quality—no client changes, just smarter infrastructure.
One endpoint, every model
Cost-Aware Orchestration

Optimize spend by mixing premium and budget models behind a single API, with pricing controls and per-route policies baked into your architecture.
Cut costs, keep quality
Resilient Fallback Logic

Automatic failover to backup models and regions when a provider degrades, keeping your AI features reliable without extra retry logic in your code.
Stay online under failure
Full-Stack Observability

Trace every call across providers with logs, metrics, and structured events so you can debug latency, failures, and quality from one place.
See every token hop
Task-Level Abstractions

Describe what you want—chat, tools, RAG, workflows—once, and let LLM.API map tasks to the right models and parameters automatically.
Think tasks, not models
High-Throughput Batch

Submit massive batches across providers with built-in queuing, parallelization, and retry semantics, instead of building and tuning your own job runner.
Millions of calls, one API

Decision guide

When to Use — When NOT to Use

Use it if...

You need a free multimodal embedding model for both images and text.
You need to build image-text retrieval or visual search with minimal infrastructure cost.
Your use case involves clustering or deduplicating large mixed text–image datasets efficiently.
Your use case involves lightweight multimodal similarity search where 1B-parameter quality is sufficient.
You need compact vision-language embeddings to power recommendation or content discovery features.
Your use case involves encoding images and captions to train downstream retrieval models.
You need to prototype multimodal search quickly using an off-the-shelf NVIDIA embedding model.

Avoid if...

You need a generative model that produces text, code, or images from prompts.
Your workload requires state-of-the-art semantic understanding on very long multimodal documents.
You need highly precise domain-specialized embeddings for legal, medical, or scientific tasks.
Your workload requires complex reasoning or tool use rather than simple similarity embeddings.
You need to run entirely on CPU-constrained environments without access to NVIDIA GPUs.
Your workload requires strict, battle-tested production SLAs and enterprise hosting out of the box.
You need multilingual embeddings with strong performance across many low-resource languages.

FAQ

Frequently Asked Questions

What is Llama Nemotron Embed VL 1B V2 (free)?

Llama Nemotron Embed VL 1B V2 (free) is an NVIDIA vision-language embedding model that generates joint vector representations for text and images.
What is Llama Nemotron Embed VL 1B V2 (free) best suited for?

It is best for semantic search, multimodal retrieval, clustering, and recommendation systems that require aligned embeddings of text and visual content.
How much does it cost to use Llama Nemotron Embed VL 1B V2 (free) on LLM.API?

The Llama Nemotron Embed VL 1B V2 (free) tier is available at zero API usage cost on LLM.API, subject to platform-wide rate limits.
What modalities does Llama Nemotron Embed VL 1B V2 (free) support?

It supports multimodal input, allowing you to encode text-only, image-only, or combined image-plus-text into a single embedding space.
What is the context window of Llama Nemotron Embed VL 1B V2 (free) for text inputs?

Llama Nemotron Embed VL 1B V2 (free) supports text inputs up to 8,192 tokens per request via LLM.API.
How fast is Llama Nemotron Embed VL 1B V2 (free) in terms of latency?

As a compact 1B-parameter model, it is optimized for low latency embedding generation, typically returning results in tens of milliseconds per request.
How do I call Llama Nemotron Embed VL 1B V2 (free) through the LLM.API gateway?

Specify the model name "nvidia/llama-nemotron-embed-vl-1b-v2-free" in your LLM.API request along with your text and image payloads.
How does Llama Nemotron Embed VL 1B V2 (free) compare to larger multimodal embedding models?

Compared to larger multimodal embedders, it generally offers lower latency and cost with slightly lower embedding quality on complex, fine-grained tasks.
Can I use Llama Nemotron Embed VL 1B V2 (free) for general text generation?

No, it is an embedding model designed solely to produce vector representations, not to generate or continue natural language text.
What limitations should I be aware of when using Llama Nemotron Embed VL 1B V2 (free)?

It may struggle with very long documents, highly specialized domains, or detailed image reasoning compared to larger, domain-tuned multimodal models.

Start in 2 lines of code

Get My API Key

Llama Nemotron Embed VL 1B V2 (free)

What is Llama Nemotron Embed VL 1B V2 (free)?

5 Core Capabilities

Multimodal Embeddings

Text Document Retrieval

Visual Document Retrieval

Question Answer Retrieval

Multilingual Support

6 Most Valuable Use Cases

Cost Comparison

Technical Specifications

Why Build on LLM.API?

Unified AI Routing

Cost-Aware Orchestration

Resilient Fallback Logic

Full-Stack Observability

Task-Level Abstractions

High-Throughput Batch

When to Use — When NOT to Use

Use it if...

Avoid if...

Start in 2 lines of code