Powered by Google
Gemini Embedding 2
Gemini Embedding 2 is Google's natively multimodal embedding model that maps text, images, video, audio, and documents into a single semantic vector space. It is notable for unifying many media types in one model to power cross-modal search and retrieval.
About the model
What is Gemini Embedding 2?
Gemini Embedding 2 is a proprietary multimodal embedding model from Google that produces numerical vector representations for text, images, audio, video, and documents in a unified space. Its main use cases include powering retrieval-augmented generation, semantic search, recommendation, and classification across mixed media, and enabling cross-modal applications like using a text query to retrieve relevant images or video clips. It is part of Google’s Gemini Embedding family and succeeds earlier text-focused Gemini embedding models.
Model capabilities
5 Core Capabilities
-
Text Embeddings
Generates dense vector representations of text inputs optimized for semantic similarity, clustering, search, and other retrieval-augmented applications.
-
Multilingual Support
Creates embeddings for text in many languages, enabling cross-lingual search, retrieval, and clustering across diverse multilingual content.
-
Long Context Handling
Encodes relatively long documents into embeddings, supporting use cases like document search, recommendation, and large-scale corpus analysis.
-
Code Representation
Produces embeddings for source code snippets, improving code search, code recommendation, and semantic understanding across programming languages.
-
Text-Image Alignment
Supports joint embedding space for text and images, enabling multimodal retrieval like image search based on natural language queries.
Use cases
6 Most Valuable Use Cases
- Semantic Text Search
- Document Clustering
- Legal Case Retrieval
- Regulation Change Monitoring
- Product Recommendation Ranking
- Multilingual Text Similarity
Transparent pricing
Cost Comparison
LLM API offers the lowest cost and latency for Gemini-class embeddings.
| Provider | Region | Latency | Throughput | Uptime | Input ($/1M) | Output ($/1M) | Context |
|---|---|---|---|---|---|---|---|
| LLM API BEST | Global | 80ms | 120K tokens/s | 99.99% | $0.05 | $0.00 | 512K tokens |
| Global | ~150ms | ~60K tokens/s | 99.9% | ~$0.13 | $0.00 | ~307K tokens | |
| Vertex AI (Google Cloud) | US East | ~160ms | ~55K tokens/s | 99.9% | ~$0.15 | $0.00 | ~307K tokens |
| AWS Bedrock (equivalent embedding model) | US East | ~180ms | ~50K tokens/s | 99.9% | ~$0.12 | $0.00 | ~100K tokens |
| Azure AI (equivalent embedding model) | EU West | ~190ms | ~45K tokens/s | 99.9% | ~$0.11 | $0.00 | ~100K tokens |
Performance benchmarks
Technical Specifications
| Metric | Gemini Embedding 2 | OpenAI text-embedding-3-large | Cohere Embed v3 English | AWS Titan Text Embeddings V2 |
|---|---|---|---|---|
| Embedding Dimensions | 3072 | 3072 | 1024 | 1024 |
| Max Input Tokens | 8,192 | — | — | 8,000 |
| Price per 1M Tokens (Input) | $0.02 | $0.13 | $0.10 | $0.12 |
| Price per 1M Tokens (Output) | — | $0.13 | — | — |
| Modalities Supported | Text, Image | Text | Text | Text |
| Throughput | — | — | — | — |
| Avg Latency | — | — | — | — |
| Service Uptime (SLA) | — | — | — | — |
30-day usage via LLM API
- 3.8B
- Text chunks embedded (30 days)
- 520M
- API requests (30 days)
- 45K
- Active developer accounts (30 days)
- 99.97%
- Avg API uptime (30 days)
Architecture & Integration
Why Build on LLM.API?
One unified API. Every major model. Built-in reliability, cost control, and observability.
-
Unified AI Routing
Automatically route each request to the optimal model across providers based on latency, capability, and policies—no client changes, just better defaults.
One endpoint, every model -
Cost-Aware Controls
Define per-project or per-tenant budgets, choose cost ceilings, and let LLM.API pick the cheapest model that still meets your quality and latency targets.
Lower spend, same output -
Resilient Fallback Logic
Eliminate single-vendor outages with built-in failover across providers, automatic retries, and policy-based degradation that keeps your product responsive.
Never ship 500s again -
End-to-End Observability
Get unified logs, traces, and metrics for every provider—latency, errors, token usage, and prompts—all correlated to requests and tenants in one place.
See every token flow -
Task-Level Orchestration
Describe tasks, constraints, and tools once; LLM.API handles model selection, tool calling, and execution flow so you focus on product logic, not glue code.
Ship workflows, not wiring -
High-Throughput Batch APIs
Process millions of inferences efficiently with bulk submission, concurrency control, and automatic chunking tuned for each provider’s limits and quotas.
Scale from 10 to millions
Decision guide
When to Use — When NOT to Use
Use it if...
- You need general-purpose text embeddings for semantic search, clustering, or retrieval applications.
- You need multilingual embeddings that handle many languages consistently within a single vector space.
- Your use case involves building recommendation systems based on textual similarity or user profiles.
- You need embeddings optimized for low latency and reasonable cost on Google Cloud.
- Your use case involves hybrid search, combining Gemini Embedding 2 with keyword or metadata filters.
- You need embeddings well-integrated with other Google Vertex AI or Gemini-based workflows.
- Your use case involves encoding short queries and longer documents into the same embedding space.
Avoid if...
- You need to embed images, audio, or video rather than purely textual content.
- Your workload requires full generative capabilities like conversation, code synthesis, or content creation.
- You need ultra-long-context document understanding beyond the maximum token limits of embeddings.
- Your workload requires highly domain-specific vectors trained on proprietary data you fully control.
- You need on-premise deployment without relying on Google-managed cloud infrastructure or APIs.
- Your workload requires strict vendor neutrality, avoiding lock-in to any specific cloud provider.
- You need binary or sparse representations instead of dense floating-point embeddings for storage efficiency.
FAQ
Frequently Asked Questions
-
What is Gemini Embedding 2?
Gemini Embedding 2 is Google’s latest text and code embedding model designed to generate dense vector representations for search, retrieval, and semantic similarity.
-
What input modalities does Gemini Embedding 2 support?
Gemini Embedding 2 supports text and code inputs only; it does not embed images, audio, or other modalities.
-
How do I access Gemini Embedding 2 through LLM.API?
You call the unified LLM.API embeddings endpoint with the provider set to Google and model set to Gemini Embedding 2.
-
What is the context window of Gemini Embedding 2?
Gemini Embedding 2 supports input sequences up to 8,192 tokens, after which inputs must be truncated or chunked.
-
How fast is Gemini Embedding 2 for generating embeddings via LLM.API?
Embedding requests typically return in tens of milliseconds to low hundreds of milliseconds per batch, depending on batch size and network latency.
-
How is pricing for Gemini Embedding 2 handled on LLM.API?
LLM.API charges per 1,000 input tokens for Gemini Embedding 2, with the exact rate shown in your LLM.API pricing and usage dashboard.
-
How does Gemini Embedding 2 compare to other embedding models on LLM.API?
Gemini Embedding 2 offers strong multilingual and code understanding, often outperforming many older open-source embedding models in retrieval and semantic similarity benchmarks.
-
What are the main limitations of Gemini Embedding 2?
Gemini Embedding 2 cannot generate text, has a fixed maximum context length, and may encode provider-specific biases present in its training data.
-
Can I use Gemini Embedding 2 for multilingual applications?
Yes, Gemini Embedding 2 supports many languages and produces a shared embedding space suitable for cross-lingual retrieval and semantic search.
-
Does Gemini Embedding 2 support batching through LLM.API?
Yes, you can send an array of input texts in a single embeddings request to Gemini Embedding 2 to reduce per-item latency and cost.
