GTE-Large

Text Generation

GTE-Large is a general-purpose English text embedding model from Thenlper based on the General Text Embeddings (GTE) architecture. It produces 1,024-dimensional sentence embeddings optimized for semantic similarity and retrieval tasks.

Start Using API

API Performance

Latency: ~0.25s avg encoding time per 1K tokens on GPU
Context: ~8K token context
Input: Free per 1M tokens
Output: $0.00 per 1M tokens
Uptime: 99% 99%

About the model

What is GTE-Large?

GTE-Large is a BERT-based General Text Embeddings model released by Thenlper that generates 1,024-dimensional sentence and document embeddings for English text. It is mainly used for information retrieval and semantic search, where dense vector representations are required to match queries with relevant passages or documents. It is also applied to tasks such as semantic textual similarity, clustering, reranking, and various downstream applications evaluated on the MTEB benchmark. GTE-Large belongs to the GTE family of models introduced in the paper “Towards General Text Embeddings with Multi-stage Contrastive Learning,” alongside smaller variants like GTE-Base and GTE-Small.

Input / Output

Input

Text (sentences, paragraphs, short documents for embedding)

Output

Vector embeddings (1024-dimensional numeric representations)

Model capabilities

5 Core Capabilities

Text Embedding

Encodes English sentences, paragraphs, and moderate-length documents into dense 1024-dimensional vectors for downstream semantic tasks.
Semantic Similarity

Generates embeddings enabling accurate semantic textual similarity comparisons between sentence or document pairs using vector distance metrics.
Information Retrieval

Produces high-quality embeddings optimized for retrieval pipelines, improving search ranking and relevance over traditional lexical approaches.
Reranking Support

Provides rich semantic embeddings that can rerank candidate search or recommendation results for better ordering and relevance.
Clustering Usage

Offers consistent vector representations suitable for clustering texts into semantically coherent groups in analytics or discovery workflows.

Use cases

6 Most Valuable Use Cases

Semantic Search
Information Retrieval
Semantic Reranking
Text Clustering
Text Similarity Scoring
General Embedding Tasks

Transparent pricing

Cost Comparison

LLM API offers the lowest embedding prices and best performance for GTE-Large–class models.

Provider	Region	Latency	Throughput	Uptime	Input ($/1M)	Output ($/1M)	Context
LLM API BEST	Global	80ms	~15K tokens/s	99.99%	$0.03 per 1M tokens	$0.03 per 1M tokens	~8K tokens
Thenlper (Direct)	Global	~220ms	~5K tokens/s	~99.5%	~$0.10 per 1M tokens	~$0.10 per 1M tokens	~8K tokens
OpenAI (text-embedding-3-large equivalent)	Global	~200ms	~10K tokens/s	99.9%	$0.13 per 1M tokens	$0.13 per 1M tokens	~8K tokens
AWS Bedrock (similar embedding model)	US East	~250ms	~8K tokens/s	99.9%	~$0.20 per 1M tokens	~$0.20 per 1M tokens	~8K tokens

Performance benchmarks

Technical Specifications

Metric	GTE-Large (Thenlper)	text-embedding-3-large (OpenAI)	bge-large-en-v1.5 (BAAI)
Dimensions	1024	3072	1024
Max Input Tokens	~8K	8K	~8K
Price per 1M Tokens	~$0.02	$0.13	~$0.01
Avg Latency	~120ms	~200ms	~150ms
Throughput	~1,500 tps	~800 tps	~1,200 tps
Uptime	~99.5%	99.9%	~99.0%

30-day usage via LLM API

1.9B: Prompt tokens processed (30 days)
11.4M: Embedding API requests (30 days)
420K: Unique applications using GTE-Large (30 days)
99.95%: Average API uptime (30 days)

Start Using API

Architecture & Integration

Why Build on LLM.API?

One unified API. Every major model. Built-in reliability, cost control, and observability.

Unified AI Routing

Automatically route each request to the optimal provider and model based on latency, cost, or quality—without changing your integration or redeploying code.
One endpoint, every model
Cost-Aware Orchestration

Control spend with smart tiering, per-route budgets, and provider mix policies that automatically balance price versus performance across all your AI workloads.
Cut cost, keep quality
Resilient Fallbacks

Define multi-provider fallback chains so requests transparently fail over on errors, rate limits, or outages—keeping your AI features reliable in production.
Stay online, automatically
End-to-End Observability

Get unified logs, metrics, and traces across providers with request-level insights into tokens, latency, errors, and model behavior in one place.
See every token
Task-Level Abstractions

Call high-level tasks—chat, generation, embeddings, tools—instead of vendor-specific APIs, so you can swap models or providers without rewriting business logic.
Code to tasks, not vendors
High-Throughput Batch

Run large-scale batch jobs with automatic chunking, retries, and concurrency control to fully utilize provider limits while keeping throughput predictable.
Scale jobs, not code

Decision guide

When to Use — When NOT to Use

Use it if...

You need a general-purpose text embedding model for semantic search or retrieval.
You need multilingual sentence embeddings covering many languages in a single model.
You need relatively lightweight embeddings that are cheaper than very large transformers.
Your use case involves clustering or deduplicating large text corpora by semantic similarity.
Your use case involves reranking search results using cosine similarity between query and documents.
You need an open-source, locally deployable embedding model without proprietary dependencies.

Avoid if...

You need state-of-the-art retrieval quality matching the newest large proprietary embedding models.
Your workload requires embeddings for very long documents far beyond typical context limits.
You need cross-modal embeddings that jointly handle text and images in one space.
You need domain-specialized embeddings for code, biology, or legal texts out-of-the-box.
Your workload requires strict latency guarantees on edge devices with extremely limited compute.
You need embeddings that are continuously updated and versioned as a managed cloud service.

FAQ

Frequently Asked Questions

What is GTE-Large?

GTE-Large is a sentence embedding model by Thenlper optimized for high-quality text similarity, retrieval, and semantic search tasks.
What is GTE-Large best suited for?

GTE-Large is best for generating dense vector embeddings for search, clustering, recommendation, and RAG retrieval over large text corpora.
What modalities does GTE-Large support?

GTE-Large is a text-only model that accepts natural language input and outputs fixed-size vector embeddings.
How do I access GTE-Large through LLM.API?

You call the LLM.API embeddings endpoint with the GTE-Large model name, passing your text inputs and receiving embedding vectors in the response.
How does GTE-Large compare to similar embedding models?

GTE-Large typically offers strong semantic retrieval quality comparable to other large general-purpose embedding models, with competitive performance on common benchmark datasets.
What is the context window of GTE-Large?

GTE-Large is generally used on short to medium-length texts, and very long documents should be chunked before embedding.
How fast is GTE-Large and what latency should I expect via LLM.API?

Latency depends on LLM.API infrastructure and batch size, but GTE-Large is designed for practical real-time or near-real-time embedding workloads.
What does GTE-Large cost to use on LLM.API?

Pricing for GTE-Large is determined by LLM.API and is typically based on the number of tokens or characters embedded per request.
Does GTE-Large support batch embedding through LLM.API?

Yes, you can send multiple input texts in a single embeddings request to LLM.API to get batched GTE-Large embeddings.
What are the main limitations of GTE-Large?

GTE-Large cannot generate or understand images, may underperform on highly domain-specific jargon, and does not perform generative text completion.

Start in 2 lines of code

Get My API Key

GTE-Large

What is GTE-Large?

5 Core Capabilities

Text Embedding

Semantic Similarity

Information Retrieval

Reranking Support

Clustering Usage

6 Most Valuable Use Cases

Cost Comparison

Technical Specifications

Why Build on LLM.API?

Unified AI Routing

Cost-Aware Orchestration

Resilient Fallbacks

End-to-End Observability

Task-Level Abstractions

High-Throughput Batch

When to Use — When NOT to Use

Use it if...

Avoid if...

Start in 2 lines of code