Gemini 3.1 Flash Lite Preview

Instruction Following

Gemini 3.1 Flash Lite Preview is a lightweight, cost-efficient Google Gemini 3.1 series model optimized for high-throughput applications with long context and adjustable thinking levels.

Start Using API

API Performance

Latency: ~0.6s time to first token
Context: ~128K token context
Input: ~$0.03 per 1M tokens
Output: ~$0.06 per 1M tokens
Uptime: 99% 99%

About the model

What is Gemini 3.1 Flash Lite Preview?

Gemini 3.1 Flash Lite Preview is a preview version of Google’s Gemini 3.1 Flash-Lite large language model, designed to offer fast, inexpensive inference while supporting long-context and multimodal tasks. It is mainly used for large-scale, latency-sensitive workloads such as chatbots, agents, and real-time assistants that need to serve many requests at low cost. It is also used for applications like document and data processing, prompt-based research assistants, and other production AI services that benefit from its long context window and configurable “thinking” budget. It belongs to the Gemini 3.x Flash/Flash-Lite family and succeeds earlier preview models like Gemini 2.5 Flash Lite Preview.

Input / Output

Input

Text prompts (natural language, code, or structured text)
Documents via text content (e.g. extracted from PDFs, HTML, or other files)

Output

Natural-language responses and other free-form text
Code snippets and programming-related output
Structured JSON and tabular-style text suitable for charts or data pipelines

Model capabilities

5 Core Capabilities

Fast Text Chat

Handles general-purpose conversational queries and instruction-following with low latency, optimized for high-throughput interactive applications.
Multimodal Input

Accepts text, image, audio, video, and PDF inputs while producing text outputs, enabling unified reasoning across diverse content types.
Code Execution

Supports executing code via tools, enabling programmatic problem solving, validation of answers, and workflow automation within applications.
Data Extraction

Performs large-scale text extraction, summarization, and classification tasks efficiently, suitable for background processing and document workflows.
Text Translation

Provides fast, cost-efficient translation between multiple languages, designed for high-frequency, production-grade localization and communication workloads.

Use cases

6 Most Valuable Use Cases

High‑volume Translation
Content Moderation Pipelines
Large‑scale Data Extraction
Bulk Text Classification
Automated UI Generation
Always‑on AI Agents

Transparent pricing

Cost Comparison

Save up to ~70% vs major Gemini-compatible providers with consistently lower latency and higher throughput.

Provider	Region	Latency	Throughput	Uptime	Input ($/1M)	Output ($/1M)	Context
LLM API BEST	Global	120ms	120 tps	99.99%	$0.02	$0.04	1M tokens
Google	Global	~220ms	~40 tps	99.9%	~$0.06	~$0.12	~1M tokens
Vertex AI (Google Cloud)	US East	~260ms	~35 tps	99.9%	~$0.065	~$0.13	~1M tokens
Third-Party Aggregator A	Global	~250ms	~30 tps	99.9%	~$0.07	~$0.14	~512K tokens
Third-Party Aggregator B	EU West	~280ms	~25 tps	99.5%	~$0.075	~$0.15	~512K tokens

Performance benchmarks

Technical Specifications

Metric	Gemini 3.1 Flash Lite Preview	GPT-4.1 mini (OpenAI)	Claude 3 Haiku (Anthropic)
Avg Latency	~120ms	~150ms	~180ms
Context Window	128K	128K	200K
Input Price ($/1M)	$0.05	$0.15	$0.25
Output Price ($/1M)	$0.15	$0.60	$0.80
Max Output Tokens	4K	4K	4K
Throughput	~120 tps	~100 tps	~80 tps
Uptime	99.9%	99.9%	99.9%

30-day usage via LLM API

11.8B: Prompt tokens processed (30 days)
7.4B: Completion tokens generated (30 days)
19.6M: API requests served (30 days)
99.8%: Average uptime over last 30 days

Start Using API

Architecture & Integration

Why Build on LLM.API?

One unified API. Every major model. Built-in reliability, cost control, and observability.

Unified AI Routing

Dynamically route each request to the optimal model across providers based on latency, cost, and quality—without changing your integration or redeploying code.
One endpoint, every model
Cost-Aware Orchestration

Automatically pick the most cost-effective model for each task, enforce budgets, and compare spend across providers from a single, unified billing layer.
Reduce AI spend fast
Resilient Fallbacks

Define per-request failover chains so outages or rate limits seamlessly roll to backup models, keeping your production workloads stable and always-on.
No single point of failure
Deep Observability

Get end-to-end traces, latency and error metrics, and payload-level logs for every provider in one place—plus hooks for alerts and custom dashboards.
See every token, everywhere
Task-Level Abstractions

Declare the job—chat, generation, tools, retrieval, structured outputs—and let LLM.API normalize APIs, schemas, and options across providers for you.
Think tasks, not vendors
High-Throughput Batch

Submit massive workloads as batches with automatic parallelization, retries, and provider-optimized chunking to drive down cost and maximize throughput.
Process millions, reliably

Decision guide

When to Use — When NOT to Use

Use it if...

You need a very low-cost model for high-volume, latency-tolerant requests.
Your use case involves simple chatbots, FAQs, or support flows with short prompts.
You need to generate or rewrite short texts like snippets, titles, and descriptions.
Your use case involves lightweight classification, tagging, or routing over many small inputs.
You need fast experimentation with prompt ideas before migrating to a larger Gemini model.
Your use case involves mobile or edge-style workloads where efficiency and speed dominate quality.

Avoid if...

You need the strongest reasoning quality Gemini offers across complex, multi-step problems.
Your workload requires high-fidelity coding assistance, debugging, or multi-file codebase understanding.
You need reliable performance on long-context tasks like large document synthesis or review.
Your workload requires state-of-the-art performance on nuanced safety-sensitive or regulated decisions.
You need top-tier multimodal understanding, complex image analysis, or precise visual reasoning.
Your workload requires highly consistent, premium-quality output for customer-facing production experiences.

FAQ

Frequently Asked Questions

What is Gemini 3.1 Flash Lite Preview?

Gemini 3.1 Flash Lite Preview is a lightweight, preview-version Gemini model from Google optimized for fast, low-cost generation via the LLM.API gateway.
What is Gemini 3.1 Flash Lite Preview best suited for?

It is best for high-volume, latency-sensitive tasks like chatbots, simple agents, and lightweight content generation where cost efficiency matters more than peak quality.
What context window does Gemini 3.1 Flash Lite Preview support on LLM.API?

Gemini 3.1 Flash Lite Preview supports up to 128K tokens of context via LLM.API, enabling long conversations and documents.
How fast is Gemini 3.1 Flash Lite Preview in terms of latency?

It is tuned for low latency, generally returning first tokens quickly and handling streaming responses efficiently for interactive applications.
Which input and output modalities does Gemini 3.1 Flash Lite Preview support?

Through LLM.API it supports text input and text output, with multimodal features depending on the specific LLM.API integration configuration.
How is Gemini 3.1 Flash Lite Preview priced on LLM.API?

Pricing is usage-based per input and output token, with rates set by LLM.API and typically lower than larger, higher-quality Gemini variants.
How do I call Gemini 3.1 Flash Lite Preview via LLM.API?

You select the model name "google/gemini-3.1-flash-lite-preview" in your LLM.API request and pass messages using the standard chat completions schema.
How does Gemini 3.1 Flash Lite Preview compare to Gemini 3.1 Flash?

Flash Lite is generally cheaper and faster but slightly lower in quality and capability than the full Gemini 3.1 Flash model.
What are the main limitations of Gemini 3.1 Flash Lite Preview?

It may underperform larger models on complex reasoning, nuanced coding tasks, and highly specialized domains, and is provided as a preview with evolving behavior.
Can I use Gemini 3.1 Flash Lite Preview for code generation?

Yes, it can generate and edit code, but for complex or critical programming tasks a more capable Gemini or other advanced model is recommended.

Start in 2 lines of code

Get My API Key

Gemini 3.1 Flash Lite Preview

What is Gemini 3.1 Flash Lite Preview?

5 Core Capabilities

Fast Text Chat

Multimodal Input

Code Execution

Data Extraction

Text Translation

6 Most Valuable Use Cases

Cost Comparison

Technical Specifications

Why Build on LLM.API?

Unified AI Routing

Cost-Aware Orchestration

Resilient Fallbacks

Deep Observability

Task-Level Abstractions

High-Throughput Batch

When to Use — When NOT to Use

Use it if...

Avoid if...

Start in 2 lines of code