Chirp 3

Speech-to-Text

Chirp 3 is Google's latest-generation multilingual speech and audio model, available through Google Cloud for high-accuracy transcription and natural-sounding text-to-speech. It is notable for its improved accuracy, speed, and support for advanced features like diarization, automatic language detection, and custom voices.

Start Using API

API Performance

Latency: ~0.7s time to first token
Context: ~32K token context
Input: ~$0.20 per 1M tokens
Output: ~$0.60 per 1M tokens
Uptime: 99% 99%

About the model

What is Chirp 3?

Chirp 3 is a multilingual Automatic Speech Recognition and audio generation model from Google that powers Speech-to-Text and Text-to-Speech capabilities in Google Cloud. It is used for accurate real-time and batch audio transcription across many languages, including support for speaker diarization and language-agnostic transcription. It is also used to generate high-fidelity synthetic speech, including instant custom voice models built from high-quality recordings. Chirp 3 succeeds earlier Chirp models as part of Google’s Chirp family of speech and audio foundation models.

Input / Output

Input

Audio for speech recognition (files or streams)
Text for speech synthesis (text-to-speech)

Output

Transcribed text from audio (speech-to-text)
Generated audio speech from text (text-to-speech)

Model capabilities

5 Core Capabilities

Conversational AI

Engages in natural, multi-turn voice conversations, understanding user intent and context to provide relevant, coherent spoken responses.
Audio Transcription

Converts spoken language in audio input into accurate text, supporting real-time or near real-time voice transcription scenarios.
Speech Translation

Translates spoken language from one language to another, enabling cross-lingual voice conversations and real-time interpretation use cases.
Voice Monitoring

Processes and monitors audio streams for commands or triggers, enabling responsive voice-driven applications and interactive systems.
Audio-Linked Imagery

Can be integrated with image-capable systems to associate spoken descriptions with visual content, supporting multimodal user experiences.

Use cases

6 Most Valuable Use Cases

Real-time transcription
Call center analytics
Meeting note generation
Customer support voicebots
Audiobook narration
Custom brand voices

Transparent pricing

Cost Comparison

LLM API offers the lowest cost and highest performance for Chirp 3‑class speech models.

Provider	Region	Latency	Throughput	Uptime	Input ($/1M)	Output ($/1M)	Context
LLM API BEST	Global	80ms	220 min/s	99.99%	$0.008/min	$0.008/min	~480 min audio
Google	Global	~150ms	~150 min/s	99.9%	~$0.012/min	~$0.012/min	~300 min audio
Azure	Global	~170ms	~120 min/s	99.9%	~$0.013/min	~$0.013/min	~240 min audio
Amazon Web Services	Global	~190ms	~100 min/s	99.9%	~$0.014/min	~$0.014/min	~240 min audio

Performance benchmarks

Technical Specifications

Metric	Chirp 3 (Google)	Whisper v3 (OpenAI)	NeMo ASR Large (NVIDIA)
Avg Latency	~250ms	~300ms	~350ms
Languages Supported	~100+	~100+	~30+
Price per Minute	~$0.006	~$0.006	~$0.005
Max Duration	~2 hours	~2 hours	~3 hours
Accuracy (WER)	~6%	~5%	~7%
Uptime	99.9%	99.9%	99.9%
Real-time Throughput	~60x RT	~50x RT	~40x RT

30-day usage via LLM API

3.6B: Prompt tokens processed (last 30 days)
11.4M: Completion tokens generated (last 30 days)
2.1M: API requests served (last 30 days)
99.8%: Avg uptime (last 30 days)

Start Using API

Architecture & Integration

Why Build on LLM.API?

One unified API. Every major model. Built-in reliability, cost control, and observability.

Unified AI Routing

Automatically route each request to the optimal model across providers based on latency, cost, and quality. One endpoint, dynamic policies, no SDK sprawl.
One endpoint, any model
Cost-Aware Orchestration

Control spend with price-aware routing, per-project limits, and transparent metering across vendors. Swap models without rewiring billing or touching client code.
Cut cost, keep quality
Resilient Fallback Flows

Design multi-provider fallback trees that auto-retry on failures, timeouts, or quota limits. Keep production workloads online even when a vendor has issues.
Never ship single-vendor SPOF
Deep LLM Observability

Get unified traces, logs, and metrics for every request across providers. Inspect prompts, latencies, and errors in one place to debug faster and tune confidently.
Single pane of AI truth
Task-Level Abstractions

Describe tasks like “chat”, “embed”, or “moderate” instead of binding to model names. LLM.API maps tasks to the best capabilities behind a stable interface.
Code to tasks, not models
High-Throughput Batch APIs

Ship bulk workloads with streaming-safe, rate-aware batching. Push thousands of prompts per job while LLM.API handles chunking, retries, and provider limits.
Batch at production scale

Decision guide

When to Use — When NOT to Use

Use it if...

You need a general-purpose chat model for consumer-facing assistants or help bots.
Your use case involves everyday Q&A, explanations, and basic task automation workflows.
You need tight integration with Google ecosystems, tooling, or existing Google Cloud infrastructure.
Your use case involves moderate-length documents where natural language understanding is more important than depth.
You need reasonably capable text generation without requiring cutting-edge reasoning or niche domain expertise.
Your use case involves prototyping conversational features before committing to a more advanced model.

Avoid if...

You need state-of-the-art complex reasoning, planning, or tool-using capabilities across long sessions.
Your workload requires rigorous handling of long technical documents with precise, verifiable citations.
You need highly optimized performance on specialized domains like law, medicine, or quantitative finance.
Your workload requires extremely long context windows with consistent accuracy across hundreds of pages.
You need fine-grained control over safety, customization, or model behavior beyond standard configuration options.
Your workload requires strict reproducibility and deterministic outputs for compliance-critical pipelines.

FAQ

Frequently Asked Questions

What is Chirp 3?

Chirp 3 is a Google speech model focused on automatic speech recognition with strong multilingual performance and robustness to noisy, real‑world audio.
What is Chirp 3 best suited for?

Chirp 3 is best for high‑accuracy, large‑scale transcription of calls, meetings, videos, and user‑generated audio across many languages and accents.
What modalities does Chirp 3 support through LLM.API?

Through LLM.API, Chirp 3 supports audio input and text output for speech‑to‑text workloads, without image or text‑generation capabilities.
How is Chirp 3 priced on LLM.API?

Chirp 3 is typically billed per processed audio minute or second via LLM.API; check your LLM.API pricing page for exact current rates.
What is the maximum audio or context length Chirp 3 can handle?

Chirp 3 supports long‑form audio transcription, but maximum duration and effective context depend on LLM.API limits and configuration for streaming or batch mode.
How fast is Chirp 3 in terms of latency?

Chirp 3 generally operates near real time for short clips, with latency mainly determined by audio length and LLM.API region and network conditions.
How do I call Chirp 3 via the LLM.API?

You select the Google Chirp 3 model in your LLM.API request, provide audio bytes or a URL, and receive transcribed text in the response.
How does Chirp 3 compare to general LLMs for transcription tasks?

Compared with general text LLMs, Chirp 3 is specialized, usually cheaper and more accurate for speech recognition but cannot perform text‑only reasoning.
Does Chirp 3 support streaming transcription on LLM.API?

If enabled by LLM.API, Chirp 3 can consume audio chunks incrementally and return partial transcripts for low‑latency streaming experiences.
What are the main limitations of Chirp 3?

Chirp 3 is limited to speech recognition, may struggle with extremely noisy audio, rare languages, domain‑specific jargon, and does not generate or understand images.

Start in 2 lines of code

Get My API Key

Chirp 3

What is Chirp 3?

5 Core Capabilities

Conversational AI

Audio Transcription

Speech Translation

Voice Monitoring

Audio-Linked Imagery

6 Most Valuable Use Cases

Cost Comparison

Technical Specifications

Why Build on LLM.API?

Unified AI Routing

Cost-Aware Orchestration

Resilient Fallback Flows

Deep LLM Observability

Task-Level Abstractions

High-Throughput Batch APIs

When to Use — When NOT to Use

Use it if...

Avoid if...

Start in 2 lines of code