Powered by Google

Chirp 3

  • Speech-to-Text

Chirp 3 is Google's latest-generation multilingual speech and audio model, available through Google Cloud for high-accuracy transcription and natural-sounding text-to-speech. It is notable for its improved accuracy, speed, and support for advanced features like diarization, automatic language detection, and custom voices.

Start Using API

What is Chirp 3?

Chirp 3 is a multilingual Automatic Speech Recognition and audio generation model from Google that powers Speech-to-Text and Text-to-Speech capabilities in Google Cloud. It is used for accurate real-time and batch audio transcription across many languages, including support for speaker diarization and language-agnostic transcription. It is also used to generate high-fidelity synthetic speech, including instant custom voice models built from high-quality recordings. Chirp 3 succeeds earlier Chirp models as part of Google’s Chirp family of speech and audio foundation models.

5 Core Capabilities

  • Conversational AI

    Engages in natural, multi-turn voice conversations, understanding user intent and context to provide relevant, coherent spoken responses.

  • Audio Transcription

    Converts spoken language in audio input into accurate text, supporting real-time or near real-time voice transcription scenarios.

  • Speech Translation

    Translates spoken language from one language to another, enabling cross-lingual voice conversations and real-time interpretation use cases.

  • Voice Monitoring

    Processes and monitors audio streams for commands or triggers, enabling responsive voice-driven applications and interactive systems.

  • Audio-Linked Imagery

    Can be integrated with image-capable systems to associate spoken descriptions with visual content, supporting multimodal user experiences.

6 Most Valuable Use Cases

  • Real-time transcription
  • Call center analytics
  • Meeting note generation
  • Customer support voicebots
  • Audiobook narration
  • Custom brand voices

Cost Comparison

LLM API offers the lowest cost and highest performance for Chirp 3‑class speech models.

Provider Region Latency Throughput Uptime Input ($/1M) Output ($/1M) Context
LLM API BEST Global 80ms 220 min/s 99.99% $0.008/min $0.008/min ~480 min audio
Google Global ~150ms ~150 min/s 99.9% ~$0.012/min ~$0.012/min ~300 min audio
Azure Global ~170ms ~120 min/s 99.9% ~$0.013/min ~$0.013/min ~240 min audio
Amazon Web Services Global ~190ms ~100 min/s 99.9% ~$0.014/min ~$0.014/min ~240 min audio

Technical Specifications

Metric Chirp 3 (Google) Whisper v3 (OpenAI) NeMo ASR Large (NVIDIA)
Avg Latency ~250ms ~300ms ~350ms
Languages Supported ~100+ ~100+ ~30+
Price per Minute ~$0.006 ~$0.006 ~$0.005
Max Duration ~2 hours ~2 hours ~3 hours
Accuracy (WER) ~6% ~5% ~7%
Uptime 99.9% 99.9% 99.9%
Real-time Throughput ~60x RT ~50x RT ~40x RT

30-day usage via LLM API

3.6B
Prompt tokens processed (last 30 days)
11.4M
Completion tokens generated (last 30 days)
2.1M
API requests served (last 30 days)
99.8%
Avg uptime (last 30 days)
Start Using API

Why Build on LLM.API?

One unified API. Every major model. Built-in reliability, cost control, and observability.

  • Unified AI Routing

    Automatically route each request to the optimal model across providers based on latency, cost, and quality. One endpoint, dynamic policies, no SDK sprawl.

    One endpoint, any model
  • Cost-Aware Orchestration

    Control spend with price-aware routing, per-project limits, and transparent metering across vendors. Swap models without rewiring billing or touching client code.

    Cut cost, keep quality
  • Resilient Fallback Flows

    Design multi-provider fallback trees that auto-retry on failures, timeouts, or quota limits. Keep production workloads online even when a vendor has issues.

    Never ship single-vendor SPOF
  • Deep LLM Observability

    Get unified traces, logs, and metrics for every request across providers. Inspect prompts, latencies, and errors in one place to debug faster and tune confidently.

    Single pane of AI truth
  • Task-Level Abstractions

    Describe tasks like “chat”, “embed”, or “moderate” instead of binding to model names. LLM.API maps tasks to the best capabilities behind a stable interface.

    Code to tasks, not models
  • High-Throughput Batch APIs

    Ship bulk workloads with streaming-safe, rate-aware batching. Push thousands of prompts per job while LLM.API handles chunking, retries, and provider limits.

    Batch at production scale

When to Use — When NOT to Use

Use it if...

  • You need a general-purpose chat model for consumer-facing assistants or help bots.
  • Your use case involves everyday Q&A, explanations, and basic task automation workflows.
  • You need tight integration with Google ecosystems, tooling, or existing Google Cloud infrastructure.
  • Your use case involves moderate-length documents where natural language understanding is more important than depth.
  • You need reasonably capable text generation without requiring cutting-edge reasoning or niche domain expertise.
  • Your use case involves prototyping conversational features before committing to a more advanced model.

Avoid if...

  • You need state-of-the-art complex reasoning, planning, or tool-using capabilities across long sessions.
  • Your workload requires rigorous handling of long technical documents with precise, verifiable citations.
  • You need highly optimized performance on specialized domains like law, medicine, or quantitative finance.
  • Your workload requires extremely long context windows with consistent accuracy across hundreds of pages.
  • You need fine-grained control over safety, customization, or model behavior beyond standard configuration options.
  • Your workload requires strict reproducibility and deterministic outputs for compliance-critical pipelines.

Frequently Asked Questions

  • What is Chirp 3?

    Chirp 3 is a Google speech model focused on automatic speech recognition with strong multilingual performance and robustness to noisy, real‑world audio.

  • What is Chirp 3 best suited for?

    Chirp 3 is best for high‑accuracy, large‑scale transcription of calls, meetings, videos, and user‑generated audio across many languages and accents.

  • What modalities does Chirp 3 support through LLM.API?

    Through LLM.API, Chirp 3 supports audio input and text output for speech‑to‑text workloads, without image or text‑generation capabilities.

  • How is Chirp 3 priced on LLM.API?

    Chirp 3 is typically billed per processed audio minute or second via LLM.API; check your LLM.API pricing page for exact current rates.

  • What is the maximum audio or context length Chirp 3 can handle?

    Chirp 3 supports long‑form audio transcription, but maximum duration and effective context depend on LLM.API limits and configuration for streaming or batch mode.

  • How fast is Chirp 3 in terms of latency?

    Chirp 3 generally operates near real time for short clips, with latency mainly determined by audio length and LLM.API region and network conditions.

  • How do I call Chirp 3 via the LLM.API?

    You select the Google Chirp 3 model in your LLM.API request, provide audio bytes or a URL, and receive transcribed text in the response.

  • How does Chirp 3 compare to general LLMs for transcription tasks?

    Compared with general text LLMs, Chirp 3 is specialized, usually cheaper and more accurate for speech recognition but cannot perform text‑only reasoning.

  • Does Chirp 3 support streaming transcription on LLM.API?

    If enabled by LLM.API, Chirp 3 can consume audio chunks incrementally and return partial transcripts for low‑latency streaming experiences.

  • What are the main limitations of Chirp 3?

    Chirp 3 is limited to speech recognition, may struggle with extremely noisy audio, rare languages, domain‑specific jargon, and does not generate or understand images.

Start in 2 lines of code

Get My API Key