Whisper 1 is OpenAI’s automatic speech recognition (ASR) model for transcribing and translating audio into text.

What modalities does Whisper 1 support via LLM.API?

Whisper 1 supports audio input and returns text output for transcription and translation tasks through LLM.API.

What is Whisper 1 best suited for?

Whisper 1 is best for accurate speech-to-text transcription, multilingual audio transcription, and speech translation to English.

How is Whisper 1 priced when used through LLM.API?

Whisper 1 is typically billed per minute of processed audio; consult LLM.API’s pricing page for exact current rates.

What is the maximum audio length or context Whisper 1 can handle per request?

Whisper 1 generally supports long-form audio, but maximum duration may be capped by LLM.API request size and timeout limits.

How fast is Whisper 1 in terms of latency?

Whisper 1 usually processes audio close to or faster than real time, but actual latency depends on audio length and LLM.API infrastructure.

How do I call Whisper 1 through LLM.API?

You select the Whisper 1 model identifier in your LLM.API request and send audio data in the supported format and encoding.

How does Whisper 1 compare to larger text LLMs for transcription tasks?

Whisper 1 is generally more accurate, robust, and cost-efficient for transcription than using general-purpose text-only LLMs with external audio preprocessing.

Does Whisper 1 support multiple languages?

Yes, Whisper 1 supports many languages for transcription and can translate non-English speech into English text.

What formats and sample rates are supported for Whisper 1 audio input?

Whisper 1 typically supports common formats like MP3, MP4, WAV, and FLAC with standard speech sample rates such as 16 kHz.

Can Whisper 1 perform real-time streaming transcription via LLM.API?

Real-time streaming support depends on LLM.API features; if streaming endpoints are provided, they can expose Whisper 1 for low-latency use.

What are some limitations of Whisper 1?

Whisper 1 may struggle with heavy background noise, strong accents, overlapping speakers, domain-specific jargon, and very low-quality recordings.

Whisper 1

Speech-to-Text

Whisper 1 is OpenAI’s hosted automatic speech recognition model based on the open-source Whisper family, designed for high-quality transcription and translation of audio. It is notable for robust multilingual speech-to-text performance and language identification across diverse audio conditions.

Start Using API

API Performance

Latency: ~1.5s avg transcription time for short audio
Context: 60 min max audio duration
Input: $0.006/min per audio minute (input)
Output: $0.006/min per audio minute (transcription output)
Uptime: 99% 99%

About the model

What is Whisper 1?

Whisper 1 is an OpenAI speech recognition model served via API for converting spoken audio into text. It is mainly used for automatic transcription of recordings such as meetings, podcasts, or voice notes, and for generating captions or searchable text from spoken content. It is also widely used to translate non‑English speech into English transcripts and to detect the spoken language in audio. Whisper 1 belongs to the Whisper model family and is based on the large-v2 variant of OpenAI’s open-source Whisper models.

Input / Output

Input

Audio files (various formats such as MP3, MP4, MPEG, MPGA, M4A, WAV, WEBM)

Output

Transcribed text

Model capabilities

5 Core Capabilities

Speech Recognition

Converts spoken audio into accurate text transcriptions across many languages, handling varied accents, recording conditions, and speaking styles.
Multilingual Transcription

Transcribes speech in multiple supported languages, preserving original language content while coping with diverse pronunciations and vocabularies.
Speech Translation

Translates spoken language in audio into written text in another language, enabling cross-lingual understanding and communication.
Audio OCR

Extracts spoken content from audio or video files, effectively performing OCR-like text extraction for voice-based information.
Audio Captioning

Provides text outputs describing spoken segments in audio, supporting captioning and subtitling workflows for media content.

Use cases

6 Most Valuable Use Cases

Multilingual Speech Transcription
Meeting and Lecture Captions
Call Center Conversation Logging
Podcast and Video Subtitles
Voice-Controlled App Interfaces
Audio Data Preprocessing

Transparent pricing

Cost Comparison

LLM API offers the lowest Whisper‑class transcription cost and latency across major providers.

Provider	Region	Latency	Throughput	Uptime	Input ($/1M)	Output ($/1M)	Context
LLM API BEST	Global	~180ms	~120 audio min/s	99.99%	~$0.003/min	$0.00	~4 hour audio
OpenAI	Global	~250ms	~60 audio min/s	99.9%	$0.006/min	$0.006/min	30 min audio
Azure OpenAI	US East	~450ms	~45 min/s	99.9%	~$0.0065/min	~$0.0065/min	30 min audio
Google Cloud Speech-to-Text	Global	~500ms	~40 min/s	99.9%	~$0.009/min	~$0.009/min	30 min audio
Amazon Transcribe	US East	~550ms	~35 min/s	99.9%	~$0.008/min	~$0.008/min	30 min audio

Performance benchmarks

Technical Specifications

Metric	Whisper 1 (OpenAI)	Google Speech-to-Text v2	Amazon Transcribe
Avg Latency	~300ms	~350ms	~400ms
Languages Supported	~99	~73	~79
Price per Minute	$0.006	$0.012	$0.015
Max Duration per Request	60 min	480 min	240 min
Accuracy (WER)	~7%	~8%	~9%
Uptime	99.9%	99.9%	99.9%
Streaming Support	Yes	Yes	Yes

30-day usage via LLM API

310M: Audio minutes transcribed (30 days)
22.5M: API requests processed (30 days)
2.1M: Unique apps and services using Whisper 1
99.9%: Average API uptime over last 30 days

Start Using API

Architecture & Integration

Why Build on LLM.API?

One unified API. Every major model. Built-in reliability, cost control, and observability.

Unified AI Routing

Automatically direct each request to the optimal model across providers using latency, cost, and quality signals, so you ship faster without hardcoding vendor logic.
One endpoint, smart routing.
Cost-Aware Execution

Control spend with per-route budgets, price-aware model selection, and real-time usage insights, so you can scale traffic without surprise bills or manual tuning.
Optimize cost, not code.
Resilient Fallbacks

Define automatic failover chains across models and providers, so outages, rate limits, or degraded quality don’t take your features offline.
Stay online under stress.
End-to-End Observability

Inspect every request with traces, metrics, and structured logs across providers, making it easy to debug prompts, compare models, and tune performance in production.
See every token flow.
Task-Level Abstractions

Describe tasks like chat, RAG, or extraction once, then swap models or providers without rewriting business logic, keeping your app code clean and future-proof.
Code to tasks, not models.
High-Throughput Batch

Process large workloads with parallelized, provider-agnostic batching and automatic retries, reducing latency and unit cost for bulk jobs and backfills.
Batch at production scale.

Decision guide

When to Use — When NOT to Use

Use it if...

You need accurate speech-to-text transcription for single-speaker English audio recordings.
You need robust transcription for noisy environments, accents, or imperfect consumer-grade microphones.
Your use case involves transcribing podcasts, interviews, or lectures for searchable text archives.
Your use case involves automatically generating subtitles or captions for pre-recorded videos.
You need to convert voice notes or meetings into text for downstream NLP processing.
You need a general-purpose ASR model without training or fine-tuning your own system.
Your use case involves batch-processing many audio files asynchronously without strict real-time constraints.

Avoid if...

You need real-time, low-latency streaming transcription for live calls or broadcasts.
Your workload requires on-device or fully offline speech recognition without cloud dependencies.
You need highly domain-specific ASR tuned to medical, legal, or niche technical jargon.
You need end-to-end spoken language understanding and dialog, not just transcription output.
Your workload requires strict, verifiable data residency on self-hosted infrastructure only.
You need fine-grained word-level timestamps and detailed diarization across many speakers.
Your workload requires direct speech-to-speech translation instead of speech-to-text capabilities.

FAQ

Frequently Asked Questions

What is Whisper 1?

Whisper 1 is OpenAI’s automatic speech recognition (ASR) model for transcribing and translating audio into text.
What modalities does Whisper 1 support via LLM.API?

Whisper 1 supports audio input and returns text output for transcription and translation tasks through LLM.API.
What is Whisper 1 best suited for?

Whisper 1 is best for accurate speech-to-text transcription, multilingual audio transcription, and speech translation to English.
How is Whisper 1 priced when used through LLM.API?

Whisper 1 is typically billed per minute of processed audio; consult LLM.API’s pricing page for exact current rates.
What is the maximum audio length or context Whisper 1 can handle per request?

Whisper 1 generally supports long-form audio, but maximum duration may be capped by LLM.API request size and timeout limits.
How fast is Whisper 1 in terms of latency?

Whisper 1 usually processes audio close to or faster than real time, but actual latency depends on audio length and LLM.API infrastructure.
How do I call Whisper 1 through LLM.API?

You select the Whisper 1 model identifier in your LLM.API request and send audio data in the supported format and encoding.
How does Whisper 1 compare to larger text LLMs for transcription tasks?

Whisper 1 is generally more accurate, robust, and cost-efficient for transcription than using general-purpose text-only LLMs with external audio preprocessing.
Does Whisper 1 support multiple languages?

Yes, Whisper 1 supports many languages for transcription and can translate non-English speech into English text.
What formats and sample rates are supported for Whisper 1 audio input?

Whisper 1 typically supports common formats like MP3, MP4, WAV, and FLAC with standard speech sample rates such as 16 kHz.
Can Whisper 1 perform real-time streaming transcription via LLM.API?

Real-time streaming support depends on LLM.API features; if streaming endpoints are provided, they can expose Whisper 1 for low-latency use.
What are some limitations of Whisper 1?

Whisper 1 may struggle with heavy background noise, strong accents, overlapping speakers, domain-specific jargon, and very low-quality recordings.

Start in 2 lines of code

Get My API Key

Whisper 1

What is Whisper 1?

5 Core Capabilities

Speech Recognition

Multilingual Transcription

Speech Translation

Audio OCR

Audio Captioning

6 Most Valuable Use Cases

Cost Comparison

Technical Specifications

Why Build on LLM.API?

Unified AI Routing

Cost-Aware Execution

Resilient Fallbacks

End-to-End Observability

Task-Level Abstractions

High-Throughput Batch

When to Use — When NOT to Use

Use it if...

Avoid if...

Start in 2 lines of code