Powered by OpenAI
Whisper 1
- Speech-to-Text
Whisper 1 is OpenAI’s hosted automatic speech recognition model based on the open-source Whisper family, designed for high-quality transcription and translation of audio. It is notable for robust multilingual speech-to-text performance and language identification across diverse audio conditions.
About the model
What is Whisper 1?
Whisper 1 is an OpenAI speech recognition model served via API for converting spoken audio into text. It is mainly used for automatic transcription of recordings such as meetings, podcasts, or voice notes, and for generating captions or searchable text from spoken content. It is also widely used to translate non‑English speech into English transcripts and to detect the spoken language in audio. Whisper 1 belongs to the Whisper model family and is based on the large-v2 variant of OpenAI’s open-source Whisper models.
Model capabilities
5 Core Capabilities
-
Speech Recognition
Converts spoken audio into accurate text transcriptions across many languages, handling varied accents, recording conditions, and speaking styles.
-
Multilingual Transcription
Transcribes speech in multiple supported languages, preserving original language content while coping with diverse pronunciations and vocabularies.
-
Speech Translation
Translates spoken language in audio into written text in another language, enabling cross-lingual understanding and communication.
-
Audio OCR
Extracts spoken content from audio or video files, effectively performing OCR-like text extraction for voice-based information.
-
Audio Captioning
Provides text outputs describing spoken segments in audio, supporting captioning and subtitling workflows for media content.
Use cases
6 Most Valuable Use Cases
- Multilingual Speech Transcription
- Meeting and Lecture Captions
- Call Center Conversation Logging
- Podcast and Video Subtitles
- Voice-Controlled App Interfaces
- Audio Data Preprocessing
Transparent pricing
Cost Comparison
LLM API offers the lowest Whisper‑class transcription cost and latency across major providers.
| Provider | Region | Latency | Throughput | Uptime | Input ($/1M) | Output ($/1M) | Context |
|---|---|---|---|---|---|---|---|
| LLM API BEST | Global | ~180ms | ~120 audio min/s | 99.99% | ~$0.003/min | $0.00 | ~4 hour audio |
| OpenAI | Global | ~250ms | ~60 audio min/s | 99.9% | $0.006/min | $0.006/min | 30 min audio |
| Azure OpenAI | US East | ~450ms | ~45 min/s | 99.9% | ~$0.0065/min | ~$0.0065/min | 30 min audio |
| Google Cloud Speech-to-Text | Global | ~500ms | ~40 min/s | 99.9% | ~$0.009/min | ~$0.009/min | 30 min audio |
| Amazon Transcribe | US East | ~550ms | ~35 min/s | 99.9% | ~$0.008/min | ~$0.008/min | 30 min audio |
Performance benchmarks
Technical Specifications
| Metric | Whisper 1 (OpenAI) | Google Speech-to-Text v2 | Amazon Transcribe |
|---|---|---|---|
| Avg Latency | ~300ms | ~350ms | ~400ms |
| Languages Supported | ~99 | ~73 | ~79 |
| Price per Minute | $0.006 | $0.012 | $0.015 |
| Max Duration per Request | 60 min | 480 min | 240 min |
| Accuracy (WER) | ~7% | ~8% | ~9% |
| Uptime | 99.9% | 99.9% | 99.9% |
| Streaming Support | Yes | Yes | Yes |
30-day usage via LLM API
- 310M
- Audio minutes transcribed (30 days)
- 22.5M
- API requests processed (30 days)
- 2.1M
- Unique apps and services using Whisper 1
- 99.9%
- Average API uptime over last 30 days
Architecture & Integration
Why Build on LLM.API?
One unified API. Every major model. Built-in reliability, cost control, and observability.
-
Unified AI Routing
Automatically direct each request to the optimal model across providers using latency, cost, and quality signals, so you ship faster without hardcoding vendor logic.
One endpoint, smart routing. -
Cost-Aware Execution
Control spend with per-route budgets, price-aware model selection, and real-time usage insights, so you can scale traffic without surprise bills or manual tuning.
Optimize cost, not code. -
Resilient Fallbacks
Define automatic failover chains across models and providers, so outages, rate limits, or degraded quality don’t take your features offline.
Stay online under stress. -
End-to-End Observability
Inspect every request with traces, metrics, and structured logs across providers, making it easy to debug prompts, compare models, and tune performance in production.
See every token flow. -
Task-Level Abstractions
Describe tasks like chat, RAG, or extraction once, then swap models or providers without rewriting business logic, keeping your app code clean and future-proof.
Code to tasks, not models. -
High-Throughput Batch
Process large workloads with parallelized, provider-agnostic batching and automatic retries, reducing latency and unit cost for bulk jobs and backfills.
Batch at production scale.
Decision guide
When to Use — When NOT to Use
Use it if...
- You need accurate speech-to-text transcription for single-speaker English audio recordings.
- You need robust transcription for noisy environments, accents, or imperfect consumer-grade microphones.
- Your use case involves transcribing podcasts, interviews, or lectures for searchable text archives.
- Your use case involves automatically generating subtitles or captions for pre-recorded videos.
- You need to convert voice notes or meetings into text for downstream NLP processing.
- You need a general-purpose ASR model without training or fine-tuning your own system.
- Your use case involves batch-processing many audio files asynchronously without strict real-time constraints.
Avoid if...
- You need real-time, low-latency streaming transcription for live calls or broadcasts.
- Your workload requires on-device or fully offline speech recognition without cloud dependencies.
- You need highly domain-specific ASR tuned to medical, legal, or niche technical jargon.
- You need end-to-end spoken language understanding and dialog, not just transcription output.
- Your workload requires strict, verifiable data residency on self-hosted infrastructure only.
- You need fine-grained word-level timestamps and detailed diarization across many speakers.
- Your workload requires direct speech-to-speech translation instead of speech-to-text capabilities.
FAQ
Frequently Asked Questions
-
What is Whisper 1?
Whisper 1 is OpenAI’s automatic speech recognition (ASR) model for transcribing and translating audio into text.
-
What modalities does Whisper 1 support via LLM.API?
Whisper 1 supports audio input and returns text output for transcription and translation tasks through LLM.API.
-
What is Whisper 1 best suited for?
Whisper 1 is best for accurate speech-to-text transcription, multilingual audio transcription, and speech translation to English.
-
How is Whisper 1 priced when used through LLM.API?
Whisper 1 is typically billed per minute of processed audio; consult LLM.API’s pricing page for exact current rates.
-
What is the maximum audio length or context Whisper 1 can handle per request?
Whisper 1 generally supports long-form audio, but maximum duration may be capped by LLM.API request size and timeout limits.
-
How fast is Whisper 1 in terms of latency?
Whisper 1 usually processes audio close to or faster than real time, but actual latency depends on audio length and LLM.API infrastructure.
-
How do I call Whisper 1 through LLM.API?
You select the Whisper 1 model identifier in your LLM.API request and send audio data in the supported format and encoding.
-
How does Whisper 1 compare to larger text LLMs for transcription tasks?
Whisper 1 is generally more accurate, robust, and cost-efficient for transcription than using general-purpose text-only LLMs with external audio preprocessing.
-
Does Whisper 1 support multiple languages?
Yes, Whisper 1 supports many languages for transcription and can translate non-English speech into English text.
-
What formats and sample rates are supported for Whisper 1 audio input?
Whisper 1 typically supports common formats like MP3, MP4, WAV, and FLAC with standard speech sample rates such as 16 kHz.
-
Can Whisper 1 perform real-time streaming transcription via LLM.API?
Real-time streaming support depends on LLM.API features; if streaming endpoints are provided, they can expose Whisper 1 for low-latency use.
-
What are some limitations of Whisper 1?
Whisper 1 may struggle with heavy background noise, strong accents, overlapping speakers, domain-specific jargon, and very low-quality recordings.
