Powered by OpenAI

Whisper 1

  • Speech-to-Text

Whisper 1 is OpenAI’s hosted automatic speech recognition model based on the open-source Whisper family, designed for high-quality transcription and translation of audio. It is notable for robust multilingual speech-to-text performance and language identification across diverse audio conditions.

Start Using API

What is Whisper 1?

Whisper 1 is an OpenAI speech recognition model served via API for converting spoken audio into text. It is mainly used for automatic transcription of recordings such as meetings, podcasts, or voice notes, and for generating captions or searchable text from spoken content. It is also widely used to translate non‑English speech into English transcripts and to detect the spoken language in audio. Whisper 1 belongs to the Whisper model family and is based on the large-v2 variant of OpenAI’s open-source Whisper models.

5 Core Capabilities

  • Speech Recognition

    Converts spoken audio into accurate text transcriptions across many languages, handling varied accents, recording conditions, and speaking styles.

  • Multilingual Transcription

    Transcribes speech in multiple supported languages, preserving original language content while coping with diverse pronunciations and vocabularies.

  • Speech Translation

    Translates spoken language in audio into written text in another language, enabling cross-lingual understanding and communication.

  • Audio OCR

    Extracts spoken content from audio or video files, effectively performing OCR-like text extraction for voice-based information.

  • Audio Captioning

    Provides text outputs describing spoken segments in audio, supporting captioning and subtitling workflows for media content.

6 Most Valuable Use Cases

  • Multilingual Speech Transcription
  • Meeting and Lecture Captions
  • Call Center Conversation Logging
  • Podcast and Video Subtitles
  • Voice-Controlled App Interfaces
  • Audio Data Preprocessing

Cost Comparison

LLM API offers the lowest Whisper‑class transcription cost and latency across major providers.

Provider Region Latency Throughput Uptime Input ($/1M) Output ($/1M) Context
LLM API BEST Global ~180ms ~120 audio min/s 99.99% ~$0.003/min $0.00 ~4 hour audio
OpenAI Global ~250ms ~60 audio min/s 99.9% $0.006/min $0.006/min 30 min audio
Azure OpenAI US East ~450ms ~45 min/s 99.9% ~$0.0065/min ~$0.0065/min 30 min audio
Google Cloud Speech-to-Text Global ~500ms ~40 min/s 99.9% ~$0.009/min ~$0.009/min 30 min audio
Amazon Transcribe US East ~550ms ~35 min/s 99.9% ~$0.008/min ~$0.008/min 30 min audio

Technical Specifications

Metric Whisper 1 (OpenAI) Google Speech-to-Text v2 Amazon Transcribe
Avg Latency ~300ms ~350ms ~400ms
Languages Supported ~99 ~73 ~79
Price per Minute $0.006 $0.012 $0.015
Max Duration per Request 60 min 480 min 240 min
Accuracy (WER) ~7% ~8% ~9%
Uptime 99.9% 99.9% 99.9%
Streaming Support Yes Yes Yes

30-day usage via LLM API

310M
Audio minutes transcribed (30 days)
22.5M
API requests processed (30 days)
2.1M
Unique apps and services using Whisper 1
99.9%
Average API uptime over last 30 days
Start Using API

Why Build on LLM.API?

One unified API. Every major model. Built-in reliability, cost control, and observability.

  • Unified AI Routing

    Automatically direct each request to the optimal model across providers using latency, cost, and quality signals, so you ship faster without hardcoding vendor logic.

    One endpoint, smart routing.
  • Cost-Aware Execution

    Control spend with per-route budgets, price-aware model selection, and real-time usage insights, so you can scale traffic without surprise bills or manual tuning.

    Optimize cost, not code.
  • Resilient Fallbacks

    Define automatic failover chains across models and providers, so outages, rate limits, or degraded quality don’t take your features offline.

    Stay online under stress.
  • End-to-End Observability

    Inspect every request with traces, metrics, and structured logs across providers, making it easy to debug prompts, compare models, and tune performance in production.

    See every token flow.
  • Task-Level Abstractions

    Describe tasks like chat, RAG, or extraction once, then swap models or providers without rewriting business logic, keeping your app code clean and future-proof.

    Code to tasks, not models.
  • High-Throughput Batch

    Process large workloads with parallelized, provider-agnostic batching and automatic retries, reducing latency and unit cost for bulk jobs and backfills.

    Batch at production scale.

When to Use — When NOT to Use

Use it if...

  • You need accurate speech-to-text transcription for single-speaker English audio recordings.
  • You need robust transcription for noisy environments, accents, or imperfect consumer-grade microphones.
  • Your use case involves transcribing podcasts, interviews, or lectures for searchable text archives.
  • Your use case involves automatically generating subtitles or captions for pre-recorded videos.
  • You need to convert voice notes or meetings into text for downstream NLP processing.
  • You need a general-purpose ASR model without training or fine-tuning your own system.
  • Your use case involves batch-processing many audio files asynchronously without strict real-time constraints.

Avoid if...

  • You need real-time, low-latency streaming transcription for live calls or broadcasts.
  • Your workload requires on-device or fully offline speech recognition without cloud dependencies.
  • You need highly domain-specific ASR tuned to medical, legal, or niche technical jargon.
  • You need end-to-end spoken language understanding and dialog, not just transcription output.
  • Your workload requires strict, verifiable data residency on self-hosted infrastructure only.
  • You need fine-grained word-level timestamps and detailed diarization across many speakers.
  • Your workload requires direct speech-to-speech translation instead of speech-to-text capabilities.

Frequently Asked Questions

  • What is Whisper 1?

    Whisper 1 is OpenAI’s automatic speech recognition (ASR) model for transcribing and translating audio into text.

  • What modalities does Whisper 1 support via LLM.API?

    Whisper 1 supports audio input and returns text output for transcription and translation tasks through LLM.API.

  • What is Whisper 1 best suited for?

    Whisper 1 is best for accurate speech-to-text transcription, multilingual audio transcription, and speech translation to English.

  • How is Whisper 1 priced when used through LLM.API?

    Whisper 1 is typically billed per minute of processed audio; consult LLM.API’s pricing page for exact current rates.

  • What is the maximum audio length or context Whisper 1 can handle per request?

    Whisper 1 generally supports long-form audio, but maximum duration may be capped by LLM.API request size and timeout limits.

  • How fast is Whisper 1 in terms of latency?

    Whisper 1 usually processes audio close to or faster than real time, but actual latency depends on audio length and LLM.API infrastructure.

  • How do I call Whisper 1 through LLM.API?

    You select the Whisper 1 model identifier in your LLM.API request and send audio data in the supported format and encoding.

  • How does Whisper 1 compare to larger text LLMs for transcription tasks?

    Whisper 1 is generally more accurate, robust, and cost-efficient for transcription than using general-purpose text-only LLMs with external audio preprocessing.

  • Does Whisper 1 support multiple languages?

    Yes, Whisper 1 supports many languages for transcription and can translate non-English speech into English text.

  • What formats and sample rates are supported for Whisper 1 audio input?

    Whisper 1 typically supports common formats like MP3, MP4, WAV, and FLAC with standard speech sample rates such as 16 kHz.

  • Can Whisper 1 perform real-time streaming transcription via LLM.API?

    Real-time streaming support depends on LLM.API features; if streaming endpoints are provided, they can expose Whisper 1 for low-latency use.

  • What are some limitations of Whisper 1?

    Whisper 1 may struggle with heavy background noise, strong accents, overlapping speakers, domain-specific jargon, and very low-quality recordings.

Start in 2 lines of code

Get My API Key