Powered by Qwen
Qwen3 ASR Flash
- Text Generation
Qwen3 ASR Flash is Qwen’s high-accuracy, multilingual automatic speech recognition (ASR) service optimized for real-time transcription of short audio. It is built on the Qwen3-Omni foundation model and trained on tens of millions of hours of multimodal speech data for robust performance across noisy and varied environments.
About the model
What is Qwen3 ASR Flash?
Qwen3 ASR Flash is an automatic speech recognition model and cloud service from Qwen (Alibaba) designed for fast, accurate transcription of short audio segments. It is mainly used to convert speech to text in real time for applications such as live captioning, meeting or call transcription, and voice-driven interfaces. It is also used as a backend ASR component in broader multimodal and translation pipelines, including tools that extend it to long-form audio transcription. The model is part of the Qwen3-ASR family and is built on the Qwen3-Omni multimodal model within the broader Qwen3 model ecosystem.
Model capabilities
5 Core Capabilities
-
Streaming ASR
Performs low-latency automatic speech recognition, transcribing spoken audio to text in real time for interactive applications.
-
Offline Transcription
Converts prerecorded audio files into accurate text transcripts, supporting efficient processing of long-form speech content.
-
Multilingual Speech
Recognizes and transcribes speech across multiple languages, enabling global voice-powered applications and multilingual audio processing.
-
Command Interfaces
Enables voice-driven control and command interfaces by reliably turning spoken instructions into structured text for downstream handling.
-
Audio Event Capture
Handles diverse acoustic conditions and speaking styles to robustly capture and transcribe speech in real-world noisy environments.
Use cases
6 Most Valuable Use Cases
- Real-time Speech Transcription
- Voice Command Interfaces
- Call Center Call Transcripts
- Meeting and Lecture Notes
- Multilingual Audio Captioning
- Streaming ASR for Apps
Transparent pricing
Cost Comparison
LLM API offers the lowest ASR minute pricing and best overall performance for Qwen3 ASR-class models.
| Provider | Region | Latency | Throughput | Uptime | Input ($/1M) | Output ($/1M) | Context |
|---|---|---|---|---|---|---|---|
| LLM API BEST | Global | 120ms | 120 audio min/s | 99.99% | $0.004/min | $0.00/min | ~4 hr audio |
| Qwen | Global | ~180ms | ~80 audio min/s | ~99.9% | ~$0.006/min | $0.00/min | ~3 hr audio |
| Alibaba Cloud | APAC | ~220ms | ~60 audio min/s | ~99.9% | ~$0.007/min | $0.00/min | ~2 hr audio |
| Replicate | Global | ~250ms | ~40 audio min/s | ~99.5% | ~$0.010/min | $0.00/min | ~2 hr audio |
| Fireworks AI | US East | ~200ms | ~70 audio min/s | ~99.9% | ~$0.008/min | $0.00/min | ~3 hr audio |
Performance benchmarks
Technical Specifications
| Metric | Qwen3 ASR Flash | Whisper Large v3 (OpenAI API) | Deepgram Nova-2 |
|---|---|---|---|
| Avg Latency (Streaming) | — | Real‑time or better on GPU (varies by provider) | — |
| Languages Supported | Multilingual (exact count —) | ≈99 languages | Multilingual (English + others; exact count —) |
| Price per Minute (Hosted API) | ~$0.0019/min | $0.006/min | $0.0043/min (pre‑recorded baseline) |
| Max Audio Duration per Request | — | ~25 MB per request via OpenAI Whisper-1; v3 limits vary by host | Typical API up to multi‑hour files; hard limit — |
| Accuracy (WER, clean English) | State‑of‑the‑art vs Whisper v3 (exact WER —) | ≈2.7% WER on clean audio | Higher accuracy than Nova and Whisper v2; vs Whisper v3 — |
| Model Type / Architecture | All‑in‑one ASR, non‑autoregressive alignment; Qwen3‑based | Encoder–decoder Transformer ASR | End‑to‑end neural ASR (Deepgram Nova family) |
| Deployment / Availability | Cloud API via Alibaba/Qwen; open weights for some variants | Open‑source weights + multiple hosted APIs | Proprietary hosted API (Deepgram cloud) |
| Licensing | Apache‑2.0 for open‑weight variants; commercial terms for cloud | MIT license (open weights); commercial API terms | Commercial, closed‑source |
30-day usage via LLM API
- 2.8B
- Audio seconds transcribed
- 9.4M
- API requests served
- 210K
- Unique developer accounts
- 99.8%
- Avg API uptime
Architecture & Integration
Why Build on LLM.API?
One unified API. Every major model. Built-in reliability, cost control, and observability.
-
Unified AI Routing
Dynamically route each request to the best model across providers based on latency, cost, and quality—no client changes, no redeploys, just smarter defaults.
One endpoint, any model -
Cost-Aware Controls
Set hard budgets, price caps, and model tiers so teams can experiment freely while finance stays in control of spend across every AI provider.
Predictable AI spend -
Automatic Fallbacks
Define provider and model failover rules so traffic transparently shifts on errors or outages—keeping your AI features online without manual intervention.
Resilience by default -
Full-Stack Observability
Trace every request, token, error, and latency across providers with unified logs, metrics, and alerts so you can debug, tune, and prove ROI in one place.
See every token -
Task-Level Orchestration
Express higher-level tasks—chat, tools, RAG, evaluation—through a single abstraction that hides provider quirks, simplifying complex AI workflows into clean, testable units.
One API for tasks -
High-Throughput Batch
Submit massive batches of generations or evaluations with built-in chunking, retries, and concurrency control to saturate throughput limits without blowing up rate caps.
Scale jobs, not code
Decision guide
When to Use — When NOT to Use
Use it if...
- You need fast, low-latency speech-to-text transcription for short utterances or commands.
- Your use case involves real-time transcription of calls, meetings, or live streams.
- You need an ASR model optimized for common Mandarin and English speech scenarios.
- Your use case involves processing large volumes of audio where throughput matters more than perfection.
- You need lightweight ASR for interactive voice features in apps, bots, or games.
- Your use case involves quick voice notes or memos that don’t require full semantic accuracy.
Avoid if...
- You need state-of-the-art accuracy on noisy, highly accented, or domain-specialized audio.
- Your workload requires robust transcription across many low-resource or uncommon languages.
- You need detailed diarization, punctuation, formatting, and rich metadata beyond basic transcripts.
- Your workload requires complex spoken language understanding or reasoning beyond simple transcription.
- You need precise offline transcription for legal, medical, or compliance-critical recordings.
- Your workload requires handling very long multi-hour recordings without segmenting the audio first.
FAQ
Frequently Asked Questions
-
What is Qwen3 ASR Flash?
Qwen3 ASR Flash is a fast automatic speech recognition model by Qwen optimized for low-latency transcription via API.
-
What modalities does Qwen3 ASR Flash support?
Qwen3 ASR Flash accepts audio as input and outputs text transcripts.
-
How does Qwen3 ASR Flash compare to other Qwen ASR or general-purpose models?
Qwen3 ASR Flash prioritizes speed and low cost over maximum accuracy or advanced language understanding found in larger general-purpose Qwen models.
-
What is the context window or maximum audio length Qwen3 ASR Flash can handle?
Qwen3 ASR Flash supports long-form audio segments, but you should chunk very long recordings client-side to manage latency and partial failures.
-
Is Qwen3 ASR Flash suitable for real-time or streaming transcription?
Yes, Qwen3 ASR Flash is designed for low-latency use cases like real-time or near real-time transcription where speed is critical.
-
What are the main limitations of Qwen3 ASR Flash?
Qwen3 ASR Flash may struggle with heavy background noise, very low-resource languages, domain-specific jargon, or tasks requiring deep semantic understanding beyond transcription.
-
How is Qwen3 ASR Flash priced when accessed through LLM.API?
LLM.API exposes Qwen3 ASR Flash with usage-based pricing per audio duration; check the LLM.API pricing page for the latest exact rates.
-
How fast is Qwen3 ASR Flash on LLM.API?
Qwen3 ASR Flash is tuned for high throughput and low latency, typically returning transcripts much faster than the input audio duration.
-
How do I call Qwen3 ASR Flash through the LLM.API gateway?
You specify the provider as Qwen and the model name as Qwen3 ASR Flash in your LLM.API request, sending audio content in the supported format.
-
Does Qwen3 ASR Flash support multiple languages?
Qwen3 ASR Flash supports multilingual transcription, but accuracy varies by language and is generally best for its highest-resource languages.
