Qwen3 ASR Flash

Text Generation

Qwen3 ASR Flash is Qwen’s high-accuracy, multilingual automatic speech recognition (ASR) service optimized for real-time transcription of short audio. It is built on the Qwen3-Omni foundation model and trained on tens of millions of hours of multimodal speech data for robust performance across noisy and varied environments.

Start Using API

API Performance

Latency: ~3s time to first transcription chunk over API
Input: $0.00009 per second of audio
Output: Included transcription included in input price
Uptime: 99% 99%

About the model

What is Qwen3 ASR Flash?

Qwen3 ASR Flash is an automatic speech recognition model and cloud service from Qwen (Alibaba) designed for fast, accurate transcription of short audio segments. It is mainly used to convert speech to text in real time for applications such as live captioning, meeting or call transcription, and voice-driven interfaces. It is also used as a backend ASR component in broader multimodal and translation pipelines, including tools that extend it to long-form audio transcription. The model is part of the Qwen3-ASR family and is built on the Qwen3-Omni multimodal model within the broader Qwen3 model ecosystem.

Input / Output

Input

Audio (streaming or file-based speech input for ASR)

Output

Transcribed text from speech (speech-to-text)

Model capabilities

5 Core Capabilities

Streaming ASR

Performs low-latency automatic speech recognition, transcribing spoken audio to text in real time for interactive applications.
Offline Transcription

Converts prerecorded audio files into accurate text transcripts, supporting efficient processing of long-form speech content.
Multilingual Speech

Recognizes and transcribes speech across multiple languages, enabling global voice-powered applications and multilingual audio processing.
Command Interfaces

Enables voice-driven control and command interfaces by reliably turning spoken instructions into structured text for downstream handling.
Audio Event Capture

Handles diverse acoustic conditions and speaking styles to robustly capture and transcribe speech in real-world noisy environments.

Use cases

6 Most Valuable Use Cases

Real-time Speech Transcription
Voice Command Interfaces
Call Center Call Transcripts
Meeting and Lecture Notes
Multilingual Audio Captioning
Streaming ASR for Apps

Transparent pricing

Cost Comparison

LLM API offers the lowest ASR minute pricing and best overall performance for Qwen3 ASR-class models.

Provider	Region	Latency	Throughput	Uptime	Input ($/1M)	Output ($/1M)	Context
LLM API BEST	Global	120ms	120 audio min/s	99.99%	$0.004/min	$0.00/min	~4 hr audio
Qwen	Global	~180ms	~80 audio min/s	~99.9%	~$0.006/min	$0.00/min	~3 hr audio
Alibaba Cloud	APAC	~220ms	~60 audio min/s	~99.9%	~$0.007/min	$0.00/min	~2 hr audio
Replicate	Global	~250ms	~40 audio min/s	~99.5%	~$0.010/min	$0.00/min	~2 hr audio
Fireworks AI	US East	~200ms	~70 audio min/s	~99.9%	~$0.008/min	$0.00/min	~3 hr audio

Performance benchmarks

Technical Specifications

Metric	Qwen3 ASR Flash	Whisper Large v3 (OpenAI API)	Deepgram Nova-2
Avg Latency (Streaming)	—	Real‑time or better on GPU (varies by provider)	—
Languages Supported	Multilingual (exact count —)	≈99 languages	Multilingual (English + others; exact count —)
Price per Minute (Hosted API)	~$0.0019/min	$0.006/min	$0.0043/min (pre‑recorded baseline)
Max Audio Duration per Request	—	~25 MB per request via OpenAI Whisper-1; v3 limits vary by host	Typical API up to multi‑hour files; hard limit —
Accuracy (WER, clean English)	State‑of‑the‑art vs Whisper v3 (exact WER —)	≈2.7% WER on clean audio	Higher accuracy than Nova and Whisper v2; vs Whisper v3 —
Model Type / Architecture	All‑in‑one ASR, non‑autoregressive alignment; Qwen3‑based	Encoder–decoder Transformer ASR	End‑to‑end neural ASR (Deepgram Nova family)
Deployment / Availability	Cloud API via Alibaba/Qwen; open weights for some variants	Open‑source weights + multiple hosted APIs	Proprietary hosted API (Deepgram cloud)
Licensing	Apache‑2.0 for open‑weight variants; commercial terms for cloud	MIT license (open weights); commercial API terms	Commercial, closed‑source

30-day usage via LLM API

2.8B: Audio seconds transcribed
9.4M: API requests served
210K: Unique developer accounts
99.8%: Avg API uptime

Start Using API

Architecture & Integration

Why Build on LLM.API?

One unified API. Every major model. Built-in reliability, cost control, and observability.

Unified AI Routing

Dynamically route each request to the best model across providers based on latency, cost, and quality—no client changes, no redeploys, just smarter defaults.
One endpoint, any model
Cost-Aware Controls

Set hard budgets, price caps, and model tiers so teams can experiment freely while finance stays in control of spend across every AI provider.
Predictable AI spend
Automatic Fallbacks

Define provider and model failover rules so traffic transparently shifts on errors or outages—keeping your AI features online without manual intervention.
Resilience by default
Full-Stack Observability

Trace every request, token, error, and latency across providers with unified logs, metrics, and alerts so you can debug, tune, and prove ROI in one place.
See every token
Task-Level Orchestration

Express higher-level tasks—chat, tools, RAG, evaluation—through a single abstraction that hides provider quirks, simplifying complex AI workflows into clean, testable units.
One API for tasks
High-Throughput Batch

Submit massive batches of generations or evaluations with built-in chunking, retries, and concurrency control to saturate throughput limits without blowing up rate caps.
Scale jobs, not code

Decision guide

When to Use — When NOT to Use

Use it if...

You need fast, low-latency speech-to-text transcription for short utterances or commands.
Your use case involves real-time transcription of calls, meetings, or live streams.
You need an ASR model optimized for common Mandarin and English speech scenarios.
Your use case involves processing large volumes of audio where throughput matters more than perfection.
You need lightweight ASR for interactive voice features in apps, bots, or games.
Your use case involves quick voice notes or memos that don’t require full semantic accuracy.

Avoid if...

You need state-of-the-art accuracy on noisy, highly accented, or domain-specialized audio.
Your workload requires robust transcription across many low-resource or uncommon languages.
You need detailed diarization, punctuation, formatting, and rich metadata beyond basic transcripts.
Your workload requires complex spoken language understanding or reasoning beyond simple transcription.
You need precise offline transcription for legal, medical, or compliance-critical recordings.
Your workload requires handling very long multi-hour recordings without segmenting the audio first.

FAQ

Frequently Asked Questions

What is Qwen3 ASR Flash?

Qwen3 ASR Flash is a fast automatic speech recognition model by Qwen optimized for low-latency transcription via API.
What modalities does Qwen3 ASR Flash support?

Qwen3 ASR Flash accepts audio as input and outputs text transcripts.
How does Qwen3 ASR Flash compare to other Qwen ASR or general-purpose models?

Qwen3 ASR Flash prioritizes speed and low cost over maximum accuracy or advanced language understanding found in larger general-purpose Qwen models.
What is the context window or maximum audio length Qwen3 ASR Flash can handle?

Qwen3 ASR Flash supports long-form audio segments, but you should chunk very long recordings client-side to manage latency and partial failures.
Is Qwen3 ASR Flash suitable for real-time or streaming transcription?

Yes, Qwen3 ASR Flash is designed for low-latency use cases like real-time or near real-time transcription where speed is critical.
What are the main limitations of Qwen3 ASR Flash?

Qwen3 ASR Flash may struggle with heavy background noise, very low-resource languages, domain-specific jargon, or tasks requiring deep semantic understanding beyond transcription.
How is Qwen3 ASR Flash priced when accessed through LLM.API?

LLM.API exposes Qwen3 ASR Flash with usage-based pricing per audio duration; check the LLM.API pricing page for the latest exact rates.
How fast is Qwen3 ASR Flash on LLM.API?

Qwen3 ASR Flash is tuned for high throughput and low latency, typically returning transcripts much faster than the input audio duration.
How do I call Qwen3 ASR Flash through the LLM.API gateway?

You specify the provider as Qwen and the model name as Qwen3 ASR Flash in your LLM.API request, sending audio content in the supported format.
Does Qwen3 ASR Flash support multiple languages?

Qwen3 ASR Flash supports multilingual transcription, but accuracy varies by language and is generally best for its highest-resource languages.

Start in 2 lines of code

Get My API Key

Qwen3 ASR Flash

What is Qwen3 ASR Flash?

5 Core Capabilities

Streaming ASR

Offline Transcription

Multilingual Speech

Command Interfaces

Audio Event Capture

6 Most Valuable Use Cases

Cost Comparison

Technical Specifications

Why Build on LLM.API?

Unified AI Routing

Cost-Aware Controls

Automatic Fallbacks

Full-Stack Observability

Task-Level Orchestration

High-Throughput Batch

When to Use — When NOT to Use

Use it if...

Avoid if...

Start in 2 lines of code