Powered by Qwen

Qwen3 ASR Flash

  • Text Generation

Qwen3 ASR Flash is Qwen’s high-accuracy, multilingual automatic speech recognition (ASR) service optimized for real-time transcription of short audio. It is built on the Qwen3-Omni foundation model and trained on tens of millions of hours of multimodal speech data for robust performance across noisy and varied environments.

Start Using API

What is Qwen3 ASR Flash?

Qwen3 ASR Flash is an automatic speech recognition model and cloud service from Qwen (Alibaba) designed for fast, accurate transcription of short audio segments. It is mainly used to convert speech to text in real time for applications such as live captioning, meeting or call transcription, and voice-driven interfaces. It is also used as a backend ASR component in broader multimodal and translation pipelines, including tools that extend it to long-form audio transcription. The model is part of the Qwen3-ASR family and is built on the Qwen3-Omni multimodal model within the broader Qwen3 model ecosystem.

5 Core Capabilities

  • Streaming ASR

    Performs low-latency automatic speech recognition, transcribing spoken audio to text in real time for interactive applications.

  • Offline Transcription

    Converts prerecorded audio files into accurate text transcripts, supporting efficient processing of long-form speech content.

  • Multilingual Speech

    Recognizes and transcribes speech across multiple languages, enabling global voice-powered applications and multilingual audio processing.

  • Command Interfaces

    Enables voice-driven control and command interfaces by reliably turning spoken instructions into structured text for downstream handling.

  • Audio Event Capture

    Handles diverse acoustic conditions and speaking styles to robustly capture and transcribe speech in real-world noisy environments.

6 Most Valuable Use Cases

  • Real-time Speech Transcription
  • Voice Command Interfaces
  • Call Center Call Transcripts
  • Meeting and Lecture Notes
  • Multilingual Audio Captioning
  • Streaming ASR for Apps

Cost Comparison

LLM API offers the lowest ASR minute pricing and best overall performance for Qwen3 ASR-class models.

Provider Region Latency Throughput Uptime Input ($/1M) Output ($/1M) Context
LLM API BEST Global 120ms 120 audio min/s 99.99% $0.004/min $0.00/min ~4 hr audio
Qwen Global ~180ms ~80 audio min/s ~99.9% ~$0.006/min $0.00/min ~3 hr audio
Alibaba Cloud APAC ~220ms ~60 audio min/s ~99.9% ~$0.007/min $0.00/min ~2 hr audio
Replicate Global ~250ms ~40 audio min/s ~99.5% ~$0.010/min $0.00/min ~2 hr audio
Fireworks AI US East ~200ms ~70 audio min/s ~99.9% ~$0.008/min $0.00/min ~3 hr audio

Technical Specifications

Metric Qwen3 ASR Flash Whisper Large v3 (OpenAI API) Deepgram Nova-2
Avg Latency (Streaming) Real‑time or better on GPU (varies by provider)
Languages Supported Multilingual (exact count —) ≈99 languages Multilingual (English + others; exact count —)
Price per Minute (Hosted API) ~$0.0019/min $0.006/min $0.0043/min (pre‑recorded baseline)
Max Audio Duration per Request ~25 MB per request via OpenAI Whisper-1; v3 limits vary by host Typical API up to multi‑hour files; hard limit —
Accuracy (WER, clean English) State‑of‑the‑art vs Whisper v3 (exact WER —) ≈2.7% WER on clean audio Higher accuracy than Nova and Whisper v2; vs Whisper v3 —
Model Type / Architecture All‑in‑one ASR, non‑autoregressive alignment; Qwen3‑based Encoder–decoder Transformer ASR End‑to‑end neural ASR (Deepgram Nova family)
Deployment / Availability Cloud API via Alibaba/Qwen; open weights for some variants Open‑source weights + multiple hosted APIs Proprietary hosted API (Deepgram cloud)
Licensing Apache‑2.0 for open‑weight variants; commercial terms for cloud MIT license (open weights); commercial API terms Commercial, closed‑source

30-day usage via LLM API

2.8B
Audio seconds transcribed
9.4M
API requests served
210K
Unique developer accounts
99.8%
Avg API uptime
Start Using API

Why Build on LLM.API?

One unified API. Every major model. Built-in reliability, cost control, and observability.

  • Unified AI Routing

    Dynamically route each request to the best model across providers based on latency, cost, and quality—no client changes, no redeploys, just smarter defaults.

    One endpoint, any model
  • Cost-Aware Controls

    Set hard budgets, price caps, and model tiers so teams can experiment freely while finance stays in control of spend across every AI provider.

    Predictable AI spend
  • Automatic Fallbacks

    Define provider and model failover rules so traffic transparently shifts on errors or outages—keeping your AI features online without manual intervention.

    Resilience by default
  • Full-Stack Observability

    Trace every request, token, error, and latency across providers with unified logs, metrics, and alerts so you can debug, tune, and prove ROI in one place.

    See every token
  • Task-Level Orchestration

    Express higher-level tasks—chat, tools, RAG, evaluation—through a single abstraction that hides provider quirks, simplifying complex AI workflows into clean, testable units.

    One API for tasks
  • High-Throughput Batch

    Submit massive batches of generations or evaluations with built-in chunking, retries, and concurrency control to saturate throughput limits without blowing up rate caps.

    Scale jobs, not code

When to Use — When NOT to Use

Use it if...

  • You need fast, low-latency speech-to-text transcription for short utterances or commands.
  • Your use case involves real-time transcription of calls, meetings, or live streams.
  • You need an ASR model optimized for common Mandarin and English speech scenarios.
  • Your use case involves processing large volumes of audio where throughput matters more than perfection.
  • You need lightweight ASR for interactive voice features in apps, bots, or games.
  • Your use case involves quick voice notes or memos that don’t require full semantic accuracy.

Avoid if...

  • You need state-of-the-art accuracy on noisy, highly accented, or domain-specialized audio.
  • Your workload requires robust transcription across many low-resource or uncommon languages.
  • You need detailed diarization, punctuation, formatting, and rich metadata beyond basic transcripts.
  • Your workload requires complex spoken language understanding or reasoning beyond simple transcription.
  • You need precise offline transcription for legal, medical, or compliance-critical recordings.
  • Your workload requires handling very long multi-hour recordings without segmenting the audio first.

Frequently Asked Questions

  • What is Qwen3 ASR Flash?

    Qwen3 ASR Flash is a fast automatic speech recognition model by Qwen optimized for low-latency transcription via API.

  • What modalities does Qwen3 ASR Flash support?

    Qwen3 ASR Flash accepts audio as input and outputs text transcripts.

  • How does Qwen3 ASR Flash compare to other Qwen ASR or general-purpose models?

    Qwen3 ASR Flash prioritizes speed and low cost over maximum accuracy or advanced language understanding found in larger general-purpose Qwen models.

  • What is the context window or maximum audio length Qwen3 ASR Flash can handle?

    Qwen3 ASR Flash supports long-form audio segments, but you should chunk very long recordings client-side to manage latency and partial failures.

  • Is Qwen3 ASR Flash suitable for real-time or streaming transcription?

    Yes, Qwen3 ASR Flash is designed for low-latency use cases like real-time or near real-time transcription where speed is critical.

  • What are the main limitations of Qwen3 ASR Flash?

    Qwen3 ASR Flash may struggle with heavy background noise, very low-resource languages, domain-specific jargon, or tasks requiring deep semantic understanding beyond transcription.

  • How is Qwen3 ASR Flash priced when accessed through LLM.API?

    LLM.API exposes Qwen3 ASR Flash with usage-based pricing per audio duration; check the LLM.API pricing page for the latest exact rates.

  • How fast is Qwen3 ASR Flash on LLM.API?

    Qwen3 ASR Flash is tuned for high throughput and low latency, typically returning transcripts much faster than the input audio duration.

  • How do I call Qwen3 ASR Flash through the LLM.API gateway?

    You specify the provider as Qwen and the model name as Qwen3 ASR Flash in your LLM.API request, sending audio content in the supported format.

  • Does Qwen3 ASR Flash support multiple languages?

    Qwen3 ASR Flash supports multilingual transcription, but accuracy varies by language and is generally best for its highest-resource languages.

Start in 2 lines of code

Get My API Key