Voxtral Mini Transcribe

Speech-to-Text

Voxtral Mini Transcribe is a speech-to-text model from Mistral focused on lightweight, efficient audio transcription. It is designed to provide accurate transcriptions while being small and fast enough for resource-constrained environments.

Start Using API

API Performance

Latency: ~0.8s time to first token
Context: ~128K tokens
Input: ~$0.15 per 1M tokens
Output: ~$0.15 per 1M tokens
Uptime: 99% 99%

About the model

What is Voxtral Mini Transcribe?

Voxtral Mini Transcribe is a compact automatic speech recognition (ASR) model by Mistral for converting spoken audio into text. It is mainly used for real-time or near real-time transcription of voice recordings, calls, and meetings. It is also suitable for integrating speech input into applications where low latency and low computational overhead are important. It belongs to Mistral’s Voxtral family of ASR models, which are optimized for practical deployment and efficiency.

Input / Output

Input

Audio (speech, voice recordings) via transcription API

Output

Transcribed text from audio input

Model capabilities

5 Core Capabilities

Speech Transcription

Converts spoken audio into accurate text, supporting various speakers and recording conditions for transcription and note-taking use cases.
Real-Time Transcribe

Processes streaming audio input to produce near real-time text transcripts suitable for live captions and interactive applications.
Multilingual Transcription

Transcribes speech from multiple supported languages, enabling cross-lingual audio processing and global applications requiring language-aware transcription.
Dialogue-Oriented Output

Produces structured, readable transcripts suitable for conversational contexts, meetings, and interviews, preserving speaker turns when available.
Audio-to-Text Alignment

Generates text closely aligned with input audio segments, facilitating downstream search, navigation, and timestamp-based audio indexing.

Use cases

6 Most Valuable Use Cases

Meeting transcription
Customer call analysis
Legal deposition transcripts
Live webinar captioning
Voice note processing
Podcast batch transcription

Transparent pricing

Cost Comparison

LLM API offers the lowest per‑minute pricing and best SLAs for Voxtral-class transcription.

Provider	Region	Latency	Throughput	Uptime	Input ($/1M)	Output ($/1M)	Context
LLM API BEST	Global	~180ms	~120 audio min/min	99.99%	$0.004/min	$0.000/min	~120 min audio
Mistral	EU West	~220ms	~80 audio min/min	~99.9%	~$0.006/min	$0.000/min	~60 min audio
OpenAI	Global	~250ms	~90 audio min/min	~99.9%	~$0.006/min	$0.000/min	~60 min audio
Azure AI	Global	~260ms	~70 audio min/min	~99.9%	~$0.007/min	$0.000/min	~60 min audio
Google Cloud	Global	~240ms	~75 audio min/min	~99.9%	~$0.007/min	$0.000/min	~60 min audio

Performance benchmarks

Technical Specifications

Metric	Voxtral Mini Transcribe	OpenAI Whisper v3 Small	Google Speech-to-Text v2
Avg Latency	~350ms	~400ms	~450ms
Languages Supported	~100	~100	~70
Price per Minute	$0.006	$0.006	$0.009
Max Duration	2h	12h	3h
Accuracy (WER)	~7%	~6%	~8%
Uptime	99.9%	99.9%	99.9%

30-day usage via LLM API

620M: Audio seconds transcribed (last 30 days)
9.4M: Transcription API requests (last 30 days)
4.7M: Unique speakers detected (last 30 days)
99.8%: Avg API uptime (last 30 days)

Start Using API

Architecture & Integration

Why Build on LLM.API?

One unified API. Every major model. Built-in reliability, cost control, and observability.

Intelligent Model Routing

Dynamically route each request to the best model across providers based on latency, cost, and capability—no client changes, just smarter traffic.
One endpoint, many models
Cost-Aware Orchestration

Automatically balance quality and spend with fine-grained controls, price-aware routing, and per-project limits so teams ship fast without surprise cloud bills.
Control spend, not speed
Resilient Fallback Flows

Define provider and model fallback chains that trigger on errors, timeouts, or degraded quality so your AI features stay online even when vendors don’t.
Fail soft, stay online
Full-Stack Observability

Trace every call across providers with unified logs, metrics, and structured traces so you can debug latency spikes and failures in minutes, not days.
See every token hop
Task-Level Abstractions

Describe work as tasks—chat, tools, RAG, agents—and let LLM.API pick the right models, prompts, and configs so you focus on product, not plumbing.
Program tasks, not models
High-Throughput Batch Jobs

Run large-scale generations, evaluations, and data labeling as batched jobs with built-in retries, concurrency controls, and cost tracking from a single API.
Scale to millions of calls

Decision guide

When to Use — When NOT to Use

Use it if...

You need fast, lightweight speech-to-text transcription for short audio clips or calls.
Your use case involves batch transcribing many short recordings with tight cost constraints.
You need a compact transcription model suitable for on-device or edge deployments.
Your use case involves generating transcripts primarily for search, indexing, or logging.
You need a simple ASR component to feed downstream NLP or analytics pipelines.
Your use case involves prototyping speech features without requiring a large general LLM.

Avoid if...

You need complex language understanding, summarization, or reasoning beyond basic transcription output.
Your workload requires state-of-the-art accuracy on noisy, multilingual, or domain-specific audio.
You need robust diarization, speaker attribution, or advanced audio segmentation features.
Your workload requires rich text generation, chat, or code understanding capabilities.
You need guaranteed, enterprise-grade SLAs and compliance for sensitive regulated speech data.
Your workload requires real-time, ultra-low-latency streaming transcription at massive global scale.

FAQ

Frequently Asked Questions

What is Voxtral Mini Transcribe?

Voxtral Mini Transcribe is a speech-to-text model by Mistral optimized for fast, low-cost audio transcription via the LLM.API gateway.
What modalities does Voxtral Mini Transcribe support?

Voxtral Mini Transcribe supports audio input and returns transcribed text output; it does not process images, video, or arbitrary text prompts directly.
How do I access Voxtral Mini Transcribe through LLM.API?

You call the unified LLM.API endpoint with the model name 'mistral:voxtral-mini-transcribe' and provide your audio data and parameters in the request body.
What is Voxtral Mini Transcribe best suited for?

Voxtral Mini Transcribe is best for real-time or batch transcription of spoken content such as meetings, calls, podcasts, and voice notes.
What is the typical latency of Voxtral Mini Transcribe on LLM.API?

Typical end-to-end latency is a few seconds for short audio clips, depending on audio length, network conditions, and your region.
What context window or duration limits apply to Voxtral Mini Transcribe?

Voxtral Mini Transcribe is limited by maximum audio duration per request, so long recordings should be chunked into smaller segments client-side.
How is pricing for Voxtral Mini Transcribe handled on LLM.API?

Voxtral Mini Transcribe is billed per unit of processed audio, with exact per-minute or per-second rates defined in the LLM.API pricing page.
How does Voxtral Mini Transcribe compare to larger transcription models?

Compared to larger models, Voxtral Mini Transcribe generally offers lower cost and latency at the expense of slightly lower accuracy on challenging audio.
Does Voxtral Mini Transcribe support streaming transcription on LLM.API?

If enabled by LLM.API, you can stream audio chunks to Voxtral Mini Transcribe and receive partial transcripts incrementally.
What limitations should I be aware of when using Voxtral Mini Transcribe?

Voxtral Mini Transcribe may struggle with heavy background noise, strong accents, overlapping speakers, low-bitrate audio, or domain-specific jargon without adaptation.

Start in 2 lines of code

Get My API Key

Voxtral Mini Transcribe

What is Voxtral Mini Transcribe?

5 Core Capabilities

Speech Transcription

Real-Time Transcribe

Multilingual Transcription

Dialogue-Oriented Output

Audio-to-Text Alignment

6 Most Valuable Use Cases

Cost Comparison

Technical Specifications

Why Build on LLM.API?

Intelligent Model Routing

Cost-Aware Orchestration

Resilient Fallback Flows

Full-Stack Observability

Task-Level Abstractions

High-Throughput Batch Jobs

When to Use — When NOT to Use

Use it if...

Avoid if...

Start in 2 lines of code