Powered by Mistral
Voxtral Mini Transcribe
- Speech-to-Text
Voxtral Mini Transcribe is a speech-to-text model from Mistral focused on lightweight, efficient audio transcription. It is designed to provide accurate transcriptions while being small and fast enough for resource-constrained environments.
About the model
What is Voxtral Mini Transcribe?
Voxtral Mini Transcribe is a compact automatic speech recognition (ASR) model by Mistral for converting spoken audio into text. It is mainly used for real-time or near real-time transcription of voice recordings, calls, and meetings. It is also suitable for integrating speech input into applications where low latency and low computational overhead are important. It belongs to Mistral’s Voxtral family of ASR models, which are optimized for practical deployment and efficiency.
Model capabilities
5 Core Capabilities
-
Speech Transcription
Converts spoken audio into accurate text, supporting various speakers and recording conditions for transcription and note-taking use cases.
-
Real-Time Transcribe
Processes streaming audio input to produce near real-time text transcripts suitable for live captions and interactive applications.
-
Multilingual Transcription
Transcribes speech from multiple supported languages, enabling cross-lingual audio processing and global applications requiring language-aware transcription.
-
Dialogue-Oriented Output
Produces structured, readable transcripts suitable for conversational contexts, meetings, and interviews, preserving speaker turns when available.
-
Audio-to-Text Alignment
Generates text closely aligned with input audio segments, facilitating downstream search, navigation, and timestamp-based audio indexing.
Use cases
6 Most Valuable Use Cases
- Meeting transcription
- Customer call analysis
- Legal deposition transcripts
- Live webinar captioning
- Voice note processing
- Podcast batch transcription
Transparent pricing
Cost Comparison
LLM API offers the lowest per‑minute pricing and best SLAs for Voxtral-class transcription.
| Provider | Region | Latency | Throughput | Uptime | Input ($/1M) | Output ($/1M) | Context |
|---|---|---|---|---|---|---|---|
| LLM API BEST | Global | ~180ms | ~120 audio min/min | 99.99% | $0.004/min | $0.000/min | ~120 min audio |
| Mistral | EU West | ~220ms | ~80 audio min/min | ~99.9% | ~$0.006/min | $0.000/min | ~60 min audio |
| OpenAI | Global | ~250ms | ~90 audio min/min | ~99.9% | ~$0.006/min | $0.000/min | ~60 min audio |
| Azure AI | Global | ~260ms | ~70 audio min/min | ~99.9% | ~$0.007/min | $0.000/min | ~60 min audio |
| Google Cloud | Global | ~240ms | ~75 audio min/min | ~99.9% | ~$0.007/min | $0.000/min | ~60 min audio |
Performance benchmarks
Technical Specifications
| Metric | Voxtral Mini Transcribe | OpenAI Whisper v3 Small | Google Speech-to-Text v2 |
|---|---|---|---|
| Avg Latency | ~350ms | ~400ms | ~450ms |
| Languages Supported | ~100 | ~100 | ~70 |
| Price per Minute | $0.006 | $0.006 | $0.009 |
| Max Duration | 2h | 12h | 3h |
| Accuracy (WER) | ~7% | ~6% | ~8% |
| Uptime | 99.9% | 99.9% | 99.9% |
30-day usage via LLM API
- 620M
- Audio seconds transcribed (last 30 days)
- 9.4M
- Transcription API requests (last 30 days)
- 4.7M
- Unique speakers detected (last 30 days)
- 99.8%
- Avg API uptime (last 30 days)
Architecture & Integration
Why Build on LLM.API?
One unified API. Every major model. Built-in reliability, cost control, and observability.
-
Intelligent Model Routing
Dynamically route each request to the best model across providers based on latency, cost, and capability—no client changes, just smarter traffic.
One endpoint, many models -
Cost-Aware Orchestration
Automatically balance quality and spend with fine-grained controls, price-aware routing, and per-project limits so teams ship fast without surprise cloud bills.
Control spend, not speed -
Resilient Fallback Flows
Define provider and model fallback chains that trigger on errors, timeouts, or degraded quality so your AI features stay online even when vendors don’t.
Fail soft, stay online -
Full-Stack Observability
Trace every call across providers with unified logs, metrics, and structured traces so you can debug latency spikes and failures in minutes, not days.
See every token hop -
Task-Level Abstractions
Describe work as tasks—chat, tools, RAG, agents—and let LLM.API pick the right models, prompts, and configs so you focus on product, not plumbing.
Program tasks, not models -
High-Throughput Batch Jobs
Run large-scale generations, evaluations, and data labeling as batched jobs with built-in retries, concurrency controls, and cost tracking from a single API.
Scale to millions of calls
Decision guide
When to Use — When NOT to Use
Use it if...
- You need fast, lightweight speech-to-text transcription for short audio clips or calls.
- Your use case involves batch transcribing many short recordings with tight cost constraints.
- You need a compact transcription model suitable for on-device or edge deployments.
- Your use case involves generating transcripts primarily for search, indexing, or logging.
- You need a simple ASR component to feed downstream NLP or analytics pipelines.
- Your use case involves prototyping speech features without requiring a large general LLM.
Avoid if...
- You need complex language understanding, summarization, or reasoning beyond basic transcription output.
- Your workload requires state-of-the-art accuracy on noisy, multilingual, or domain-specific audio.
- You need robust diarization, speaker attribution, or advanced audio segmentation features.
- Your workload requires rich text generation, chat, or code understanding capabilities.
- You need guaranteed, enterprise-grade SLAs and compliance for sensitive regulated speech data.
- Your workload requires real-time, ultra-low-latency streaming transcription at massive global scale.
FAQ
Frequently Asked Questions
-
What is Voxtral Mini Transcribe?
Voxtral Mini Transcribe is a speech-to-text model by Mistral optimized for fast, low-cost audio transcription via the LLM.API gateway.
-
What modalities does Voxtral Mini Transcribe support?
Voxtral Mini Transcribe supports audio input and returns transcribed text output; it does not process images, video, or arbitrary text prompts directly.
-
How do I access Voxtral Mini Transcribe through LLM.API?
You call the unified LLM.API endpoint with the model name 'mistral:voxtral-mini-transcribe' and provide your audio data and parameters in the request body.
-
What is Voxtral Mini Transcribe best suited for?
Voxtral Mini Transcribe is best for real-time or batch transcription of spoken content such as meetings, calls, podcasts, and voice notes.
-
What is the typical latency of Voxtral Mini Transcribe on LLM.API?
Typical end-to-end latency is a few seconds for short audio clips, depending on audio length, network conditions, and your region.
-
What context window or duration limits apply to Voxtral Mini Transcribe?
Voxtral Mini Transcribe is limited by maximum audio duration per request, so long recordings should be chunked into smaller segments client-side.
-
How is pricing for Voxtral Mini Transcribe handled on LLM.API?
Voxtral Mini Transcribe is billed per unit of processed audio, with exact per-minute or per-second rates defined in the LLM.API pricing page.
-
How does Voxtral Mini Transcribe compare to larger transcription models?
Compared to larger models, Voxtral Mini Transcribe generally offers lower cost and latency at the expense of slightly lower accuracy on challenging audio.
-
Does Voxtral Mini Transcribe support streaming transcription on LLM.API?
If enabled by LLM.API, you can stream audio chunks to Voxtral Mini Transcribe and receive partial transcripts incrementally.
-
What limitations should I be aware of when using Voxtral Mini Transcribe?
Voxtral Mini Transcribe may struggle with heavy background noise, strong accents, overlapping speakers, low-bitrate audio, or domain-specific jargon without adaptation.
