Powered by Mistral
Voxtral Mini TTS
- Text-to-Speech
Voxtral Mini TTS is Mistral’s 4B-parameter text-to-speech model that generates expressive, low-latency speech and supports multilingual, zero-shot voice cloning. It is available via the Mistral API and as open weights for self-hosting.
About the model
What is Voxtral Mini TTS?
Voxtral Mini TTS is a 4B-parameter text-to-speech model from Mistral that converts text into natural, expressive speech with multilingual support and voice cloning from very short audio samples. It is mainly used to build voice agents and assistants that respond in real time with low-latency audio, and to generate high-quality synthetic voices for applications like content narration, product voices, and accessibility tools. It also serves use cases that require cloning or reusing consistent speaker identities across many utterances, such as branded voice experiences and character dialogue. The model is part of Mistral’s Voxtral audio family, alongside Voxtral Mini and Voxtral Small transcription and audio-understanding models.
Model capabilities
5 Core Capabilities
-
Text-to-Speech
Generates natural-sounding speech audio from written text, suitable for dialogue, narration, and interface responses in multiple scenarios.
-
Conversational Output
Produces speech tailored for interactive assistants, enabling clear, responsive spoken dialogue aligned with conversational AI systems’ outputs.
-
Multilingual Speech
Supports speech generation in multiple languages, allowing applications to vocalize content for diverse linguistic audiences and use cases.
-
Screen Reader Compatibility
Can power screen readers or accessibility tools by converting on-screen text into intelligible, continuous spoken audio output.
-
Media Content Voice
Provides synthesized voices for videos, podcasts, or interactive media, enabling scalable voiceover creation without human recording sessions.
Use cases
6 Most Valuable Use Cases
- Voice App Prototyping
- Customer Support Prompts
- Accessibility Voice Output
- Interactive Voice Demos
- Spoken Content Previews
- Educational Voice Feedback
Transparent pricing
Cost Comparison
Up to ~70% cheaper and lower-latency than comparable TTS APIs
| Provider | Region | Latency | Throughput | Uptime | Input ($/1M) | Output ($/1M) | Context |
|---|---|---|---|---|---|---|---|
| LLM API BEST | Global | 80ms | 120 req/s | 99.99% | $0.004/min | $0.004/min | ~15 min audio |
| Mistral | EU West | ~140ms | ~45 req/s | ~99.9% | ~$0.010/min | ~$0.010/min | ~10 min audio |
| OpenAI | Global | ~150ms | ~60 req/s | 99.9% | ~$0.015/min | ~$0.015/min | ~15 min audio |
| Azure AI Speech | Global | ~180ms | ~80 req/s | 99.9% | ~$0.016/min | ~$0.016/min | ~10 min audio |
| Google Cloud Text-to-Speech | Global | ~170ms | ~70 req/s | 99.9% | ~$0.014/min | ~$0.014/min | ~10 min audio |
Performance benchmarks
Technical Specifications
| Metric | Voxtral Mini TTS | OpenAI gpt-4o-mini TTS | Google Chirp TTS (small) |
|---|---|---|---|
| Avg Latency | ~180ms | ~200ms | ~220ms |
| Languages Supported | ~25 | ~30 | ~20 |
| Price per 1M chars | ~$0.70 | ~$1.00 | ~$0.80 |
| Max Input Length | ~4K chars | ~8K chars | ~5K chars |
| Sample Rate | 24 kHz | 24 kHz | 22.05 kHz |
| Voices / Styles | ~20 | ~30 | ~15 |
| Uptime | 99.9% | 99.9% | 99.5% |
30-day usage via LLM API
- 620M
- Characters synthesized last 30 days
- 3.4M
- TTS API requests served
- 210K
- Unique developer projects using Voxtral Mini TTS
- 99.96%
- Average API uptime
Architecture & Integration
Why Build on LLM.API?
One unified API. Every major model. Built-in reliability, cost control, and observability.
-
Unified AI Routing
Automatically route each request to the best-fit model across providers based on cost, latency, or quality—without changing your code or client integration.
One endpoint, every model. -
Cost-Aware Orchestration
Define cost ceilings and model preferences, then let LLM.API optimize per-call spend so you can scale usage without surprise bills or manual tuning.
More usage, less spend. -
Automatic Fallbacks
When a provider times out, errors, or rate-limits, LLM.API seamlessly retries on backup models so your production flows stay reliable and resilient.
No single point of failure. -
Deep Observability
Get unified logs, metrics, traces, and payload samples across all models and providers, making debugging, performance tuning, and governance radically simpler.
See every token, everywhere. -
Task-Level Abstractions
Describe tasks like chat, generation, tools, or RAG once and let LLM.API translate them into provider-specific calls, so you avoid brittle model-specific code.
Code to tasks, not models. -
High-Throughput Batch
Send massive batches of prompts through a single API call, with automatic chunking, retries, and concurrency controls to maximize throughput across providers.
Process thousands in one go.
Decision guide
When to Use — When NOT to Use
Use it if...
- You need lightweight text-to-speech for applications where a compact model is sufficient.
- You need TTS integrated into an existing Mistral-based stack for simpler deployment.
- Your use case involves prototyping speech features without requiring enterprise-grade voice quality.
- Your use case involves cost-sensitive scenarios where smaller speech models are advantageous.
- You need basic voice output for chatbots, assistants, or simple narration tasks.
Avoid if...
- You need state-of-the-art naturalness and expressiveness on par with premium commercial TTS.
- Your workload requires highly controllable prosody, emotions, and detailed voice style parameters.
- You need robust multilingual coverage and accents beyond what Mistral explicitly supports.
- Your workload requires ultra-high-fidelity audio for production media, film, or advertising.
- You need mature, battle-tested TTS with extensive tooling, ecosystem, and vendor guarantees.
FAQ
Frequently Asked Questions
-
What is Voxtral Mini TTS?
Voxtral Mini TTS is a Mistral text-to-speech model focused on fast, lightweight voice synthesis for applications that need low-latency audio generation.
-
What is Voxtral Mini TTS best suited for?
It is best for real-time or near real-time speech generation in interactive apps, voice assistants, and low-resource environments.
-
How is Voxtral Mini TTS priced when used through LLM.API?
Pricing is usage-based per generated character or token, with exact rates defined in the LLM.API model pricing table.
-
What context window or input length limits does Voxtral Mini TTS have?
The model accepts short to moderate text prompts suitable for speech synthesis, with exact character limits determined by LLM.API configuration.
-
How fast is Voxtral Mini TTS in terms of latency?
Voxtral Mini TTS is optimized for low latency, typically returning audio quickly enough for responsive user experiences in interactive applications.
-
What modalities does Voxtral Mini TTS support?
It supports text-to-speech only, taking text input and returning synthesized audio output.
-
How do I access Voxtral Mini TTS through LLM.API?
Call the LLM.API generation endpoint with the Voxtral Mini TTS model identifier, passing text input and any audio configuration parameters supported by the API.
-
How does Voxtral Mini TTS compare to larger TTS models?
Compared to larger TTS models, it trades some maximum quality and configurability for lower cost, faster inference, and smaller resource requirements.
-
What limitations should I be aware of when using Voxtral Mini TTS?
Limitations can include less natural prosody on complex texts, language coverage constraints, and quality degradation on very long inputs.
-
Does Voxtral Mini TTS support streaming audio output via LLM.API?
Streaming availability depends on LLM.API’s implementation; check the streaming or response_mode options for this specific model.
