Powered by Mistral

Voxtral Mini TTS

  • Text-to-Speech

Voxtral Mini TTS is Mistral’s 4B-parameter text-to-speech model that generates expressive, low-latency speech and supports multilingual, zero-shot voice cloning. It is available via the Mistral API and as open weights for self-hosting.

Start Using API

What is Voxtral Mini TTS?

Voxtral Mini TTS is a 4B-parameter text-to-speech model from Mistral that converts text into natural, expressive speech with multilingual support and voice cloning from very short audio samples. It is mainly used to build voice agents and assistants that respond in real time with low-latency audio, and to generate high-quality synthetic voices for applications like content narration, product voices, and accessibility tools. It also serves use cases that require cloning or reusing consistent speaker identities across many utterances, such as branded voice experiences and character dialogue. The model is part of Mistral’s Voxtral audio family, alongside Voxtral Mini and Voxtral Small transcription and audio-understanding models.

5 Core Capabilities

  • Text-to-Speech

    Generates natural-sounding speech audio from written text, suitable for dialogue, narration, and interface responses in multiple scenarios.

  • Conversational Output

    Produces speech tailored for interactive assistants, enabling clear, responsive spoken dialogue aligned with conversational AI systems’ outputs.

  • Multilingual Speech

    Supports speech generation in multiple languages, allowing applications to vocalize content for diverse linguistic audiences and use cases.

  • Screen Reader Compatibility

    Can power screen readers or accessibility tools by converting on-screen text into intelligible, continuous spoken audio output.

  • Media Content Voice

    Provides synthesized voices for videos, podcasts, or interactive media, enabling scalable voiceover creation without human recording sessions.

6 Most Valuable Use Cases

  • Voice App Prototyping
  • Customer Support Prompts
  • Accessibility Voice Output
  • Interactive Voice Demos
  • Spoken Content Previews
  • Educational Voice Feedback

Cost Comparison

Up to ~70% cheaper and lower-latency than comparable TTS APIs

Provider Region Latency Throughput Uptime Input ($/1M) Output ($/1M) Context
LLM API BEST Global 80ms 120 req/s 99.99% $0.004/min $0.004/min ~15 min audio
Mistral EU West ~140ms ~45 req/s ~99.9% ~$0.010/min ~$0.010/min ~10 min audio
OpenAI Global ~150ms ~60 req/s 99.9% ~$0.015/min ~$0.015/min ~15 min audio
Azure AI Speech Global ~180ms ~80 req/s 99.9% ~$0.016/min ~$0.016/min ~10 min audio
Google Cloud Text-to-Speech Global ~170ms ~70 req/s 99.9% ~$0.014/min ~$0.014/min ~10 min audio

Technical Specifications

Metric Voxtral Mini TTS OpenAI gpt-4o-mini TTS Google Chirp TTS (small)
Avg Latency ~180ms ~200ms ~220ms
Languages Supported ~25 ~30 ~20
Price per 1M chars ~$0.70 ~$1.00 ~$0.80
Max Input Length ~4K chars ~8K chars ~5K chars
Sample Rate 24 kHz 24 kHz 22.05 kHz
Voices / Styles ~20 ~30 ~15
Uptime 99.9% 99.9% 99.5%

30-day usage via LLM API

620M
Characters synthesized last 30 days
3.4M
TTS API requests served
210K
Unique developer projects using Voxtral Mini TTS
99.96%
Average API uptime
Start Using API

Why Build on LLM.API?

One unified API. Every major model. Built-in reliability, cost control, and observability.

  • Unified AI Routing

    Automatically route each request to the best-fit model across providers based on cost, latency, or quality—without changing your code or client integration.

    One endpoint, every model.
  • Cost-Aware Orchestration

    Define cost ceilings and model preferences, then let LLM.API optimize per-call spend so you can scale usage without surprise bills or manual tuning.

    More usage, less spend.
  • Automatic Fallbacks

    When a provider times out, errors, or rate-limits, LLM.API seamlessly retries on backup models so your production flows stay reliable and resilient.

    No single point of failure.
  • Deep Observability

    Get unified logs, metrics, traces, and payload samples across all models and providers, making debugging, performance tuning, and governance radically simpler.

    See every token, everywhere.
  • Task-Level Abstractions

    Describe tasks like chat, generation, tools, or RAG once and let LLM.API translate them into provider-specific calls, so you avoid brittle model-specific code.

    Code to tasks, not models.
  • High-Throughput Batch

    Send massive batches of prompts through a single API call, with automatic chunking, retries, and concurrency controls to maximize throughput across providers.

    Process thousands in one go.

When to Use — When NOT to Use

Use it if...

  • You need lightweight text-to-speech for applications where a compact model is sufficient.
  • You need TTS integrated into an existing Mistral-based stack for simpler deployment.
  • Your use case involves prototyping speech features without requiring enterprise-grade voice quality.
  • Your use case involves cost-sensitive scenarios where smaller speech models are advantageous.
  • You need basic voice output for chatbots, assistants, or simple narration tasks.

Avoid if...

  • You need state-of-the-art naturalness and expressiveness on par with premium commercial TTS.
  • Your workload requires highly controllable prosody, emotions, and detailed voice style parameters.
  • You need robust multilingual coverage and accents beyond what Mistral explicitly supports.
  • Your workload requires ultra-high-fidelity audio for production media, film, or advertising.
  • You need mature, battle-tested TTS with extensive tooling, ecosystem, and vendor guarantees.

Frequently Asked Questions

  • What is Voxtral Mini TTS?

    Voxtral Mini TTS is a Mistral text-to-speech model focused on fast, lightweight voice synthesis for applications that need low-latency audio generation.

  • What is Voxtral Mini TTS best suited for?

    It is best for real-time or near real-time speech generation in interactive apps, voice assistants, and low-resource environments.

  • How is Voxtral Mini TTS priced when used through LLM.API?

    Pricing is usage-based per generated character or token, with exact rates defined in the LLM.API model pricing table.

  • What context window or input length limits does Voxtral Mini TTS have?

    The model accepts short to moderate text prompts suitable for speech synthesis, with exact character limits determined by LLM.API configuration.

  • How fast is Voxtral Mini TTS in terms of latency?

    Voxtral Mini TTS is optimized for low latency, typically returning audio quickly enough for responsive user experiences in interactive applications.

  • What modalities does Voxtral Mini TTS support?

    It supports text-to-speech only, taking text input and returning synthesized audio output.

  • How do I access Voxtral Mini TTS through LLM.API?

    Call the LLM.API generation endpoint with the Voxtral Mini TTS model identifier, passing text input and any audio configuration parameters supported by the API.

  • How does Voxtral Mini TTS compare to larger TTS models?

    Compared to larger TTS models, it trades some maximum quality and configurability for lower cost, faster inference, and smaller resource requirements.

  • What limitations should I be aware of when using Voxtral Mini TTS?

    Limitations can include less natural prosody on complex texts, language coverage constraints, and quality degradation on very long inputs.

  • Does Voxtral Mini TTS support streaming audio output via LLM.API?

    Streaming availability depends on LLM.API’s implementation; check the streaming or response_mode options for this specific model.

Start in 2 lines of code

Get My API Key