MiMo-V2-Omni

Vision-Language

MiMo-V2-Omni is Xiaomi’s omni‑modal foundation model that natively handles text, images, audio, and video while supporting agent-style tool use and interface control. It is positioned as the full-modality MiMo V2 variant for complex real‑world multimodal interaction and execution.

Start Using API

API Performance

Latency: ~2.5s time to first token
Context: 262K token context
Input: ~$0.40 per 1M tokens
Output: ~$1.20 per 1M tokens
Uptime: 99% 99%

About the model

What is MiMo-V2-Omni?

MiMo-V2-Omni is an omni‑modal AI foundation model from Xiaomi that unifies text, vision, and audio processing in a single architecture. It is mainly used for multimodal assistants that must understand and respond to combinations of text, images, audio, or video. It is also used as an agent backbone for tool invocation, function execution, and GUI operation in automated workflows. It belongs to Xiaomi’s MiMo V2 family alongside MiMo-V2-Pro and related MiMo V2 series models.

Input / Output

Input

Text prompts and documents (multimodal context)
Images for visual understanding
Audio inputs for speech and sound understanding
Video inputs for multimodal perception

Output

Structured or free-form text responses
Program code generation and editing

Model capabilities

5 Core Capabilities

Multimodal Perception

Processes and understands text, images, audio, and video inputs for unified multimodal tasks and content comprehension across modalities.
Conversational Chat

Supports general-purpose dialogue, task-oriented assistance, and reasoning-based responses within Xiaomi’s MiMo ecosystem and compatible agent frameworks.
Image and Video

Interprets images and video frames, recognizing objects, scenes, and temporal context for perception-driven downstream applications and agents.
Audio Transcription

Understands spoken content within multimodal inputs, enabling recognition and interpretation of embedded audio segments in complex tasks.
Cross-Lingual Understanding

Handles multilingual text inputs and responses, enabling cross-language comprehension and interaction in Xiaomi’s global user scenarios.

Use cases

6 Most Valuable Use Cases

Mixed Media Chatbots
Image & Video Analysis
Audio Transcription Support
Long-Context Document QA
Customer Support Automation
Multimodal Agent Workflows

Transparent pricing

Cost Comparison

LLM API offers the lowest cost and latency for MiMo-V2-Omni–class multimodal models.

Provider	Region	Latency	Throughput	Uptime	Input ($/1M)	Output ($/1M)	Context
LLM API BEST	Global	120ms	120 img/min	99.99%	$0.40/1K images	$0.40/1K images	12MP images, ~4K text tokens
Xiaomi	Asia Pacific	~220ms	~70 img/min	~99.9%	~$0.60/1K images	~$0.60/1K images	~12MP images, ~4K text tokens
OpenAI	Global	~180ms	~90 img/min	99.9%	~$1.50/1K images	~$1.50/1K images	~16MP images, ~8K text tokens
Google Cloud	US East	~200ms	~80 img/min	99.9%	~$1.20/1K images	~$1.20/1K images	~12MP images, ~8K text tokens
Azure AI	EU West	~190ms	~85 img/min	99.9%	~$1.30/1K images	~$1.30/1K images	~12MP images, ~8K text tokens

Performance benchmarks

Technical Specifications

Metric	MiMo-V2-Omni (Xiaomi)	GPT-4o (OpenAI)	Gemini 1.5 Pro (Google)
Avg Latency	~180ms	~200ms	~220ms
Context Window	128K	128K	2M
Input Price ($/1M tokens)	~$0.75	~$5.00	~$3.50
Output Price ($/1M tokens)	~$2.50	~$15.00	~$10.50
Max Output Tokens	8K	4K	8K
Throughput	~60 tps	~50 tps	~45 tps
Uptime	99.9%	99.9%	99.9%

30-day usage via LLM API

7.8B: Prompt tokens processed (30 days)
4.5B: Completion tokens generated (30 days)
22.4M: API requests served (30 days)
99.8%: Average API uptime (30 days)

Start Using API

Architecture & Integration

Why Build on LLM.API?

One unified API. Every major model. Built-in reliability, cost control, and observability.

Intelligent Model Routing

Dynamically route each request to the best model by cost, latency, or quality—no code changes when providers, versions, or limits shift.
One endpoint, every model.
Cost-Aware Execution

Enforce per-project budgets, pick cheaper equivalents automatically, and get real-time usage insights so you never lose control of AI spend at scale.
Optimize tokens, not features.
Resilient Fallback Flows

Define provider and model fallbacks once; LLM.API transparently retries and reroutes around outages and rate limits to keep production workloads online.
Fail soft, never offline.
Deep LLM Observability

Trace every request across providers with logs, metrics, and structured spans so you can debug prompt issues and performance regressions in minutes.
See every token hop.
Task-Level Orchestration

Define tasks—retrieval, generation, tools, agents—at a higher level than raw prompts, then swap underlying models or providers without touching application code.
Think tasks, not prompts.
High-Throughput Batch API

Submit massive batches of jobs with built-in concurrency control, retries, and cost tracking to unlock efficient large-scale workflows and backfills.
Scale jobs by the million.

Decision guide

When to Use — When NOT to Use

Use it if...

You need tight integration with Xiaomi devices, sensors, and the MIUI ecosystem.
You need an on-device assistant optimized for Xiaomi smartphones and smart home products.
Your use case involves Chinese-language interaction and services targeting Xiaomi’s primary markets.
You need a multimodal model aligned with Xiaomi’s hardware capabilities for photos and video.
Your use case involves value-added AI features inside Xiaomi apps or system utilities.
You need OEM-level support and contracts directly tied to Xiaomi’s product roadmap.

Avoid if...

You need a widely documented, cloud-agnostic model with mature third-party developer ecosystem.
Your workload requires strict US or EU compliance certifications and established regulatory track record.
You need proven performance benchmarks across many public leaderboards and independent evaluations.
Your workload requires vendor-neutral deployment across heterogeneous hardware beyond Xiaomi infrastructure.
You need long-term stability guarantees independent of a single smartphone manufacturer’s strategy.
Your workload requires extensive community tooling, plugins, and open-source integrations already available.

FAQ

Frequently Asked Questions

What is MiMo-V2-Omni?

MiMo-V2-Omni is a Xiaomi multimodal model available through LLM.API, designed to handle both text and image inputs for general-purpose assistant tasks.
What is MiMo-V2-Omni best suited for?

MiMo-V2-Omni is best for everyday assistant use, multimodal chat, code help, and lightweight vision-language tasks rather than highly specialized domains.
How is MiMo-V2-Omni priced on LLM.API?

MiMo-V2-Omni uses LLM.API’s unified pay-per-token pricing; check the MiMo-V2-Omni entry in the pricing table for current input and output rates.
What context window does MiMo-V2-Omni support?

MiMo-V2-Omni supports a 16K token context window through LLM.API, suitable for moderately long conversations and documents.
How fast is MiMo-V2-Omni in terms of latency?

MiMo-V2-Omni is optimized for low to medium latency, typically returning first tokens within a couple of seconds depending on load and request size.
Which modalities does MiMo-V2-Omni support?

MiMo-V2-Omni supports text input and output plus image input, enabling vision-language use cases like image description and analysis.
How do I call MiMo-V2-Omni via the LLM.API?

Use the standard LLM.API chat or completions endpoint, setting the model parameter to "xiaomi/mimo-v2-omni" and providing your API key.
How does MiMo-V2-Omni compare to similar multimodal models?

Compared with similar multimodal models, MiMo-V2-Omni targets a balance of cost and performance, favoring affordability over cutting-edge reasoning strength.
What are the main limitations of MiMo-V2-Omni?

MiMo-V2-Omni may struggle with very long documents, highly specialized professional domains, precise numerical reasoning, and real-time information beyond its training data.
Can MiMo-V2-Omni be fine-tuned through LLM.API?

MiMo-V2-Omni is generally offered as a hosted foundation model on LLM.API, and direct fine-tuning support depends on the platform’s current feature set.

Start in 2 lines of code

Get My API Key

MiMo-V2-Omni

What is MiMo-V2-Omni?

5 Core Capabilities

Multimodal Perception

Conversational Chat

Image and Video

Audio Transcription

Cross-Lingual Understanding

6 Most Valuable Use Cases

Cost Comparison

Technical Specifications

Why Build on LLM.API?

Intelligent Model Routing

Cost-Aware Execution

Resilient Fallback Flows

Deep LLM Observability

Task-Level Orchestration

High-Throughput Batch API

When to Use — When NOT to Use

Use it if...

Avoid if...

Start in 2 lines of code