Step 3.7 Flash

Instruction Following

Step 3.7 Flash is StepFun’s latest high-efficiency multimodal Mixture-of-Experts vision-language model, optimized for enterprise-scale agentic, coding, and long-context reasoning workloads.

Start Using API

API Performance

Latency: ~0.5s time to first token
Context: ~32K token context
Input: ~$0.15 per 1M tokens
Output: ~$0.60 per 1M tokens
Uptime: 99% 99%

About the model

What is Step 3.7 Flash?

Step 3.7 Flash is a 198B-parameter sparse Mixture-of-Experts vision-language model from StepFun that combines a large language backbone with a vision encoder for native image and video understanding. It is primarily used for high-throughput agentic workflows such as tool-calling, multi-step reasoning, and structured automation across text, image, and video inputs. It is also applied to coding, math, and long-context productivity tasks like parsing large documents or running concurrent coding agents with a 256K-token context window. The model extends and builds on the Step 3.5 Flash language architecture within the broader Step 3.x Flash family.

Input / Output

Input

Text prompts
Images (RGB screenshots, photos, UI, documents, charts)
Video frames or clips (for video-to-text understanding)

Output

Structured or free-form text responses
Source code generation and editing

Model capabilities

5 Core Capabilities

Multimodal Reasoning

Processes combined text and image inputs using a vision-language Mixture-of-Experts architecture for complex multimodal understanding and analysis.
Conversational AI

Acts as a high-efficiency assistant for dialogue, instruction following, long-context conversations, and enterprise-focused agent workflows.
Code and Math

Supports coding-related assistance, multi-step reasoning, and mathematical problem solving within large-context, tool-using agent scenarios.
Multilingual Support

Handles prompts and content in multiple languages, enabling global applications and cross-lingual understanding in text and images.
Document OCR

Interprets text within screenshots, documents, and UI images as part of its native image understanding and agentic tool-use workflows.

Use cases

6 Most Valuable Use Cases

Real-time Chatbots
Invoice Data Extraction
Legal Case Retrieval
Regulatory Case Monitoring
E-commerce Support Assistant
Code Generation Helper

Transparent pricing

Cost Comparison

LLM API offers the lowest cost and highest performance for Step 3.7 Flash–class models.

Provider	Region	Latency	Throughput	Uptime	Input ($/1M)	Output ($/1M)	Context
LLM API BEST	Global	120ms	80 tps	99.99%	$0.08	$0.24	256K
StepFun	Global	~250ms	~40 tps	~99.9%	~$0.12	~$0.36	~128K
OpenAI (GPT-4.1 mini equivalent)	Global	~220ms	~50 tps	~99.9%	~$0.15	~$0.45	~128K
Anthropic (Claude 3.7 Haiku equivalent)	US/EU	~230ms	~45 tps	~99.9%	~$0.14	~$0.42	~200K
Google Cloud (Gemini Flash equivalent)	US/EU/Asia	~240ms	~40 tps	~99.9%	~$0.13	~$0.39	~128K

Performance benchmarks

Technical Specifications

Metric	Step 3.7 Flash	DeepSeek V4 Flash	Gemini 2.5 Flash
Model Type	Multimodal MoE VLM	Multimodal LLM	Multimodal LLM
Total Parameters	198B	—	—
Active Parameters / Token	~11B	—	—
Context Window	256K	—	1M
Modalities	Text, Image, Video	Text, Image	Text, Image, Audio, Video
Input Price ($/1M tokens)	$0.071	—	$0.10
Output Price ($/1M tokens)	$1.15	—	$0.40
Max Output Tokens	—	—	8192

30-day usage via LLM API

2.3B: Prompt tokens processed (last 30 days)
1.1B: Completion tokens generated (last 30 days)
7.8M: API requests served (last 30 days)
99.8%: Avg uptime across all regions

Start Using API

Architecture & Integration

Why Build on LLM.API?

One unified API. Every major model. Built-in reliability, cost control, and observability.

Unified AI Routing

Automatically route each request to the optimal model across providers based on latency, quality, or custom rules—no client changes required as your stack evolves.
One endpoint, every model
Cost-Aware Orchestration

Control spend by mixing premium and budget models behind one API, with routing policies that cap cost per request and optimize for price-performance.
Lower cost, same output
Resilient Fallbacks

Eliminate single-provider outages with automatic failover to backup models, preserving SLAs and uptime without adding error-handling complexity to your application code.
Stay online, automatically
Full-Stack Observability

Get unified logs, metrics, traces, and model-level analytics so you can debug latency spikes, track usage, and tune routing—all from a single dashboard.
See every token
Task-Level Abstractions

Call high-level tasks like chat, generation, or extraction instead of provider-specific APIs, so you can swap models without rewriting business logic.
Code to tasks, not models
High-Throughput Batch

Run large-scale batch jobs across models with automatic chunking, retry, and rate-limit handling, achieving maximum throughput without custom queue infrastructure.
Thousands of calls, one job

Decision guide

When to Use — When NOT to Use

Use it if...

You need a fast, low-cost model for simple question answering or retrieval.
You need to serve high-volume API traffic where throughput and latency dominate accuracy.
Your use case involves lightweight classification, tagging, or routing over many short texts.
Your use case involves simple data extraction from semi-structured content like forms or receipts.
You need a compact model for rapid experimentation, A/B tests, or fallback logic.
Your use case involves template-based content generation where creativity and nuance are limited.

Avoid if...

You need state-of-the-art reasoning for complex multi-step problems or intricate planning tasks.
Your workload requires handling very long contexts with high faithfulness to source documents.
You need expert-level coding assistance, complex refactoring, or multi-file software design support.
You need highly creative writing, nuanced style control, or domain-specialist technical drafting.
Your workload requires robust multilingual performance across low-resource languages or tricky scripts.
You need strict reliability for safety-critical decisions, legal analysis, or medical advice.

FAQ

Frequently Asked Questions

What is Step 3.7 Flash?

Step 3.7 Flash is a StepFun large language model optimized for fast, low-cost text generation through the LLM.API unified gateway.
What is Step 3.7 Flash best suited for?

Step 3.7 Flash is best for high-volume, latency-sensitive tasks like chatbots, routing, drafting, and lightweight reasoning where speed and cost matter most.
What is the context window of Step 3.7 Flash?

Step 3.7 Flash supports context windows up to 16K tokens, suitable for long conversations or moderately sized documents.
How fast is Step 3.7 Flash in terms of latency?

Step 3.7 Flash is designed for low-latency responses, typically returning first tokens quickly enough for real-time interactive applications.
What modalities does Step 3.7 Flash support?

Step 3.7 Flash currently supports text-in, text-out interactions and does not natively process images, audio, or video.
How do I call Step 3.7 Flash via LLM.API?

Use the LLM.API chat or completions endpoint and set the model parameter to "stepfun/step-3.7-flash" with your LLM.API key.
How is pricing for Step 3.7 Flash handled on LLM.API?

Pricing for Step 3.7 Flash is metered per input and output token by LLM.API, with rates listed in your LLM.API dashboard and pricing page.
How does Step 3.7 Flash compare to more capable StepFun models?

Compared to larger StepFun models, Step 3.7 Flash is cheaper and faster but offers weaker reasoning, coding, and complex instruction-following.
Can I use Step 3.7 Flash for code generation?

Step 3.7 Flash can generate and edit code for straightforward tasks, but complex, critical coding workloads should use a more capable model.
What are the main limitations of Step 3.7 Flash?

Step 3.7 Flash may hallucinate facts, struggle with intricate multi-step reasoning, and is not suitable for safety-critical or compliance-sensitive decisions.

Start in 2 lines of code

Get My API Key

Step 3.7 Flash

What is Step 3.7 Flash?

5 Core Capabilities

Multimodal Reasoning

Conversational AI

Code and Math

Multilingual Support

Document OCR

6 Most Valuable Use Cases

Cost Comparison

Technical Specifications

Why Build on LLM.API?

Unified AI Routing

Cost-Aware Orchestration

Resilient Fallbacks

Full-Stack Observability

Task-Level Abstractions

High-Throughput Batch

When to Use — When NOT to Use

Use it if...

Avoid if...

Start in 2 lines of code