Powered by StepFun
Step 3.7 Flash
- Instruction Following
Step 3.7 Flash is StepFun’s latest high-efficiency multimodal Mixture-of-Experts vision-language model, optimized for enterprise-scale agentic, coding, and long-context reasoning workloads.
About the model
What is Step 3.7 Flash?
Step 3.7 Flash is a 198B-parameter sparse Mixture-of-Experts vision-language model from StepFun that combines a large language backbone with a vision encoder for native image and video understanding. It is primarily used for high-throughput agentic workflows such as tool-calling, multi-step reasoning, and structured automation across text, image, and video inputs. It is also applied to coding, math, and long-context productivity tasks like parsing large documents or running concurrent coding agents with a 256K-token context window. The model extends and builds on the Step 3.5 Flash language architecture within the broader Step 3.x Flash family.
Model capabilities
5 Core Capabilities
-
Multimodal Reasoning
Processes combined text and image inputs using a vision-language Mixture-of-Experts architecture for complex multimodal understanding and analysis.
-
Conversational AI
Acts as a high-efficiency assistant for dialogue, instruction following, long-context conversations, and enterprise-focused agent workflows.
-
Code and Math
Supports coding-related assistance, multi-step reasoning, and mathematical problem solving within large-context, tool-using agent scenarios.
-
Multilingual Support
Handles prompts and content in multiple languages, enabling global applications and cross-lingual understanding in text and images.
-
Document OCR
Interprets text within screenshots, documents, and UI images as part of its native image understanding and agentic tool-use workflows.
Use cases
6 Most Valuable Use Cases
- Real-time Chatbots
- Invoice Data Extraction
- Legal Case Retrieval
- Regulatory Case Monitoring
- E-commerce Support Assistant
- Code Generation Helper
Transparent pricing
Cost Comparison
LLM API offers the lowest cost and highest performance for Step 3.7 Flash–class models.
| Provider | Region | Latency | Throughput | Uptime | Input ($/1M) | Output ($/1M) | Context |
|---|---|---|---|---|---|---|---|
| LLM API BEST | Global | 120ms | 80 tps | 99.99% | $0.08 | $0.24 | 256K |
| StepFun | Global | ~250ms | ~40 tps | ~99.9% | ~$0.12 | ~$0.36 | ~128K |
| OpenAI (GPT-4.1 mini equivalent) | Global | ~220ms | ~50 tps | ~99.9% | ~$0.15 | ~$0.45 | ~128K |
| Anthropic (Claude 3.7 Haiku equivalent) | US/EU | ~230ms | ~45 tps | ~99.9% | ~$0.14 | ~$0.42 | ~200K |
| Google Cloud (Gemini Flash equivalent) | US/EU/Asia | ~240ms | ~40 tps | ~99.9% | ~$0.13 | ~$0.39 | ~128K |
Performance benchmarks
Technical Specifications
| Metric | Step 3.7 Flash | DeepSeek V4 Flash | Gemini 2.5 Flash |
|---|---|---|---|
| Model Type | Multimodal MoE VLM | Multimodal LLM | Multimodal LLM |
| Total Parameters | 198B | — | — |
| Active Parameters / Token | ~11B | — | — |
| Context Window | 256K | — | 1M |
| Modalities | Text, Image, Video | Text, Image | Text, Image, Audio, Video |
| Input Price ($/1M tokens) | $0.071 | — | $0.10 |
| Output Price ($/1M tokens) | $1.15 | — | $0.40 |
| Max Output Tokens | — | — | 8192 |
30-day usage via LLM API
- 2.3B
- Prompt tokens processed (last 30 days)
- 1.1B
- Completion tokens generated (last 30 days)
- 7.8M
- API requests served (last 30 days)
- 99.8%
- Avg uptime across all regions
Architecture & Integration
Why Build on LLM.API?
One unified API. Every major model. Built-in reliability, cost control, and observability.
-
Unified AI Routing
Automatically route each request to the optimal model across providers based on latency, quality, or custom rules—no client changes required as your stack evolves.
One endpoint, every model -
Cost-Aware Orchestration
Control spend by mixing premium and budget models behind one API, with routing policies that cap cost per request and optimize for price-performance.
Lower cost, same output -
Resilient Fallbacks
Eliminate single-provider outages with automatic failover to backup models, preserving SLAs and uptime without adding error-handling complexity to your application code.
Stay online, automatically -
Full-Stack Observability
Get unified logs, metrics, traces, and model-level analytics so you can debug latency spikes, track usage, and tune routing—all from a single dashboard.
See every token -
Task-Level Abstractions
Call high-level tasks like chat, generation, or extraction instead of provider-specific APIs, so you can swap models without rewriting business logic.
Code to tasks, not models -
High-Throughput Batch
Run large-scale batch jobs across models with automatic chunking, retry, and rate-limit handling, achieving maximum throughput without custom queue infrastructure.
Thousands of calls, one job
Decision guide
When to Use — When NOT to Use
Use it if...
- You need a fast, low-cost model for simple question answering or retrieval.
- You need to serve high-volume API traffic where throughput and latency dominate accuracy.
- Your use case involves lightweight classification, tagging, or routing over many short texts.
- Your use case involves simple data extraction from semi-structured content like forms or receipts.
- You need a compact model for rapid experimentation, A/B tests, or fallback logic.
- Your use case involves template-based content generation where creativity and nuance are limited.
Avoid if...
- You need state-of-the-art reasoning for complex multi-step problems or intricate planning tasks.
- Your workload requires handling very long contexts with high faithfulness to source documents.
- You need expert-level coding assistance, complex refactoring, or multi-file software design support.
- You need highly creative writing, nuanced style control, or domain-specialist technical drafting.
- Your workload requires robust multilingual performance across low-resource languages or tricky scripts.
- You need strict reliability for safety-critical decisions, legal analysis, or medical advice.
FAQ
Frequently Asked Questions
-
What is Step 3.7 Flash?
Step 3.7 Flash is a StepFun large language model optimized for fast, low-cost text generation through the LLM.API unified gateway.
-
What is Step 3.7 Flash best suited for?
Step 3.7 Flash is best for high-volume, latency-sensitive tasks like chatbots, routing, drafting, and lightweight reasoning where speed and cost matter most.
-
What is the context window of Step 3.7 Flash?
Step 3.7 Flash supports context windows up to 16K tokens, suitable for long conversations or moderately sized documents.
-
How fast is Step 3.7 Flash in terms of latency?
Step 3.7 Flash is designed for low-latency responses, typically returning first tokens quickly enough for real-time interactive applications.
-
What modalities does Step 3.7 Flash support?
Step 3.7 Flash currently supports text-in, text-out interactions and does not natively process images, audio, or video.
-
How do I call Step 3.7 Flash via LLM.API?
Use the LLM.API chat or completions endpoint and set the model parameter to "stepfun/step-3.7-flash" with your LLM.API key.
-
How is pricing for Step 3.7 Flash handled on LLM.API?
Pricing for Step 3.7 Flash is metered per input and output token by LLM.API, with rates listed in your LLM.API dashboard and pricing page.
-
How does Step 3.7 Flash compare to more capable StepFun models?
Compared to larger StepFun models, Step 3.7 Flash is cheaper and faster but offers weaker reasoning, coding, and complex instruction-following.
-
Can I use Step 3.7 Flash for code generation?
Step 3.7 Flash can generate and edit code for straightforward tasks, but complex, critical coding workloads should use a more capable model.
-
What are the main limitations of Step 3.7 Flash?
Step 3.7 Flash may hallucinate facts, struggle with intricate multi-step reasoning, and is not suitable for safety-critical or compliance-sensitive decisions.
