Step 3.7 Flash

Multimodal
Vision-Language
Code Generation
Agentic Workflows
Long Context
Fast Inference
+1 category

Step 3.7 Flash is StepFun’s latest high-efficiency multimodal Mixture-of-Experts vision-language model, optimized for enterprise-scale agentic, coding, and long-context reasoning workloads.

Start Using API

API Performance

Context: 256000 tokens
Input: $0.20 per 1M tokens
Output: $1.15 per 1M tokens
Uptime: 99% 99%

About the model

What is Step 3.7 Flash?

Step 3.7 Flash is a 198B-parameter sparse Mixture-of-Experts vision-language model from StepFun that combines a large language backbone with a vision encoder for native image and video understanding. It is primarily used for high-throughput agentic workflows such as tool-calling, multi-step reasoning, and structured automation across text, image, and video inputs. It is also applied to coding, math, and long-context productivity tasks like parsing large documents or running concurrent coding agents with a 256K-token context window. The model extends and builds on the Step 3.5 Flash language architecture within the broader Step 3.x Flash family.

Input / Output

Input

Text prompts
Images (RGB screenshots, photos, UI, documents, charts)
Video frames or clips (for video-to-text understanding)

Output

Structured or free-form text responses
Source code generation and editing

Model capabilities

5 Core Capabilities

Multimodal Understanding

Processes text, images, and video frames together, enabling native image and video understanding for complex perception and reasoning tasks.
Conversational Reasoning

Supports fast, multi-step reasoning in dialogue, with selectable reasoning depth to balance speed, cost, and quality of answers.
Agentic Workflows

Designed for agent-style applications, coordinating perception, search, and multi-step actions across tools, terminals, browsers, and services.
Code Generation

Generates and edits code, supports frontend generation from mockups, screenshot-based debugging, and high-throughput concurrent coding agents.
Long-Context Processing

Handles up to 256k tokens, enabling single-pass analysis of large documents, multi-source search traces, and extensive conversational histories.

Use cases

6 Most Valuable Use Cases

Multimodal UI Agents
Screenshot Debugging
Frontend From Mockups
Document Understanding
Tool-Calling Orchestration
Code Generation Agents

Transparent pricing

Cost Comparison

LLM API offers Step 3.7 Flash access at the same base token prices as direct StepFun, while aggregators and cloud endpoints may add their own margins or be free-tier only.

Provider	Region	Input ($/1M)	Output ($/1M)	Context
LLM API BEST	Global	$0.20 per 1M tokens	$1.15 per 1M tokens	256K
StepFun	Global	$0.20 per 1M tokens	$1.15 per 1M tokens	256K
OpenRouter	Global	$0.20 per 1M tokens	$1.15 per 1M tokens	256K
NVIDIA NIM	Global

Performance benchmarks

Technical Specifications

Metric	Step 3.7 Flash	DeepSeek V4 Flash	Gemini 2.5 Flash
Model Type	Multimodal MoE VLM	Multimodal LLM	Multimodal LLM
Total Parameters	198B	—	—
Active Parameters / Token	~11B	—	—
Context Window	256K	—	1M
Modalities	Text, Image, Video	Text, Image	Text, Image, Audio, Video
Input Price ($/1M tokens)	$0.071	—	$0.10
Output Price ($/1M tokens)	$1.15	—	$0.40
Max Output Tokens	—	—	8192

30-day usage via LLM API

2.3B: Prompt tokens processed (last 30 days)
1.1B: Completion tokens generated (last 30 days)
7.8M: API requests served (last 30 days)
99.8%: Avg uptime across all regions

Start Using API

Architecture & Integration

Why Build on LLM.API?

One unified API. Every major model. Built-in reliability, cost control, and observability.

Unified AI Routing

Automatically route each request to the optimal model across providers based on latency, quality, or custom rules—no client changes required as your stack evolves.
One endpoint, every model
Cost-Aware Orchestration

Control spend by mixing premium and budget models behind one API, with routing policies that cap cost per request and optimize for price-performance.
Lower cost, same output
Resilient Fallbacks

Eliminate single-provider outages with automatic failover to backup models, preserving SLAs and uptime without adding error-handling complexity to your application code.
Stay online, automatically
Full-Stack Observability

Get unified logs, metrics, traces, and model-level analytics so you can debug latency spikes, track usage, and tune routing—all from a single dashboard.
See every token
Task-Level Abstractions

Call high-level tasks like chat, generation, or extraction instead of provider-specific APIs, so you can swap models without rewriting business logic.
Code to tasks, not models
High-Throughput Batch

Run large-scale batch jobs across models with automatic chunking, retry, and rate-limit handling, achieving maximum throughput without custom queue infrastructure.
Thousands of calls, one job

Decision guide

When to Use — When NOT to Use

Use it if...

You need a fast, low-cost model for simple question answering or retrieval.
You need to serve high-volume API traffic where throughput and latency dominate accuracy.
Your use case involves lightweight classification, tagging, or routing over many short texts.
Your use case involves simple data extraction from semi-structured content like forms or receipts.
You need a compact model for rapid experimentation, A/B tests, or fallback logic.
Your use case involves template-based content generation where creativity and nuance are limited.

Avoid if...

You need state-of-the-art reasoning for complex multi-step problems or intricate planning tasks.
Your workload requires handling very long contexts with high faithfulness to source documents.
You need expert-level coding assistance, complex refactoring, or multi-file software design support.
You need highly creative writing, nuanced style control, or domain-specialist technical drafting.
Your workload requires robust multilingual performance across low-resource languages or tricky scripts.
You need strict reliability for safety-critical decisions, legal analysis, or medical advice.

FAQ

Frequently Asked Questions

What is Step 3.7 Flash?

Step 3.7 Flash is a StepFun large language model optimized for fast, low-cost text generation through the LLM.API unified gateway.
What is Step 3.7 Flash best suited for?

Step 3.7 Flash is best for high-volume, latency-sensitive tasks like chatbots, routing, drafting, and lightweight reasoning where speed and cost matter most.
What is the context window of Step 3.7 Flash?

Step 3.7 Flash supports context windows up to 16K tokens, suitable for long conversations or moderately sized documents.
How fast is Step 3.7 Flash in terms of latency?

Step 3.7 Flash is designed for low-latency responses, typically returning first tokens quickly enough for real-time interactive applications.
What modalities does Step 3.7 Flash support?

Step 3.7 Flash currently supports text-in, text-out interactions and does not natively process images, audio, or video.
How do I call Step 3.7 Flash via LLM.API?

Use the LLM.API chat or completions endpoint and set the model parameter to "stepfun/step-3.7-flash" with your LLM.API key.
How is pricing for Step 3.7 Flash handled on LLM.API?

Pricing for Step 3.7 Flash is metered per input and output token by LLM.API, with rates listed in your LLM.API dashboard and pricing page.
How does Step 3.7 Flash compare to more capable StepFun models?

Compared to larger StepFun models, Step 3.7 Flash is cheaper and faster but offers weaker reasoning, coding, and complex instruction-following.
Can I use Step 3.7 Flash for code generation?

Step 3.7 Flash can generate and edit code for straightforward tasks, but complex, critical coding workloads should use a more capable model.
What are the main limitations of Step 3.7 Flash?

Step 3.7 Flash may hallucinate facts, struggle with intricate multi-step reasoning, and is not suitable for safety-critical or compliance-sensitive decisions.

EXPLORE MORE

Related Resources

Start in 2 lines of code

Get My API Key

Step 3.7 Flash

What is Step 3.7 Flash?

5 Core Capabilities

Multimodal Understanding

Conversational Reasoning

Agentic Workflows

Code Generation

Long-Context Processing

6 Most Valuable Use Cases

Cost Comparison

Technical Specifications

Why Build on LLM.API?

Unified AI Routing

Cost-Aware Orchestration

Resilient Fallbacks

Full-Stack Observability

Task-Level Abstractions

High-Throughput Batch

When to Use — When NOT to Use

Use it if...

Avoid if...

Related Resources

Gemini 3.5 Flash

Grok Build 0.1

Qwen3.7 Max

Claude Opus 4.8

Start in 2 lines of code