qwen-bg
max-ico04
In
0.182
Out
0.364
max-ico02
Chat
max-ico03
Active
DeepSeek V4 Flash
In the 1M-token context setting, V4 Flash achieves only 10% of the single-token FLOPs and 7% of the KV cache size compared with DeepSeek-V3.2 — a dramatic efficiency jump that makes serving very long contexts actually economical.
Free Tokens for New Members
Text to Speech
                                        const { OpenAI } = require('openai');

const api = new OpenAI({
  baseURL: 'https://api.ai.cc/v1',
  apiKey: '',
});

const main = async () => {
  const result = await api.chat.completions.create({
    model: 'deepseek/deepseek-v4-flash',
    messages: [
      {
        role: 'system',
        content: 'You are an AI assistant who knows everything.',
      },
      {
        role: 'user',
        content: 'Tell me, why is the sky blue?'
      }
    ],
  });

  const message = result.choices[0].message.content;
  console.log(`Assistant: ${message}`);
};

main();
                                
                                        import os
from openai import OpenAI

client = OpenAI(
    base_url="https://api.ai.cc/v1",
    api_key="",    
)

response = client.chat.completions.create(
    model="deepseek/deepseek-v4-flash",
    messages=[
        {
            "role": "system",
            "content": "You are an AI assistant who knows everything.",
        },
        {
            "role": "user",
            "content": "Tell me, why is the sky blue?"
        },
    ],
)

message = response.choices[0].message.content

print(f"Assistant: {message}")
Docs

300+ AI Models for OpenClaw & AI Agents

Save 20% on Costs & Free Tokens
qwenmax-bg
deepseek-copy (1).svg
DeepSeek V4 Flash

DeepSeek/Models/V4 Flash

DeepSeek V4 Flash

A 284B-parameter Mixture-of-Experts model engineered for fast, affordable inference without sacrificing reasoning depth. Thirteen billion parameters active per forward pass. One million tokens of context.

Preview April 24, 2026 Open Weights MoE Architecture 1M Context
284B
Total Parameters
MoE architecture
13B
Active per Pass
per forward pass
1M
Context Window
tokens
84 t/s
Output Speed
vs 52 median
1.00s
TTFT
vs 2.03s median
47
Intelligence Index
avg open-weight: 28
// 01 — OVERVIEW

What Is DeepSeek V4 Flash?

DeepSeek V4 Flash is the efficiency-first member of DeepSeek's fourth-generation model family. It sits alongside V4 Pro as a complementary option — where Pro optimizes for maximum intelligence, Flash optimizes for throughput, latency, and cost per token without falling dramatically short on quality.

The model uses a sparse Mixture-of-Experts design: while it carries 284 billion parameters in total, only 13 billion are active during any single inference call. That translates directly into lower compute and lower cost while keeping outputs sharper than a dense 13B model would achieve on its own.

API Pricing (per 1M tokens)
Input (cache miss)
$0.18
per 1M tokens
Input (cache hit)
$0.04
per 1M tokens
Output
$0.36
per 1M tokens
// 02 — ARCHITECTURE

Architecture & Key Innovations

Several architectural decisions separate V4 Flash from earlier DeepSeek releases and from the broader open-source field.

Compressed Sparse Attention (CSA)
Compresses KV caches along the sequence dimension (rate 4 in Flash), then applies DeepSeek Sparse Attention. A lightning indexer picks the top 512 most relevant compressed KV entries per query, plus a 128-token sliding window so local context is never missed.
Heavily Compressed Attention (HCA)
Applies a much more aggressive compression rate of 128, then performs dense attention over that compressed representation — giving the model a cheap global view of distant tokens in every layer. CSA and HCA layers are interleaved throughout.
Manifold-Constrained Hyper-Connections
Strengthens conventional residual connections to enhance stability of signal propagation across layers while preserving model expressivity — a key factor in maintaining quality at high compression ratios.
MoE Routing + Muon Optimizer
First 3 MoE layers use Hash routing; remaining layers use learned DeepSeekMoE routing. Multi-Token Prediction enabled at depth 1. Muon optimizer during training alongside FP4/FP8 mixed precision for low training cost.
Training Data

Pre-trained on more than 32 trillion diverse, high-quality tokens. Post-training used a two-stage pipeline: independent cultivation of domain-specific experts via SFT and RL with GRPO, followed by unified model consolidation via on-policy distillation.

// 03 — REASONING MODES

Reasoning Modes

V4 Flash supports three configurable reasoning effort modes — direct control over the latency/quality trade-off without switching models entirely.

Non-Thinking
No reasoning chain generated. Fastest latency, lowest token count. Best for simple queries, chat, and RAG retrieval steps.
Thinking
Internal chain-of-thought before answering. Standard mode for coding, structured reasoning, and multi-step agentic tasks.
Think Max
Extended reasoning budget. Approaches V4 Pro quality on complex math, STEM, and formal proofs. Recommended context: 384K+ tokens.
// 04 — BENCHMARKS

Benchmark Performance

On the Artificial Analysis Intelligence Index v4.0 (covering GDPval-AA, GPQA Diamond, HLE, IFBench, SciCode, Terminal-Bench, and others), V4 Flash in reasoning mode scores 47 versus an open-weight median of 28.

BENCHMARK
SCORE
STATUS
Intelligence Index (AA v4.0)
47 / 100
+19 vs median
Putnam-200 Pass@8
81.0
Top Tier
HMMT 2026 Feb
95.2
Leader
IMOAnswerBench
89.8
Leader
Output Speed
84 t/s
1.6× median
TTFT
1.00s
2× faster
// 05 — USE CASES

Use Cases

V4 Flash is positioned as the cost-effective default for most serving scenarios — the model you reach for first unless maximum frontier intelligence is explicitly required.

  • Coding Assist Long-context repo understanding, diff review, and autocomplete at high throughput. 1M-token context absorbs entire medium codebases in a single call.
  • RAG Pipelines High-volume retrieval synthesis where cache hits reduce input costs to fractions of a cent. Ideal for document-heavy Q&A production workloads.
  • Agentic Multi-step tool-calling loops. Performs on par with V4 Pro on simple agent tasks, at 3–4× lower cost per token.
  • Document Processing 1M-token context absorbs entire contracts, codebases, or report archives in a single call — no chunking required.
  • Math / STEM Think Max mode produces frontier-level formal reasoning at a fraction of Pro pricing. 95.2 on HMMT 2026 Feb.
  • Chat & Support Sub-second TTFT and 84 t/s throughput keep conversational latency imperceptible in real-time applications.
// 06 — COMPARISONS

How It Compares

vs.
DeepSeek V4 Pro
Pro carries 1.6T total / 49B active params. Flash is roughly 3–4× cheaper and faster, with reasoning that closely approaches Pro quality. Simple agent tasks: parity. Knowledge-intensive chains: Pro leads.
vs.
DeepSeek V3.2
Flash uses 10% of V3.2's FLOPs and 7% of its KV cache at 1M-token context — a generational efficiency leap — while introducing hybrid attention and configurable reasoning modes V3.2 lacked.
vs.
GPT-5.4 Nano
V4 Flash is currently the cheapest among small capable models, undercutting GPT-5.4 Nano on price while offering open weights and 1M-token context that most nano-class models do not provide.

AI Playground

Test all API models in the sandbox environment before you integrate. We provide more than 300 models to integrate into your app.
Try For Free
api-right-1
model-bg02-1

300+ AI Models for
OpenClaw & AI Agents

Save 20% on Costs