DeepSeek is the Chinese AI lab whose open-weight models keep rewriting the price-to-performance rules for the whole industry. Its current generation, DeepSeek V4, landed on April 24, 2026 as a mixture-of-experts (MoE) family released under the permissive MIT license — meaning you can use it through DeepSeek's cheap API or download the weights and run them yourself, with a frontier-class 1M-token context window either way.
This guide covers the V4 lineup and how it is built, the three ways to run it, what it costs, how to call the API, and where DeepSeek is the right tool versus the closed frontier labs.
The DeepSeek V4 lineup
V4 ships in two sizes, both mixture-of-experts models that activate only a fraction of their parameters per token:
- V4 Pro — 1.6 trillion total parameters with about 49 billion active per token. This is the flagship, for maximum reasoning and coding capability.
- V4 Flash — 284 billion total with roughly 13 billion active. Built for high-volume, latency-sensitive pipelines, with far higher concurrency limits and about 3× lower output cost than Pro.
Both share a 1M-token context window and up to 384K tokens of output, and both support a thinking mode (visible reasoning steps) and a faster non-thinking mode, with thinking on by default. The efficiency comes from architecture: V4 uses DeepSeek Sparse Attention, a hybrid scheme that compresses the key-value cache to roughly 2% of a standard attention model's, which is what makes a million-token context economical to serve.
Three ways to run it
DeepSeek is unusually flexible about how you reach it:
Chat, for free. The quickest start is chat.deepseek.com or the mobile apps — no key, no setup. The API, for builders — covered below. And self-hosting: because the weights are open under MIT and published on Hugging Face, you can download V4 and run it on your own hardware, which is the route to take when data cannot leave your environment. (See our guide to running open-source LLMs locally for the mechanics.)
What it costs
Price is DeepSeek's headline. Through the API, V4 Flash runs about $0.14 per million input tokens (cache miss) and $0.28 per million output, while V4 Pro is roughly $0.44 / $0.87 — and a cache hit on repeated context drops input to a fraction of a cent per million. That is one to two orders of magnitude cheaper than the closed Western flagships for comparable context sizes, which is exactly why DeepSeek shows up so often in cost-sensitive production stacks.
Using the API
The DeepSeek API is a drop-in for the OpenAI SDK: keep your client, point base_url at https://api.deepseek.com, set your DeepSeek key, and pass deepseek-v4-pro or deepseek-v4-flash as the model. There is also an Anthropic-compatible endpoint at https://api.deepseek.com/anthropic.
from openai import OpenAI
client = OpenAI(api_key="YOUR_DEEPSEEK_KEY",
base_url="https://api.deepseek.com")
resp = client.chat.completions.create(
model="deepseek-v4-pro",
messages=[{"role": "user", "content": "Explain MoE routing simply."}],
)
print(resp.choices[0].message.content)
One migration note: the older model names deepseek-chat and deepseek-reasoner are being deprecated on July 24, 2026. Keep your base URL the same and just switch the model string to deepseek-v4-flash (which covers both the old non-thinking and thinking behaviors) or deepseek-v4-pro.
How good is it, and when to choose it
On coding, V4 Pro's strongest configuration scores around 80.6% on SWE-bench Verified — level with Gemini 3.1 Pro, and a genuine frontier result — though still behind the very top closed models like Claude Opus 4.8 (≈88.6%) and Claude Fable 5 (≈95%). The honest read: DeepSeek is not quite the single most capable model on the hardest benchmarks, but it is remarkably close for a tiny fraction of the cost, and it is open — one of the leading open-source LLMs you can run yourself. Choose it when price-per-token or open weights matter most — high-volume extraction, self-hosted deployments, or anything where you need full control of the model. For the absolute frontier of reasoning or for enterprise governance, weigh it against Claude and Gemini. And note the data-residency question: the hosted API runs on DeepSeek's infrastructure, so for sensitive data the open weights and self-hosting are the safer path.
Want AI news before everyone else?
The morning's most important AI stories, straight to your inbox. No fluff.