DeepSeek V4 multimodal: what creators need to know
DeepSeek V4 multimodal — 1M-token context, V4-Flash and V4-Pro pricing, the CSA + HCA architecture, and what it means for creators inside OmniArt's stack.

DeepSeek V4 went live on April 24, 2026 with two tiers, a 1-million-token context, and a 384K maximum output length. It's not a video model and it isn't trying to replace one. What V4 actually changes is the layer above the visual stack — the brief, the storyboard, the brand bible, the long-context retrieval that turns "make a campaign" into "make a campaign that respects every shoot we did this year." This piece covers what DeepSeek V4 is, what's in it for creators using OmniArt, and where it fits next to the rest of the model roster.
What DeepSeek V4 is
DeepSeek V4 is a long-context reasoning and tool-use model with two production tiers — V4-Flash and V4-Pro — both available via an OpenAI-compatible API at api.deepseek.com. The 1M-token context plus structured tool calls is the headline; the architecture underneath uses compressed sparse attention (CSA) plus heavy compressed attention (HCA), which is what keeps cost from scaling linearly with context length.
| Tier | Total params | Active params | Pre-training tokens | Output price | Input price (cache miss) |
|---|---|---|---|---|---|
| V4-Flash | 284B | 13B | 32T | ¥2 / 1M tokens (~$0.28) | ¥1 / 1M tokens |
| V4-Pro | 1.6T | 49B | 33T | ¥24 / 1M tokens (~$3.48) | ¥12 / 1M tokens |
Both tiers cap output at 384K tokens. Both tiers serve "thinking" and "non-thinking" modes from the same model — V4 unifies what V3 and R1 used to handle separately.
The architecture in one paragraph
The interesting bit is CSA + HCA. Compressed sparse attention narrows attention to a small number of high-information tokens at each layer; heavy compressed attention layers in dense compression on top of that. The combination is what makes the 1M context affordable rather than a benchmark trophy. DeepSeek trained and serves V4 on Huawei Ascend-class infrastructure rather than a CUDA-only stack, with Cambricon's vLLM adaptation handling inference optimization.
Benchmarks worth quoting
| Benchmark | Result |
|---|---|
| Arena.ai open-source code arena | V4-Pro #3 |
| Arena.ai overall | V4-Pro #14 |
| Vals AI Vibe Code Benchmark | V4 #1 among open-weight models |
| Vibe Code vs V3.2 | ~10× performance jump |
| Closed-model competitive set | Beats Gemini 3.1 Pro in select scenarios |
DeepSeek's own messaging is honest about the gap: V4 "still trails the very top closed systems by roughly three to six months in complex knowledge and reasoning ability." For most creator workflows that gap doesn't bind — but it's worth knowing it exists.
What changed between V3, R1, and V4
V3 was a strong text and code model. R1 was a chain-of-thought reasoning model. V4 unifies both modes under one model with selectable thinking and non-thinking inference paths. Context expanded from 128K (V3) to 1M (V4). Tool use and long-context retrieval are now first-class instead of patched on.
| Capability | V3 | R1 | V4 |
|---|---|---|---|
| Context | 128K | 128K | 1M |
| Reasoning mode | No | Yes (default) | Toggleable |
| Tool use | Limited | Limited | First-class |
| Multimodal | No | No | Roadmap (in progress) |
What multimodal means here — and what it doesn't (yet)
DeepSeek's V4 launch deliberately undersold the multimodal piece. The release described the multimodal feature matrix as "continuing to evolve" — there are no published image, video, or audio entry points at the API level today. That's not a knock; it's a roadmap signal. The current value of V4 for creators sits in long-context text and tool-driven workflows that wrap the visual stack, not inside it.
When the multimodal entry points land, they'll fold into the OmniArt model picker the same way GPT Image 2 and the rest did. Until then, treat V4 as the brain that drives the brief.
What creators actually do with V4 today
Three patterns earn their keep on OmniArt right now.
1. Brand bibles as 1M-token context
The 1M context comfortably holds a full brand book, every published campaign, the tone-of-voice guide, the character sheet, the do-not-say list, and the last twelve months of post copy. Pin all of it as system context, then ask V4 to draft a launch brief. The output respects the entire document set without an embeddings round-trip.
2. Long-form structured generation
Capped output is 384K tokens. That's enough to draft an entire narrative bible, a six-episode storyboard with shot lists, or a 50-page localization spec in a single pass. For shorter work, V4-Flash at ~$0.28 per 1M output tokens turns this into the cheapest reliable way to draft long-form structured content.
3. Tool-first agents that drive the visual stack
V4's tool-call discipline is the part that matters when you wire it to image and video generators. Hand it the OmniArt API surface, give it a brief, and it will propose the model, the prompt, and the references shot by shot. That's the pattern OmniArt is building integration around.
Picking between V4-Flash and V4-Pro
The price ratio is roughly 12× — Flash for high-volume ideation, Pro for the sessions where depth matters more than token cost.
| Job | Pick |
|---|---|
| Brainstorming, drafting, headline iteration | V4-Flash |
| Brand-bible reasoning, narrative construction | V4-Pro |
| Long-context retrieval over campaign history | V4-Pro |
| Tool-driven agent loops that drive image/video | V4-Pro for planning, V4-Flash for execution |
How V4 fits next to the rest of the OmniArt stack
V4 isn't a replacement for the image and video models in OmniArt. It's the planning layer above them. The pattern that's emerging:
| Layer | Job | Model |
|---|---|---|
| Plan | Brief, storyboard, shot list, brand reasoning | DeepSeek V4-Pro |
| Image | Stills, reference frames, layout | Nano Banana Pro, GPT Image 2, Seedream 5.0 Lite |
| Video | Animated shots, multi-shot sequences | PixVerse V6 / BACH, Sora 2, Veo 3, Seedance 2.0, HappyHorse 1.0 |
| Iterate | Restyle, extend, modify | Grok Imagine, Runway Gen-4.5 |
Note
The multimodal entry points for V4 are on DeepSeek's published roadmap but not in the OmniArt model picker yet. We'll publish a follow-up the day they land — credits, recommended prompts, and where they sit in the stack.
What to watch next
Three signals worth tracking in the next two months.
- Multimodal API entry points. When DeepSeek publishes them, the model picker conversation reopens.
- Distilled V4 variants. Earlier reporting flagged V4 Lite and a smaller V4 variant. Both could change the cost surface for high-volume tool-call agents.
- Hardware story. The Huawei Ascend-class inference path matters for regions where CUDA-only models are harder to deploy.
Getting started on OmniArt
DeepSeek V4 isn't yet a one-click model in the OmniArt picker — its current home is the API. If you want to use it as the planning layer above OmniArt today, drive it through the OpenAI-compatible endpoint at api.deepseek.com and point its tool-call surface at the OmniArt API for image and video generation.
For background reading on the visual side of the stack, the GPT Image 2 vs Nano Banana 2 comparison covers the flagship image picker decision, and the best image-to-video shortlist covers the video-side options V4 will eventually drive.