guideTutorials & how-to guides8 min read

Veo 3.1 prompt and cinematic guide

How to prompt Veo 3.1 for broadcast-grade results: native 4K, spatial audio, and strong image adherence — with a reusable five-part formula and before/after reads inside OmniArt.

OmniArt Team
Veo 3.1 prompt and cinematic guide

Veo 3.1 is OmniArt's broadcast-grade video model — the one you reach for when the output has to hold up on a large screen. It ships native 4K, spatial audio that co-generates with the video frames, and unusually strong start-frame adherence when you supply a reference image. But none of that matters if the prompt is vague. This guide gives you a reusable formula for directing Veo 3.1 the way it wants to be directed, along with before/after reads, a cinematic vocabulary table, and guidance on choosing the right tier (standard, fast, or lite) for the job.

The five-part Veo 3.1 prompt formula

Veo 3.1 responds well to structured prompts that answer five questions in order. Leave one out and the model fills the gap — usually in the most generic way possible.

  1. Subject and action — who or what, doing what, where. "A filmmaker reviewing footage alone in a dark edit suite."
  2. Camera (movement, lens, framing) — shot size, focal length, the move and its speed. "Slow push-in, 50mm, medium close-up, camera locked then drifting forward."
  3. Lighting and mood — source, direction, quality, palette. "Single monitor glow as key light, deep shadows, cool blue, high contrast."
  4. Audio and ambience — what the space sounds like, any specific sounds, music direction or no music. "Quiet electrical hum, occasional keyboard click, no music."
  5. Technical output — resolution (4K or not), duration intent, any style reference. "4K, 8 seconds, photorealistic."

A fully worked example

Prompt:

"A filmmaker reviewing footage alone in a dark edit suite. Slow push-in, 50mm, medium close-up, camera locked then drifting forward. Single monitor glow as key light, deep shadows, cool blue, high contrast. Quiet electrical hum, occasional keyboard click, no music. 4K, 8 seconds, photorealistic."

This prompt takes under thirty seconds to write. It specifies the shot the way a director of photography would describe it to a gaffer, and Veo 3.1 has little room to guess incorrectly.

Tip

Put audio direction in every prompt, not just the ones where sound matters. Veo 3.1 generates spatial audio alongside video frames — leaving audio unspecified doesn't silence the output, it just hands control to the model. Write no music if you want a clean room tone you can score separately.

Cinematic vocabulary cheat table

These terms translate directly into Veo 3.1 generations. Copy the phrases you need into your prompts.

Camera moves

MovePrompt phrase
Slow approach"slow dolly-in", "gentle push-in"
Retreat"slow pull-back", "dolly-out to reveal"
Track alongside"smooth tracking shot from the left", "lateral dolly"
Rise and reveal"slow crane up to reveal the skyline"
Handheld tension"subtle handheld shake, reactive framing"
Locked, stable"tripod-locked", "static wide"
Arc around subject"slow arc around the subject"

Shot sizes and angles

IntentPrompt phrase
Scale and context"wide 18mm, deep focus, full environment"
Subject in space"medium shot, eye level"
Intimacy"medium close-up, 50mm"
Intensity"tight close-up, 85mm, shallow focus"
Power and menace"low angle looking up"
Vulnerability"high angle looking down"

Lighting

LookPrompt phrase
Natural warmth"golden-hour side light, warm highlights, cool shadows"
Moody contrast"chiaroscuro, single hard source from camera right"
Urban atmosphere"neon spill, magenta and cyan, reflections in wet pavement"
Clean interview"soft diffused key, slightly warm, low contrast"
Night presence"practical light only — a single lamp, deep background falloff"

Before and after reads

A: camera direction — the biggest single lever

The most impactful change you can make to a Veo 3.1 prompt is adding a camera move and focal length. Compare:

Without: "A street musician playing violin in the rain."

With: "Medium close-up of a street musician playing violin in the rain. Slow dolly-in, 85mm, shallow depth of field — background traffic dissolving into blur. Practical street-lamp from above, rim-lighting the bow. Light rain sound, distant traffic, no music."

The second version doesn't use the word "cinematic" once. It specifies what makes the shot cinematic — and the model renders the intent rather than picking one of ten generic interpretations.

B: image-to-video start-frame adherence

Veo 3.1 has notably strong image adherence when you supply a reference image as the start frame. The model holds the composition, colour grade, and key character details from the first frame and uses them as a constraint throughout the generation.

Practical use: take a still from a commercial shoot, a product render, or a character concept, supply it as the start frame in OmniArt's image-to-video workflow, then write a prompt that describes the motion from that starting point.

Prompt after supplying a product-shot start frame:

"The perfume bottle sits on a white marble surface. Slow arc from left to right, the bottle staying centred. Late-afternoon light from a high window sweeps across the glass, catching the facets. 4K, 6 seconds, no music."

The model inherits the exact lighting, product positioning, and surface texture from your reference and applies the described motion to it — rather than regenerating the scene from scratch.

Note

Image adherence is strongest when your start-frame image is close to the aspect ratio and resolution you are generating at. A square image supplied to a 16:9 generation will be cropped or pillar-boxed, which can shift the composition the model inherits.

C: spatial audio from a single prompt line

Veo 3.1's spatial audio doesn't require a separate pass — one descriptive audio line in the prompt is enough to produce a layered, positionally-aware soundscape.

Prompt fragment:

"...Audio: close-mic'd rain on corrugated iron overhead, a distant market crowd, occasional motorbike passing right to left, no music."

What the model produces: the rain is present and directional — you can hear it spatially above the scene. The market crowd occupies the mid-distance. The motorbike sweeps through the stereo field as described. The directionality comes from Veo 3.1's native audio architecture, not post-processing. Naming layers and their spatial relationships — close, distant, passing left to right — gives the model what it needs to render positionally.

Choosing between standard, fast, and lite

Veo 3.1 ships three tiers on OmniArt. The right choice depends on the job, not a default habit.

TierWhen to use itCredit cost
veo-3.1-standardFinal output, broadcast delivery, client review, any 4K use caseHighest per second
veo-3.1-fastIteration and prompt refinement at reasonable qualityMid-range
veo-3.1-liteQuick concept tests, thumbnail checks, storyboard motion passesLowest per second

When 4K is worth the extra credits: large-screen deliverables, product hero shots, anything that will be exported at full resolution, or work where the model's detail rendering in backgrounds and textures matters to the brief. 4K is only available on veo-3.1-standard.

When 4K is wasted: social crops at 1080p or smaller, motion drafts you're going to regenerate anyway, anything you're exploring rather than delivering. Use veo-3.1-lite for that work — iterate cheaply, then switch to standard for the final pass.

Warning

Running 4K on an exploratory prompt that you'll regenerate several times multiplies your credit spend fast. Settle the prompt on fast or lite first, then commit the final version to standard at 4K.

Common prompt mistakes

Over-stuffing the subject line. "A middle-aged woman with curly red hair wearing a vintage coat standing by a canal in Amsterdam holding a bouquet of tulips looking wistful" front-loads so many details that the model has to choose which ones to actually render. Break the character description into what is essential for this shot and let the rest go.

Conflicting camera directions. "Slow push-in with a wide pull-back" is physically impossible — the model will pick one and ignore the other. Write one motivated move per prompt. If you need a shot that starts wide and closes in, that is a push-in, full stop.

Forgetting audio entirely. Veo 3.1 will generate audio whether or not you direct it. An undirected audio generation is not silence — it's the model's best guess, which may not match your intent. Always close the prompt with one audio line, even if it's just no music, ambient room tone only.

Writing "cinematic" as a style word. The word "cinematic" asks the model to make a decision you should be making. Replace it with the specific visual properties you actually want: lens, light, motion, palette.

Getting started on OmniArt

Veo 3.1 — standard, fast, and lite — is available in the OmniArt video workspace alongside every other model in the library. The fastest way to build fluency is to take one existing idea, write it as the five-part formula above, and generate on veo-3.1-fast first to refine the prompt before committing to standard.

For the broader cinematic vocabulary and how the same prompt patterns apply across OmniArt's full video model lineup, see the cinematic AI video prompt guide. When you're ready to go deeper on Veo 3.1's audio generation specifically, the Veo 3.1 spatial audio best practices guide covers layered soundscapes, positional audio cues, and music direction in detail. For a head-to-head look at how Veo 3.1 stacks up against other top-tier models, see Veo 3.1 vs Sora 2.

Ready to Create?

Start generating amazing content with AI

Get started free