guideModels & insights12 min read

From product still to moving ad: Grok Imagine 1.5 image-to-video best practices

Grok Imagine 1.5's strongest mode is turning a clean product still into a moving ad. Source-image rules, a five-part prompt formula, a 480p-to-720p workflow, and four worked examples inside OmniArt.

OmniArt Team
From product still to moving ad: Grok Imagine 1.5 image-to-video best practices

Grok Imagine 1.5's image-to-video mode has one job it does exceptionally well: take a clean product still and turn it into a moving ad clip without rebuilding the product from a text description. The Aurora engine anchors the subject's position, lighting, and camera trajectory from your source image, so the sneaker stays the right shade of white and the watch dial stays legible — which text-to-video simply cannot guarantee for a product you actually sell.

This guide covers the three craft pillars that determine whether a Grok Imagine 1.5 i2v clip is usable on the first attempt: source image quality, prompt construction, and the 480p-to-720p resolution workflow. Four worked examples — a sneaker, a watch, a handbag, and a beauty product — show each pillar applied end to end.

For the broader e-commerce ad workflow covering model selection, platform formats, and audio, see Turn product photos into video ads with OmniArt. This article stays narrowly focused on getting the best results from Grok Imagine 1.5 specifically.

What Grok Imagine 1.5 brings to image-to-video

SpecValue
ResolutionUp to 720p
Frame rate24 fps
Duration1–15 seconds
Native audioYes — generated in the same inference pass
Image baseFLUX.1 (Black Forest Labs)
Arena rankingRanked #1 on the Image-to-Video Arena (+52 Elo over 1.0)

The FLUX.1 foundation is the reason natural-language prompting works here. You describe the shot the way you'd brief a camera operator, not keyword-stack your way through an OpenCLIP vocabulary. The Aurora engine then uses the source image as the dominant spatial reference — keeping the subject's silhouette, color, and relative position stable while the camera and light move around it.

OmniArt surfaces Grok Imagine inside the video workspace alongside every other model, so no separate xAI subscription is needed. The credit rate is 10 credits per second at 480p and 15 credits per second at 720p — meaning a 5-second 480p draft costs 50 credits and the same 5-second 720p final costs 75.

Pillar 1: Source image quality

The Aurora engine anchors composition from the source frame. Strong inputs produce anchored motion; weak inputs introduce drift — the model re-interpolates what it can't read clearly, and accuracy suffers.

The source-image checklist

DoDon't
Use a clean, uncluttered background (white, light grey, or lifestyle context with breathing room)Use backgrounds so busy that the product disappears into them
Shoot or crop so the product fills 50–70% of the frameUse heavily cropped or edge-clipped product shots
Use high contrast between subject and backgroundUse a product shot whose color matches the background
Keep text, logos, and labels in focus and readableUse images with heavy JPEG compression artifacts
Work from your highest-resolution source (at minimum 1024 × 1024)Use a thumbnail-resolution or downsized web image
Use a single hero subject per frameUse a grouped flat-lay with five products
Make sure the product's defining detail (sole, dial, clasp, cap) is clearly visibleUse an angled shot that hides the product's key feature

Warning

Compression artifacts and visual ambiguity in the source carry into the motion. The model can't recover sharpness that isn't there — it will interpolate and invent, which produces label blur and shape drift. Always start from the cleanest file you have.

Why this matters more for Grok than for text-to-video

With text-to-video you describe a product and the model invents one that fits your words. With image-to-video the model is committed to respecting your actual product — but only to the degree it can read it from the source frame. A low-resolution or visually ambiguous photo is the single most common reason Grok Imagine 1.5 i2v outputs disappoint.

Pillar 2: The five-part prompt formula

Grok Imagine 1.5 uses FLUX.1 as its image foundation, which rewards natural-language descriptions over keyword strings. The five parts below map to what Aurora's motion engine can directly act on.

The formula

[Action] — [Lighting] — [Pace] — [Background] — [Mood/reference]

Each part in detail:

  1. Action — the camera or subject movement. Be specific: "slow dolly-in from waist height", "orbital pan around the left side", "gentle vertical float, 3 cm up and back down". Vague terms like "dynamic" give the model too much latitude and produce inconsistent results.

  2. Lighting — describe light direction, quality, and source. "Rim-lit from behind with a warm tungsten key at camera-left" beats "dramatic lighting". Specific color temperatures ("3200K", "5600K daylight") or named light qualities ("softbox fill", "hard shadow at 45 degrees") anchor the look.

  3. Pace — the speed and rhythm of the motion. "2-second slow push, no acceleration", "0.5× playback feel", "unhurried, editorial". Without an explicit pace the model defaults to moderate motion, which is too fast for hero product work.

  4. Background — whether it should hold still, subtly shift, or contribute to the scene. "White cyclorama, no background motion", "Blurred bokeh marble surface, subtle light shift", "Studio void, no environment detail". Leaving this out often produces unwanted background drift.

  5. Mood and camera reference — a single phrase that calibrates the overall register. Equipment references are more reliable than adjectives: "shot on Fujifilm XT4" beats "cinematic"; "luxury print ad feel" beats "high-end"; a specific month + time of day ("January morning, 9 AM studio") beats "golden hour".

Tip

Specific color words beat vague ones. "Ivory white" beats "light", "deep indigo" beats "dark blue", "champagne gold" beats "golden". The FLUX.1 base is trained on image descriptions that use precise color names, and the motion preserves whatever color reading it makes from the first frame.

What to omit

Do not include brand names, people's faces, or references to real places. Do not keyword-stack synonyms ("luxury high-end premium") — FLUX.1 natural-language prompting gains nothing from it and adds noise. One clear sentence per part is better than three fragmented adjectives.

Pillar 3: The 480p-to-720p resolution workflow

The credit cost difference between 480p and 720p is 5 credits per second — modest for a single clip, but meaningful when you're iterating on prompt and motion before committing.

StepResolutionPurposeCost (5s clip)
1. Prompt ideation480pTest the camera move and subject stability50 credits
2. Motion refinement480pDial in pace, background, and lighting prompt50 credits per iteration
3. Final output720pClean social or pitch-deck master75 credits

Three 480p iterations plus one 720p final totals 225 credits — the same as three 720p renders. The key discipline is not moving to 720p until the 480p draft has the motion and composition you want. The Aurora engine scales the same clip, so a passing 480p result becomes a passing 720p output reliably.

Note

Native audio is generated in the same inference pass regardless of resolution. The ambient sound and any mechanical audio Grok Imagine 1.5 produces at 480p will be identical in character to what the 720p final produces — so you can evaluate the audio during the 480p iteration stage as well.

Four worked examples

Example 1: Sneaker hero push

Product: White low-top sneaker, quarter-profile shot on a white tabletop, clean reflections.

Source image setup: Shot from slightly above at a 45-degree angle, sole visible, lace knots sharp, tongue label readable. Exported at 2048 × 2048, no compression.

Prompt:

"Slow dolly-in from mid-distance to close-up on the toe box, stopping when the sole fills one-third of frame. Hard shadow from overhead natural light raking left to right. Unhurried, 0.3× pace feel. White infinity backdrop, no movement. Shot on Leica SL2, luxury footwear editorial register."

What the motion adds: The gradual push-in reveals the material texture of the toe box and the sole edge in sequence — information a flat still can't communicate. The natural-light shadow raking across the side panel shows the surface quality without a voiceover.

Audio: Grok generates a faint ambient room tone and a subtle material sound as the sole comes into frame — remove or layer under music as needed.


Example 2: Watch reveal orbit

Product: Stainless steel dress watch, flat lay on grey textured paper, face-up with strap unfastened.

Source image setup: Face fills 60% of the frame, indices legible, crown detail visible at right. Shot at 2000 × 2000, even diffuse light.

Prompt:

"Slow orbital pan starting at the 9 o'clock position, travelling clockwise around the watch face, completing 180 degrees over 8 seconds. Softbox fill from above, hard specular rim from camera-right at 4500K. No pace acceleration. Pale grey linen surface, stationary background. Studio watchmaker editorial style."

What the motion adds: The orbit catches the metallic glint of the case edge and the hands from multiple angles in a single pass — a product detail that typically requires four separate stills to communicate. The 180-degree arc keeps the dial legible throughout.

Audio: The Aurora engine generates a faint mechanical ambience — thin, precise, appropriate for watchmaking context. Useful as a bed under a voiceover.


Example 3: Handbag float and settle

Product: Tan structured leather handbag, standing upright against a warm cream background, hardware visible.

Source image setup: Front face centred in frame, top handle loops visible, zipper pull sharp. Shot at 1800 × 1800.

Prompt:

"Bag floats 6 cm upward from the surface, holds for 2 seconds at peak, then settles softly back down. Light barely moves. Warm 3200K ambient fill from above-left, subtle leather highlight from below-right. Deliberate, considered pace. Cream infinity backdrop, no environment motion. Luxury fashion catalogue register, shot on Hasselblad medium format."

What the motion adds: The float-and-settle creates a sense of weight and material substance — the bag behaves like a physical object rather than a cutout. The hold at peak gives the viewer time to read the hardware and stitching detail.

Audio: Room tone is minimal; the settle back down produces a faint surface contact sound that reinforces the physicality.


Example 4: Beauty product rotation with condensation

Product: Matte-finish serum bottle, upright, silver dropper cap, white label.

Source image setup: Bottle fills 55% of frame, label text sharp, cap detail visible, clean white background. Shot at 1920 × 1920.

Prompt:

"Slow counter-clockwise rotation, full 360 degrees over 10 seconds. Fine moisture condensation forms on the glass surface as the rotation begins and disperses by the halfway point. Soft cool daylight from above at 6000K, rim light from behind. Steady, unhurried pace. White studio background, no drift. Skincare campaign aesthetic, shot on Phase One IQ4."

What the motion adds: The condensation effect communicates efficacy and freshness — two ideas that are conceptually expensive to convey in a still. The full rotation shows the back-label copy and the dropper mechanism from every angle.

Warning

Condensation and particle effects are emergent in Grok Imagine 1.5 — the model interprets the instruction rather than rendering it procedurally. On some generations the effect is dense; on others it is subtle. Generate two to three 480p drafts and keep the result where the effect reads without obscuring the label.

Common failure modes and fixes

ProblemLikely causeFix
Label text blurs or warps during motionSource image is compressed or label is small in frameStart from a higher-resolution source; crop tighter so the label fills more of the frame
Subject drifts from its starting positionBackground too visually similar to the productReshoot on a higher-contrast background, or describe the background colour explicitly in the prompt
Camera move is too fastPace left unspecifiedAdd an explicit pace descriptor: "unhurried", "0.3× feel", or a second-count
Background generates unwanted motionBackground description omittedAdd "stationary background, no background motion" explicitly
Colour shifts mid-clipSource image has inconsistent white balanceCorrect the source image white balance before upload
Native audio sounds mismatchedMood reference is vagueAdd a more specific register ("silent studio", "minimal room tone") if you don't want a generated soundscape

When to choose Grok Imagine 1.5 vs other models

Grok Imagine 1.5 is the right tool when you have a clean source still and want consistent subject anchoring at a credit-efficient rate. It is not the right tool for every video brief.

NeedBetter fit
Character consistency across multi-shot scenesSeedance 2.0
Frame-level camera parameterisationV6
Broadcast 4K outputVeo 3
Heavy motion energy, lifestyle UGC feelPixVerse models
Longest clip runtime (up to 60s)Sora 2

For the general model-selection framework across the full i2v landscape, the product photos to video ads guide covers picks by goal and budget.

Getting started on OmniArt

Open the OmniArt video workspace, select Grok Imagine as the model, and upload a product still that passes the source-image checklist above. Write a five-part prompt — action, lighting, pace, background, mood — and generate a 5-second draft at 480p. If the motion and subject anchoring hold, move to 720p for the final.

The whole loop — draft, refine, master — runs inside one workspace with the same credit balance you use across every other OmniArt model. No separate xAI account, no file export to a different tool, no starting over from text when you already have the product shot you want.

Ready to Create?

Start generating amazing content with AI

Get started free