From product still to moving ad: Grok Imagine 1.5 image-to-video best practices
Grok Imagine 1.5's strongest mode is turning a clean product still into a moving ad. Source-image rules, a five-part prompt formula, a 480p-to-720p workflow, and four worked examples inside OmniArt.

Grok Imagine 1.5's image-to-video mode has one job it does exceptionally well: take a clean product still and turn it into a moving ad clip without rebuilding the product from a text description. The Aurora engine anchors the subject's position, lighting, and camera trajectory from your source image, so the sneaker stays the right shade of white and the watch dial stays legible — which text-to-video simply cannot guarantee for a product you actually sell.
This guide covers the three craft pillars that determine whether a Grok Imagine 1.5 i2v clip is usable on the first attempt: source image quality, prompt construction, and the 480p-to-720p resolution workflow. Four worked examples — a sneaker, a watch, a handbag, and a beauty product — show each pillar applied end to end.
For the broader e-commerce ad workflow covering model selection, platform formats, and audio, see Turn product photos into video ads with OmniArt. This article stays narrowly focused on getting the best results from Grok Imagine 1.5 specifically.
What Grok Imagine 1.5 brings to image-to-video
| Spec | Value |
|---|---|
| Resolution | Up to 720p |
| Frame rate | 24 fps |
| Duration | 1–15 seconds |
| Native audio | Yes — generated in the same inference pass |
| Image base | FLUX.1 (Black Forest Labs) |
| Arena ranking | Ranked #1 on the Image-to-Video Arena (+52 Elo over 1.0) |
The FLUX.1 foundation is the reason natural-language prompting works here. You describe the shot the way you'd brief a camera operator, not keyword-stack your way through an OpenCLIP vocabulary. The Aurora engine then uses the source image as the dominant spatial reference — keeping the subject's silhouette, color, and relative position stable while the camera and light move around it.
OmniArt surfaces Grok Imagine inside the video workspace alongside every other model, so no separate xAI subscription is needed. The credit rate is 10 credits per second at 480p and 15 credits per second at 720p — meaning a 5-second 480p draft costs 50 credits and the same 5-second 720p final costs 75.
Pillar 1: Source image quality
The Aurora engine anchors composition from the source frame. Strong inputs produce anchored motion; weak inputs introduce drift — the model re-interpolates what it can't read clearly, and accuracy suffers.
The source-image checklist
| Do | Don't |
|---|---|
| Use a clean, uncluttered background (white, light grey, or lifestyle context with breathing room) | Use backgrounds so busy that the product disappears into them |
| Shoot or crop so the product fills 50–70% of the frame | Use heavily cropped or edge-clipped product shots |
| Use high contrast between subject and background | Use a product shot whose color matches the background |
| Keep text, logos, and labels in focus and readable | Use images with heavy JPEG compression artifacts |
| Work from your highest-resolution source (at minimum 1024 × 1024) | Use a thumbnail-resolution or downsized web image |
| Use a single hero subject per frame | Use a grouped flat-lay with five products |
| Make sure the product's defining detail (sole, dial, clasp, cap) is clearly visible | Use an angled shot that hides the product's key feature |
Warning
Why this matters more for Grok than for text-to-video
With text-to-video you describe a product and the model invents one that fits your words. With image-to-video the model is committed to respecting your actual product — but only to the degree it can read it from the source frame. A low-resolution or visually ambiguous photo is the single most common reason Grok Imagine 1.5 i2v outputs disappoint.
Pillar 2: The five-part prompt formula
Grok Imagine 1.5 uses FLUX.1 as its image foundation, which rewards natural-language descriptions over keyword strings. The five parts below map to what Aurora's motion engine can directly act on.
The formula
[Action] — [Lighting] — [Pace] — [Background] — [Mood/reference]
Each part in detail:
-
Action — the camera or subject movement. Be specific: "slow dolly-in from waist height", "orbital pan around the left side", "gentle vertical float, 3 cm up and back down". Vague terms like "dynamic" give the model too much latitude and produce inconsistent results.
-
Lighting — describe light direction, quality, and source. "Rim-lit from behind with a warm tungsten key at camera-left" beats "dramatic lighting". Specific color temperatures ("3200K", "5600K daylight") or named light qualities ("softbox fill", "hard shadow at 45 degrees") anchor the look.
-
Pace — the speed and rhythm of the motion. "2-second slow push, no acceleration", "0.5× playback feel", "unhurried, editorial". Without an explicit pace the model defaults to moderate motion, which is too fast for hero product work.
-
Background — whether it should hold still, subtly shift, or contribute to the scene. "White cyclorama, no background motion", "Blurred bokeh marble surface, subtle light shift", "Studio void, no environment detail". Leaving this out often produces unwanted background drift.
-
Mood and camera reference — a single phrase that calibrates the overall register. Equipment references are more reliable than adjectives: "shot on Fujifilm XT4" beats "cinematic"; "luxury print ad feel" beats "high-end"; a specific month + time of day ("January morning, 9 AM studio") beats "golden hour".
Tip
What to omit
Do not include brand names, people's faces, or references to real places. Do not keyword-stack synonyms ("luxury high-end premium") — FLUX.1 natural-language prompting gains nothing from it and adds noise. One clear sentence per part is better than three fragmented adjectives.
Pillar 3: The 480p-to-720p resolution workflow
The credit cost difference between 480p and 720p is 5 credits per second — modest for a single clip, but meaningful when you're iterating on prompt and motion before committing.
Recommended workflow
| Step | Resolution | Purpose | Cost (5s clip) |
|---|---|---|---|
| 1. Prompt ideation | 480p | Test the camera move and subject stability | 50 credits |
| 2. Motion refinement | 480p | Dial in pace, background, and lighting prompt | 50 credits per iteration |
| 3. Final output | 720p | Clean social or pitch-deck master | 75 credits |
Three 480p iterations plus one 720p final totals 225 credits — the same as three 720p renders. The key discipline is not moving to 720p until the 480p draft has the motion and composition you want. The Aurora engine scales the same clip, so a passing 480p result becomes a passing 720p output reliably.
Note
Four worked examples
Example 1: Sneaker hero push
Product: White low-top sneaker, quarter-profile shot on a white tabletop, clean reflections.
Source image setup: Shot from slightly above at a 45-degree angle, sole visible, lace knots sharp, tongue label readable. Exported at 2048 × 2048, no compression.
Prompt:
"Slow dolly-in from mid-distance to close-up on the toe box, stopping when the sole fills one-third of frame. Hard shadow from overhead natural light raking left to right. Unhurried, 0.3× pace feel. White infinity backdrop, no movement. Shot on Leica SL2, luxury footwear editorial register."
What the motion adds: The gradual push-in reveals the material texture of the toe box and the sole edge in sequence — information a flat still can't communicate. The natural-light shadow raking across the side panel shows the surface quality without a voiceover.
Audio: Grok generates a faint ambient room tone and a subtle material sound as the sole comes into frame — remove or layer under music as needed.
Example 2: Watch reveal orbit
Product: Stainless steel dress watch, flat lay on grey textured paper, face-up with strap unfastened.
Source image setup: Face fills 60% of the frame, indices legible, crown detail visible at right. Shot at 2000 × 2000, even diffuse light.
Prompt:
"Slow orbital pan starting at the 9 o'clock position, travelling clockwise around the watch face, completing 180 degrees over 8 seconds. Softbox fill from above, hard specular rim from camera-right at 4500K. No pace acceleration. Pale grey linen surface, stationary background. Studio watchmaker editorial style."
What the motion adds: The orbit catches the metallic glint of the case edge and the hands from multiple angles in a single pass — a product detail that typically requires four separate stills to communicate. The 180-degree arc keeps the dial legible throughout.
Audio: The Aurora engine generates a faint mechanical ambience — thin, precise, appropriate for watchmaking context. Useful as a bed under a voiceover.
Example 3: Handbag float and settle
Product: Tan structured leather handbag, standing upright against a warm cream background, hardware visible.
Source image setup: Front face centred in frame, top handle loops visible, zipper pull sharp. Shot at 1800 × 1800.
Prompt:
"Bag floats 6 cm upward from the surface, holds for 2 seconds at peak, then settles softly back down. Light barely moves. Warm 3200K ambient fill from above-left, subtle leather highlight from below-right. Deliberate, considered pace. Cream infinity backdrop, no environment motion. Luxury fashion catalogue register, shot on Hasselblad medium format."
What the motion adds: The float-and-settle creates a sense of weight and material substance — the bag behaves like a physical object rather than a cutout. The hold at peak gives the viewer time to read the hardware and stitching detail.
Audio: Room tone is minimal; the settle back down produces a faint surface contact sound that reinforces the physicality.
Example 4: Beauty product rotation with condensation
Product: Matte-finish serum bottle, upright, silver dropper cap, white label.
Source image setup: Bottle fills 55% of frame, label text sharp, cap detail visible, clean white background. Shot at 1920 × 1920.
Prompt:
"Slow counter-clockwise rotation, full 360 degrees over 10 seconds. Fine moisture condensation forms on the glass surface as the rotation begins and disperses by the halfway point. Soft cool daylight from above at 6000K, rim light from behind. Steady, unhurried pace. White studio background, no drift. Skincare campaign aesthetic, shot on Phase One IQ4."
What the motion adds: The condensation effect communicates efficacy and freshness — two ideas that are conceptually expensive to convey in a still. The full rotation shows the back-label copy and the dropper mechanism from every angle.
Warning
Common failure modes and fixes
| Problem | Likely cause | Fix |
|---|---|---|
| Label text blurs or warps during motion | Source image is compressed or label is small in frame | Start from a higher-resolution source; crop tighter so the label fills more of the frame |
| Subject drifts from its starting position | Background too visually similar to the product | Reshoot on a higher-contrast background, or describe the background colour explicitly in the prompt |
| Camera move is too fast | Pace left unspecified | Add an explicit pace descriptor: "unhurried", "0.3× feel", or a second-count |
| Background generates unwanted motion | Background description omitted | Add "stationary background, no background motion" explicitly |
| Colour shifts mid-clip | Source image has inconsistent white balance | Correct the source image white balance before upload |
| Native audio sounds mismatched | Mood reference is vague | Add a more specific register ("silent studio", "minimal room tone") if you don't want a generated soundscape |
When to choose Grok Imagine 1.5 vs other models
Grok Imagine 1.5 is the right tool when you have a clean source still and want consistent subject anchoring at a credit-efficient rate. It is not the right tool for every video brief.
| Need | Better fit |
|---|---|
| Character consistency across multi-shot scenes | Seedance 2.0 |
| Frame-level camera parameterisation | V6 |
| Broadcast 4K output | Veo 3 |
| Heavy motion energy, lifestyle UGC feel | PixVerse models |
| Longest clip runtime (up to 60s) | Sora 2 |
For the general model-selection framework across the full i2v landscape, the product photos to video ads guide covers picks by goal and budget.
Getting started on OmniArt
Open the OmniArt video workspace, select Grok Imagine as the model, and upload a product still that passes the source-image checklist above. Write a five-part prompt — action, lighting, pace, background, mood — and generate a 5-second draft at 480p. If the motion and subject anchoring hold, move to 720p for the final.
The whole loop — draft, refine, master — runs inside one workspace with the same credit balance you use across every other OmniArt model. No separate xAI account, no file export to a different tool, no starting over from text when you already have the product shot you want.
Ready to Create?
Start generating amazing content with AI