guideTutorials & how-to guides11 min read

8 Grok Imagine prompts that actually work

Eight copy-ready Grok Imagine 1.5 prompts across image and video — built on the FLUX.1 natural-language style with the Subject + Action + Camera + Style + Audio structure. What each prompt does and why it lands, inside OmniArt.

OmniArt Team
8 Grok Imagine prompts that actually work

Grok Imagine 1.5 upgraded the image base to FLUX.1 from Black Forest Labs, and that change has a concrete implication for how you write prompts: the model responds to natural-language description the way a photographer reads a brief, not the way older models parsed keyword lists. The eight prompts below are copy-ready — drop them into OmniArt's Grok Imagine workspace, adjust the specifics, and generate. Each card includes the exact prompt text, what it produces, and one craft note on why the structure lands.

For general prompt theory across all OmniArt models, see how to write better prompts. For the deeper treatment of Grok Imagine's six generation modes and cost math, see the Grok Imagine creator's guide. This article is specifically about Grok Imagine 1.5 — the FLUX.1 release — and the prompt craft it rewards.

What Grok Imagine 1.5 changed about prompting

The FLUX.1 base model is trained differently from earlier text-to-image architectures. It parses connected prose well and tends to under-respond to pure keyword stacks. Five habits move quality up most reliably:

  • Natural language over keyword stacks. Full sentences outperform comma-separated adjectives. "A street at blue hour, lit by the hum of a convenience store sign" beats "street, night, neon, cinematic, 4K."
  • Specific references over vague adjectives. "Shot on a Fujifilm XT4, 23mm f/2" tells the model more than "high quality photo." Named equipment and film stocks carry real latent space weight.
  • Exact color words over "colorful." "Electric blue and hot pink" produces a deliberate palette. "Colorful" produces averaged noise.
  • Exact time over "golden hour." "Late October, 5:45 pm, sun 6° above the horizon" tells the model the precise angle and warmth of the light. "Golden hour" is ambiguous across seasons and latitudes.
  • Video structure: Subject + Action + Camera + Style + Audio. Front-load the core subject and action in the first 20–30 words. A single style focus beats a blend. Iterate progressively — change one variable per generation until the result locks, then push further.

For a full breakdown of the cinematic vocabulary that transfers to video, the cinematic AI video prompt guide covers lens choice, motivated camera moves, and lighting language in depth.


The 8 prompts

1. Cinematic product shot (image)

35mm product photography, shot on Fujifilm XT4. A matte black mechanical wristwatch resting on a slab of raw concrete, 
late October afternoon light coming in low from camera left at roughly 20°, casting a long shadow across the concrete 
face. Shallow depth of field, background falling completely soft. Color palette: warm amber highlights, cool blue-grey 
shadow fill. No props, no reflections except the concrete surface itself.

What it produces: a clean, art-directed still that reads as professional product photography rather than AI output.

Why it lands: the Fujifilm XT4 reference grounds the color science and sensor rendering in a specific real-world look. The angle of light is specified numerically, which prevents the model from defaulting to diffuse overhead lighting. Keeping the palette to two colors — warm amber highlights, cool blue-grey shadow — prevents the model from introducing a third competing hue.


2. Character close-up with audio (video)

Medium close-up of a young woman with short silver hair and a worn leather jacket, inside a neon-lit record shop at 
3 am. She looks directly into camera and says: "Every city has one song. I'm still looking for mine." Natural lip 
sync. Camera holds completely still. Light source: one pink neon tube overhead, one cyan neon sign spilling from 
camera right. Atmosphere: quiet, a little melancholic, not cinematic drama. Ambient audio: low vinyl static underneath 
the dialogue. 8 seconds.

What it produces: a character moment with native Grok Imagine 1.5 audio — the model generates dialogue, lip sync, and ambient sound in a single inference pass.

Why it lands: the dialogue line is short enough to lip-sync cleanly within 8 seconds. Two separate, named neon light sources (pink overhead, cyan from right) give the model a clear light map and prevent generic "neon city" averaging. "Not cinematic drama" is a negative constraint that guides mood more precisely than a positive adjective would.

Tip

Keep spoken dialogue to one or two short sentences in clips under 10 seconds. Longer lines crowd the available duration, and the model may rush the delivery or cut the audio early.


3. Atmospheric environment — ambience clip (video)

Wide establishing shot of a fog-filled pine forest in southern Norway, early November, 7 am. No people, no animals. 
Soft diffused dawn light filtering through the canopy, pale grey-white, casting almost no shadow. Slow imperceptible 
push forward, as if the camera is drifting on breath. Audio: deep forest ambience — distant water, occasional bird, 
near-silence underneath. No music. 12 seconds.

What it produces: a mood-setting ambient clip ideal as background footage, transition material, or opening scene.

Why it lands: "early November, 7 am" is more accurate than "foggy morning." The push is described as "imperceptible" and "drifting on breath," which communicates pace more precisely than "slow push in." Asking for no music keeps the audio from defaulting to underscore — the model generates genuine field recording-style ambience instead.


4. Fast-paced social vertical — product reveal (video)

9:16 vertical. A pair of electric blue running shoes drops into frame from the top, landing on a wet reflective black 
studio floor. High-speed impact, tiny water spray, shoes bounce once and settle. Immediate cut to product floating 
at centre frame, slow rotation 360°. Fast rhythm: first motion 0–2s, rotation 2–8s. Hard direct light from above, 
electric blue accent light from below floor (subtle). No dialogue. Audio: sharp impact sound on drop, then a clean 
single synthesizer tone during rotation. 8 seconds.

What it produces: a punchy 9:16 social clip built for TikTok, Reels, or Shorts — fast-cut product reveal with native audio.

Why it lands: specifying 9:16 up front sets the aspect ratio before anything else in the prompt. The timeline is written out explicitly ("0–2s / 2–8s"), which helps the model pace the two beats correctly rather than blending them into one motion. Naming the specific audio events (impact sound, synthesizer tone) produces more intentional sound design than "add sound effects."

Warning

Grok Imagine 1.5 clips run up to 15 seconds. For social content keep clips at 8–10 seconds maximum — the model's motion is cleanest in that range, and social platform attention windows are short. At 720p, an 8-second clip costs 120 credits on OmniArt.


5. Stylized illustration (image)

Risograph print illustration of a small coastal Japanese fishing village at dusk, mid-December. Two ink colors only: 
deep indigo and warm persimmon orange. Flat graphic shapes, no gradients. Fishing boats pulled up on shore, a single 
wooden dock, lantern light in two window rectangles. Composition: low horizon line, large sky area, boats and dock in 
lower third. The print has slight ink misregistration — indigo shifted 2px left from the orange layer. Texture: 
visible paper grain throughout.

What it produces: a graphic, limited-color illustration that reads as a real print process rather than generic digital art.

Why it lands: naming the print technique (Risograph) and its specific constraints (two ink colors, flat shapes, no gradients, ink misregistration) gives the model a complete technical brief. "Ink misregistration" is the kind of physical-process detail that anchors the output in a real-world aesthetic — it's the FLUX.1-equivalent of naming a film stock. Without it, the model tends to add gradients or blend colors.


6. Dynamic camera move — drone pull-back (video)

Aerial drone footage. Extreme close-up on the face of a compass resting on a weathered wooden ship's deck, late 
afternoon November light, warm golden horizontal rays from camera left. Slow pull-back revealing the full deck, 
then the ship's hull, then open grey Atlantic ocean horizon. Pull-back runs the full 15 seconds — begin on compass, 
end with ocean filling 80% of the frame. Camera elevation stays constant, no tilt. Real drone color science: flat 
LOG-style color, slight lens vignette. Audio: wind increasing in volume as ocean fills frame.

What it produces: a sustained 15-second reveal shot — the model's maximum clip length — built around a single motivated camera move.

Why it lands: this prompt uses the full 15-second duration for one continuous motion, which is the most reliable way to get a clean result at that length. The pull-back is constrained to constant elevation (no tilt), which prevents the model from improvising a second camera axis and creating choppy motion. "LOG-style color, slight lens vignette" codes a real-camera look without requiring specific equipment names.


7. Stylized fashion — film stock portrait (image)

Expired Kodak Portra 400 film scan. Portrait of a woman in her mid-thirties, strong afternoon window light from 
camera right, half of her face in deep shadow. She is wearing a deep forest green linen blazer, no visible jewellery. 
Expression is neutral, looking slightly off-camera left. Grain heavy and warm, slight halation around the window 
highlight, greens shifted slightly toward yellow-olive. Tight crop: from collarbone to just above top of head. 
Aspect ratio 4:5.

What it produces: a film-photography portrait with accurate vintage color rendering — authentic grain, halation, and expired-stock color shifts.

Why it lands: "expired Kodak Portra 400" is one of the strongest single-phrase style references in the image latent space — it carries a complete set of tonal expectations. Specifying the color shift ("greens shifted slightly toward yellow-olive") prevents generic vintage grain and guides the exact palette corruption associated with expired film. Tight crop and a specific aspect ratio (4:5) produce a portrait that reads as a real print.


8. Immersive environment — rainfall (video)

Ground-level POV inside a glass bus shelter, heavy urban rain, Tokyo residential street, late June 22:00. Camera 
holds completely still. Rain streaks down the glass panels in foreground, streetlights smear into vertical bokeh 
streaks behind the wet glass. A cyclist passes in the distance — silhouette only, visible for about 2 seconds in 
mid-clip. No camera movement. Audio: heavy rain on glass, distant car tyre hiss, one distant motorbike engine 
fading right-to-left. No music. 10 seconds.

What it produces: an immersive, single-POV environmental clip — strong as an establishing shot or as a standalone mood piece.

Why it lands: "late June 22:00" specifies the exact season, temperature feel (humid summer rain), and darkness level. The cyclist passing is planted as a specific event at a specific moment ("about 2 seconds in mid-clip"), which gives the model a narrative anchor without asking for complex character action. The audio is given in three separate layers (rain on glass, tyre hiss, motorbike), which tends to produce more considered sound design than a single "ambient city rain" instruction.


Running these on OmniArt

All eight prompts run on Grok Imagine 1.5 inside OmniArt's creation workspace — no separate xAI subscription required. The image prompts (1, 5, 7) go into the image workspace; the video prompts (2, 3, 4, 6, 8) go into the video workspace under Grok Imagine.

A few practical notes for OmniArt runs:

  • Start at 480p for iteration. At 480p, video costs 10 credits per second. Once the structure is right, bump to 720p (15 credits per second) for the final take.
  • Use Extend Mode to lengthen. The atmosphere clip (prompt 3) and the drone pull-back (prompt 6) can be extended up to 15 additional seconds using Grok Imagine's Extend Mode — the same model, billed only for the appended portion.
  • Use Modify Mode for targeted corrections. If the lighting in a result is almost right but one element is off, Modify Mode lets you describe the change in text without regenerating the full clip. Keep source clips at 480p before passing to Modify — the mode caps input at 854×480.
  • Character consistency across shots: if you're generating multiple shots of the same character (prompt 2 style), use Reference Mode with a headshot as @Image1 and restate the character description in each new prompt. Grok Imagine 1.5's Reference Mode is the most direct path to consistency without relying on a fine-tuned model.

For a full breakdown of all six Grok Imagine generation modes, cost scenarios, and when to switch to a different model, see the complete Grok Imagine guide. For the broader cinematography vocabulary that transfers to any video prompt, the cinematic AI video prompt guide is worth bookmarking alongside this one.

Ready to Create?

Start generating amazing content with AI

Get started free