industryModels & insights6 min read

Gemini Omni Flash any-to-any input: what it really does

Omni-modal is Gemini Omni Flash's signature pitch, but the shipping API is narrower than the marketing. Here's what any-to-any input actually changes about the brief.

OmniArt TeamJul 1, 2026

The word doing the heaviest lifting in Gemini Omni Flash's launch was "Omni" — the promise of a single model you can feed text, images, audio, and video all at once, in one prompt. It's a genuinely different pitch from the single-input video models that came before it, and it's the reason the model earns its name. But the version that shipped in the developer API is narrower than the keynote framing, and the gap matters if you're planning real work around it.

This piece separates what any-to-any actually buys you today from what's still aspirational — and then gets to the more useful point, which is how multimodal input changes the way you write a brief at all.

What "any-to-any" actually means

Most video models accept one kind of steering. You write text, or you supply a single reference image, and the model works from that. Any-to-any input means one prompt grammar accepts several modalities together and returns a coherent result that respects all of them: a reference frame for the look, a short clip for the motion, and a written direction for everything else — combined, not chosen between.

The shift is from describing a shot in words to composing it from assets. That's the real capability, and it's why "omni-modal" isn't pure marketing. The question is how much of it is live.

The pitch versus the shipping API

Here's the honest matrix for the current preview, straight from the API's own documentation:

Input	Status	Notes
Text prompt	Supported	The backbone of every generation
Image reference	Supported	Text-to-video, image-to-video, and subject reference
Video reference	Supported, with a caveat	References over 3 seconds aren't fully processed
Audio reference	Not supported	You cannot upload a sound or voice for the model to match
Multiple video references	Not supported	One reference clip per generation
Non-English prompts	Untested	English is the only fully supported language

Warning

The audio gap is the one most likely to trip up a plan. Omni Flash generates an audio track by default, but "any-to-any" does not include handing it a music bed, a voiceover, or an ambient recording to sync against. Audio is an output you steer with words, not an input you supply.

So the accurate read: any-to-any today is text + image + video in, video (with generated audio) out. The audio-in half of the omni-modal promise is deliberately withheld — consistent with the in-video speech editing and avatar features Google held back at launch for safety reasons. It's a real capability change over single-input models; it's just not the full any-to-any-to-any picture the name implies yet.

What multimodal input changes about the brief

Once you're composing from assets instead of describing in prose, the brief itself changes shape. Three inputs do different jobs, and the skill is assigning each one to what it's best at:

The image reference carries the look — the subject, the palette, the framing you already like.
The video reference carries the motion — a camera move or an action you want echoed.
The text carries intent and everything the assets don't already show — mood, changes, the thing that isn't in either reference.

The practical effect is that you stop trying to translate a picture into adjectives. Instead of writing "a warm, shallow-depth close-up with a slow push-in," you supply the frame that already looks like that and the clip that already moves like that, and spend your words on what's new. For anyone who has fought to describe a specific aesthetic in text, that's the workflow unlock.

The four task modes, and how they combine

The API exposes four task types, and they map cleanly onto the compose-from-assets idea:

text_to_video — pure description, no assets. The fallback when you're starting from nothing.
image_to_video — animate a still. The most common entry point: a strong image becomes the first frame of motion.
reference_to_video — carry a subject or style from a reference into a new generation.
edit — the conversational, stateful mode that revises the prior clip while preserving what you didn't change.

The intended flow chains them: generate or animate a base with one of the first three, then move into edit and refine conversationally. That's the same shape as Google's own Nano Banana 2 Lite to Omni Flash pairing — edit a still, then animate it — extended across turns.

The audio nuance, spelled out

Because audio can't be supplied, sound design becomes a writing task. The model produces dialogue, effects, and ambience based on what your prompt describes — "gentle rain on a window, no music" or "a single soft click, then room tone." You get meaningful control, but it's descriptive control, and it means two things for planning:

If your project needs the generated video to match an existing track — a licensed song, a brand sting, a recorded VO — that sync happens in a separate audio step, not inside Omni Flash.
If you just need fitting, original sound, describing it well in the prompt gets you there without an upload.

Where OmniArt lands today

The compose-from-assets workflow isn't something you have to wait for Omni Flash to try — it already runs on the models live in OmniArt's video workspace, and in one respect they go further.

Seedance 2.0, available on OmniArt now, was built around exactly this idea: it accepts up to nine images, three video clips, and — notably — three audio files in a single prompt, each bound to a role with @image1 / @video1 / @audio1 syntax. That includes the audio-reference input Omni Flash withholds. If your brief depends on feeding the model a specific sound to work with, that path exists today.

And the direction of travel is clear across the field: Seedance 2.5, announced in June, pushes the same reference architecture to as many as 50 multimodal inputs at once. Any-to-any input isn't a single-model story — it's where directed AI video is heading. Omni Flash named the idea; the workspace already lets you practice it.

Open the video workspace on OmniArt, assemble your reference set, and let the assets carry the look and motion while your words carry the intent. That's the any-to-any brief, available now.

Ready to Create?

Start generating amazing content with AI

Get started free