guideHướng dẫn và cách làm5 phút đọc

HappyHorse 1.0: hướng dẫn prompt và sáu use case video AI

Hướng dẫn HappyHorse 1.0 — Transformer text-image-video-audio thống nhất, audio gốc, inference 8 bước, lip-sync 6 ngôn ngữ. Sáu use case.

Đội ngũ OmniArt4 thg 5, 2026

HappyHorse 1.0 là một Transformer ~15B tham số khử nhiễu token text, image, video và audio trong cùng một chuỗi. Thực tế: 1080p + audio joint gốc ~38 giây trên H100 — nhanh hơn 3–6× so peer mà vẫn giữ chất cảm nhận. Lip-sync đa ngôn ngữ sáu thứ tiếng từ một bộ weight. Bài này: pattern prompt khai thác kiến trúc và sáu use case model thực sự phục vụ.

HappyHorse 1.0 là gì

40 lớp layout sandwich: 4 entry/exit mỗi modality, 32 lớp shared giữa. Sigmoid gating per-head ổn định multimodal. Không module audio riêng — token audio cùng chuỗi video.

Spec	Giá trị
Tham số	~15B
Độ phân giải	tới 1080p
Thời lượng	3–15s (mặc định 5s)
Tỷ lệ	16:9, 9:16, 1:1, 4:3, 3:4
Inference	~38s 1080p trên H100
Bước	8 (DMD-2, không CFG)
Audio gốc	Dialogue, Foley, ambient
Lip-sync	EN, Mandarin, JA, KO, DE, FR
Input	Text, image

Vì sao kiến trúc thống nhất quan trọng

Hầu hết model video: render video → synth track → cố sync. HappyHorse sinh cùng một pass khử nhiễu — dialogue khớp miệng, Foley khớp tiếp xúc, ambient nhất quán trong clip.

8 bước DMD-2 không CFG — đổi headroom nhỏ lấy 3–6× tốc độ: khác biệt giữa 3 draft/giờ và 12 draft/giờ.

Framework prompt

Audio-first

Không chỉ đạo audio	Có chỉ đạo audio
Vendor đường phố Bangkok chiên mì.	… — dầu xèo, spatula cạ kim loại, đĩa, xe máy xa, khách nói tiếng Thái giữa khoảng cách.

Ngôn ngữ camera cụ thể

Slow push-in, tracking shot, low-angle, macro close-up, 360 orbit, aerial/drone, whip pan.

Ba lớp audio

Foreground (thoại/SFX chính), mid-ground (bước chân, va nhẹ), background (đám đông, mưa, gió).

Neo style

2–3 token style — photorealism anamorphic, anime cel-shading, retro VHS, commercial studio.

Bảy mẹo

Chủ thể + hành động trong 15 từ đầu.
Mô tả audio; dialogue trong ngoặc kép.
Camera cụ thể hơn động từ chung.
Tham chiếu phim/palette/truyền thống.
Chi tiết vật lý — mưa trên kính, lụa trong gió.
Giữ dưới ~100 từ.
Test độ phân giải thấp trước 1080p.

Sáu use case đã test

"Thai street food vendor flipping pad see ew on a flat-top griddle, close-up of wok with garlic and chilis, oil sizzles loud, spatula scrapes metal, neon signage above, warm tungsten lighting, handheld camera with subtle shake, light rain on plastic awning in the, customer chatter in Thai mid-distance. 9:16."

2. Marketing cinematic

"Luxury chronograph watch on a polished volcanic stone, slow-motion water droplets bead and roll across the dial, slow 360-degree orbit camera, soft mechanical click as the crown is pressed, deep ambient hum, studio lighting on a black background, anamorphic flare from upper left, 16:9."

3. Campaign đa ngôn ngữ

"A barista in a specialty coffee shop slides a flat white across a wooden counter and says, in casual Mandarin, '今天的豆子很特别，慢慢喝。' Espresso machine hisses, cup slides on wood, indie film aesthetic, soft window light from behind, shallow depth of field, 16:9."

4. B-roll / previz

"Wide shot of a figure in a red parka approaching a glowing Antarctic research station at twilight, slow forward tracking, the camera then pulls back into a wide aerial, howling wind continuous, boots crunching frozen snow, faint radio crackle from inside the station, atmospheric ambient pad, cool blue palette, 21:9."

5. Ảnh sang video e-commerce

"White running shoes on a charcoal pedestal, slow 360-degree orbit revealing tread, mesh, and neon accents, fine dust particles drift through a key light beam, soft whoosh as the shoe rotates, faint rubber creak, soft landing thud at the end of the rotation, soft studio lighting, 1:1."

6. Stress test nghiên cứu

"Three-piece jazz ensemble in a dim club: drums brushed lightly, walking double bass, saxophone solo. The audience taps a glass on the table in rhythm. Smoke drifts through a single overhead spotlight, vintage 16mm film grain, warm amber tungsten, slow lateral tracking from drums to saxophonist, 16:9."

So sánh nhanh

vs.	HappyHorse	Model kia
Seedance 2.0	8 bước, audio joint, lip-sync 6 ngôn ngữ	Multi-reference (tới 12 asset), 2K, multi-shot gốc
Kling 3.0	Open path, inference nhanh	4K, lip-sync đã established
Veo 3	Kiến trúc thống nhất, 3–6× nhanh hơn	Spatial audio, 4K gốc
Wan 2.2	Audio joint một pass	Open-source hôm nay; weight HappyHorse chưa public

Giới hạn thật

Weight/code chưa publish — dùng qua OmniArt hoặc Dashscope API.
Cap 15s/clip — không timeline multi-shot gốc; chain Extend model khác cho dài hơn.
Chỉ text + image reference — video/audio reference → Seedance 2.0.

Ghi chú

Biến thể DMD-2 không CFG — default production; base model chỉ khi cần chất lượng tối đa và có thời gian loop dài hơn.

Bắt đầu trên OmniArt

Cạnh Seedance, Kling, Veo 3, Sora 2, V6 — một account, một số dư. Bắt brief social ASMR, rồi e-commerce image-to-video.

Chọn giữa HappyHorse và Seedance: so sánh. Narrative dài: BACH.

Sẵn sàng sáng tạo?

Bắt đầu tạo nội dung tuyệt vời bằng AI

Bắt đầu miễn phí