API for GenAI

The AI Music Video Prompt: How Songbrain Powers GenAI Video Tools

May 14, 2026 · 8 min read

Sora 2 renders cinematic 4K. Runway Gen-4 is up to 20 seconds per shot. Veo 3 nails physics. Kling 2 shoots clean dolly-ins. The video generation models are no longer the bottleneck. The prompt is.

And when the goal is a music video, the prompt has a problem no general-purpose video tool can solve on its own: it needs to know the song. Sec-precise. Where the drop is. Which moments are clip-worthy. What the dominant mood is at second 22 vs. second 88. Who's actually going to watch it.

That's the layer Songbrain owns. Our long-term goal is to be the best music-intelligence API for genAI tooling — specifically the one video-gen tools call before they render a frame. This guide walks through what we hand back, and what a Sora/Runway/Veo/Kling-class model does with it.

What a video-gen AI gets from one API call

best_momentsSec-precise start/end of each hook, with confidence score and moment type (intro_grab, pre_drop, chorus, drop, outro_peak).

energy_curveTime-series energy 0–1 across the full song, plus the exact moments where it inflects.

mood_signatureDominant moods per section (aggressive, euphoric, nocturnal, melancholic, hype, tense). The emotional script for the visual.

subgenre + style_descriptorsVisual aesthetic genome — e.g. drift phonk → chrome / sodium-vapor / JDM cars / dust particles / 90s VHS overlay.

target_audienceDemographics, regions, platforms and feed contexts derived from mood + subgenre + live trend data.

lyric_hooksThe strongest repeating lyric lines, with timestamps. The lines the visual should literalize or contradict for maximum tension.

trend_matchWhat's viral in this subgenre this week. Visual style cues the AI should lean into for feed-fit.

From data to scene script

On its own, that data dump is just a JSON blob. The magic happens in the next step: Songbrain compiles it into a scene-by-scene video prompt that speaks the language of modern video-gen models. Sec-precise. Camera-aware. Aesthetic-locked. Audience-tuned.

Below is a full example. The song: a fictional drift phonk track called “Night Lap,” 2:46, Virality Score 88. This is the exact prompt-spec a video-gen tool would receive when calling Songbrain's API and asking “turn this into a music video.”

Example: “Night Lap” — Songbrain video prompt spec

GET /api/v1/genai/video-spec/{job_id}

{

"song_id": "sb_a3d9_night_lap",

"duration_sec": 166,

"subgenre": "Drift Phonk",

"virality_score": 88,

"primary_moods": ["aggressive", "nocturnal", "tense", "hype"],

"global_aesthetic": "sodium-vapor orange + neon cyan, crushed blacks, JDM coupe culture, 21:9 letterbox, 90s VHS chroma bleed, dust particles in cone-of-light, light leaks on every drop",

"target_audience": {

"demo": "male-leaning, 16–26",

"regions": ["Eastern Europe", "LATAM", "SEA", "US car-culture pockets"],

"platforms": ["TikTok", "YouTube Shorts", "Instagram Reels"],

"feed_contexts": ["drift edits", "anime AMV", "gym hype", "JDM car compilations"],

"trend_tags": ["#driftphonk", "#nightlap", "#jdm", "#phonkedit"]

"scenes": [

{

"t": "0:00 – 0:08",

"energy": 0.35,

"mood": "tense",

"prompt": "Slow dolly-in toward a black-tinted Nissan S15 parked in a wet underground garage. Sodium-vapor orange overhead, neon cyan reflections on the hood. Engine off. Steam rising from the asphalt. Cinematic 21:9. 35mm anamorphic lens, shallow depth of field. No subject yet — just the car."

{

"t": "0:08 – 0:18",

"energy": 0.55,

"mood": "tense rising",

"prompt": "Quick cuts (0.8s avg): male hands on steering wheel (rings, scarred knuckles), key turning in ignition, hand on gear shift, lighter clicking. Bleach-bypass color grade, crushed blacks. Each cut on the cowbell hit. No face visible yet — withhold the subject."

{

"t": "0:18 – 0:22",

"energy": 0.85,

"label": "PRE-DROP BUILD",

"mood": "tension max",

"prompt": "Engine starts. Camera pushes in 24fps slow on the driver's eyes only — male, late 20s, sharp jaw, dead-serious into the lens. Eyes lit from below by the dashboard. Dust particles drifting through the cone of light. Hold the stare. The world goes quiet for half a beat."

{

"t": "0:22",

"energy": 1.0,

"label": "★ DROP — visual pop",

"mood": "aggressive release",

"prompt": "HARD CUT TO: tires breaking traction in slow-mo 240fps. Smoke geyser from the rear wheels. Reverse-shot from a low front-bumper rig as the car launches forward. RGB chromatic aberration on each rim spoke. Single hard lens flare across the windshield. Strobe-frame flash of the headlight cutting on. Letterbox cracks open one frame wider on this beat. This is the most clip-worthy frame in the whole video — designed for the TikTok freeze-frame."

{

"t": "0:22 – 0:42",

"energy": 0.92,

"label": "★ BEST MOMENT — primary chorus",

"mood": "aggressive + euphoric",

"prompt": "Drift sequence on a multi-level neon parking deck. Whip-pan follow shot, then mounted side-cam through the door window — driver's profile lit by the strip lights flickering past. Tire-smoke catches the cyan and goes magenta where it hits the orange. Cut on every 4th beat: wide drift / driver close-up / wheel slip / sparks. Letterbox stays. Color grade: peak saturation, no blooms — keep it punchy, not dreamy. This is the segment 90% of the clips will sample from."

{

"t": "0:42 – 1:08",

"energy": 0.45,

"mood": "nocturnal, hollow",

"prompt": "Cut to static wide on a rain-slick bridge overpass. Single sodium light. Car parked. Driver stands at the rail, smoking, back to camera, looking down at the city. No movement for 6 full seconds — let the breakdown breathe. Then slow handheld push-in on his shoulders. VHS tracking artifacts at the edges of the frame. Mood: he's alone with whatever this song is about."

{

"t": "1:08 – 1:24",

"energy": 0.78,

"mood": "building anger",

"prompt": "Back in the driver seat. Quick intercuts: rearview mirror (red and blue flashing far behind, just a hint), foot pressing the pedal harder, shifter slamming into gear, dashboard tach climbing to redline. Cut tempo accelerates from 1.2s shots down to 0.4s shots over the 16-second window."

{

"t": "1:24",

"energy": 1.0,

"label": "★ DROP 2 — chase release",

"mood": "high-octane",

"prompt": "Same visual pop as scene 4 — but now the car is moving. Top-down drone tracking shot of the S15 weaving between cones / barriers / sleeping traffic. Cut to: side-mounted shot, full slide, smoke wall obscures everything for half a second, smoke clears on the next beat, car still in frame."

{

"t": "1:24 – 1:58",

"energy": 0.95,

"mood": "aggressive + euphoric (callback)",

"prompt": "Variation on scene 5's chorus visual but at street-level: long empty boulevard, palm-tree silhouettes against orange sky, drift-line cutting across the centerline. Reuse the side-cam-through-window angle on the driver — same shot, different city. The callback is the point: the audience should recognize the visual rhyme between the two choruses."

{

"t": "1:58 – 2:46",

"energy": 0.5,

"label": "★ OUTRO PEAK — emotional close",

"mood": "spent, melancholic, defiant",

"prompt": "Sunrise pulling the orange out of the sky. Car parked on an empty cliff road overlooking the city. Engine off, ticking from heat. Driver's silhouette against the dawn — same eyes from scene 3 now reflective, not aggressive. Last 6 seconds: slow dolly out, the car gets smaller, the city wakes up, the music fades on a held cowbell. Hard cut to black."

}

"editor_notes": {

"never_show": ["the antagonist", "lyrics on screen", "stock dashcam stock-footage"],

"keep_letterbox_throughout": true,

"clip_export_targets": [

"0:18 – 0:42 (TikTok primary, 24s)",

"1:24 – 1:39 (Reels secondary, 15s)",

"2:32 – 2:46 (Shorts outro tease, 14s)"

]

}

Why this matters for video-gen tools

A video-gen model on its own has no reason to cut on beat 22. It doesn't know the chorus comes back at 1:24. It doesn't know that drift phonk audiences expect the visual rhyme between chorus one and chorus two, or that the breakdown demands stillness. It guesses. The result — even from Sora 2 or Veo 3 — is technically clean but emotionally arbitrary.

Songbrain's job is to remove the guessing. Every video-gen company building a “music video” product is going to need this layer. The choice is: build the music intelligence stack themselves (a multi-year, 10-model pipeline), or call our API.

What else changes when you add Songbrain

Clip export targets are pre-computed. The same render is auto-trimmed into TikTok, Reels and Shorts cuts using Songbrain's Best Moments timestamps. No human picks the clip.
Audience-aware visual choices. A K-pop video gets a different aesthetic library than a deathcore one — not because the model picked it, but because Songbrain's subgenre + audience layer told it to.
Trend-aware aesthetics. If drift phonk is currently leaning into VHS chroma bleed and away from glossy CGI this week, the prompt reflects that automatically.
Caption + hashtag handoff. The same API call also returns the social-media kit: caption, hashtags, hook line. The video and the post share one source of truth.

Who this is for

Three kinds of customer, all building the same gap:

Video-gen platforms adding a “music video” mode (Sora, Runway, Veo, Kling-class tools wanting better audio-aware output).
AI music video startups building consumer products on top of Sora/Runway/Veo APIs.
Label / DSP / promo tools wanting auto-generated promo videos per release without hiring an editor for every track.

We're onboarding integration partners now. The API returns the JSON above — plus the original full analysis (Virality Score, Best Moments, Song DNA, lyrics, genre, reel reverse-engineering data) — in under 60 seconds per song.