A proposal for semantic binding between narration and visuals
AI Video Factory Enhancement Proposal
Current pipeline treats audio and video as separate concerns:
No connection between what's being said and what's being shown at each moment.
When the narration says specific things, the visuals should match:
| Audio Says... | Video Should Show... |
|---|---|
| "When you receive a 1099 form..." | Close-up of 1099 tax form |
| "...the revenue flows directly to you" | Animated money flow to athlete |
| "But here's what most athletes miss..." | Athlete looking confused at paperwork |
Currently, we can't guarantee this alignment.
Single YAML document combining narration + visuals per scene
SSML-like inline visual cues within the narration text
Two-column format like traditional film scripts
Enhanced LLM analysis with word-level timestamps
scenes: - narration: "When you receive a 1099 form from a brand deal..." visual: "Close-up of 1099 tax form with NIL branding" - narration: "...the revenue flows directly to you as the athlete." visual: "Animated money flow: brand logo → athlete" - narration: "But here's what most athletes miss..." visual: "Athlete looking confused at paperwork"
Sora only generates 4, 8, or 12 second clips. This creates timing challenges:
| Narration | Duration | Sora Options | Mismatch |
|---|---|---|---|
| "When you receive a 1099..." | ~2.5s | 4s | +1.5s extra |
| "...revenue flows to you." | ~2.8s | 4s | +1.2s extra |
| Longer explanation passage | ~9s | 8s or 12s | -1s or +3s |
You can't get a 2.5-second clip from Sora. We need a strategy to handle this.
Group multiple narration beats into Sora-compatible durations:
scenes: - duration: 8 # One 8-second Sora clip visual: "1099 form on desk, camera pulls back to reveal athlete" narration: - "When you receive a 1099 form..." # ~2.5s - "...revenue flows directly to you." # ~2.8s # Total: ~5.3s narration + 2.7s visual breathing room
Frame-exact sync where every word has a matching visual cue.
Requires precise clip lengths Sora can't provide.
Visual theme matches narration topic. Extra seconds become natural "breathing room."
How professional B-roll actually works.
Real videos rarely have frame-exact sync. A few extra seconds of "1099 form on screen" while discussing 1099s is perfectly natural.
scenes: - target_duration: 8 visual_theme: "Tax documentation and money flow" visual_prompt: "Professional desk with 1099 form, soft lighting, subtle camera drift. Form shows NIL income details." beats: - text: "When you receive a 1099 form from a brand deal..." visual_emphasis: "1099 form visible" - text: "...the revenue flows directly to you." visual_emphasis: "pull back to show bigger picture"
Multiple narration segments per scene
Snaps to 4/8/12 automatically
Cohesive prompt for Sora
Define unified storyboard format with beats, themes, and duration targets
Update VideoJob Prisma schema to support new storyboard structure
Modify LLM scene planner to work with grouped beats and duration snapping
Build storyboard editor for intuitive input and editing
Combine narration + visuals in one document
Bundle beats into 4/8/12s Sora clips
Visual themes match narration topics
Explicit is better than implicit.
Let users define the connection between audio and video.