Audio-Video Synchronization

A proposal for semantic binding between narration and visuals

AI Video Factory Enhancement Proposal

The Problem

Current pipeline treats audio and video as separate concerns:

Talk Track
Audio
|
Visual Style
Scene Plan
Videos

Missing: Semantic Binding

No connection between what's being said and what's being shown at each moment.

Real-World Example

When the narration says specific things, the visuals should match:

Audio Says... Video Should Show...
"When you receive a 1099 form..." Close-up of 1099 tax form
"...the revenue flows directly to you" Animated money flow to athlete
"But here's what most athletes miss..." Athlete looking confused at paperwork

Currently, we can't guarantee this alignment.

Proposed Solutions

📋

Unified Storyboard

Single YAML document combining narration + visuals per scene

🏷️

Tagged Script

SSML-like inline visual cues within the narration text

🎬

Screenplay Format

Two-column format like traditional film scripts

🤖

Smart Extraction

Enhanced LLM analysis with word-level timestamps

Recommended: Unified Storyboard

scenes:
  - narration: "When you receive a 1099 form from a brand deal..."
    visual: "Close-up of 1099 tax form with NIL branding"

  - narration: "...the revenue flows directly to you as the athlete."
    visual: "Animated money flow: brand logo → athlete"

  - narration: "But here's what most athletes miss..."
    visual: "Athlete looking confused at paperwork"

✓ Advantages

  • Crystal clear alignment
  • Each scene self-contained
  • Easy to review & edit
  • Matches pro workflow
  • LLM can help generate

✗ Trade-offs

  • More upfront planning
  • Changes input format
  • Learning curve for users

The Sora Constraint

Sora only generates 4, 8, or 12 second clips. This creates timing challenges:

Narration Duration Sora Options Mismatch
"When you receive a 1099..." ~2.5s 4s +1.5s extra
"...revenue flows to you." ~2.8s 4s +1.2s extra
Longer explanation passage ~9s 8s or 12s -1s or +3s

Key Insight

You can't get a 2.5-second clip from Sora. We need a strategy to handle this.

Solution: Scene Grouping

Group multiple narration beats into Sora-compatible durations:

scenes:
  - duration: 8  # One 8-second Sora clip
    visual: "1099 form on desk, camera pulls back to reveal athlete"
    narration:
      - "When you receive a 1099 form..."      # ~2.5s
      - "...revenue flows directly to you."  # ~2.8s
      # Total: ~5.3s narration + 2.7s visual breathing room
Calculate narration duration
Snap to 4/8/12s
Generate cohesive visual

Thematic vs. Literal Sync

❌ Literal (Impossible)

Frame-exact sync where every word has a matching visual cue.

Requires precise clip lengths Sora can't provide.

✓ Thematic (Recommended)

Visual theme matches narration topic. Extra seconds become natural "breathing room."

How professional B-roll actually works.

Professional Reality

Real videos rarely have frame-exact sync. A few extra seconds of "1099 form on screen" while discussing 1099s is perfectly natural.

Final Proposed Schema

scenes:
  - target_duration: 8
    visual_theme: "Tax documentation and money flow"
    visual_prompt: "Professional desk with 1099 form, soft lighting,
                     subtle camera drift. Form shows NIL income details."
    beats:
      - text: "When you receive a 1099 form from a brand deal..."
        visual_emphasis: "1099 form visible"
      - text: "...the revenue flows directly to you."
        visual_emphasis: "pull back to show bigger picture"
🎯

Grouped Beats

Multiple narration segments per scene

⏱️

Target Duration

Snaps to 4/8/12 automatically

🎨

Visual Theme

Cohesive prompt for Sora

Implementation Steps

1. Schema Design

Define unified storyboard format with beats, themes, and duration targets

2. Model Update

Update VideoJob Prisma schema to support new storyboard structure

3. Scene Planner

Modify LLM scene planner to work with grouped beats and duration snapping

4. UI Update

Build storyboard editor for intuitive input and editing

User Brief
LLM Storyboard
Review/Edit
Generate Video

Summary

Unified Storyboard

Combine narration + visuals in one document

Scene Grouping

Bundle beats into 4/8/12s Sora clips

Thematic Sync

Visual themes match narration topics

Explicit is better than implicit.

Let users define the connection between audio and video.

1 / 11