The Meeting Bot Is A Multimodal Alignment Problem

A real-time avatar has to preserve voice, face, expression, timing, conversation state, and meeting latency all at once.

Media / alldata.md / 7:49

Audience

Multimodal model teams, agent teams, safety reviewers

Core idea

The hard part is not making an avatar move. The hard part is keeping every modality in distribution while it responds live.

Founder source

Phoenix + VoiceAgent

Watch on YouTube· 7:49

This is why the constellation framing matters beyond a demo: a live agent needs runtime checks, not just a trained prior.

The videos are raw build context. These notes translate them into the shortest useful frame for creators, companies, and AI lab readers.

Identity drift can occur in face, voice, affect, or timing.

Latency pressure should not remove verification.

A meeting bot needs explicit consent and scope controls.

Related notes stay inside the same problem area first, then move to the next useful context.

Watch + read / 14:09

ClipCannon breaks video into transcripts, frames, scenes, emotion, speaker, prosody, highlights, storyboards, and provenance.

Watch + read / 13:09

The editor only works because the system already knows scenes, transcript timing, narrative flow, captions, crops, and render constraints.

Watch + read / 7:57

The measured voice result comes from reference selection, full ICL, best-of-N scoring, centroid enrollment, and quality gates.

Send the audience, data type, target task, proof bar, and sharing limits.