Audience
Speech researchers, evals leads, safety reviewers
The measured voice result comes from reference selection, full ICL, best-of-N scoring, centroid enrollment, and quality gates.
Media / PAPER + alldata.md / 7:57

Audience
Speech researchers, evals leads, safety reviewers
Core idea
The model is not the whole story. Pipeline engineering can move identity agreement when the scorer, reference set, and selection loop are aligned.
Watch on YouTube· 7:57
The scope is narrow and useful: one speaker, English, within-WavLM-family. The website should say that clearly instead of overselling it.
Watch videoOpen the full video on YouTubeThe videos are raw build context. These notes translate them into the shortest useful frame for creators, companies, and AI lab readers.
Report the encoder-matched scope every time.
Best-of-N only matters if the scorer matches the target.
Naturalness and identity are separate checks.
Related notes stay inside the same problem area first, then move to the next useful context.

Watch + read / 14:09
ClipCannon breaks video into transcripts, frames, scenes, emotion, speaker, prosody, highlights, storyboards, and provenance.

Watch + read / 13:09
The editor only works because the system already knows scenes, transcript timing, narrative flow, captions, crops, and render constraints.

Watch + read / 7:49
A real-time avatar has to preserve voice, face, expression, timing, conversation state, and meeting latency all at once.
Send the audience, data type, target task, proof bar, and sharing limits.