Five Megabytes Of Shakespeare Became A Training System

The Shakespeare case shows how a small corpus can become SFT examples, DPO pairs, graph edges, style centroids, and verification checks.

Media / PAPER + video / 11:49

Audience

Post-training researchers, evals leads, language model teams

Core idea

A small text corpus can be transformed into multiple supervision surfaces when the system extracts structure instead of just counting tokens.

Founder source

Style LoRA

Watch on YouTube· 11:49

This is the most intuitive demo of meaning compression: same source, more structured training material, more checks.

The videos are raw build context. These notes translate them into the shortest useful frame for creators, companies, and AI lab readers.

SFT, DPO, and verification can all come from one corpus pipeline.

High style fidelity needs negative examples, not just imitation.

The audit has to catch memorized artifacts like headers.

Related notes stay inside the same problem area first, then move to the next useful context.

Watch + read / 14:09

ClipCannon breaks video into transcripts, frames, scenes, emotion, speaker, prosody, highlights, storyboards, and provenance.

Watch + read / 13:09

The editor only works because the system already knows scenes, transcript timing, narrative flow, captions, crops, and render constraints.

Watch + read / 7:49

A real-time avatar has to preserve voice, face, expression, timing, conversation state, and meeting latency all at once.

Send the audience, data type, target task, proof bar, and sharing limits.