Is this actually working, or is it a roadmap claim?

It's working. A 16-minute interview of a single subject has been turned into a model that regenerates the person with prosody, 196+ captured micro-expressions, lip-sync against captured phonemes and visemes, and full emotional-range variants. A live use case clone-loaded a subject into Zoom and the other participants attended a class on the subject's behalf — without detecting the swap.

Does this need a specific base model?

No. The approach is model-agnostic. EchoMimicV3 is the reference backbone we demo on, but the seven-modality constellation + LoRA + runtime guard can wrap any diffusion-transformer-style video generator you point it at.

What does it actually capture?

Prosody, pitch, acoustics. Laughter, breathing, inhales, exhales, lip smacks, tongue movement. 196+ micro-expressions, so each smile / laugh / pause has its own variants instead of a single repeated expression. Beard geometry, which is typically a failure mode for talking-head avatars. Emotional range — high-tone, low-tone, and the behavioural variants inside each range.

Yes. Reference footage, the computed constellation, and the LoRA weights never leave the customer environment. Suitable for enterprise brand-voice, media licensing, and sovereign programs.

The LoRA training and video rendering run end-to-end on a single RTX 5090 today with a rewritten FFmpeg kernel targeting CUDA 13.2 / Blackwell sm_120a. Real-time streaming at photoreal grade is under active optimisation via a Rust compile of the rendering kernel; pre-Rust throughput is already fast enough for production video pipelines.

Subject provides informed written consent prior to cloning and retains deletion rights that do not sunset at publication. Release posture is protocol-level — the technical pipeline is disclosed, subject-specific artefacts are withheld — to balance reproducibility against identity-impersonation risk.

HOME · APPS · APP 7

APP 7 · AVATAR

Fully operational. Identity-locked talking-head that perfectly reproduces the person — voice, micro-expressions, and all.

One 16-minute interview becomes a model that regenerates the subject at indistinguishable grade — prosody, lip-sync, 196+ captured micro-expressions, emotional-range variants, breathing, laughter, beard geometry. The approach is model-agnostic: it wraps any video generator you can point at it. Synthesia $4B and HeyGen $500M sit inside a 12–18 month absorption window; the $42.29B avatar-video market opens structurally for products that ship per-frame cosine verification as a first-class property, and this is that product.

MEASURED · Fully operational · demoed live · model-agnostic

Schedule 48-hr POC Book 20-min briefing Strategic inquiry

WHO THIS IS FOR

For enterprise executive-comms teams, multilingual localisation, media + entertainment (actor-likeness IP licensing), and sports (archive monetisation).

PROOF STACK

Every claim tagged.

SOURCE FOOTAGE

16 minutes, single subject, consented

LABELED TRAINING CLIPS FROM THAT 16 MIN

12,000+ (ClipCannon 23-stage DAG)

CAPTURED MICRO-EXPRESSIONS

196+ (each smile, laugh, inhale, pause — variant-aware, not repeated)

MODALITIES THE CONSTELLATION CONSTRAINS SIMULTANEOUSLY

visual · local visual · motion · speaker · emotion · caption · voice

WHAT GETS REPRODUCED

Prosody, pitch, acoustics, laughter, breathing, lip-sync against captured phonemes/visemes, emotional-range variants, beard geometry

PROVEN USE CASE

Subject cloned into Zoom — the other participants could not tell it from the real person

MODEL-AGNOSTIC

Wraps any video generator. Tested on EchoMimicV3 (1.86B params) as the reference backbone; portable to any diffusion-transformer lineage

REAL-TIME PATH

Rust compile of the rendering kernel underway — pre-Rust, real-time on a single RTX 5090 w/ rewritten CUDA 13.2 / Blackwell sm_120a FFmpeg kernel

TECHNICAL DEPTH

How it composes.

Reference backbone: EchoMimicV3 (1.86B-param diffusion-transformer, Wan2.1-Fun lineage) — swap-in, not locked-in
Loss: mouth-weighted flow-matching (10× mouth, 2× face, 1× background)
Modalities simultaneously constrained: global visual · local visual · motion (optical flow) · speaker (WavLM) · emotion · caption · voice (ECAPA-TDNN / Resemblyzer)
Operating thresholds, configurable per deployment: visual 0.70 · emotion 0.60 · speaker 0.80 · voice 0.85
Phonemes + visemes captured from every speech clip so new-utterance generation lip-syncs against the correct mouth shape
Rewritten FFmpeg kernel for CUDA 13.2 / Blackwell sm_120a on RTX 5090 — rendering pipeline optimised for identity-locked talking-head output

MARKETS UNLOCKED

What opens downstream.

$10–40B

Verification-grade synthetic media / anti-deepfake

$5–20B ARR

Enterprise brand-voice / executive-comms

$5–15B/yr top-50

Media / entertainment / sports archive monetisation

FREQUENTLY ASKED

Questions from every buyer.

Fully operational. Identity-locked talking-head that perfectly reproduces the person — voice, micro-expressions, and all.

MEASURED · Fully operational · demoed live · model-agnostic

Every claim tagged.

SOURCE FOOTAGE

16 minutes, single subject, consented

LABELED TRAINING CLIPS FROM THAT 16 MIN

12,000+ (ClipCannon 23-stage DAG)

CAPTURED MICRO-EXPRESSIONS

196+ (each smile, laugh, inhale, pause — variant-aware, not repeated)

MODALITIES THE CONSTELLATION CONSTRAINS SIMULTANEOUSLY

visual · local visual · motion · speaker · emotion · caption · voice

WHAT GETS REPRODUCED

Prosody, pitch, acoustics, laughter, breathing, lip-sync against captured phonemes/visemes, emotional-range variants, beard geometry

PROVEN USE CASE

Subject cloned into Zoom — the other participants could not tell it from the real person

MODEL-AGNOSTIC

Wraps any video generator. Tested on EchoMimicV3 (1.86B params) as the reference backbone; portable to any diffusion-transformer lineage

REAL-TIME PATH

Rust compile of the rendering kernel underway — pre-Rust, real-time on a single RTX 5090 w/ rewritten CUDA 13.2 / Blackwell sm_120a FFmpeg kernel

How it composes.

Reference backbone: EchoMimicV3 (1.86B-param diffusion-transformer, Wan2.1-Fun lineage) — swap-in, not locked-in

Loss: mouth-weighted flow-matching (10× mouth, 2× face, 1× background)

Modalities simultaneously constrained: global visual · local visual · motion (optical flow) · speaker (WavLM) · emotion · caption · voice (ECAPA-TDNN / Resemblyzer)

Operating thresholds, configurable per deployment: visual 0.70 · emotion 0.60 · speaker 0.80 · voice 0.85

Phonemes + visemes captured from every speech clip so new-utterance generation lip-syncs against the correct mouth shape

Rewritten FFmpeg kernel for CUDA 13.2 / Blackwell sm_120a on RTX 5090 — rendering pipeline optimised for identity-locked talking-head output

Questions from every buyer.