Chris Royse's paper is the reason the conversation is serious. This page is the reason the conversation is commercial. Every number below carries a measurement date, a scope disclosure, and a reproducibility manifest — the discipline a C-suite buyer needs to sign, and the discipline a frontier lab needs to defend the budget. The Shakespeare run is the public anchor: 5.4 MB of plain text, 120 MB of labeled training corpus, 85 minutes on a single GPU, zero human labelers, zero synthetic tokens.
Commercial read: beyond the published frontier on a consumer workstation — and the pattern ports into any frontier-lab TTS surface. The measured number is what lets a lab CTO defend the line item; the reproducibility manifest is what lets the board approve it.
Commercial read: production-grade pipelines, not research prototypes. The constellation guard running against every generation is the audit chain the regulator will ask for — shipped as a first-class property, not a bolt-on consultancy.
Commercial read: the code is the proof. No lab building this internally ships faster than 24–36 months at $200–400M — and that assumes they have a spec, which they do not. Either buy the substrate or license it; neither clock resets by ignoring it.
What a diligence reader wants to see on this page: what the system actually measures, at what scale, with what reproducibility, and whether it runs without a cluster. The table below is that answer — one public Shakespeare corpus, one consumer GPU, every number independently verifiable.
The diligence question this table answers: does the substrate recover a pattern a domain expert would recognise, without being told what to look for. The 6 most similar work-pairs by E1-centroid cosine similarity are listed below. Eight of nine works are English-king history plays — separated from comedies, tragedies, and sonnets with zero supervision. No labels. No taxonomy hint-in. Pure geometry.
| Work A | Work B | Cosine |
|---|---|---|
| 3 Henry VI | 2 Henry VI | 0.9885 |
| 1 Henry IV | Henry V | 0.9867 |
| 1 Henry IV | 2 Henry IV | 0.9863 |
| Richard II | 2 Henry VI | 0.9857 |
| 2 Henry IV | 2 Henry VI | 0.9852 |
| Richard II | 2 Henry IV | 0.9842 |
That's what's possible when you stop labeling and start measuring.
The diligence reader wants a sanity check: does the multi-embedder retrieval path return the obvious answer on queries where the answer is not in dispute. Eight canonical quotations, four retrieval strategies, listed below with the rank returned.
The diligence reader wants to see prompt injection fail on video, not in prose. Six demos follow, each reproducible on the listed hardware. All run against Gemma 4 (8B, gemma4:e4b) with constellation guard scoring against all 44 Shakespeare constellations in real time. Model: Apache 2.0 licensed. Pipeline: RTX 5090 + CUDA 13.2. Every verdict shown on-screen.
The diligence reader wants the roster: which frozen models, which dimensions, what coverage of the representation space. The list below is exhaustive — one row per embedder, with the public model name and dimensionality, so the selection can be audited against the published frontier.
The diligence reader wants build-vs-buy math. The two boxes below carry the relevant inputs: languages, crate and file counts, MCP tool surface, RocksDB column families, analysis DAG depth, and licence. The architecture is the proof — and the reason the twenty-four-month internal rebuild clock does not reset by ignoring it.
“Derived Data Abundance”
The diligence reader wants the timing argument, not the roadmap. Seven independent inputs that make the substrate-buy decision urgent this cycle — each with a citable source a board member can verify without a call with us.
The diligence reader wants the addressable-pool math in one view: which markets Teleox.ai enters, their published size trajectory, and the role the substrate plays in each. The row ordering maps to where the evidence tier is most commercially legible today.
| Market | Size | Teleox's Role |
|---|---|---|
| AI Training Data | $3.2B → $16.3B by 2033 | 100×+ better labeled data |
| Data Labeling | $4.89B → $17.1B by 2030 | Automated vs. manual |
| AI Safety & Alignment | $2.24B → $18.15B by 2030 | Deterministic outputs + prompt-injection elimination |
| Agentic AI | $7.8B → $52B by 2030 | Security guardrails for autonomous agents |
| Enterprise AI Deployments | $600B+ addressable by 2030 | Unlock with deterministic LoRA three-layer enforcement |
“Because we spent 2 years shipping working code instead of raising money. You're one of the first to see this.”