Calvin AI — Proposal B: Calvion (Custom Build)
Prepared for: Calvin Prepared by: Thomas Cobley Date: 2 July 2026 Option: B of 2 (see also Proposal A: Launch on Delphi.ai)
1. What Calvion is
Calvion is a bespoke, fully-owned AI version of you — same idea as Delphi, but built on infrastructure you control, using best-in-class components chosen per capability. It does the three things you asked for:
- Chat — ChatGPT-style text, answering in your voice from your transcripts.
- Voice — real-time calls in your cloned voice.
- Video talking-head — you on camera, delivering answers in real time. This is the key difference: Calvion can ship video now, whereas Delphi's video is currently unavailable.
Think of it as the model Hormozi actually uses — his ACQ AI is a custom, owned clone, not an off-the-shelf platform. Calvion is that, for you.
2. How it works (the stack)
Your transcripts ──▶ Ingest & normalise ──▶ Chunk + embed ──▶ Vector DB (your cloud)
(Fireflies/Zoom/…) │
▼
User question ───────────────────────────────────▶ Retrieve relevant context
│
▼
LLM (Claude) answers as "you", streaming
│
┌──────────────────────────────┼──────────────────────────────┐
▼ ▼ ▼
Text chat Voice (cloned) Video talking-head
(web + embed) ElevenLabs/Cartesia Tavus / HeyGen avatar
Recommended components (all swappable — no lock-in):
| Layer | Recommendation | Why |
|---|---|---|
| Video avatar | Tavus (real-time Conversational Video Interface) | Most conversation-native, 1080p real-time, bring-your-own LLM so our RAG stays ours. HeyGen LiveAvatar is the alternative (broader API, 720p live cap). |
| Voice clone | ElevenLabs (Pro clone) or Cartesia (lowest latency) | ElevenLabs = best quality; Cartesia Sonic = ~40ms if latency is critical. |
| Brain (LLM) | Claude (Sonnet 4.6) | Best quality/cost for RAG answering (~$0.02/answer); Haiku for cheaper volume. |
| Knowledge / RAG | Supabase (pgvector) | Transcripts + vectors + metadata in one DB you own — cheapest at this scale and the strongest data-ownership story for sensitive meeting data. |
| App & chat UI | Next.js + Vercel AI SDK on Vercel | Streaming chat, embeddable widget, fast to build and deploy. |
| Transcript ingestion | Fireflies (GraphQL API), Zoom (VTT), Granola (REST) | Automated pull; Otter is manual-export only. |
3. What you get that Delphi can't give you
- The video talking-head — live and yours. The #1 feature, working now.
- Your data stays on your infrastructure. Transcripts never leave a database you own. Best answer for confidential meetings — no third-party platform terms, no SOC 2 gap, full export/portability.
- Full control of persona, prompts, model choice, branding, UX, and every deployment surface (site, app, Slack, WhatsApp, internal tools).
- Own unit economics. As usage scales, you optimise costs directly instead of paying platform markup.
4. What you take on vs Delphi
- ~2–3 months to a solid v1 (vs weeks on Delphi).
- Higher upfront cost (build fee) and ongoing maintenance (I handle this on retainer).
- Multiple vendors to manage — I own the integration; you just see one product.
- Real-time video is the dominant running cost — see §6. We design to keep it in check (text-first, video only when asked).
5. Build plan & timeline
| Phase | Work | Duration |
|---|---|---|
| 0. Discovery | Intake form, gather transcripts, sample content, define persona & use cases | ~1 week |
| 1. Ingestion | Pull + normalise transcripts (Fireflies/Zoom/Granola), clean, structure | ~1–2 weeks |
| 2. RAG + Chat | Embed corpus, retrieval, Claude answering as "you", streaming chat UI + embed widget | ~2 weeks |
| 3. Voice | Clone voice (ElevenLabs Pro / Cartesia), wire real-time voice mode | ~1 week |
| 4. Video | Create avatar (Tavus/HeyGen — footage session), real-time talking-head, turn-taking, latency tuning | ~2–3 weeks |
| 5. Launch | Auth, deploy, embed, QA, handover + training | ~1–2 weeks |
| Total to v1 | ~7–11 weeks |
A chat + voice POC is achievable in ~2–3 weeks if you want to see it working before committing to the video phase.
6. Investment
One-time build fee (paid to me)
| Scope | Recommended fixed price |
|---|---|
| Full v1 (chat + voice + video, embed, deploy) | £28,000–£42,000 |
| Phase 1 — Chat + voice only (video deferred) | £15,000–£22,000 |
| Add video layer later | £10,000–£18,000 |
Plus a one-time avatar/voice production session (footage + 30 min–3 hr of clean audio) and any per-replica training fees (~$40–65 on Tavus).
Monthly running costs (pass-through — paid to vendors)
| Usage level | Estimate | Driver |
|---|---|---|
| Low (single clone, few hundred short chats/mo, modest video) | ~$130–370/mo | Avatar minutes |
| Moderate (thousands of chats, ~1,000–2,000 video min/mo) | ~$900–5,500/mo | Real-time video is 70–90% of this |
Unit economics reality: real-time avatar streaming runs roughly $0.30–$3.00 per minute depending on vendor/tier. This is the number that dominates cost at scale, so Calvion is designed to be text-first, with the video talking-head only when the user explicitly asks for it, plus caching of common answers. We pin the contracted per-minute rate before launch.
Ongoing management (paid to me — optional)
| Item | Recommended |
|---|---|
| Maintenance, new transcripts, model/vendor upkeep, monitoring | £1,000–£2,500/mo |
All-in to launch full v1: roughly £28k–£42k one-time + ~$130–370/mo to run at low usage (+ optional retainer). More expensive and slower than Delphi — you're paying for video-now, ownership, and control.
7. Delphi vs Calvion at a glance
| Delphi (Proposal A) | Calvion (Proposal B) | |
|---|---|---|
| Chat | ✅ | ✅ |
| Real-time voice | ✅ | ✅ |
| Video talking-head | ⚠️ Unavailable now | ✅ Live now |
| Time to launch | Weeks | ~2–3 months |
| Upfront cost | Low (£2.5–4k + $299/mo) | High (£28–42k) |
| Data ownership | On Delphi's platform | Your infrastructure |
| Customisation / control | Limited | Full |
| Maintenance | Delphi handles it | Retainer |
| Monetization | Immortal tier only | Build whatever you want |
| Best when | Speed & low effort | Video, ownership, control |
8. Recommendation
If the video talking-head and data ownership matter, Calvion is the right build — it delivers both, and it's the model Hormozi's own AI actually follows. If speed and low cost win, start on Delphi (Proposal A).
Suggested path: run the intake form first (next step regardless of route), then decide. A strong middle option is to build Calvion's chat + voice POC (~2–3 weeks, ~£15–22k), prove it on your transcripts, and add the video layer once you've seen it working.
Stack, pricing, and effort estimates verified against vendor pricing/docs (Tavus, HeyGen, ElevenLabs, Cartesia, Supabase, Anthropic) as of early 2026. Real-time avatar per-minute rates change often and are partly sales-gated — confirmed before any fixed quote.
Ready to go with Calvion?
Choose this route and we'll take you straight to the intake — everything we need to start building your Calvion.