Calvin AI — Proposal B: Calvion (Custom Build)

Prepared for: Calvin Prepared by: Thomas Cobley Date: 2 July 2026 Option: B of 2 (see also Proposal A: Launch on Delphi.ai)

1. What Calvion is

Calvion is a bespoke, fully-owned AI version of you — same idea as Delphi, but built on infrastructure you control, using best-in-class components chosen per capability. It does the three things you asked for:

Chat — ChatGPT-style text, answering in your voice from your transcripts.
Voice — real-time calls in your cloned voice.
Video talking-head — you on camera, delivering answers in real time. This is the key difference: Calvion can ship video now, whereas Delphi's video is currently unavailable.

Think of it as the model Hormozi actually uses — his ACQ AI is a custom, owned clone, not an off-the-shelf platform. Calvion is that, for you.

2. How it works (the stack)

  Your transcripts ──▶ Ingest & normalise ──▶ Chunk + embed ──▶ Vector DB (your cloud)
  (Fireflies/Zoom/…)                                                │
                                                                    ▼
  User question ───────────────────────────────────▶ Retrieve relevant context
                                                                    │
                                                                    ▼
                                        LLM (Claude) answers as "you", streaming
                                                       │
                        ┌──────────────────────────────┼──────────────────────────────┐
                        ▼                               ▼                               ▼
                    Text chat                     Voice (cloned)               Video talking-head
                  (web + embed)                 ElevenLabs/Cartesia            Tavus / HeyGen avatar

Recommended components (all swappable — no lock-in):

Layer	Recommendation	Why
Video avatar	Tavus (real-time Conversational Video Interface)	Most conversation-native, 1080p real-time, bring-your-own LLM so our RAG stays ours. HeyGen LiveAvatar is the alternative (broader API, 720p live cap).
Voice clone	ElevenLabs (Pro clone) or Cartesia (lowest latency)	ElevenLabs = best quality; Cartesia Sonic = ~40ms if latency is critical.
Brain (LLM)	Claude (Sonnet 4.6)	Best quality/cost for RAG answering (~$0.02/answer); Haiku for cheaper volume.
Knowledge / RAG	Supabase (pgvector)	Transcripts + vectors + metadata in one DB you own — cheapest at this scale and the strongest data-ownership story for sensitive meeting data.
App & chat UI	Next.js + Vercel AI SDK on Vercel	Streaming chat, embeddable widget, fast to build and deploy.
Transcript ingestion	Fireflies (GraphQL API), Zoom (VTT), Granola (REST)	Automated pull; Otter is manual-export only.

3. What you get that Delphi can't give you

The video talking-head — live and yours. The #1 feature, working now.
Your data stays on your infrastructure. Transcripts never leave a database you own. Best answer for confidential meetings — no third-party platform terms, no SOC 2 gap, full export/portability.
Full control of persona, prompts, model choice, branding, UX, and every deployment surface (site, app, Slack, WhatsApp, internal tools).
Own unit economics. As usage scales, you optimise costs directly instead of paying platform markup.

4. What you take on vs Delphi

~2–3 months to a solid v1 (vs weeks on Delphi).
Higher upfront cost (build fee) and ongoing maintenance (I handle this on retainer).
Multiple vendors to manage — I own the integration; you just see one product.
Real-time video is the dominant running cost — see §6. We design to keep it in check (text-first, video only when asked).

5. Build plan & timeline

Phase	Work	Duration
0. Discovery	Intake form, gather transcripts, sample content, define persona & use cases	~1 week
1. Ingestion	Pull + normalise transcripts (Fireflies/Zoom/Granola), clean, structure	~1–2 weeks
2. RAG + Chat	Embed corpus, retrieval, Claude answering as "you", streaming chat UI + embed widget	~2 weeks
3. Voice	Clone voice (ElevenLabs Pro / Cartesia), wire real-time voice mode	~1 week
4. Video	Create avatar (Tavus/HeyGen — footage session), real-time talking-head, turn-taking, latency tuning	~2–3 weeks
5. Launch	Auth, deploy, embed, QA, handover + training	~1–2 weeks
	Total to v1	~7–11 weeks

A chat + voice POC is achievable in ~2–3 weeks if you want to see it working before committing to the video phase.

6. Investment

One-time build fee (paid to me)

Scope	Recommended fixed price
Full v1 (chat + voice + video, embed, deploy)	£28,000–£42,000
Phase 1 — Chat + voice only (video deferred)	£15,000–£22,000
Add video layer later	£10,000–£18,000

Plus a one-time avatar/voice production session (footage + 30 min–3 hr of clean audio) and any per-replica training fees (~$40–65 on Tavus).

Monthly running costs (pass-through — paid to vendors)

Usage level	Estimate	Driver
Low (single clone, few hundred short chats/mo, modest video)	~$130–370/mo	Avatar minutes
Moderate (thousands of chats, ~1,000–2,000 video min/mo)	~$900–5,500/mo	Real-time video is 70–90% of this

Unit economics reality: real-time avatar streaming runs roughly $0.30–$3.00 per minute depending on vendor/tier. This is the number that dominates cost at scale, so Calvion is designed to be text-first, with the video talking-head only when the user explicitly asks for it, plus caching of common answers. We pin the contracted per-minute rate before launch.

Ongoing management (paid to me — optional)

Item	Recommended
Maintenance, new transcripts, model/vendor upkeep, monitoring	£1,000–£2,500/mo

All-in to launch full v1: roughly £28k–£42k one-time + ~$130–370/mo to run at low usage (+ optional retainer). More expensive and slower than Delphi — you're paying for video-now, ownership, and control.

7. Delphi vs Calvion at a glance

	Delphi (Proposal A)	Calvion (Proposal B)
Chat	✅	✅
Real-time voice	✅	✅
Video talking-head	⚠️ Unavailable now	✅ Live now
Time to launch	Weeks	~2–3 months
Upfront cost	Low (£2.5–4k + $299/mo)	High (£28–42k)
Data ownership	On Delphi's platform	Your infrastructure
Customisation / control	Limited	Full
Maintenance	Delphi handles it	Retainer
Monetization	Immortal tier only	Build whatever you want
Best when	Speed & low effort	Video, ownership, control

8. Recommendation

If the video talking-head and data ownership matter, Calvion is the right build — it delivers both, and it's the model Hormozi's own AI actually follows. If speed and low cost win, start on Delphi (Proposal A).

Suggested path: run the intake form first (next step regardless of route), then decide. A strong middle option is to build Calvion's chat + voice POC (~2–3 weeks, ~£15–22k), prove it on your transcripts, and add the video layer once you've seen it working.

Stack, pricing, and effort estimates verified against vendor pricing/docs (Tavus, HeyGen, ElevenLabs, Cartesia, Supabase, Anthropic) as of early 2026. Real-time avatar per-minute rates change often and are partly sales-gated — confirmed before any fixed quote.

Ready to go with Calvion?

Choose this route and we'll take you straight to the intake — everything we need to start building your Calvion.

Choose Calvion → Start intake Compare with Delphi →