Calvin AI
← Both routes

Calvin AI — Proposal B: Calvion (Custom Build)

Prepared for: Calvin Prepared by: Thomas Cobley Date: 2 July 2026 Option: B of 2 (see also Proposal A: Launch on Delphi.ai)


1. What Calvion is

Calvion is a bespoke, fully-owned AI version of you — same idea as Delphi, but built on infrastructure you control, using best-in-class components chosen per capability. It does the three things you asked for:

  1. Chat — ChatGPT-style text, answering in your voice from your transcripts.
  2. Voice — real-time calls in your cloned voice.
  3. Video talking-headyou on camera, delivering answers in real time. This is the key difference: Calvion can ship video now, whereas Delphi's video is currently unavailable.

Think of it as the model Hormozi actually uses — his ACQ AI is a custom, owned clone, not an off-the-shelf platform. Calvion is that, for you.


2. How it works (the stack)

  Your transcripts ──▶ Ingest & normalise ──▶ Chunk + embed ──▶ Vector DB (your cloud)
  (Fireflies/Zoom/…)                                                │
                                                                    ▼
  User question ───────────────────────────────────▶ Retrieve relevant context
                                                                    │
                                                                    ▼
                                        LLM (Claude) answers as "you", streaming
                                                       │
                        ┌──────────────────────────────┼──────────────────────────────┐
                        ▼                               ▼                               ▼
                    Text chat                     Voice (cloned)               Video talking-head
                  (web + embed)                 ElevenLabs/Cartesia            Tavus / HeyGen avatar

Recommended components (all swappable — no lock-in):

LayerRecommendationWhy
Video avatarTavus (real-time Conversational Video Interface)Most conversation-native, 1080p real-time, bring-your-own LLM so our RAG stays ours. HeyGen LiveAvatar is the alternative (broader API, 720p live cap).
Voice cloneElevenLabs (Pro clone) or Cartesia (lowest latency)ElevenLabs = best quality; Cartesia Sonic = ~40ms if latency is critical.
Brain (LLM)Claude (Sonnet 4.6)Best quality/cost for RAG answering (~$0.02/answer); Haiku for cheaper volume.
Knowledge / RAGSupabase (pgvector)Transcripts + vectors + metadata in one DB you own — cheapest at this scale and the strongest data-ownership story for sensitive meeting data.
App & chat UINext.js + Vercel AI SDK on VercelStreaming chat, embeddable widget, fast to build and deploy.
Transcript ingestionFireflies (GraphQL API), Zoom (VTT), Granola (REST)Automated pull; Otter is manual-export only.

3. What you get that Delphi can't give you

  • The video talking-head — live and yours. The #1 feature, working now.
  • Your data stays on your infrastructure. Transcripts never leave a database you own. Best answer for confidential meetings — no third-party platform terms, no SOC 2 gap, full export/portability.
  • Full control of persona, prompts, model choice, branding, UX, and every deployment surface (site, app, Slack, WhatsApp, internal tools).
  • Own unit economics. As usage scales, you optimise costs directly instead of paying platform markup.

4. What you take on vs Delphi

  • ~2–3 months to a solid v1 (vs weeks on Delphi).
  • Higher upfront cost (build fee) and ongoing maintenance (I handle this on retainer).
  • Multiple vendors to manage — I own the integration; you just see one product.
  • Real-time video is the dominant running cost — see §6. We design to keep it in check (text-first, video only when asked).

5. Build plan & timeline

PhaseWorkDuration
0. DiscoveryIntake form, gather transcripts, sample content, define persona & use cases~1 week
1. IngestionPull + normalise transcripts (Fireflies/Zoom/Granola), clean, structure~1–2 weeks
2. RAG + ChatEmbed corpus, retrieval, Claude answering as "you", streaming chat UI + embed widget~2 weeks
3. VoiceClone voice (ElevenLabs Pro / Cartesia), wire real-time voice mode~1 week
4. VideoCreate avatar (Tavus/HeyGen — footage session), real-time talking-head, turn-taking, latency tuning~2–3 weeks
5. LaunchAuth, deploy, embed, QA, handover + training~1–2 weeks
Total to v1~7–11 weeks

A chat + voice POC is achievable in ~2–3 weeks if you want to see it working before committing to the video phase.


6. Investment

One-time build fee (paid to me)

ScopeRecommended fixed price
Full v1 (chat + voice + video, embed, deploy)£28,000–£42,000
Phase 1 — Chat + voice only (video deferred)£15,000–£22,000
Add video layer later£10,000–£18,000

Plus a one-time avatar/voice production session (footage + 30 min–3 hr of clean audio) and any per-replica training fees (~$40–65 on Tavus).

Monthly running costs (pass-through — paid to vendors)

Usage levelEstimateDriver
Low (single clone, few hundred short chats/mo, modest video)~$130–370/moAvatar minutes
Moderate (thousands of chats, ~1,000–2,000 video min/mo)~$900–5,500/moReal-time video is 70–90% of this

Unit economics reality: real-time avatar streaming runs roughly $0.30–$3.00 per minute depending on vendor/tier. This is the number that dominates cost at scale, so Calvion is designed to be text-first, with the video talking-head only when the user explicitly asks for it, plus caching of common answers. We pin the contracted per-minute rate before launch.

Ongoing management (paid to me — optional)

ItemRecommended
Maintenance, new transcripts, model/vendor upkeep, monitoring£1,000–£2,500/mo

All-in to launch full v1: roughly £28k–£42k one-time + ~$130–370/mo to run at low usage (+ optional retainer). More expensive and slower than Delphi — you're paying for video-now, ownership, and control.


7. Delphi vs Calvion at a glance

Delphi (Proposal A)Calvion (Proposal B)
Chat
Real-time voice
Video talking-head⚠️ Unavailable nowLive now
Time to launchWeeks~2–3 months
Upfront costLow (£2.5–4k + $299/mo)High (£28–42k)
Data ownershipOn Delphi's platformYour infrastructure
Customisation / controlLimitedFull
MaintenanceDelphi handles itRetainer
MonetizationImmortal tier onlyBuild whatever you want
Best whenSpeed & low effortVideo, ownership, control

8. Recommendation

If the video talking-head and data ownership matter, Calvion is the right build — it delivers both, and it's the model Hormozi's own AI actually follows. If speed and low cost win, start on Delphi (Proposal A).

Suggested path: run the intake form first (next step regardless of route), then decide. A strong middle option is to build Calvion's chat + voice POC (~2–3 weeks, ~£15–22k), prove it on your transcripts, and add the video layer once you've seen it working.


Stack, pricing, and effort estimates verified against vendor pricing/docs (Tavus, HeyGen, ElevenLabs, Cartesia, Supabase, Anthropic) as of early 2026. Real-time avatar per-minute rates change often and are partly sales-gated — confirmed before any fixed quote.

Ready to go with Calvion?

Choose this route and we'll take you straight to the intake — everything we need to start building your Calvion.