Files
ai-podcast/docs/plans/2026-02-05-real-callers-design.md
tcpsyn 356bf145b8 Add show improvement features: crossfade, emotions, returning callers, transcripts, screening
- Music crossfade: smooth 3-second blend between tracks instead of hard stop/start
- Emotional detection: analyze host mood from recent messages so callers adapt tone
- AI caller summaries: generate call summaries with timestamps for show history
- Returning callers: persist regular callers across sessions with call history
- Session export: generate transcripts with speaker labels and chapter markers
- Caller screening: AI pre-screens phone callers to get name and topic while queued

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-07 02:43:01 -07:00

6.9 KiB

Real Callers + AI Follow-Up Design

Overview

Add real phone callers to the AI Radio Show via Twilio, alongside existing AI callers. Real callers dial a phone number, wait in a hold queue, and get taken on air by the host. Three-way conversations between host, real caller, and AI caller are supported. AI follow-up callers automatically reference what real callers said.

Requirements

  • Real callers connect via Twilio phone number
  • Full-duplex audio — host and caller talk simultaneously, talk over each other
  • Each real caller gets their own dedicated audio channel for recording
  • Three-way calls: host + real caller + AI caller all live at once
  • AI caller can respond manually (host-triggered) or automatically (listens and decides when to jump in)
  • AI follow-up callers reference real caller conversations via show history
  • Auto follow-up mode: system picks an AI caller and connects them after a real call
  • Simple hold queue — callers wait with hold music, host sees list and picks who goes on air
  • Twilio webhooks exposed via Cloudflare tunnel

Architecture

Audio Routing (Loopback Channels)

Ch 1:  Host mic (existing)
Ch 2:  AI callers / TTS (existing)
Ch 3+: Real callers (dynamically assigned per call)
Ch N-1: Music (existing)
Ch N:   SFX (existing)

Call Flow — Real Caller

Caller dials Twilio number
  → Twilio POST /api/twilio/voice
  → TwiML response: greeting + enqueue with hold music
  → Caller waits in hold queue
  → Host sees caller in dashboard queue panel
  → Host clicks "Take Call"
  → POST /api/queue/take/{call_sid}
  → Twilio opens WebSocket to /api/twilio/stream
  → Bidirectional audio:
      Caller audio → decode mulaw → dedicated Loopback channel
      Host audio + AI TTS → encode mulaw → Twilio → caller hears both
  → Real-time Whisper transcription of caller audio
  → Host hangs up → call summarized → stored in show history

Three-Way Call Flow

Host mic ──────→ Ch 1 (recording)
               → Twilio outbound (real caller hears you)
               → Whisper transcription (AI gets your words)

Real caller ──→ Ch 3+ (recording, dedicated channel)
               → Whisper transcription (AI gets their words)
               → Host headphones

AI TTS ───────→ Ch 2 (recording)
               → Twilio outbound (real caller hears AI)
               → Host headphones (already works)

Conversation history becomes three-party with role labels: host, real_caller, ai_caller.

AI Auto-Respond Mode

When toggled on, after each real caller transcription chunk:

  1. Lightweight LLM call ("should I respond?" — use fast model like Haiku)
  2. If YES → full response generated → TTS → plays on AI channel + streams to Twilio
  3. Cooldown (~10s) prevents rapid-fire
  4. Host can override with mute button

AI Follow-Up System

After a real caller hangs up:

  1. Full transcript (host + real caller + any AI) summarized by LLM
  2. Summary stored in session.call_history
  3. Next AI caller's system prompt includes show history:
    EARLIER IN THE SHOW:
    - Dave (real caller) called about his wife leaving after 12 years.
      He got emotional about his kids.
    - Jasmine called about her boss hitting on her at work.
    You can reference these if it feels natural. Don't force it.
    

Host-triggered (default): Click any AI caller as normal. They already have show context.

Auto mode: After real caller hangs up, system waits ~5-10s, picks a fitting AI caller via short LLM call, biases their background generation toward the topic, auto-connects.

Backend Changes

New Module: backend/services/twilio_service.py

Manages Twilio integration:

  • WebSocket handler for Media Streams (decode/encode mulaw 8kHz ↔ PCM)
  • Call queue state (waiting callers, SIDs, timestamps, assigned channels)
  • Channel pool management (allocate/release Loopback channels for real callers)
  • Outbound audio mixing (host + AI TTS → mulaw → Twilio)
  • Methods: take_call(), hangup_real_caller(), get_queue(), send_audio_to_caller()

New Endpoints

# Twilio webhooks
POST /api/twilio/voice            # Incoming call → TwiML (greet + enqueue)
POST /api/twilio/hold-music       # Hold music TwiML for waiting callers
WS   /api/twilio/stream           # Media Streams WebSocket (bidirectional audio)

# Host controls
GET  /api/queue                    # List waiting callers (number, wait time)
POST /api/queue/take/{call_sid}    # Dequeue caller → start media stream
POST /api/queue/drop/{call_sid}    # Drop caller from queue

# AI follow-up
POST /api/followup/generate        # Summarize last real call, trigger AI follow-up

Session Model Changes

class CallRecord:
    caller_type: str          # "ai" or "real"
    caller_name: str          # "Tony" or "Caller #3"
    summary: str              # LLM-generated summary after hangup
    transcript: list[dict]    # Full conversation [{role, content}]

class Session:
    # Existing fields...
    call_history: list[CallRecord]     # All calls this episode
    active_real_caller: dict | None    # {call_sid, phone, channel, name}
    active_ai_caller: str | None       # Caller key
    ai_respond_mode: str               # "manual" or "auto"
    auto_followup: bool                # Auto-generate AI follow-up after real calls

Three-party conversation history uses roles: host, real_caller:{name}, ai_caller:{name}.

AI Caller Prompt Changes

get_caller_prompt() extended to include:

  • Show history from session.call_history
  • Current real caller context (if three-way call active)
  • Instructions for referencing real callers naturally

Frontend Changes

New: Call Queue Panel

Between callers section and chat. Shows waiting real callers with phone number and wait time. "Take Call" and "Drop" buttons per caller. Polls /api/queue every few seconds.

Modified: Active Call Indicator

Shows real caller and AI caller simultaneously when both active:

  • Real caller: name, channel number, call duration, hang up button
  • AI caller: name, Manual/Auto toggle, "Let [name] respond" button (manual mode)
  • Auto Follow-Up checkbox

Modified: Chat Log

Three-party with visual distinction:

  • Host messages: existing style
  • Real caller: labeled "Dave (caller)", distinct color
  • AI caller: labeled "Tony (AI)", distinct color

Modified: Caller Grid

When real caller is active, clicking an AI caller adds them as third party instead of starting fresh call. Indicator shows which AI callers have been on the show this session.

Dependencies

  • twilio Python package (for TwiML generation, REST API)
  • Twilio account with phone number (~$1.15/mo + per-minute)
  • Cloudflare tunnel for exposing webhook endpoints
  • audioop or equivalent for mulaw encode/decode (stdlib in Python 3.11)

Configuration

New env vars in .env:

TWILIO_ACCOUNT_SID=...
TWILIO_AUTH_TOKEN=...
TWILIO_PHONE_NUMBER=+1...
TWILIO_WEBHOOK_BASE_URL=https://your-tunnel.cloudflare.com