Files
ai-podcast/docs/architecture.md
2026-02-07 00:36:17 -07:00

16 KiB

Luke at the Roost — Architecture

System Overview

┌─────────────────────────────────────────────────────────────────────────┐
│                        BROWSER (Control Panel)                          │
│                                                                         │
│  ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌────────┐ ┌───────────────┐  │
│  │ Caller   │ │  Chat    │ │  Music/  │ │Settings│ │  Server Log   │  │
│  │ Buttons  │ │  Window  │ │  Ads/SFX │ │ Modal  │ │  (live tail)  │  │
│  │ (0-9)    │ │          │ │          │ │        │ │               │  │
│  └────┬─────┘ └────┬─────┘ └────┬─────┘ └───┬────┘ └───────┬───────┘  │
│       │            │            │            │              │           │
│  ┌────┴────────────┴────────────┴────────────┴──────────────┴───────┐  │
│  │                    frontend/js/app.js                             │  │
│  │  Polling: queue (3s), chat updates (real-time), logs (1s)        │  │
│  │  Push-to-talk: record/stop → transcribe → chat → TTS → play     │  │
│  └──────────────────────────┬───────────────────────────────────────┘  │
└─────────────────────────────┼───────────────────────────────────────────┘
                              │ REST API + WebSocket
                              ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                     FastAPI Backend (main.py)                            │
│                     uvicorn :8000                                        │
└─────────────────────────────────────────────────────────────────────────┘

Caller Generation Pipeline

Session Reset / First Access to Caller Slot
    │
    ▼
_randomize_callers()
    │  Assigns unique names (from 24M/24F pool) and voices (5M/5F) to 10 slots
    │
    ▼
generate_caller_background(base)
    │
    ├─ Demographics: age (from range), job (gendered pool), location
    │                                                        │
    │                              ┌─────────────────────────┘
    │                              ▼
    │                     pick_location()
    │                     80% LOCATIONS_LOCAL (weighted: Animas, Lordsburg)
    │                     20% LOCATIONS_OUT_OF_STATE
    │                              │
    │                              ▼
    │                     _get_town_from_location()
    │                     └─ TOWN_KNOWLEDGE[town]
    │                        32 towns with real facts
    │                        "Only reference real places..."
    │
    ├─ 70% → PROBLEMS (100+ templates)
    │        Fill {affair_person}, {fantasy_subject}, etc. from PROBLEM_FILLS
    │
    ├─ 30% → TOPIC_CALLIN (61 entries)
    │        Prestige TV, science, poker, photography, physics, US news
    │
    ├─ 2x random INTERESTS (86 entries: TV shows, science, tech, poker, etc.)
    │
    └─ 2x random QUIRKS (conversational style traits)
    │
    ▼
Result: "43, works IT for the city in Lordsburg. Just finished Severance
        season 2... Follows JWST discoveries... Deflects with humor...
        ABOUT WHERE THEY LIVE (Lordsburg): Small town on I-10, about 2,500
        people... Only reference real places..."

News Enrichment (at pickup time)

POST /api/call/{key}
    │
    ▼
enrich_caller_background(background)     ← 5s timeout, fails silently
    │
    ├─ _extract_search_query(background)
    │   ├─ Check _TOPIC_SEARCH_MAP (50+ keyword→query mappings)
    │   │   "severance" → "Severance TV show"
    │   │   "quantum"   → "quantum physics research"
    │   │   "poker"     → "poker tournament"
    │   │
    │   └─ Fallback: extract keywords from problem sentence
    │
    ▼
SearXNG (localhost:8888)
    │  /search?q=...&format=json&categories=news
    │
    ▼
LLM summarizes headline+snippet → natural one-liner
    │  "Recently read about how Severance ties up the Lumon mystery"
    │
    ▼
Appended to background: "..., and it's been on their mind."

AI Caller Conversation Flow

    Host speaks (push-to-talk or type)
        │
        ▼
POST /api/record/start → record from input device
POST /api/record/stop  → transcribe (Whisper @ 16kHz)
        │
        ▼
POST /api/chat { text }
        │
        ├─ session.add_message("user", text)
        │
        ├─ Build system prompt: get_caller_prompt()
        │   ├─ Caller identity + background + town knowledge
        │   ├─ Show history (summaries of previous callers)
        │   ├─ Conversation summary (last 6 messages)
        │   └─ HOW TO TALK rules (varied length, no rehashing, etc.)
        │
        ├─ Last 10 messages → _normalize_messages_for_llm()
        │
        ▼
LLMService.generate(messages, system_prompt)
        │
        ├─ OpenRouter: primary model (15s timeout)
        ├─ Fallback 1: gemini-flash-1.5 (10s)
        ├─ Fallback 2: gpt-4o-mini (10s)
        ├─ Fallback 3: llama-3.1-8b (10s)
        └─ Last resort: "Sorry, I totally blanked out..."
        │
        ▼
clean_for_tts()              → strip (actions), *gestures*, fix phonetics
ensure_complete_thought()    → trim to last complete sentence
        │
        ▼
Response returned to frontend
        │
        ▼
POST /api/tts { text, voice_id }
        │
        ▼
generate_speech(text, voice_id)
        │
        ├─ Inworld (default cloud)     ─┐
        ├─ ElevenLabs (cloud)           │
        ├─ F5-TTS (local, cloned)       ├─→ PCM audio bytes (24kHz)
        ├─ Kokoro MLX (local, fast)     │
        ├─ ChatTTS / StyleTTS2 / etc.  ─┘
        │
        ▼
AudioService.play_caller_audio(bytes, 24000)
        │
        └─→ Output Device Channel 1 (caller TTS)

Real Caller (Phone) Flow

Caller dials 208-439-LUKE
        │
        ▼
SignalWire routes to webhook
        │
        ▼
POST /api/signalwire/voice
        │
        ├─ If OFF AIR → play message + hangup
        │
        └─ If ON AIR → return BXML:
           <Stream url="wss://.../api/signalwire/stream" codec="L16@16000h">
        │
        ▼
WebSocket /api/signalwire/stream connects
        │
        ├─ "start" event → add to queue, play ring SFX
        │                   broadcast_event("caller_queued")
        │
        │   [Caller waits in queue until host takes them]
        │
        ├─ Host clicks "Take Call" in UI
        │   POST /api/queue/take/{caller_id}
        │   └─ CallerService.take_call() → allocate channel
        │   └─ Start host mic streaming → _host_audio_sender()
        │
        ├─ "media" events (continuous) ← caller's voice
        │   │
        │   ├─ route_real_caller_audio(pcm) → Ch 9 (host monitoring)
        │   │
        │   └─ Buffer 3s chunks → transcribe (Whisper)
        │       │
        │       └─ broadcast_chat() → appears in chat window
        │
        │   Host mic audio → _host_audio_sync_callback()
        │   │
        │   └─ _host_audio_sender() → CallerService.send_audio_to_caller()
        │       └─ base64 encode → WebSocket → SignalWire → caller's phone
        │
        │   If AI caller also active (auto-respond mode):
        │   │
        │   └─ _debounced_auto_respond() (4s silence)
        │       └─ LLM → TTS → play on Ch 1 + stream to real caller
        │
        ├─ Host hangs up
        │   POST /api/hangup/real
        │   └─ _signalwire_end_call(call_sid) → end phone call
        │   └─ _summarize_real_call() → LLM summary → call_history
        │   └─ Optional: _auto_followup() → pick AI caller to continue
        │
        └─ "stop" event or disconnect → cleanup

Audio Routing (Multi-Channel Output)

All audio goes to ONE physical output device (Loopback/interface)
Each content type on a separate channel for mixing in DAW/OBS

┌─────────────────────────────────────────────────────────────┐
│                   Output Device (e.g. Loopback 16ch)        │
│                                                             │
│   Ch 1  ◄── Caller TTS (AI voices)          play_caller_audio()
│   Ch 2  ◄── Music (loops)                   play_music()
│   Ch 3  ◄── Sound Effects (one-shots)       play_sfx()
│   Ch 9  ◄── Live Caller Audio (monitoring)  route_real_caller_audio()
│   Ch 11 ◄── Ads (one-shots, no loop)        play_ad()
│                                                             │
│   All channels configurable via Settings panel              │
└─────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│                   Input Device (mic/interface)               │
│                                                             │
│   Ch N  ──► Host mic recording (push-to-talk)               │
│         ──► Host mic streaming (to real callers via WS)     │
└─────────────────────────────────────────────────────────────┘

External Services

┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│  SignalWire   │     │  OpenRouter   │     │   SearXNG    │
│              │     │              │     │  (local)     │
│  Phone calls │     │  LLM API     │     │  News search │
│  REST + WS   │     │  Claude,GPT  │     │  :8888       │
│  Bidirectional│     │  Gemini,Llama│     │              │
│  audio stream│     │  Fallback    │     │              │
└──────────────┘     └──────────────┘     └──────────────┘

┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│   Inworld    │     │  ElevenLabs  │     │  Local TTS   │
│              │     │              │     │              │
│  TTS (cloud) │     │  TTS (cloud) │     │  Kokoro MLX  │
│  Default     │     │  Premium     │     │  F5-TTS      │
│  provider    │     │              │     │  ChatTTS     │
│              │     │              │     │  + others    │
└──────────────┘     └──────────────┘     └──────────────┘

┌──────────────┐
│  Castopod    │
│              │
│  Podcast     │
│  publishing  │
│  (NAS)       │
└──────────────┘

Session Lifecycle

New Session (reset)
    │
    ├─ Randomize all 10 caller names + voices
    ├─ Clear conversation, call history, research
    ├─ New session ID
    │
    ▼
Show goes ON AIR (toggle)
    │
    ├─ SignalWire starts accepting calls
    │
    ▼
Caller interactions (loop)
    │
    ├─ Pick AI caller (click button 0-9)
    │   ├─ Generate background (if first time this session)
    │   ├─ Enrich with news (SearXNG → LLM summary)
    │   ├─ Conversation loop (chat/respond/auto-respond)
    │   └─ Hangup → summarize → add to call_history
    │
    ├─ Take real caller from queue
    │   ├─ Route audio both directions
    │   ├─ Transcribe caller speech in real-time
    │   ├─ Optional: AI caller auto-responds to real caller
    │   └─ Hangup → summarize → add to call_history
    │
    ├─ Play music / ads / SFX between calls
    │
    └─ Each new caller sees show_history (summaries of all previous calls)
        "EARLIER IN THE SHOW: Tony talked about... Carmen discussed..."
    │
    ▼
Show goes OFF AIR
    │
    └─ Incoming calls get off-air message + hangup

Key Design Patterns

Pattern Where Why
Epoch-based staleness _session_epoch in main.py Prevents stale LLM/TTS responses from playing after hangup
Fallback chain LLMService Guarantees a response even if primary model times out
Debounced auto-respond _debounced_auto_respond() Waits 4s for real caller to stop talking before AI jumps in
Silent failure News enrichment If search/LLM fails, caller just doesn't have news context
Threading for audio play_caller_audio() Audio playback can't block the async event loop
Ring buffer route_real_caller_audio() Absorbs jitter in real caller audio stream
Lock contention guard _ai_response_lock Only one AI response generates at a time
Town knowledge injection TOWN_KNOWLEDGE dict Prevents LLM from inventing fake local businesses

File Map

ai-podcast/
├── backend/
│   ├── main.py              ← FastAPI app, all endpoints, caller generation, session
│   ├── config.py            ← Settings (env vars, paths)
│   └── services/
│       ├── audio.py         ← Multi-channel audio I/O (sounddevice)
│       ├── caller_service.py← Phone queue, WebSocket registry, audio routing
│       ├── llm.py           ← OpenRouter/Ollama with fallback chain
│       ├── news.py          ← SearXNG search + caching
│       ├── tts.py           ← 8 TTS providers (cloud + local)
│       └── transcription.py ← Whisper speech-to-text
├── frontend/
│   ├── index.html           ← Control panel layout
│   ├── js/app.js            ← UI logic, polling, event handlers
│   └── css/style.css        ← Dark theme styling
├── sounds/                  ← SFX files (ring, hangup, busy, etc.)
├── music/                   ← Background music tracks
├── ads/                     ← Ad audio files
├── website/                 ← Landing page (lukeattheroost.com)
├── publish_episode.py       ← Castopod episode publisher
└── run.sh                   ← Server launcher with restart support