Files
ai-podcast/docs/architecture.md
tcpsyn bd6c8ccbab Landing page: testimonials slider, how-it-works page, 25 TTS voices
- Add testimonial slider with 8 fake caller reviews
- Add how-it-works page with visual architecture diagram
- Expand voice pools: Inworld 25 voices (14M/11F), ElevenLabs 22 (14M/8F)
- Voice pools auto-switch when TTS provider changes
- Add cover art locally, update cache-busted image refs
- Add "More from Luke" footer links (MMG, prints, YouTube)
- Ad channel configurable in settings UI

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-07 01:34:30 -07:00

362 lines
16 KiB
Markdown

# Luke at the Roost — Architecture
## System Overview
```
┌─────────────────────────────────────────────────────────────────────────┐
│ BROWSER (Control Panel) │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌────────┐ ┌───────────────┐ │
│ │ Caller │ │ Chat │ │ Music/ │ │Settings│ │ Server Log │ │
│ │ Buttons │ │ Window │ │ Ads/SFX │ │ Modal │ │ (live tail) │ │
│ │ (0-9) │ │ │ │ │ │ │ │ │ │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ └───┬────┘ └───────┬───────┘ │
│ │ │ │ │ │ │
│ ┌────┴────────────┴────────────┴────────────┴──────────────┴───────┐ │
│ │ frontend/js/app.js │ │
│ │ Polling: queue (3s), chat updates (real-time), logs (1s) │ │
│ │ Push-to-talk: record/stop → transcribe → chat → TTS → play │ │
│ └──────────────────────────┬───────────────────────────────────────┘ │
└─────────────────────────────┼───────────────────────────────────────────┘
│ REST API + WebSocket
┌─────────────────────────────────────────────────────────────────────────┐
│ FastAPI Backend (main.py) │
│ uvicorn :8000 │
└─────────────────────────────────────────────────────────────────────────┘
```
---
## Caller Generation Pipeline
```
Session Reset / First Access to Caller Slot
_randomize_callers()
│ Assigns unique names (from 24M/24F pool) and voices (Inworld: 14M/11F, ElevenLabs: 14M/8F) to 10 slots
generate_caller_background(base)
├─ Demographics: age (from range), job (gendered pool), location
│ │
│ ┌─────────────────────────┘
│ ▼
│ pick_location()
│ 80% LOCATIONS_LOCAL (weighted: Animas, Lordsburg)
│ 20% LOCATIONS_OUT_OF_STATE
│ │
│ ▼
│ _get_town_from_location()
│ └─ TOWN_KNOWLEDGE[town]
│ 32 towns with real facts
│ "Only reference real places..."
├─ 70% → PROBLEMS (100+ templates)
│ Fill {affair_person}, {fantasy_subject}, etc. from PROBLEM_FILLS
├─ 30% → TOPIC_CALLIN (61 entries)
│ Prestige TV, science, poker, photography, physics, US news
├─ 2x random INTERESTS (86 entries: TV shows, science, tech, poker, etc.)
└─ 2x random QUIRKS (conversational style traits)
Result: "43, works IT for the city in Lordsburg. Just finished Severance
season 2... Follows JWST discoveries... Deflects with humor...
ABOUT WHERE THEY LIVE (Lordsburg): Small town on I-10, about 2,500
people... Only reference real places..."
```
### News Enrichment (at pickup time)
```
POST /api/call/{key}
enrich_caller_background(background) ← 5s timeout, fails silently
├─ _extract_search_query(background)
│ ├─ Check _TOPIC_SEARCH_MAP (50+ keyword→query mappings)
│ │ "severance" → "Severance TV show"
│ │ "quantum" → "quantum physics research"
│ │ "poker" → "poker tournament"
│ │
│ └─ Fallback: extract keywords from problem sentence
SearXNG (localhost:8888)
│ /search?q=...&format=json&categories=news
LLM summarizes headline+snippet → natural one-liner
│ "Recently read about how Severance ties up the Lumon mystery"
Appended to background: "..., and it's been on their mind."
```
---
## AI Caller Conversation Flow
```
Host speaks (push-to-talk or type)
POST /api/record/start → record from input device
POST /api/record/stop → transcribe (Whisper @ 16kHz)
POST /api/chat { text }
├─ session.add_message("user", text)
├─ Build system prompt: get_caller_prompt()
│ ├─ Caller identity + background + town knowledge
│ ├─ Show history (summaries of previous callers)
│ ├─ Conversation summary (last 6 messages)
│ └─ HOW TO TALK rules (varied length, no rehashing, etc.)
├─ Last 10 messages → _normalize_messages_for_llm()
LLMService.generate(messages, system_prompt)
├─ OpenRouter: primary model (15s timeout)
├─ Fallback 1: gemini-flash-1.5 (10s)
├─ Fallback 2: gpt-4o-mini (10s)
├─ Fallback 3: llama-3.1-8b (10s)
└─ Last resort: "Sorry, I totally blanked out..."
clean_for_tts() → strip (actions), *gestures*, fix phonetics
ensure_complete_thought() → trim to last complete sentence
Response returned to frontend
POST /api/tts { text, voice_id }
generate_speech(text, voice_id)
├─ Inworld (default cloud) ─┐
├─ ElevenLabs (cloud) │
├─ F5-TTS (local, cloned) ├─→ PCM audio bytes (24kHz)
├─ Kokoro MLX (local, fast) │
├─ ChatTTS / StyleTTS2 / etc. ─┘
AudioService.play_caller_audio(bytes, 24000)
└─→ Output Device Channel 1 (caller TTS)
```
---
## Real Caller (Phone) Flow
```
Caller dials 208-439-LUKE
SignalWire routes to webhook
POST /api/signalwire/voice
├─ If OFF AIR → play message + hangup
└─ If ON AIR → return BXML:
<Stream url="wss://.../api/signalwire/stream" codec="L16@16000h">
WebSocket /api/signalwire/stream connects
├─ "start" event → add to queue, play ring SFX
│ broadcast_event("caller_queued")
│ [Caller waits in queue until host takes them]
├─ Host clicks "Take Call" in UI
│ POST /api/queue/take/{caller_id}
│ └─ CallerService.take_call() → allocate channel
│ └─ Start host mic streaming → _host_audio_sender()
├─ "media" events (continuous) ← caller's voice
│ │
│ ├─ route_real_caller_audio(pcm) → Ch 9 (host monitoring)
│ │
│ └─ Buffer 3s chunks → transcribe (Whisper)
│ │
│ └─ broadcast_chat() → appears in chat window
│ Host mic audio → _host_audio_sync_callback()
│ │
│ └─ _host_audio_sender() → CallerService.send_audio_to_caller()
│ └─ base64 encode → WebSocket → SignalWire → caller's phone
│ If AI caller also active (auto-respond mode):
│ │
│ └─ _debounced_auto_respond() (4s silence)
│ └─ LLM → TTS → play on Ch 1 + stream to real caller
├─ Host hangs up
│ POST /api/hangup/real
│ └─ _signalwire_end_call(call_sid) → end phone call
│ └─ _summarize_real_call() → LLM summary → call_history
│ └─ Optional: _auto_followup() → pick AI caller to continue
└─ "stop" event or disconnect → cleanup
```
---
## Audio Routing (Multi-Channel Output)
```
All audio goes to ONE physical output device (Loopback/interface)
Each content type on a separate channel for mixing in DAW/OBS
┌─────────────────────────────────────────────────────────────┐
│ Output Device (e.g. Loopback 16ch) │
│ │
│ Ch 1 ◄── Caller TTS (AI voices) play_caller_audio()
│ Ch 2 ◄── Music (loops) play_music()
│ Ch 3 ◄── Sound Effects (one-shots) play_sfx()
│ Ch 9 ◄── Live Caller Audio (monitoring) route_real_caller_audio()
│ Ch 11 ◄── Ads (one-shots, no loop) play_ad()
│ │
│ All channels configurable via Settings panel │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ Input Device (mic/interface) │
│ │
│ Ch N ──► Host mic recording (push-to-talk) │
│ ──► Host mic streaming (to real callers via WS) │
└─────────────────────────────────────────────────────────────┘
```
---
## External Services
```
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ SignalWire │ │ OpenRouter │ │ SearXNG │
│ │ │ │ │ (local) │
│ Phone calls │ │ LLM API │ │ News search │
│ REST + WS │ │ Claude,GPT │ │ :8888 │
│ Bidirectional│ │ Gemini,Llama│ │ │
│ audio stream│ │ Fallback │ │ │
└──────────────┘ └──────────────┘ └──────────────┘
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Inworld │ │ ElevenLabs │ │ Local TTS │
│ │ │ │ │ │
│ TTS (cloud) │ │ TTS (cloud) │ │ Kokoro MLX │
│ Default │ │ Premium │ │ F5-TTS │
│ provider │ │ │ │ ChatTTS │
│ │ │ │ │ + others │
└──────────────┘ └──────────────┘ └──────────────┘
┌──────────────┐
│ Castopod │
│ │
│ Podcast │
│ publishing │
│ (NAS) │
└──────────────┘
```
---
## Session Lifecycle
```
New Session (reset)
├─ Randomize all 10 caller names + voices
├─ Clear conversation, call history, research
├─ New session ID
Show goes ON AIR (toggle)
├─ SignalWire starts accepting calls
Caller interactions (loop)
├─ Pick AI caller (click button 0-9)
│ ├─ Generate background (if first time this session)
│ ├─ Enrich with news (SearXNG → LLM summary)
│ ├─ Conversation loop (chat/respond/auto-respond)
│ └─ Hangup → summarize → add to call_history
├─ Take real caller from queue
│ ├─ Route audio both directions
│ ├─ Transcribe caller speech in real-time
│ ├─ Optional: AI caller auto-responds to real caller
│ └─ Hangup → summarize → add to call_history
├─ Play music / ads / SFX between calls
└─ Each new caller sees show_history (summaries of all previous calls)
"EARLIER IN THE SHOW: Tony talked about... Carmen discussed..."
Show goes OFF AIR
└─ Incoming calls get off-air message + hangup
```
---
## Key Design Patterns
| Pattern | Where | Why |
|---------|-------|-----|
| **Epoch-based staleness** | `_session_epoch` in main.py | Prevents stale LLM/TTS responses from playing after hangup |
| **Fallback chain** | LLMService | Guarantees a response even if primary model times out |
| **Debounced auto-respond** | `_debounced_auto_respond()` | Waits 4s for real caller to stop talking before AI jumps in |
| **Silent failure** | News enrichment | If search/LLM fails, caller just doesn't have news context |
| **Threading for audio** | `play_caller_audio()` | Audio playback can't block the async event loop |
| **Ring buffer** | `route_real_caller_audio()` | Absorbs jitter in real caller audio stream |
| **Lock contention guard** | `_ai_response_lock` | Only one AI response generates at a time |
| **Town knowledge injection** | `TOWN_KNOWLEDGE` dict | Prevents LLM from inventing fake local businesses |
---
## File Map
```
ai-podcast/
├── backend/
│ ├── main.py ← FastAPI app, all endpoints, caller generation, session
│ ├── config.py ← Settings (env vars, paths)
│ └── services/
│ ├── audio.py ← Multi-channel audio I/O (sounddevice)
│ ├── caller_service.py← Phone queue, WebSocket registry, audio routing
│ ├── llm.py ← OpenRouter/Ollama with fallback chain
│ ├── news.py ← SearXNG search + caching
│ ├── tts.py ← 8 TTS providers (cloud + local)
│ └── transcription.py ← Whisper speech-to-text
├── frontend/
│ ├── index.html ← Control panel layout
│ ├── js/app.js ← UI logic, polling, event handlers
│ └── css/style.css ← Dark theme styling
├── sounds/ ← SFX files (ring, hangup, busy, etc.)
├── music/ ← Background music tracks
├── ads/ ← Ad audio files
├── website/ ← Landing page (lukeattheroost.com)
├── publish_episode.py ← Castopod episode publisher
└── run.sh ← Server launcher with restart support
```