Technical Implementation Notes

Internal development documentation and pipeline specifications

Conversational Video Pipeline (Option 2 + Smart Clip Router)

Goal: Real-time feeling, high visual fidelity. We mix pre-rendered “hero” clips with live lipsynced talk loops.

States

IDLE → SUMMON → TALK → OUTRO (+ GLITCH when needed)

IDLE – 20–30 s loop: breathing/blinking head, slow particles.
SUMMON – 2–3 s burst: particles rush in, head forms on the bookend pose.
TALK
- FAQ match (prebaked clip: audio + video) → instant play.
- Else live: LLM → TTS → Wav2Lip-on-mouth over a neutral loop (4/6/8/12 s).
OUTRO – 1–2 s dissolve: head → particles → black.
GLITCH – 1 s loop: frozen frame + RGB offset/datamosh to cover stalls.

Clip Library (per persona)

Type	Count	Lengths (s)	Notes
Idle loops	2–3	20–30	Micro-motion only
Talk bases	6–10	4/6/8/12	Neutral face motion, bookend pose
Summons	2–3	2–3	Particles → head
Outros	2–3	1–2	Head → particles
Glitch masks	2–3	1 (loop)	For >2 s latency spikes
Accent bursts	2–3	0.5–1	Persona color/shape flourishes
FAQ hero clips	30–50	variable	Full animation + baked audio

Bookend pose: Every clip starts/ends on the same still frame (12–16 frames) for clean cuts.

Router Logic (pseudo)

q = transcribe(audio)
vec = embed(q)
hit = faiss.search(vec, top_k=1)

if hit.score > 0.83 and cooldown_ok(hit.id):
    play_clip(hit.clip_path)          # Prebaked FAQ
else:
    answer = llm(q, persona)
    wav = tts(answer)
    base = choose_loop(len(wav))       # 4/6/8/12 s
    talk = wav2lip(base, wav, roi="mouth")
    play_clip(talk)

File Format

Canvas: 3840×2160, with a 2160×2160 square action-safe centered.
Codec: ProRes 4444 (alpha) or ProRes 422 HQ on black; HAP-Q in TouchDesigner.
Blacks crushed to 0–2%. Add light grain to kill banding.

Latency Budget

STT + LLM + TTS: ~0.4–0.7 s (fast path).
Wav2Lip mouth pass: ~2–3 s for 6–8 s chunk on a 4090.
Mask anything above ~2 s with GLITCH or “ruminating…” particle loop.

To-Do

Draft FAQ list + canonical answers (30–50 per persona).
Generate particle packs and all state clips (bookend pose locked).
Build router microservice (embeddings + FAISS).
Integrate Wav2Lip/GeneFace++ ROI pipeline.
Wire TouchDesigner/Resolume state machine (OSC/HTTP triggers).
Test end-to-end latency, add missing FAQs from logs.

Development Pipeline Notes

Content Production Workflow

Asset Generation Priority:

Vonnegut persona pack (primary launch target)
Default Oracle baseline interactions
FAQ library expansion based on user logs

Technical Stack Integration:

TouchDesigner for real-time compositing and state management
FAISS vector database for FAQ matching
Wav2Lip for real-time mouth synchronization
ProRes pipeline for maximum quality preservation

Performance Optimization

GPU Requirements:

RTX 4090 or equivalent for real-time Wav2Lip processing
Dedicated VRAM allocation for clip buffering
Hardware-accelerated video decode/encode pipeline

Fallback Strategies:

Pre-rendered FAQ clips eliminate processing delays
Glitch state masks technical limitations
Graceful degradation for lower-end hardware

Integration Notes

This pipeline connects to the main Pepper’s Ghost installation described in the implementation strategy and technical specifications. The conversational system operates independently of the display technology choice, allowing flexibility in deployment scenarios.

Cross-Reference

Clip Library JSON Schema

Schema definition for the Oracle clip library system. This schema validates all clip metadata entries to ensure consistency across the content pipeline.

{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "title": "OracleClip",
  "type": "object",
  "properties": {
    "id": { "type": "string" },
    "persona": { "type": "string" },
    "type": { "type": "string", "enum": ["idle","summon","talk_base","outro","glitch","accent","faq"] },
    "duration": { "type": "number" },
    "emotion": { "type": "string" },
    "path": { "type": "string" },
    "bookend": { "type": "boolean", "default": true },
    "cooldown": { "type": "number", "description": "seconds before reuse allowed" },
    "tags": { "type": "array", "items": { "type": "string" } }
  },
  "required": ["id","persona","type","duration","path"]
}

Schema Usage

This schema defines the structure for all video clips in the Oracle system:

id: Unique identifier for the clip
persona: Which Oracle persona (Vonnegut, Default, etc.)
type: Clip category matching state machine requirements
duration: Length in seconds
emotion: Optional emotional state for context-aware selection
path: File system or CDN path to video asset
bookend: Whether clip uses standard start/end frames for smooth transitions
cooldown: Minimum seconds before clip can be replayed
tags: Additional metadata for routing decisions

The schema file is also available at /spec/clip_schema.json for direct integration with validation tools.

Visual Pipeline Demo

Particle Consolidation Demo — Demonstration of particles consolidating to form a persona, then dispersing back to neutral/idle particle animation. This represents the SUMMON and OUTRO states in the pipeline.

This demo illustrates the core visual transformation of the Oracle Entity system:

Particle Consolidation: Ambient particles coalesce to form the Oracle persona
Persona Materialization: The gradual formation of a recognizable entity from particle chaos
Dissolution Effect: The persona disperses back into neutral particle state
Idle State: Continuous ambient particle movement between interactions

Oracle Entity Introduction Demo

Project Introduction by Base Persona — Demonstration of the Oracle Entity introducing the Echoes of Indiana project, showing how the persona would welcome and orient visitors to the experience.

Recommended Implementation Updates

Based on recent technical discussions, here are key updates to consider for the implementation approach:

1. Simplified Technical Pipeline

States: IDLE → SUMMON → TALK → OUTRO (+ GLITCH when needed)

TALK Options:

FAQ match → instant prebaked clip playback (no processing)
Live response → Audio-amplitude particle modulation
Optional future: Lip-sync for photorealistic personas only

Particle Animation Pipeline:

Base: Pre-rendered particle face videos (4/6/8/12s loops)
Enhancement: Real-time audio-reactive particle overlay
Mouth region responds to speech amplitude/frequency
Persona-specific particle behaviors (smoke, stars, data points)

2. Technology Stack Additions

Add:

Video generation tools (RunwayML, Pika, Stable Video)
Simplified playback options (Resolume/QLab)
WebSocket voice pipeline (replace button-based description)
Particle systems as primary character representation

Downplay:

Heavy emphasis on Wav2Lip processing times
Complex facial landmark tracking

3. Budget Reallocation

Reallocate:

Less: “Wav2Lip optimization” → More: “Particle system development”
Less: “Facial tracking hardware” → More: “Premium projection equipment”
Add: “AI video generation tools subscription” (~$200/month)
Add: “Technical collaborator/TD” (part-time, 3-6 months)

4. Updated Phase 1 Implementation

Months 1-2: Foundation

Generate particle-based persona videos using AI tools
Implement WebSocket voice conversation pipeline
Set up simple video playback system (Resolume/QLab)
Expand FAQ library from visitor interactions

5. Particle Design Language

Each Oracle persona has a unique particle signature:

Base System

Primary particles: Cyan/teal energy motes (#00CED1)
Accent particles: Golden wisdom sparks (#FFD700)
Behavior: Constant drift, responsive to breath and speech
Density: Variable based on audio amplitude and emotion

Persona Signatures

Vonnegut: Cigarette smoke wisps, typewriter characters
Indiana Sage: Limestone dust, corn pollen
Lil Bub: Iridescent stars, rainbow shifts
Kinsey: Data points, graph elements
Madame Walker: Art Deco gold flourishes, hair product molecules
Lettie Vance: Fashion fabric swatches, pattern fragments

6. Artistic Rationale

”Why Particles?”

Oracle entities manifest as living constellations of light particles rather than photorealistic faces. This aesthetic choice:

Creates a more magical, ethereal presence
Aligns with the “echoes” theme - memories as fragments
Is technically forgiving while allowing real-time audio reactivity
Suggests consciousness emerging from quantum possibility

7. Simplified Response Times

Response Times:

FAQ match: Instant (<100ms)
Live generation: 2-3 seconds (masked by SUMMON animation)
Particle reactivity: Real-time (audio-driven)

8. Updated Experience Description

Consider updating visitor-facing descriptions to:

“The Entity responds through directional speakers, their particle form pulsing and flowing with the rhythm of speech, occasionally coalescing into clearer features during profound moments”

9. Collaboration Opportunities

Seeking Technical Collaborator:

TouchDesigner/Resolume experience
Installation art background
Audio-reactive visual systems
Part-time, project-based (3-6 months)

10. Simplified Technical Flow

Audio Input → Speech-to-Text → FAQ Matching/LLM
     ↓                              ↓
Particle Modulation ← Video Selection ← Response

Key Insight: Pre-rendered FAQ responses eliminate 80% of processing time, enabling instant playback of high-quality interactions. Only truly novel questions require live generation.

Implementation Notes

The main goal: Shift the narrative from “complex technical challenge” to “artistic choice that happens to be technically smart.” The particle approach isn’t a compromise—it’s a more magical solution than photorealism would be.

Real-Time Streaming Avatar Pipeline Research

Glass-to-glass latency under 1 second is achievable

The research confirms that sub-1-second latency is achievable using modern streaming technologies and optimized pipelines. The most promising approach combines HeyGen’s WebRTC-based streaming avatar API (150-250ms latency) with ElevenLabs streaming TTS (150-300ms) and efficient particle masking systems, achieving total glass-to-glass latency of 300-500ms on a single RTX 4090 workstation.

Avatar Animation Pipeline Recommendations

Primary recommendation: HeyGen Streaming Avatar API HeyGen emerges as the optimal commercial solution for immediate deployment, offering 150-250ms glass-to-glass latency through WebRTC streaming with LiveKit integration. The API provides built-in audio-to-viseme mapping, alpha channel support for holographic displays, and production-ready TypeScript/JavaScript SDKs. At $0.20-$0.30 per minute, it balances cost with performance for rapid prototyping.

Secondary option: NVIDIA Audio2Face 2.0 For maximum performance and local control, NVIDIA’s Audio2Face achieves the lowest theoretical latency at ~50ms with RTX optimization. This on-premises solution requires Omniverse ecosystem setup but provides industry-leading facial blendshape generation and full data control. The RTX 4090’s architecture is specifically optimized for this workload.

Open-source alternative: GeneFace++ GeneFace++ offers real-time NeRF-based 3D talking face generation with no per-minute costs. While requiring significant development investment and ML expertise, it provides complete customization and can achieve real-time performance on RTX 4090 hardware.

Particle Systems and Masking Techniques

TouchDesigner leads for real-time particle effects TouchDesigner proves optimal for particle-based avatar emergence effects, capable of handling 200,000+ particles at 60fps on RTX 4090. The platform uses texture-based particle systems where RGB textures encode particle positions, updated via feedback loops.

Unity VFX Graph for maximum particle count Unity’s Visual Effect Graph supports 1 million+ particles with GPU simulation on RTX 4090, making it suitable for complex emergence effects. The platform offers mesh-based particle emission for face formation and WebRTC integration for real-time input.

Real-Time Pipeline Architecture

WebSocket/WebRTC migration strategy Transitioning from REST to WebSocket reduces latency by 10x, from ~500ms to ~50ms. The recommended implementation uses python-socketio or aiortc for WebRTC, with connection management including exponential backoff retry logic and 15-second heartbeat intervals. ElevenLabs’ WebSocket API with optimized chunk scheduling ([120, 160, 250, 290]) achieves 150-300ms time-to-first-byte.

Microservices architecture pattern The optimal architecture separates speech processing, avatar animation, and delivery into distinct services communicating via Redis message queues. This enables horizontal scaling, with load balancing across multiple avatar instances. The pipeline maintains 150-300ms adaptive buffering with jitter compensation based on network conditions.

Cost Analysis and Deployment Strategy

Operational costs scale with usage Basic deployments start at $0.13/minute using cloud resources, scaling to $0.52/minute for full production configurations. Monthly costs range from $500 for prototypes to $15,000+ for production systems. The break-even point for self-hosted versus API-based solutions typically occurs at 6-12 months with $3,000-$5,000 monthly usage.

Recommended deployment approach Start with HeyGen’s API for rapid prototyping at $99-$330/month, enabling immediate testing without infrastructure investment. Simultaneously develop the particle masking system using TouchDesigner on local RTX 4090 hardware. Once monthly costs exceed $3,000, transition to a hybrid approach with on-premise Audio2Face for avatar generation while maintaining cloud APIs for overflow capacity.

Technical Implementation Roadmap

Phase 1 (Week 1-2): Deploy HeyGen streaming avatar with basic WebSocket integration, achieving <500ms latency baseline. Implement TouchDesigner particle system prototype with alpha channel output.

Phase 2 (Week 3-4): Integrate ElevenLabs streaming TTS with alignment data. Develop adaptive buffering system targeting 150-300ms. Configure holographic display with appropriate codec support.

Phase 3 (Week 5-6): Implement particle-to-face morphing transitions. Optimize pipeline for consistent <1 second glass-to-glass latency. Add monitoring and failover systems.

Phase 4 (Week 7-8): Performance tune for production deployment. Implement caching strategies for common responses. Document API usage patterns for cost optimization.

This architecture achieves the required <1 second latency while providing flexibility to scale from prototype to production deployment on a single RTX 4090 workstation.