Internal development documentation and pipeline specifications
Goal: Real-time feeling, high visual fidelity. We mix pre-rendered “hero” clips with live lipsynced talk loops.
IDLE → SUMMON → TALK → OUTRO (+ GLITCH when needed)
Type | Count | Lengths (s) | Notes |
---|---|---|---|
Idle loops | 2–3 | 20–30 | Micro-motion only |
Talk bases | 6–10 | 4/6/8/12 | Neutral face motion, bookend pose |
Summons | 2–3 | 2–3 | Particles → head |
Outros | 2–3 | 1–2 | Head → particles |
Glitch masks | 2–3 | 1 (loop) | For >2 s latency spikes |
Accent bursts | 2–3 | 0.5–1 | Persona color/shape flourishes |
FAQ hero clips | 30–50 | variable | Full animation + baked audio |
Bookend pose: Every clip starts/ends on the same still frame (12–16 frames) for clean cuts.
q = transcribe(audio)
vec = embed(q)
hit = faiss.search(vec, top_k=1)
if hit.score > 0.83 and cooldown_ok(hit.id):
play_clip(hit.clip_path) # Prebaked FAQ
else:
answer = llm(q, persona)
wav = tts(answer)
base = choose_loop(len(wav)) # 4/6/8/12 s
talk = wav2lip(base, wav, roi="mouth")
play_clip(talk)
Asset Generation Priority:
Technical Stack Integration:
GPU Requirements:
Fallback Strategies:
This pipeline connects to the main Pepper’s Ghost installation described in the implementation strategy and technical specifications. The conversational system operates independently of the display technology choice, allowing flexibility in deployment scenarios.
Schema definition for the Oracle clip library system. This schema validates all clip metadata entries to ensure consistency across the content pipeline.
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"title": "OracleClip",
"type": "object",
"properties": {
"id": { "type": "string" },
"persona": { "type": "string" },
"type": { "type": "string", "enum": ["idle","summon","talk_base","outro","glitch","accent","faq"] },
"duration": { "type": "number" },
"emotion": { "type": "string" },
"path": { "type": "string" },
"bookend": { "type": "boolean", "default": true },
"cooldown": { "type": "number", "description": "seconds before reuse allowed" },
"tags": { "type": "array", "items": { "type": "string" } }
},
"required": ["id","persona","type","duration","path"]
}
This schema defines the structure for all video clips in the Oracle system:
The schema file is also available at /spec/clip_schema.json
for direct integration with validation tools.
Particle Consolidation Demo — Demonstration of particles consolidating to form a persona, then dispersing back to neutral/idle particle animation. This represents the SUMMON and OUTRO states in the pipeline.
This demo illustrates the core visual transformation of the Oracle Entity system:
Project Introduction by Base Persona — Demonstration of the Oracle Entity introducing the Echoes of Indiana project, showing how the persona would welcome and orient visitors to the experience.
Based on recent technical discussions, here are key updates to consider for the implementation approach:
States:
IDLE → SUMMON → TALK → OUTRO (+ GLITCH when needed)
TALK Options:
Particle Animation Pipeline:
Add:
Downplay:
Reallocate:
Months 1-2: Foundation
Each Oracle persona has a unique particle signature:
”Why Particles?”
Oracle entities manifest as living constellations of light particles rather than photorealistic faces. This aesthetic choice:
Response Times:
Consider updating visitor-facing descriptions to:
“The Entity responds through directional speakers, their particle form pulsing and flowing with the rhythm of speech, occasionally coalescing into clearer features during profound moments”
Seeking Technical Collaborator:
Audio Input → Speech-to-Text → FAQ Matching/LLM
↓ ↓
Particle Modulation ← Video Selection ← Response
Key Insight: Pre-rendered FAQ responses eliminate 80% of processing time, enabling instant playback of high-quality interactions. Only truly novel questions require live generation.
The main goal: Shift the narrative from “complex technical challenge” to “artistic choice that happens to be technically smart.” The particle approach isn’t a compromise—it’s a more magical solution than photorealism would be.
The research confirms that sub-1-second latency is achievable using modern streaming technologies and optimized pipelines. The most promising approach combines HeyGen’s WebRTC-based streaming avatar API (150-250ms latency) with ElevenLabs streaming TTS (150-300ms) and efficient particle masking systems, achieving total glass-to-glass latency of 300-500ms on a single RTX 4090 workstation.
Primary recommendation: HeyGen Streaming Avatar API HeyGen emerges as the optimal commercial solution for immediate deployment, offering 150-250ms glass-to-glass latency through WebRTC streaming with LiveKit integration. The API provides built-in audio-to-viseme mapping, alpha channel support for holographic displays, and production-ready TypeScript/JavaScript SDKs. At $0.20-$0.30 per minute, it balances cost with performance for rapid prototyping.
Secondary option: NVIDIA Audio2Face 2.0 For maximum performance and local control, NVIDIA’s Audio2Face achieves the lowest theoretical latency at ~50ms with RTX optimization. This on-premises solution requires Omniverse ecosystem setup but provides industry-leading facial blendshape generation and full data control. The RTX 4090’s architecture is specifically optimized for this workload.
Open-source alternative: GeneFace++ GeneFace++ offers real-time NeRF-based 3D talking face generation with no per-minute costs. While requiring significant development investment and ML expertise, it provides complete customization and can achieve real-time performance on RTX 4090 hardware.
TouchDesigner leads for real-time particle effects TouchDesigner proves optimal for particle-based avatar emergence effects, capable of handling 200,000+ particles at 60fps on RTX 4090. The platform uses texture-based particle systems where RGB textures encode particle positions, updated via feedback loops.
Unity VFX Graph for maximum particle count Unity’s Visual Effect Graph supports 1 million+ particles with GPU simulation on RTX 4090, making it suitable for complex emergence effects. The platform offers mesh-based particle emission for face formation and WebRTC integration for real-time input.
WebSocket/WebRTC migration strategy Transitioning from REST to WebSocket reduces latency by 10x, from ~500ms to ~50ms. The recommended implementation uses python-socketio or aiortc for WebRTC, with connection management including exponential backoff retry logic and 15-second heartbeat intervals. ElevenLabs’ WebSocket API with optimized chunk scheduling ([120, 160, 250, 290]) achieves 150-300ms time-to-first-byte.
Microservices architecture pattern The optimal architecture separates speech processing, avatar animation, and delivery into distinct services communicating via Redis message queues. This enables horizontal scaling, with load balancing across multiple avatar instances. The pipeline maintains 150-300ms adaptive buffering with jitter compensation based on network conditions.
Operational costs scale with usage Basic deployments start at $0.13/minute using cloud resources, scaling to $0.52/minute for full production configurations. Monthly costs range from $500 for prototypes to $15,000+ for production systems. The break-even point for self-hosted versus API-based solutions typically occurs at 6-12 months with $3,000-$5,000 monthly usage.
Recommended deployment approach Start with HeyGen’s API for rapid prototyping at $99-$330/month, enabling immediate testing without infrastructure investment. Simultaneously develop the particle masking system using TouchDesigner on local RTX 4090 hardware. Once monthly costs exceed $3,000, transition to a hybrid approach with on-premise Audio2Face for avatar generation while maintaining cloud APIs for overflow capacity.
Phase 1 (Week 1-2): Deploy HeyGen streaming avatar with basic WebSocket integration, achieving <500ms latency baseline. Implement TouchDesigner particle system prototype with alpha channel output.
Phase 2 (Week 3-4): Integrate ElevenLabs streaming TTS with alignment data. Develop adaptive buffering system targeting 150-300ms. Configure holographic display with appropriate codec support.
Phase 3 (Week 5-6): Implement particle-to-face morphing transitions. Optimize pipeline for consistent <1 second glass-to-glass latency. Add monitoring and failover systems.
Phase 4 (Week 7-8): Performance tune for production deployment. Implement caching strategies for common responses. Document API usage patterns for cost optimization.
This architecture achieves the required <1 second latency while providing flexibility to scale from prototype to production deployment on a single RTX 4090 workstation.