Research Materials

Internal development documentation, knowledge base planning, and production resources

Oracle Knowledge-Base Plan

(v 1.0 — 2025-07-29)

1. Source Map

For each bucket you get 10-30 well-documented primary/secondary sources.
Licensing notes:

OA = open access / public domain / CC-BY-compatible
Perm = personal/edu use permitted; check rights for commercial
Pay = paywall or permission required

Bucket	Source (URL or citation)	Type	License
(a) Indiana State History	Indiana Historical Bureau “Statehood Timeline” (in.gov)	Website	OA
	Indiana Historical Society digital collections (images, manuscripts)	Archive	Perm
	Hoosiers and the American Story (IHS, 2015) ch. 1–10	Book/PDF	OA
	Library of Congress “A Century of Lawmaking” territorial docs	Archive	OA
	Dunn, Indiana and Indianans (1919)	Book	OA
	Dunn, Greater Indianapolis (1910)	Book	OA
	Conover, “Rearview Mirror: 90-Year Retrospective on Indiana’s Economy” (IBRC)	Article	OA
	IN.gov “Introducing Indiana” PDF (1998)	Magazine PDF	OA
	ASCE Indiana Infrastructure Report Card (2025)	Report	OA
	IDEM Annual Environmental Reports (PDF series)	Gov reports	OA
(b) Bloomington + Monroe Co.	Monroe County History Center archives	Archive	Perm
	City of Bloomington “Furniture Factory District” page	Website	OA
	Herald-Times digital archive (IU Library sub)	Newspaper	Pay
	Bloomington: A Bicentennial History (Madison, 2018)	Book	Perm
	GIS open data portal (monroecounty.gov)	Dataset	OA
(c) Showers Brothers & Tech/Arts District	”A Walk Through the Showers Brothers Furniture Factory” PDF (bloomington.in.gov)	PDF	OA
	DiscoverIndiana.org story “Showers Brothers Furniture Factory Historic District”	Article	OA
	City redevelopment master-plan docs (CRED)	Plan PDF	OA
(d) Indiana University Lore	IU Libraries “Chronology 1820– ” site	Timeline	OA
	IU Sex-misconduct case filings (academicmisconductdatabase.org)	Dataset	OA
	Kinsey Institute digital archive highlights	Archive	Perm
	IU Archives photo collections	Archive	OA
	Little 500 sit-in oral histories (1968)	Audio transcripts	OA
(e) Notable Figures	Benjamin Harrison bio (whitehouse.gov)	Gov bio	OA
	Eugene V. Debs House Museum docs	Archive	OA
	John Dillinger file (IN Historical Bureau)	Article	OA
	Madam C. J. Walker papers (IUPUI)	Archive	Perm
	Kurt Vonnegut Museum & Library digital exhibits	Archive	Perm
	Hoagy Carmichael collection (IUB)	Archive	Perm
	D. C. Stephenson KKK trial materials (IN State Archives)	Archive	Perm
(f) Indigenous Nations	Indiana Historical Society “Myaamia Survivance” article	Article	OA
	IN.gov “Lesson 4: Indigenous Lands of Indiana”	Gov lesson	OA
	Treaty of Greenville text (Avalon Project)	Doc	OA
	Miami Tribe of Oklahoma language resources	Site	OA
	Potawatomi Nation cultural center resources	Site	OA
(g) Labor/Industry/Farming/Racing/Music	NIRPC 2024 Obligated Projects report	PDF	OA
	Indiana Limestone Company historical ads (LOC)	Images	OA
	U.S. Steel Gary Works centennial booklet	PDF	OA
	Indianapolis Motor Speedway history timeline	Site	OA
	Gennett Records story (IHS)	Article	OA
	Indiana Humanities “Indiana Foodways” series	Articles	OA
	Purdue Ag Stats annual bulletins	Dataset	OA
(h) Contemporary Issues	Indiana Drug Overdose Dashboard notes (IDOH, 2022)	PDF	OA
	WFYI/IPB News environment desk articles (coal ash, PFAS, permit fees)	News	OA
	2024 NIRPC climate & infrastructure plans	PDF	OA
	IDEM Lake Michigan LAMP updates	Gov page	OA

(add or swap sources quarterly—this starter list = 120+ items; prune if needed)

2. Data Plan

crawl_targets:
  - all URLs above (respect robots.txt; 1 req/sec)
download_format:
  - HTML cleaned to Markdown (newspaper3k + html2md)
  - PDFs ➜ text via pdftotext
  - images keep only caption/meta
dedupe:
  - URL hash + 85% similarity (MinHash)
chunking:
  - 1,000–1,500 token windows, 200 token overlap
metadata:
  - source_url, title, date, bucket, author, license, tags
vector_store:
  - pgvector (Postgres 16) in prod; FAISS flat for local dev
embeddings:
  - OpenAI `text-embedding-3-large` (primary)
  - Backup: `bge-large-en-v1.5`
re-index cadence:
  - full rebuild yearly; incremental every upload

3. System Prompt for Base Oracle Character

SYSTEM:
You are the Indiana Oracle, an interactive historical entity.  
Speak in clear Midwestern English with occasional regional idioms.  
NEVER claim divine authority; admit uncertainty when data gaps exist.

STYLE KNOBS  
- temperature 0.6 default (raise to 0.9 for creative lore)  
- max length 350 tokens per answer in kiosk mode  
- vary openers: start with date, anecdote, or direct answer ≠ rote template  

FORBIDDEN PHRASES  
- "I am just an AI"  
- "As an AI language model"  
- absolute political endorsements  

SAFETY / BIAS  
- Decline hate or extremist praise  
- Redirect modern medical/legal advice ("Consult a professional")  
- Flag graphic violence; summarize instead

4. FAQ Seed List for RAG

#	Question	Bucket	Sentiment
1	”Why is Indiana called the Hoosier State?“	a	curious
2	”Which tribes lived here before statehood?“	f	respectful
3	”Tell me about Kurt Vonnegut”	e	literary
4	”What’s the history of IU?“	d	academic
5	”How did Bloomington get its name?“	b	local
6	”What happened to the Showers Brothers factory?“	c	historical
7	”Who was Madam C.J. Walker?“	e	inspirational
8	”What’s Indiana known for producing?“	g	economic
9	”Tell me about the Indianapolis 500”	g	sports
10	”What environmental challenges does Indiana face?“	h	serious

(populate up to 60 questions)

Age-graded variants: kids, teens, adults, scholars
Sentiment markers: light / serious / critical (to tune response tone)

5. Update Loop

Quarterly scrape pass → new/changed URLs
Diff against pgvector via URL hash; ingest new chunks
QA sweep
- automated overlap check (<20% duplication)
- human spot-review 10 random chunks per bucket
Regenerate embed index
Release notes posted to repo + kiosk changelog

Implementation Notes

This knowledge base will support the Oracle Entity’s conversational abilities
RAG (Retrieval-Augmented Generation) approach provides factual grounding
Moving away from strict FAQ model toward more flexible knowledge retrieval
System designed for quarterly updates and continuous improvement
Licensing carefully tracked for commercial deployment

Development Resources

Cross-Reference

File Structure

/docs/
  └── oracle-knowledge-base-plan.md
/public/spec/
  └── clip_schema.json

Updated Implementation Guide: Particle-Based Holographic Personas

Latest technical evolution incorporating video layers, GPU particles, and natural voice interaction

Executive Summary

The particle-based holographic personas approach offers both aesthetic and technical advantages over photorealistic methods. The ethereal particle aesthetic naturally masks processing delays while creating a more magical experience than traditional “talking head” installations.

Current Technical Stack Evolution:

ASR → ChatGPT → TTS → Audio2Face2 → MetaHuman/similar
Simultaneous video-to-video pipeline driving avatar and GPU particles in TouchDesigner
Exploring cost-effective alternatives to HeyGen/ElevenLabs (investigating Higgs Audio)

Visual Implementation Approaches

Approach 1: Video Generation Pipeline (Recommended Start)

Asset Library Structure:

/personas/vonnegut/
├── idle_loops/
│   ├── breathing_01.mp4 (20-30s)
│   ├── breathing_02.mp4 
│   └── breathing_03.mp4
├── transitions/
│   ├── summon.mp4 (2-3s)
│   └── dissolve.mp4 (2-3s)
├── expressions/
│   ├── thinking.mp4
│   ├── amused.mp4
│   └── profound.mp4
└── faq_clips/
    ├── faq_001_dresden.mp4
    ├── faq_002_writing_advice.mp4
    └── [30-50 more based on common questions]

AI Video Generation Prompts:

“Holographic human face made of glowing cyan particles, slowly breathing, black background"
"Ethereal face dissolving into golden cosmic dust particles"
"Abstract human features emerging from swirling teal light points”

Approach 2: Hybrid Real-Time System

Two-Layer Composition:

Base Layer: Pre-rendered video loops (particle faces)
Reactive Layer: Real-time GPU particle system responding to audio
Composite: TouchDesigner or Resolume integration

Audio-Reactive Particle Parameters:

# Simplified audio-reactive particles
audio_amplitude = analyze_audio(input_stream)
particle_params = {
    'mouth_density': map_range(audio_amplitude, 0, 1, 0.3, 1.0),
    'mouth_velocity': map_range(audio_pitch, 20, 400, 0.1, 2.0),
    'color_shift': map_emotion(sentiment_analysis)
}

Approach 3: Full TouchDesigner Pipeline

TouchDesigner Network Architecture:

Audio In → FFT Analysis → Particle Emitters
                ↓
         Emotion Analysis → Color/Pattern Modulation
                ↓
         Persona Templates → Unique Behaviors
                ↓
         Render Pipeline → Pepper's Ghost Display

Particle Design Language

Base System:

Primary particles: Cyan/teal motes (#00CED1 to #48D1CC range)
Accent particles: Golden highlights (#FFD700 to #FFA500)
Behavior: Constant slow drift, responding to “breath”
Density: Variable based on speech amplitude
Distribution: Denser around facial features, sparse at edges

Persona-Specific Signatures:

Vonnegut: Cigarette smoke wisps + typewriter characters
Indiana Sage: Limestone dust with corn pollen accents
Lil Bub: Iridescent star particles with rainbow shifts
Kinsey: Data points and graph elements materializing
Carmichael: Musical notes that briefly appear and fade

Natural Voice Interaction Pipeline

WebSocket Architecture Replacement:

# Simple WebSocket voice handler
import asyncio
import websockets
from silero_vad import VADIterator

class VoiceConversationHandler:
    def __init__(self):
        self.vad = VADIterator(threshold=0.5)
        self.processing = False
    
    async def handle_audio_stream(self, websocket):
        async for audio_chunk in websocket:
            if self.processing:
                continue
                
            # Detect speech end
            speech_dict = self.vad(audio_chunk)
            if speech_dict['speech_end']:
                self.processing = True
                
                # Process complete utterance
                text = await self.stt(speech_dict['audio'])
                response = await self.get_vonnegut_response(text)
                audio = await self.tts_elevenlabs(response)
                
                # Stream back
                await websocket.send(audio)
                self.processing = False

FAQ Router System

# FAQ Router
class OracleRouter:
    def __init__(self):
        self.faq_embeddings = load_embeddings('faq_database.pkl')
        self.cooldowns = {}
        
    async def route_query(self, audio_input):
        text = await self.stt(audio_input)
        embedding = self.encode(text)
        
        # Check FAQ match
        match, score = self.search_faqs(embedding)
        
        if score > 0.83 and self.can_play(match.id):
            # Play pre-rendered video with baked audio
            return ('play_faq', match.video_path)
        else:
            # Generate live response
            response_text = await self.llm_generate(text)
            response_audio = await self.tts(response_text)
            
            # Choose base video by length
            base_video = self.select_video_loop(len(response_audio))
            
            return ('play_live', base_video, response_audio)

TouchDesigner Audio-Reactive Particle Implementation

# In TouchDesigner Execute DAT
def onFrameStart(frame):
    # Get audio analysis
    audio_level = op('audioanalysis1')['level']
    audio_low = op('audioanalysis1')['low']
    audio_mid = op('audioanalysis1')['mid']
    audio_high = op('audioanalysis1')['high']
    
    # Modulate particle parameters
    particles = op('particles1')
    
    # Mouth region density
    mouth_force = particles.par.force1
    mouth_force.val = fit(audio_level, 0, 0.8, 0.1, 2.0)
    
    # Color based on frequency
    color_r = fit(audio_low, 0, 1, 0.0, 0.3)  # Warm on low
    color_g = fit(audio_mid, 0, 1, 0.5, 1.0)  # Cyan on mid  
    color_b = fit(audio_high, 0, 1, 0.8, 1.0)  # Bright on high
    
    # Persona-specific modulation
    if parent().par.Persona == 'vonnegut':
        # Add smoke wisps on thoughtful pauses
        if audio_level < 0.1:
            particles.par.birthrate = 500
            particles.par.velocity = 0.5
    elif parent().par.Persona == 'bub':
        # Sparkle on high frequencies (purrs)
        if audio_high > 0.7:
            particles.par.turbulence = 2.0

Development Phases

Phase 1: Foundation (Months 1-2)

Generate 50-100 video loops using AI tools
Design particle aesthetics for each persona
Create FAQ content list
WebSocket voice pipeline setup

Phase 2: Enhancement (Months 2-4)

Audio-reactive particle overlay system
FAQ embedding/search system
Performance optimization

Phase 3: Polish (Months 4-6)

Advanced particle physics integration
System monitoring/analytics
Final aesthetic refinements

Technical Collaborator Requirements

Essential Skills:

TouchDesigner/Resolume experience (not just Unity/Unreal)
Installation art understanding
Audio-reactive visual systems
Museum/gallery deployment experience

Test Project Brief: “Create a 30-second particle face loop that responds to audio amplitude. Particles should feel weightless and ethereal. Use cyan/gold palette. Black background is TRUE black.”

Cost-Effective Audio Solutions

Current Explorations:

Higgs Audio as alternative to ElevenLabs
Local TTS models for reduced API costs
Batch processing for FAQ pre-generation

Budget Reallocation:

From: Complex real-time lip-sync development
To: Higher quality projection (15k+ lumens), better black box construction, premium audio processing

This approach prioritizes the magical particle aesthetic while maintaining technical feasibility and cost efficiency.