Back to Portfolio
Complete2025
Real-Time Speech Analysis
Live audio transcription with faster-whisper, pyannote speaker diarization, multi-layer noise filtering, and WebSocket streaming achieving 90%+ accuracy.
85–95%
Word Accuracy
90%+
Hallucination Filter
2–5s
Latency
1–5
Sessions/Core
Architecture
- Browser captures audio via Web Audio API, streams over WebSocket to FastAPI ingest service
- Jitter buffer → VAD → normalization → sliding window manager (60s windows, 15s overlap)
- Whisper ASR engine → speaker diarization → broadcast service pushes updates via WebSocket
- Three diarization modes: heuristic (fast), embeddings-based (voice characteristics), pyannote (neural)
- Four-layer noise filtering: pre-transcription VAD, Whisper param tuning, post-transcription hallucination detection, segment deduplication
Key Decisions
Four-layer noise filtering pipeline
Why: Whisper hallucinates aggressively on silence and background noise — single-layer filtering is insufficient
Tradeoff: Added latency from multi-pass processing, but accuracy gains justify it
Sliding windows (60s/15s overlap) over fixed chunks
Why: Whisper needs context for accurate transcription, especially domain-specific terms
Tradeoff: Higher memory usage from overlapping audio buffers
CPU-only design (no GPU assumed)
Why: Target deployment environments don't guarantee GPU access
Tradeoff: Limited to 1–5 concurrent sessions per core
Technologies
Pythonfaster-whisperpyannoteWebSocketFastAPI
What I Learned
- Whisper hallucinates aggressively on silence and background noise — the four-layer filtering pipeline was essential, not optional.
- Sliding windows with overlap give much better results than fixed non-overlapping chunks because Whisper needs context.
- Using the last 3 transcript segments as context prompts for subsequent windows significantly improved accuracy on domain-specific terms.