Complete2025

Real-Time Speech Analysis

Live audio transcription with faster-whisper, pyannote speaker diarization, multi-layer noise filtering, and WebSocket streaming achieving 90%+ accuracy.

GitHub

85–95%

Word Accuracy

90%+

Hallucination Filter

2–5s

Latency

1–5

Sessions/Core

Architecture

Browser captures audio via Web Audio API, streams over WebSocket to FastAPI ingest service
Jitter buffer → VAD → normalization → sliding window manager (60s windows, 15s overlap)
Whisper ASR engine → speaker diarization → broadcast service pushes updates via WebSocket
Three diarization modes: heuristic (fast), embeddings-based (voice characteristics), pyannote (neural)
Four-layer noise filtering: pre-transcription VAD, Whisper param tuning, post-transcription hallucination detection, segment deduplication

Key Decisions

Four-layer noise filtering pipeline

Why: Whisper hallucinates aggressively on silence and background noise — single-layer filtering is insufficient

Tradeoff: Added latency from multi-pass processing, but accuracy gains justify it

Sliding windows (60s/15s overlap) over fixed chunks

Why: Whisper needs context for accurate transcription, especially domain-specific terms

Tradeoff: Higher memory usage from overlapping audio buffers

CPU-only design (no GPU assumed)

Why: Target deployment environments don't guarantee GPU access

Tradeoff: Limited to 1–5 concurrent sessions per core

Technologies

Pythonfaster-whisperpyannoteWebSocketFastAPI

What I Learned

Whisper hallucinates aggressively on silence and background noise — the four-layer filtering pipeline was essential, not optional.
Sliding windows with overlap give much better results than fixed non-overlapping chunks because Whisper needs context.
Using the last 3 transcript segments as context prompts for subsequent windows significantly improved accuracy on domain-specific terms.

Back to Portfolio