Back to Portfolio
Complete2025

Real-Time Speech Analysis

Live audio transcription with faster-whisper, pyannote speaker diarization, multi-layer noise filtering, and WebSocket streaming achieving 90%+ accuracy.

85–95%
Word Accuracy
90%+
Hallucination Filter
2–5s
Latency
1–5
Sessions/Core

Architecture

  • Browser captures audio via Web Audio API, streams over WebSocket to FastAPI ingest service
  • Jitter buffer → VAD → normalization → sliding window manager (60s windows, 15s overlap)
  • Whisper ASR engine → speaker diarization → broadcast service pushes updates via WebSocket
  • Three diarization modes: heuristic (fast), embeddings-based (voice characteristics), pyannote (neural)
  • Four-layer noise filtering: pre-transcription VAD, Whisper param tuning, post-transcription hallucination detection, segment deduplication

Key Decisions

Four-layer noise filtering pipeline

Why: Whisper hallucinates aggressively on silence and background noise — single-layer filtering is insufficient

Tradeoff: Added latency from multi-pass processing, but accuracy gains justify it

Sliding windows (60s/15s overlap) over fixed chunks

Why: Whisper needs context for accurate transcription, especially domain-specific terms

Tradeoff: Higher memory usage from overlapping audio buffers

CPU-only design (no GPU assumed)

Why: Target deployment environments don't guarantee GPU access

Tradeoff: Limited to 1–5 concurrent sessions per core

Technologies

Pythonfaster-whisperpyannoteWebSocketFastAPI

What I Learned

  • Whisper hallucinates aggressively on silence and background noise — the four-layer filtering pipeline was essential, not optional.
  • Sliding windows with overlap give much better results than fixed non-overlapping chunks because Whisper needs context.
  • Using the last 3 transcript segments as context prompts for subsequent windows significantly improved accuracy on domain-specific terms.