Skip to main content
Test Setup: 6 scenarios, 5 languages (EN, JA, KO, VI, ZH), 3 models
Hardware: NVIDIA GeForce RTX 4090
Date: December 2025

TL;DR

We evaluated three speaker diarization models across six scenarios:
ModelDescriptionAvg DERAvg RTF
NeMo Neural (MSDD)Multi-Scale Diarization Decoder with neural refinement0.0810.020
NeMo ClusteringClustering-only approach without MSDD0.1030.010
Pyannote 3.1End-to-end diarization pipeline0.1810.027
Key Findings:
  • NeMo Neural provides best accuracy with fast processing
  • Japanese benefits from longer context: Performance improves on 30min+ audio
  • Multilingual without Japanese performs excellently (DER: 0.050)

1. Introduction

We needed to choose a diarization model for production. Our evaluation covers 6 scenarios representing real-world conditions:
  • Different audio lengths (10 minutes to 1 hour)
  • Varying speaker counts (4 to 14 speakers)
  • Different overlap levels (0% to 40%)
  • Multilingual audio mixing

2. Models Under Test

NeMo Neural (MSDD)

  • TitaNet-large for 192-dimensional speaker embeddings
  • Processes audio at 5 temporal scales (1.0s-3.0s windows)
  • MSDD neural network refines initial clustering results
  • Average RTF: ~0.015-0.032

NeMo Clustering (Pure)

  • Same embedding model (TitaNet-large)
  • Uses only spectral clustering without MSDD refinement
  • Significantly faster due to skipping neural refinement
  • Average RTF: ~0.014-0.028

Pyannote 3.1

  • End-to-end pipeline with VAD, segmentation, and clustering
  • Uses pyannote/segmentation-3.0 and wespeaker models
  • Average RTF: ~0.018-0.043

3. Evaluation Setup

3.1 Test Scenarios

ScenarioDurationSpeakersOverlapPurpose
Long Audio10min4-515%Standard production
Very Long30min10-1215%Stress test
1-Hour Audio60min12-1415%Extreme duration
High Overlap15min8-1040%Worst-case overlap
Multilingual (5-lang)15min820%EN+JA+KO+VI+ZH
Multilingual (4-lang)15min820%EN+KO+VI+ZH (no JP)

3.2 Metrics

Accuracy Metrics:
  • DER Full (collar=0.0s): Strictest metric, no boundary tolerance
  • DER Fair (collar=0.25s): Primary metric with 250ms tolerance
  • DER Forgiving (collar=0.25s, overlap ignored): Most lenient
DER Components:
  • Miss Rate: Speech missed by the system
  • False Alarm Rate: Non-speech marked as speech
  • Confusion Rate: Speech assigned to wrong speaker

4. Overall Performance

4.1 Accuracy Comparison

Overall DER comparison

Overall DER comparison across all scenarios

Key Observations:
  • NeMo Neural is ~55% more accurate than Pyannote (DER: 0.081 vs 0.181)
  • NeMo Clustering performs nearly as well as Neural (only 27% worse)
  • Pyannote has 3.4x higher confusion rate

4.2 Speed Comparison

Speed comparison

Processing speed comparison (RTF - lower is faster)

  • NeMo Clustering is fastest (RTF 0.010)
  • NeMo Neural is very fast (RTF 0.020)
  • All models are much faster than real-time

4.3 Accuracy vs Speed Trade-off

Accuracy vs Speed

Accuracy vs Speed trade-off visualization

Major Finding: NeMo Neural achieves best accuracy with fast speed, making it the clear winner for most use cases.

5. Results by Scenario

5.1 Long Audio (10 minutes)

NeMo Neural Results by Language:
  • EN: 0.019 (Excellent)
  • JA: 0.157 (8.3x harder than English)
  • KO: 0.046
  • VI: 0.037
  • ZH: 0.053
  • Average: 0.062

5.2 Very Long Audio (30 minutes)

Critical Discovery - Japanese Benefits from Longer Context:
  • 10min audio: DER 0.157 (8.3x harder than English)
  • 30min audio: DER 0.067 (2.9x harder than English)
Extended duration provides better acoustic context for pitch-accent language modeling.

5.3 High Overlap (40%)

  • NeMo Neural and Clustering perform virtually identically (DER: 0.114 vs 0.115)
  • Pyannote struggles more (DER: 0.202, ~77% worse than NeMo)
  • Japanese remains the hardest language (DER: 0.232)

6. Language-Specific Analysis

Language difficulty

Overall language difficulty ranking

Key Observations:
  • Japanese is universally hardest (5.0x harder than English on average)
  • English is easiest (DER: 0.037)
  • Vietnamese is close second (only 1.1x harder than English)

Why Japanese is Difficult

Japanese context dependency

Japanese performance across different audio lengths

Hypotheses:
  1. Pitch-accent language: Pitch carries linguistic meaning, confusing speaker embeddings
  2. Narrow phonetic inventory: ~100 mora vs thousands of English phonemes
  3. Shorter syllable durations: Less temporal context per speaking turn

7. Neural vs Clustering

Neural vs Clustering

Neural vs Clustering performance comparison

Key Findings:
  • Clustering is only 3% worse on average
  • Clustering is 2x faster in processing
  • The speed/accuracy trade-off is minimal
Recommendation:
  • Use NeMo Neural for best accuracy
  • Use NeMo Clustering for maximum speed (2x faster, 3% worse)

8. Multilingual Performance

8.1 The Japanese Effect

Multilingual comparison

Multilingual performance with and without Japanese

Key Insight: Japanese is the primary factor making multilingual diarization difficult.
ConfigurationNeMo Neural DER
With Japanese (5-lang)0.142
Without Japanese (4-lang)0.050

8.2 Error Analysis

Confusion analysis

Error breakdown with vs without Japanese

Why 4-Language Multilingual Works Well:
  1. More acoustic diversity helps VAD detect speech boundaries
  2. Language changes provide natural segment boundaries
  3. EN, KO, VI, ZH have compatible acoustic features
  4. Japanese’s pitch-accent features cause cross-language speaker confusion

9. Conclusion

Key Takeaways

NeMo Neural is the clear winner:
  • Best accuracy: DER 0.081 average
  • Fast processing: RTF 0.020 (50x faster than real-time)
  • Excellent multilingual without Japanese: DER 0.050
Critical Findings:
  1. Japanese benefits dramatically from longer context (30min optimal)
  2. Multilingual with Japanese is challenging (DER 0.142) but manageable
  3. MSDD neural refinement provides minimal benefit over clustering (27% better)
  4. All models are fast and production-ready

Recommendations

Use CaseModelReason
Best accuracyNeMo NeuralDER 0.081
Maximum speedNeMo Clustering2x faster
Long audio (30min-1h)NeMo NeuralHandles complexity
Multilingual (no Japanese)NeMo NeuralDER 0.050
Japanese (30min+)NeMo NeuralContext helps
Default Choice: NeMo Neural - best accuracy with fast processing.