Author
Taishin Maeda - Waseda University
Abstract
This paper evaluates and compares two state-of-the-art open-source speaker diarization frameworks: Pyannote.audio and Nvidia Nemo. The evaluation focuses on Diarization Error Rate (DER), execution time, and GPU resource usage across different audio scenarios. Additionally, a post-processing approach using OpenAI’s GPT-4-Turbo is explored to improve diarization accuracy.
Key Results:
- Nvidia Nemo achieves ~9% lower DER for 2-speaker scenarios
- Pyannote.audio performs better for multi-speaker (9+) scenarios
- GPT-4-Turbo post-processing shows potential but requires audio context integration
- Real-time speaker diarization web application demonstrated
1. Introduction
What is Speaker Diarization?
Speaker Diarization is the process of segmenting and labeling audio based on different speakers - answering the question “who spoke when?” in a given audio. It is a crucial conversation analysis tool coupled with Automatic Speech Recognition (ASR).
The speaker diarization system consists of:
- Voice Activity Detection (VAD) - Timestamps where speech occurs
- Audio Embeddings Model - Extract embeddings from timestamped segments
- Clustering - Group embeddings to estimate speaker count
Pyannote.audio
Pyannote.audio is an open-source Python toolkit for speaker diarization and speaker embedding based on PyTorch.
Nvidia Nemo
Nvidia Nemo uses a different approach with multi-scale segmentation and a Neural Diarizer (MSDD model) for handling overlapping speech.
Multi-scale Segmentation
Nemo addresses the trade-off between speaker identification quality and temporal granularity:
- Longer segments → Better speaker representations, lower temporal resolution
- Shorter segments → Lower quality representations, higher temporal resolution
Comparing the Frameworks
| Component | Pyannote.audio | Nvidia Nemo |
|---|
| VAD | Pyannet from Syncnet | Multilingual MarbleNet |
| Speaker Embedding | ECAPA-TDNN | Titanet Large |
| Clustering | Hidden Markov Model | Multi-scale Clustering (MSDD) |
2. Evaluation Method
Diarization Error Rate (DER)
The standard metric for speaker diarization, introduced by NIST in 2000:
DER = (False Alarm + Missed Detection + Confusion) / Total
Where:
- False Alarm: Speech detected but no speaker present
- Missed Detection: No speech detected but speaker present
- Confusion: Speech assigned to wrong speaker
The goal is to minimize DER toward 0, indicating no errors.
Rich Transcription Time Marked (RTTM) is the standard format for speaker diarization output:
SPEAKER obama_zach(5min).wav 1 66.32 0.27 <NA> <NA> SPEAKER_01 <NA> <NA>
Key fields: segment start time (66.32), duration (0.27), speaker label (SPEAKER_01)
3. Experimental Setup
Datasets
- 5-minute audio - Two speakers (Obama-Zach interview), manually annotated using Audacity
- 9-minute audio - Nine speakers from VoxConverse dataset with professional ground truth
Hardware
- GPU: Nvidia GeForce RTX 3090
- Time measured using Python’s
time module
Pyannote.audio Code
from pyannote.audio import Pipeline
import torch
pipeline = Pipeline.from_pretrained(
"pyannote/speaker-diarization-3.1",
use_auth_token="your_auth_token"
)
if torch.cuda.is_available():
pipeline.to(torch.device("cuda"))
def diarization(audio_path):
diarization = pipeline(audio_path)
rttm = "SPEAKER {file} 1 {start:.2f} {duration:.2f} <NA> <NA> {speaker} <NA> <NA>"
return [
rttm.format(file=audio_path, start=turn.start,
duration=turn.duration, speaker=speaker)
for turn, _, speaker in diarization.itertracks(yield_label=True)
]
Nvidia Nemo Code
from nemo.collections.asr.models import NeuralDiarizer
from omegaconf import OmegaConf
config = OmegaConf.load('diar_infer_telephonic.yaml')
config.diarizer.msdd_model.model_path = 'diar_msdd_telephonic'
config.diarizer.msdd_model.parameters.sigmoid_threshold = [0.7, 1.0]
msdd_model = NeuralDiarizer(cfg=config)
diarization_result = msdd_model.diarize()
4. Results and Discussion
DER Results - Two Speakers (5 min)
| Framework | DER |
|---|
| Pyannote.audio | 0.252 |
| Pyannote.audio (pre-identified speakers) | 0.214 |
| Nvidia Nemo | 0.161 |
| Nvidia Nemo (pre-identified speakers) | 0.161 |
Nvidia Nemo produces approximately 9% less DER than Pyannote.audio for two-speaker scenarios.
DER Results - Nine Speakers (9 min)
| Framework | DER |
|---|
| Pyannote.audio | 0.083 |
| Pyannote.audio (pre-identified speakers) | 0.098 |
| Nvidia Nemo (pre-identified speakers) | 0.097 |
For multi-speaker scenarios, Pyannote.audio achieves ~1.4% lower DER than Nvidia Nemo.
GPT-4 Post-Processing Results
| Framework | GPT-4-Turbo DER | GPT-3.5 DER |
|---|
| Pyannote (5min, 2 speakers) | 0.427 | 0.494 |
| Nemo (5min, 2 speakers) | 0.179 | 0.544 |
| Pyannote (9min, 9 speakers) | 0.103 | 0.214 |
| Nemo (9min, 9 speakers) | 0.128 | 0.179 |
GPT-4 post-processing shows higher DER because it lacks direct audio access. Providing speaker timing and audio context could improve results.
| Framework | 5-min Audio | 9-min Audio |
|---|
| Pyannote.audio | 31.3s | 44.5s |
| Pyannote (pre-identified) | 29.8s | 41.5s |
| Nvidia Nemo | 63.9s | - |
| Nemo (pre-identified) | 49.9s | 108.2s |
Nvidia Nemo takes approximately double the execution time compared to Pyannote.audio.
5. Real-time Application
A real-time speaker diarization web application was developed using:
- WebSockets for streaming audio
- FastAPI for the backend
- Pyannote.audio for diarization
Key Implementation Details
The application uses 3-second audio chunks instead of 30-second chunks for better real-time performance:
class PyannoteService:
def __init__(self):
self.pipeline = Pipeline.from_pretrained(
"pyannote/speaker-diarization-3.1",
use_auth_token=config.hugging_face.token,
)
self.pipeline.to(torch.device("cuda"))
self.embedding_model = Model.from_pretrained(
"pyannote/embedding",
use_auth_token=config.hugging_face.token
)
self.embedding_inference = Inference(
self.embedding_model, window="whole"
)
Results Comparison
The modified chunk logic significantly reduced timing errors and provided smoother speaker transitions.
6. Conclusion
Key Findings
- Nvidia Nemo excels in shorter audio with fewer speakers (DER: 0.161 vs 0.252)
- Pyannote.audio performs better with more speakers and when speaker count is pre-identified
- GPT-4 post-processing shows potential but needs audio context integration
- Execution time: Pyannote.audio is approximately 2x faster
- Real-time application: Modified chunk logic improves accuracy
Future Work
- Adjust Nvidia Nemo models for non-telephonic scenarios
- Integrate audio context into GPT post-processing
- Fine-tune speaker identification thresholds for real-time applications
- Explore domain-specific LLMs trained for diarization tasks
7. References
- NIST Rich Transcription Evaluation (2022)
- Nvidia NeMo Documentation - Speaker Diarization
- Pyannote.audio GitHub Repository
- OpenAI GPT-4 Turbo Documentation
- VoxConverse Speaker Diarization Dataset
Acknowledgments
- Akinori Nakajima - Representative Director of VoicePing Corporation
- Melnikov Ivan - AI Developer of VoicePing Corporation