Skip to main content

Author

Taishin Maeda - Waseda University

Abstract

This paper evaluates and compares two state-of-the-art open-source speaker diarization frameworks: Pyannote.audio and Nvidia Nemo. The evaluation focuses on Diarization Error Rate (DER), execution time, and GPU resource usage across different audio scenarios. Additionally, a post-processing approach using OpenAI’s GPT-4-Turbo is explored to improve diarization accuracy. Key Results:
  • Nvidia Nemo achieves ~9% lower DER for 2-speaker scenarios
  • Pyannote.audio performs better for multi-speaker (9+) scenarios
  • GPT-4-Turbo post-processing shows potential but requires audio context integration
  • Real-time speaker diarization web application demonstrated

1. Introduction

What is Speaker Diarization?

Speaker Diarization is the process of segmenting and labeling audio based on different speakers - answering the question “who spoke when?” in a given audio. It is a crucial conversation analysis tool coupled with Automatic Speech Recognition (ASR).
Speaker Diarization Pipeline
The speaker diarization system consists of:
  1. Voice Activity Detection (VAD) - Timestamps where speech occurs
  2. Audio Embeddings Model - Extract embeddings from timestamped segments
  3. Clustering - Group embeddings to estimate speaker count

Pyannote.audio

Pyannote.audio is an open-source Python toolkit for speaker diarization and speaker embedding based on PyTorch.

Nvidia Nemo

Nvidia Nemo uses a different approach with multi-scale segmentation and a Neural Diarizer (MSDD model) for handling overlapping speech.
Nvidia Nemo Speaker Diarization Pipeline

Multi-scale Segmentation

Nemo addresses the trade-off between speaker identification quality and temporal granularity:
  • Longer segments → Better speaker representations, lower temporal resolution
  • Shorter segments → Lower quality representations, higher temporal resolution
Multi-scale Segmentation

Comparing the Frameworks

ComponentPyannote.audioNvidia Nemo
VADPyannet from SyncnetMultilingual MarbleNet
Speaker EmbeddingECAPA-TDNNTitanet Large
ClusteringHidden Markov ModelMulti-scale Clustering (MSDD)

2. Evaluation Method

Diarization Error Rate (DER)

The standard metric for speaker diarization, introduced by NIST in 2000:
DER = (False Alarm + Missed Detection + Confusion) / Total
Where:
  • False Alarm: Speech detected but no speaker present
  • Missed Detection: No speech detected but speaker present
  • Confusion: Speech assigned to wrong speaker
The goal is to minimize DER toward 0, indicating no errors.

RTTM File Format

Rich Transcription Time Marked (RTTM) is the standard format for speaker diarization output:
SPEAKER obama_zach(5min).wav 1 66.32 0.27 <NA> <NA> SPEAKER_01 <NA> <NA>
Key fields: segment start time (66.32), duration (0.27), speaker label (SPEAKER_01)

3. Experimental Setup

Datasets

  1. 5-minute audio - Two speakers (Obama-Zach interview), manually annotated using Audacity
  2. 9-minute audio - Nine speakers from VoxConverse dataset with professional ground truth

Hardware

  • GPU: Nvidia GeForce RTX 3090
  • Time measured using Python’s time module

Pyannote.audio Code

from pyannote.audio import Pipeline
import torch

pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization-3.1",
    use_auth_token="your_auth_token"
)
if torch.cuda.is_available():
    pipeline.to(torch.device("cuda"))

def diarization(audio_path):
    diarization = pipeline(audio_path)
    rttm = "SPEAKER {file} 1 {start:.2f} {duration:.2f} <NA> <NA> {speaker} <NA> <NA>"
    return [
        rttm.format(file=audio_path, start=turn.start,
                   duration=turn.duration, speaker=speaker)
        for turn, _, speaker in diarization.itertracks(yield_label=True)
    ]

Nvidia Nemo Code

from nemo.collections.asr.models import NeuralDiarizer
from omegaconf import OmegaConf

config = OmegaConf.load('diar_infer_telephonic.yaml')
config.diarizer.msdd_model.model_path = 'diar_msdd_telephonic'
config.diarizer.msdd_model.parameters.sigmoid_threshold = [0.7, 1.0]

msdd_model = NeuralDiarizer(cfg=config)
diarization_result = msdd_model.diarize()

4. Results and Discussion

DER Results - Two Speakers (5 min)

FrameworkDER
Pyannote.audio0.252
Pyannote.audio (pre-identified speakers)0.214
Nvidia Nemo0.161
Nvidia Nemo (pre-identified speakers)0.161
Nvidia Nemo produces approximately 9% less DER than Pyannote.audio for two-speaker scenarios.

DER Results - Nine Speakers (9 min)

FrameworkDER
Pyannote.audio0.083
Pyannote.audio (pre-identified speakers)0.098
Nvidia Nemo (pre-identified speakers)0.097
For multi-speaker scenarios, Pyannote.audio achieves ~1.4% lower DER than Nvidia Nemo.

GPT-4 Post-Processing Results

FrameworkGPT-4-Turbo DERGPT-3.5 DER
Pyannote (5min, 2 speakers)0.4270.494
Nemo (5min, 2 speakers)0.1790.544
Pyannote (9min, 9 speakers)0.1030.214
Nemo (9min, 9 speakers)0.1280.179
GPT-4 post-processing shows higher DER because it lacks direct audio access. Providing speaker timing and audio context could improve results.

Execution Time Performance

Framework5-min Audio9-min Audio
Pyannote.audio31.3s44.5s
Pyannote (pre-identified)29.8s41.5s
Nvidia Nemo63.9s-
Nemo (pre-identified)49.9s108.2s
Nvidia Nemo takes approximately double the execution time compared to Pyannote.audio.
GPU Usage for Pyannote.audio
GPU Usage for Nvidia Nemo

5. Real-time Application

A real-time speaker diarization web application was developed using:
  • WebSockets for streaming audio
  • FastAPI for the backend
  • Pyannote.audio for diarization

Key Implementation Details

The application uses 3-second audio chunks instead of 30-second chunks for better real-time performance:
class PyannoteService:
    def __init__(self):
        self.pipeline = Pipeline.from_pretrained(
            "pyannote/speaker-diarization-3.1",
            use_auth_token=config.hugging_face.token,
        )
        self.pipeline.to(torch.device("cuda"))

        self.embedding_model = Model.from_pretrained(
            "pyannote/embedding",
            use_auth_token=config.hugging_face.token
        )
        self.embedding_inference = Inference(
            self.embedding_model, window="whole"
        )

Results Comparison

Before Chunk Logic Modification
After Chunk Logic Modification
The modified chunk logic significantly reduced timing errors and provided smoother speaker transitions.

6. Conclusion

Key Findings

  1. Nvidia Nemo excels in shorter audio with fewer speakers (DER: 0.161 vs 0.252)
  2. Pyannote.audio performs better with more speakers and when speaker count is pre-identified
  3. GPT-4 post-processing shows potential but needs audio context integration
  4. Execution time: Pyannote.audio is approximately 2x faster
  5. Real-time application: Modified chunk logic improves accuracy

Future Work

  1. Adjust Nvidia Nemo models for non-telephonic scenarios
  2. Integrate audio context into GPT post-processing
  3. Fine-tune speaker identification thresholds for real-time applications
  4. Explore domain-specific LLMs trained for diarization tasks

7. References

  1. NIST Rich Transcription Evaluation (2022)
  2. Nvidia NeMo Documentation - Speaker Diarization
  3. Pyannote.audio GitHub Repository
  4. OpenAI GPT-4 Turbo Documentation
  5. VoxConverse Speaker Diarization Dataset

Acknowledgments

  • Akinori Nakajima - Representative Director of VoicePing Corporation
  • Melnikov Ivan - AI Developer of VoicePing Corporation