Skip to main content
Author: Linchuan Du Affiliation: Department of Mathematics, The University of British Columbia Date: August 2023

Abstract

Automatic Speech Recognition (ASR), also known as Speech to Text (STT), uses Deep Learning technologies to transcribe speech-included audios to texts. In the fields of Deep Learning Artificial Intelligence, Large Language Models (LLMs) mimic human brains in processing words and phrases, and have the ability to understand and generate text data. LLMs usually contain millions of weights and pre-trained with various kinds of datasets. Specifically, an ASR LLM will convert audio inputs to desired input formats by feature extraction and tokenization. To customize an ASR LLM with ideal performance, fine-tuning procedures of Whisper, an ASR LLM developed by OpenAI, were tested on Google Colaboratory first. Larger models were then deployed in GPU-equipped environments in Windows OS to speed up training and alleviate GPU availability or limit issues on Colab and MacOS. Audio data were investigated on reliability based on information such as audio quality and transcript accuracy. Models were then improved and optimized by data preprocessing and hyper-parameter tuning techniques. In case of failing to resolve GPU memory issues by means of regular fine-tuning, Parameter-Efficient-Fine-Tuning (PEFT) with Low Rank Adaptation (LoRA) was utilized to freeze most parameters to save memory allocation without sacrificing too much in performances. Results were visualized along with loss curves to ensure the fitness and optimization of fine-tuning processes. Possibility of multi-speaker support in Whisper was explored using Neural Speaker Diarization. Integration with Pyannote was implemented using pipeline and WhisperX, a project containing similar ideas with extra features of word-level timestamps and Voice Activity Detection (VAD). WhisperX was tested on long-form transcription with batching as well as diarization. Besides Whisper, other models with ASR functionality were installed and compared with Whisper baseline, including Massively Multilingual Speech (MMS) by Meta AI research, PaddleSpeech by PaddlePaddle, SpeechBrain and ESPNet. Chinese datasets were used to compare these models in CER metrics. In addition, Custom Speech in Azure AI, which supports real-time STT features, was introduced to compare performances (mainly Mandarin Chinese). Then a choice can be made between trained Azure models and loadable models like Whisper for deployment.

Overview

1. Preparing Environment

a. Google Colaboratory

Google Colaboratory is a hosted Jupyter Notebook service that has limited free GPU & TPU computing resources. In Google Colaboratory, ipynb extension format is used to edit and execute Python scripts. Log in to Google Colab through Google account, share written scripts with others via “share” on the right top corner of the page, and optionally authorize Colab with a Github account. How to set up environments on Colab:
  1. Select Tab Runtime → Change Runtime to enable GPU for use
  2. Use pip or other package installers to install necessary dependencies
!pip install packageName

b. Anaconda

Besides Colab, environments can also be prepared on local PCs. Anaconda is a well-known distribution platform for Data Science field, including data analysis and building machine learning models in Python. It contains Conda, an environment and package manager that helps to manage open-source Python packages and libraries. How to set up environments with Anaconda:
  1. Install Anaconda from Free Download | Anaconda and add to PATH environment variable
  2. Search Command Prompt and get into base environment, e.g (Windows):
(base) C:\Users\username>
  1. Create a new Conda environment with a new name:
conda create --name myenv
  1. Activate every time a specific Conda environment is needed, or return to base environment using deactivate:
conda activate myenv
conda deactivate
  1. Install dependencies through PyPI or Conda package manager:
pip install packageName>=0.0.1
conda install packageName

c. Visual Studio Code

Visual Studio Code, or VS Code, is a powerful source-code editor for Windows, MacOS and Linux with various programming languages available for editing. It supports multiple tasks, including debugging, executing in integrated terminals, enriching functionalities by extensions, and version control by embedded Git. How to set up environments in VS Code:
  1. Open the folder(s) on the left side under EXPLORER and create files inside the folder
  2. On the bottom right, select the environment needed. Execute Python scripts in either interactive window on the top right with IPython kernel installed or executing Python files using commands:
python xxx.py
  1. An alternative way is to use the ipynb extension (Jupyter Notebook)
  2. The Git icon on the left panel is the place where the source codes are controlled
VS Code needs to reopen if packages in the environment are updated

d. CUDA GPU

Compute Unified Device Architecture (CUDA) is a parallel computing platform and Application Programming Interface (API) developed by NVIDIA. It allows developers to use NVIDIA Graphics Processing Units (GPUs) for multiple computing tasks. How to use CUDA GPU:
  1. Install the CUDA Toolkit, which includes necessary libraries, tools, and drivers for developing and running CUDA applications
  2. Check relevant information in Command Prompt with the command:
nvidia-smi
PyTorch Installation
After setting up CUDA Toolkit, download a GPU-compatible PyTorch version from PyTorch.
When a previous PyTorch version is needed, check the right commands of Previous PyTorch Versions to avoid compatibility issues.
Version check can be performed directly through Python:
import torch

print(f' CUDA availability on PyTorch is {torch.cuda.is_available()}')
print(f' Current PyTorch version is {torch.__version__}')
print(f' Current CUDA version is {torch.version.cuda}')
print(f' cuDNN version is {torch.backends.cudnn.version()}')
print(f' The number of available GPU devices is {torch.cuda.device_count()}')

# Use CUDA on the device
device = torch.device("cuda")

2. Audio Data Source

a. Hugging Face

Hugging Face is a company and an open-source platform dedicated to Natural Language Processing (NLP) and Artificial Intelligence.
Hugging Face Token Settings
Create a Hugging Face account to utilize published models or upload customized models. Personal READ and WRITE tokens can be created on https://huggingface.co/settings/tokens. Common ASR LLMs and their relevant information:
Model# Params SizeLanguagesTaskStructure
OpenAI Whisperlarge-v2 1550MMost languagesMultitasksTransformer encoder-decoder Regularized
OpenAI Whisperlarge 1550MMost languagesMultitasksTransformer encoder-decoder
OpenAI Whispermedium 769MMost languagesMultitasksTransformer encoder-decoder
OpenAI Whispersmall 244MMost languagesMultitasksTransformer encoder-decoder
guillaumekln faster-whisperlarge-v2Most languagesMultitasksCTranslate2
facebook wav2vec2large-960h-lv60-selfEnglishtranscriptionWav2Vec2CTC decoder
facebook wav2vec2base-960h 94.4MEnglishtranscriptionWav2Vec2CTC decoder
facebook mms1b-all 965MMost languagesMultitasksWav2Vec2CTC decoder
Common audio datasets:
Dataset# hours / SizeLanguages
mozilla-foundation common_voice_13_017689 validated hrs108 languages
google fleurs~12 hrs per language102 languages
LIUM tedlium118 to 452 hrs for 3 releasesEnglish
librispeech_asr~1000 hrsEnglish
speechcolab gigaspeech10000 hrsEnglish
PolyAI minds148.17k rows14 languages
PolyAI/minds14 is primarily for intent detection task, and not ideal for ASR purpose

b. Open SLR

Open SLR is another useful website that hosts speech and language resources with compressed files. Various audio datasets can be seen along with their brief summaries in the Resources tab. Chinese audio datasets for ASR purposes:
Dataset# hours (size)# speakersTranscript accuracy
Aishell-1 (SLR33)178 hrs40095+%
Free ST (SLR38)100+ hrs855/
aidatatang_200zh (SLR62)200 hrs60098+%
MAGICDATA (SLR68)755 hrs108098+%

3. Whisper Model Fine-tuning

Whisper is an ASR (Automatic Speech Recognition) system released by OpenAI in September, 2022. It was trained on 680,000 hours of multilingual and multitask supervised data, enabling multiple language transcription and translation. The architecture is an encoder-decoder Transformer. The audios will be chunked into 30 seconds and converted into a log-Mel spectrogram, which enables frequencies to be changed into the Mel scale. Then it will be passed into an encoder. Resources:

a. Fine-tuning on Colab

Step 1: Login through Hugging Face token to enable datasets download
from huggingface_hub import notebook_login
notebook_login()
Step 2: Load desired dataset(s) through load_dataset in datasets
Sometimes permissions for access to certain datasets are needed on Hugging Face
Step 3: Preprocess datasets to feed data into Whisper:
  • Manipulate columns: e.g. remove_columns, cast_column
  • Normalize transcript, e.g. upper/lowercase, punctuations, special tokens
  • Change sampling rate to 16k using Audio in Datasets library
  • Load pre-trained feature extractor and tokenizer from transformers library
from transformers import WhisperFeatureExtractor, WhisperTokenizer
WhisperFeatureExtractor.from_pretrained("model_id")
WhisperTokenizer.from_pretrained("model_id")
WhisperProcessor.from_pretrained("model_id")
AutoProcessor detects processor type automatically
In tokenizer, usually target languages and tasks are specified:
language="lang", task="transcribe"  # or "translation"
Step 4: Define Data Collator in Sequence to Sequence with label padding
import torch
from dataclasses import dataclass
from typing import Any, Dict, List, Union

@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        input_features = [{"input_features": feature["input_features"]} for feature in features]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        label_features = [{"input_ids": feature["labels"]} for feature in features]
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)

        if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
            labels = labels[:, 1:]

        batch["labels"] = labels

        return batch
Step 5: Import evaluation metrics (WER)
import evaluate
metric = evaluate.load("wer")
When using English or most European languages, WER (Word Error Rate) is a common evaluation metric for transcription accuracy.
WER Formula: WER = (Substitutions + Deletions + Insertions) / Total Words in Reference Step 6: Design metrics computation
def compute_metrics(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids

    label_ids[label_ids == -100] = tokenizer.pad_token_id

    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = tokenizer.batch_decode(label_ids, skip_special_tokens=True)

    wer = 100 * metric.compute(predictions=pred_str, references=label_str)

    return {"wer": wer}
Step 7: Load conditional generation and configure model
from transformers import WhisperForConditionalGeneration
model = WhisperForConditionalGeneration.from_pretrained("model_id")

model.config.forced_decoder_ids = None
model.config.suppress_tokens = []
Step 8: Define hyperparameters in Seq2SeqTrainingArguments
training_args = Seq2SeqTrainingArguments(
    output_dir="kawadlc/whisperv1",          # own repo name
    per_device_train_batch_size=16,          # batch size per GPU for train
    gradient_accumulation_steps=1,           # increase by 2x for every 2x decrease in batch size
    learning_rate=1e-5,                      # important param to handle overfitting and underfitting issue
    weight_decay=1e-2,                       # mechanism of regularization
    warmup_steps=200,                        # enhance early performances
    max_steps=3000,                          # total optimization step
    gradient_checkpointing=True,             # saving memory
    evaluation_strategy="steps",             # evaluation strategy, others: "epoch"
    fp16=True,                               # half-precision floating point format
    per_device_eval_batch_size=8,            # batch size per GPU for evaluation
    predict_with_generate=True,              # do generation
    generation_max_length=200,               # max num of tokens for autoregressive generation
    eval_steps=500,                          # num of steps per evaluation
    report_to=["tensorboard"],               # save training logs to tensorboard
    load_best_model_at_end=True,             # best model at the end of output
    metric_for_best_model="wer",             # metric of the best at the end of output
    greater_is_better=False,                 # WER lower for better
    push_to_hub=False,                       # push to hub, optional
)
Step 9: Start training with trainer.train()
from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    args=training_args,
    model=model,
    train_dataset=common_voice["train"],
    eval_dataset=common_voice["test"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=processor.feature_extractor,
)
Handling CUDA Out of Memory (OOM) Errors:
  1. First priority: Reduce batch size to use more time to compensate for memory savings. Work along with gradient accumulation.
  2. Gradient checkpointing: Trades a small increase in computation time for significant reductions in memory usage.
  3. Mixed precision training: Reduces memory footprint significantly while maintaining training stability.
  4. Clear GPU cache:
import gc
import torch

gc.collect()
torch.cuda.empty_cache()
If all methods fail, changing to a smaller model size can be the last resort. Less model complexity will help save GPU memories.

b. Data Preprocessing

Hugging Face Dataset

Load the dataset using the load_dataset function:
common_voice = DatasetDict()

common_voice["train"] = load_dataset("common_voice", "ja", split="train+validated", use_auth_token=True)
common_voice["validation"] = load_dataset("common_voice", "ja", split="validation", use_auth_token=True)
common_voice["test"] = load_dataset("common_voice", "ja", split="test", use_auth_token=True)

# Create DatasetDict, choose the sample size for training and evaluation
common_voice = DatasetDict({
    "train": common_voice['train'].select(range(3500)),
    "validation": common_voice['validation'].select(range(500)),
    "test": common_voice['test'].select(range(100)),
})

# Remove columns that are not needed for the training
common_voice = common_voice.remove_columns(["age", "client_id", "down_votes", "gender", "path", "up_votes"])
Use “streaming=True” when space is limited on the disk, or if the download of the whole dataset is unnecessary.
Change sampling rate to 16k Hz (required by Whisper architecture):
common_voice = common_voice.cast_column("audio", Audio(sampling_rate=16000))

Transcript Cleaning

# lowercase texts and ignore apostrophes
text = [s.lower() for s in text]
punctuation_without_apostrophe = string.punctuation.replace("'", "")
translator = str.maketrans('', '', punctuation_without_apostrophe)
text = [s.translate(translator) for s in text]
# remove special tokens
def remove_tags(text):
    clean = re.compile('<.*?>')
    return re.sub(clean, '', text)

c. Fine-tuned Results

Abbreviations:
  • lr = learning rate, wd = weight decay, ws = warmup steps
  • ms = max steps, #e = number of epochs
  • es = evaluation strategy, ml = max length
  • tbz = train batch size, ebz = eval batch size
  • #ts = train sample size, #es = eval sample size
Dataset/Size/SplitModel/Lang/TaskHyperparametersResult
common_voice_11_0 #ts=100, #es=100 train/testWhisper small Hindi Transcribelr=1e-5, wd=0, ws=5, ms=40, es=steps, ml=225, tbz=4, ebz=8WER: 67.442%
common_voice_11_0 #ts=500, #es=500 train+validation/testWhisper small Hindi Transcribelr=1e-5, wd=0, ws=0, ms=60, es=steps, ml=50, tbz=16, ebz=8WER: 62.207%
common_voice #ts=3500, #es=500 train+validated/validationWhisper small Japanese Transcribelr=1e-6, wd=0, ws=50, ms=3500, es=steps, ml=200, tbz=16, ebz=8WER: 2.4%
librispeech_asr #ts=750, #es=250 train.100/validationWhisper medium English Transcribelr=1e-5, wd=0.01, ws=10, ms=750, es=steps, ml=80, tbz=1, ebz=1WER: 13.095%
As Japanese is character-based, a more suitable evaluation metric is Character Error Rate (CER).

d. PEFT with LoRA

Parameter-Efficient Fine-tuning (PEFT) approaches only fine-tune a small number of model parameters while freezing most parameters of the pre-trained LLMs, greatly decreasing computational and storage costs. LoRA (Low Rank Adaptation) decomposes the weights of pre-trained models into low-rank matrices or tensors and significantly reduces the number of parameters that need to be fine-tuned.
model = WhisperForConditionalGeneration.from_pretrained(
    'openai/whisper-large-v2',
    load_in_8bit=True,
    device_map="auto"
)

from peft import LoraConfig, PeftModel, LoraModel, LoraConfig, get_peft_model, prepare_model_for_int8_training

model = prepare_model_for_int8_training(model)

def make_inputs_require_grad(module, input, output):
    output.requires_grad_(True)

model.model.encoder.conv1.register_forward_hook(make_inputs_require_grad)

config = LoraConfig(
    r=32,
    lora_alpha=64,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none"
)

model = get_peft_model(model, config)
model.print_trainable_parameters()
PEFT Training Arguments:
training_args = Seq2SeqTrainingArguments(
    output_dir="jackdu/whisper-peft",
    per_device_train_batch_size=8,
    gradient_accumulation_steps=1,
    learning_rate=1e-4,
    weight_decay=0.01,
    warmup_steps=0,
    num_train_epochs=3,
    evaluation_strategy="steps",
    fp16=True,
    per_device_eval_batch_size=8,
    generation_max_length=150,
    logging_steps=100,
    remove_unused_columns=False,  # required for PEFT
    label_names=["labels"],       # required for PEFT
)
PEFT Results:
Dataset/Size/SplitModel/Lang/TaskHyperparametersResult
common_voice_13_0 #ts=1000, #es=100 train+validation/testWhisper medium Japanese Transcribelr=1e-3, wd=0, ws=50, #e=3, es=steps, ml=128, tbz=8, ebz=8WER: 73%, NormWER: 70.186%
common_voice_13_0 #ts=100, #es=30 train+validation/testWhisper large-v2 Vietnamese Transcribelr=1e-4, wd=0.01, ws=0, #e=3, es=steps, ml=150, tbz=8, ebz=8WER: 26.577%, NormWER: 22.523%

e. Loss Curves Visualization

plt.figure(figsize=(10, 6))
plt.plot(training_epoch, training_loss, label="Training Loss")
plt.plot(evaluation_epoch, evaluation_loss, label="Evaluation Loss")
plt.xlabel("Training Epochs")
plt.ylabel("Loss")
plt.title("Loss Curves for Whisper Fine-Tuning")
plt.legend()
plt.grid(True)
plt.show()
Key patterns to identify:
  • Overfitting: Low training loss but high validation loss
  • Underfitting: High training and validation loss
  • Smoothness: Smooth curves indicate well-behaved training
  • Loss Plateau: Model struggles to learn further from available data

f. Baseline Results

Dataset/Split/SizeModel/TaskResult
distil-whisper/tedlium-long-form testWhisper medium baseline en→enWER: 28.418%
distil-whisper/tedlium-long-form validationWhisper large-v2 baseline en→enWER: 26.671%
librispeech_asr clean testWhisper large-v2 baseline en→enWER: 4.746%
Aishell S0770 test #353Whisper large-v2 baseline zh-CN→zh-CNCER: 8.595%
Aishell S0768 test #367Whisper large-v2 baseline zh-CN→zh-CNCER: 12.379%
MagicData 38_5837 test #585Whisper large-v2 baseline zh-CN→zh-CNCER: 21.750%

4. Speaker Diarization

Speaker Diarization involves segmenting speech audio into distinct segments corresponding to different speakers. The goal is to identify and differentiate individual speakers in an audio stream.

a. Pyannote.audio

Pyannote-audio is an open-source toolkit for speaker diarization, voice activity detection, and speech turn segmentation. How to use Pyannote.audio with Whisper:
pip install -qq https://github.com/pyannote/pyannote-audio/archive/refs/heads/develop.zip
login(read_token)
device = "cuda:0" if torch.cuda.is_available() else "cpu"

from pyannote.audio import Pipeline, Audio
sd_pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization", use_auth_token=True)
wav_files = glob.glob(os.path.join(audio_dirpath, '*.wav'))

pipe = pipeline(
    "automatic-speech-recognition",
    model="openai/whisper-large-v2",
    chunk_length_s=30,
    device=device,
)
results = []

for audio_file in wav_files:
    diarization = sd_pipeline(audio_file, min_speakers=min_speakers, max_speakers=max_speakers)
    audio = Audio(sample_rate=16000, mono='random')
    for segment, _, speaker in diarization.itertracks(yield_label=True):
        waveform, sample_rate = audio.crop(audio_file, segment)
        text = pipe(
            {"raw": waveform.squeeze().numpy(), "sampling_rate": sample_rate},
            batch_size=8,
            generate_kwargs={"language": "<|zh|>", "task": "transcribe"}
        )["text"]
        results.append({
            'start': segment.start,
            'stop': segment.end,
            'speaker': speaker,
            'text': text
        })

b. WhisperX

WhisperX integrates Whisper, Phoneme-Based Model (Wav2Vec2) and Pyannote.audio. It claims to be 70x faster in real-time speech recognition than Whisper large-v2 with word-level timestamps and speaker diarization with VAD feature.
conda install pytorch==2.0.0 torchaudio==2.0.0 pytorch-cuda=11.8 -c pytorch -c nvidia
pip install git+https://github.com/m-bain/whisperx.git
diarize_model = whisperx.DiarizationPipeline(
    model_name="pyannote/speaker-diarization",
    use_auth_token='hf_token',
    device=device
)

model = whisperx.load_model(
    whisper_arch=model,
    device=device,
    compute_type=compute_type,
    language=language_abbr
)

audio = whisperx.load_audio(matching_file_path)
diarize_segments = diarize_model(matching_file_path, min_speakers=6, max_speakers=6)
result = model.transcribe(audio, batch_size=batch_size)
result = whisperx.assign_word_speakers(diarize_segments, result)
Advantages:
  • WhisperX: Multi-speaker scenario, VAD, Extra Phoneme model, Easier for local audios
  • Whisper Pipeline: More languages, Flexible chunk length (≤30s), Easier for HF datasets
WhisperX Results:
DatasetModel/Task/Compute TypeResult
TED LIUM 1st release SLR7 testWhisperX medium en→en int8WER: 37.041%
TED LIUM 1st release SLR7 testWhisperX large-v2 en→en int8WER: 36.917%
distil-whisper/tedlium-long-form validationWhisperX large-v2 en→en int8 batch_size=1WER: 24.651%
distil-whisper/tedlium-long-form validationWhisperX medium en→en int8 batch_size=1WER: 24.353%
AISHELL-4 selected audio fileWhisperX manual checkCER: 15.6%~24.658%

5. Other Models

a. Meta MMS

The Massively Multilingual Speech (MMS) project by Meta expands speech technology from around 100 languages to more than 1,100 languages.
from transformers import Wav2Vec2ForCTC, AutoProcessor

model_id = "facebook/mms-1b-all"
target_lang = "cmn-script_simplified"
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
processor = AutoProcessor.from_pretrained(model_id)
model = Wav2Vec2ForCTC.from_pretrained(model_id)

processor.tokenizer.set_target_lang(target_lang)
model.load_adapter(target_lang)
model = model.to(device)

b. PaddleSpeech

PaddleSpeech is a Chinese open-source toolkit on the PaddlePaddle platform. Available architectures include DeepSpeech2, Conformer, and U2 (Unified Streaming and Non-streaming). See the feature list for details.
pip install pytest-runner
pip install paddlespeech
from paddlespeech.cli.asr.infer import ASRExecutor
asr = ASRExecutor()

for audio_file in wav_files:
    result = asr(
        model='conformer_wenetspeech',
        lang='zh',
        sample_rate=16000,
        audio_file=audio_file,
        device=paddle.get_device()
    )
    transcript.append(result)
ASR training tutorial on Linux: asr1

c. SpeechBrain

SpeechBrain is an open-source conversational AI toolkit developed by the University of Montreal.
pip install speechbrain
from speechbrain.pretrained import EncoderDecoderASR

asr_model = EncoderDecoderASR.from_hparams(
    source="speechbrain/asr-transformer-aishell",
    savedir="pretrained_models/asr-transformer-aishell",
    run_opts={"device": "cuda"}
)
result = asr_model.transcribe_file(audio_file)

d. ESPnet

ESPnet is an end-to-end speech processing toolkit covering speech recognition, text-to-speech, speech translation, and speaker diarization.
pip install espnet_model_zoo
from espnet2.bin.asr_inference import Speech2Text

speech2text = Speech2Text.from_pretrained(
    model_id,
    maxlenratio=0.0,
    minlenratio=0.0,
    beam_size=20,
    ctc_weight=0.3,
    lm_weight=0.5,
    penalty=0.0,
    nbest=1
)

e. Baseline Results Comparison

English:
DatasetModel/MethodWER
librispeech_asr cleanMeta MMS mms-1b-all4.331%
common_voice_13_0 #1000Meta MMS mms-1b-all23.963%
Chinese:
DatasetModel/MethodCER
Aishell S0770 #353PaddleSpeech Default (conformer_u2pp_online_wenetspeech)4.062%
Aishell S0768 #367SpeechBrain wav2vec2-transformer-aishell8.436%
Aishell S0768 #367Meta MMS mms-1b-all34.241%
MagicData 4 speakers #2372PaddleSpeech conformer-wenetspeech9.79%
MagicData 4 speakers #2372SpeechBrain wav2vec2-ctc-aishell15.911%
MagicData 4 speakers #2372Whisper large-v2 baseline24.747%
Key Finding: For Chinese inference, PaddleSpeech had better performance compared to Whisper, while Meta MMS Chinese transcription results were worse than Whisper.

6. Azure Speech Studio

Azure AI Speech Services is a collection of cloud-based speech-related services offered by Microsoft Azure. Custom Speech Projects in Speech Studio can be created in different languages.

a. Upload Datasets

Three methods for uploading training and testing datasets:
  1. Speech Studio (direct upload)
  2. REST API
  3. CLI usage
Azure Blob Storage:
pip install azure-storage-blob

from azure.storage.blob import BlobServiceClient, BlobClient, ContainerClient

def upload_zip_to_azure_blob(account_name, account_key, container_name, local_zip_path, zip_blob_name):
    connection_string = f"DefaultEndpointsProtocol=https;AccountName={account_name};AccountKey={account_key};EndpointSuffix=core.windows.net"
    blob_service_client = BlobServiceClient.from_connection_string(connection_string)

    container_client = blob_service_client.get_container_client(container_name)
    if not container_client.exists():
        container_client.create_container()

    zip_blob_client = container_client.get_blob_client(zip_blob_name)
    with open(local_zip_path, "rb") as zip_file:
        zip_blob_client.upload_blob(zip_file)
Audio format requirements:
  • Format: WAV
  • Sampling rate: 8k Hz or 16k Hz
  • Channels: Single channel (mono)
  • Archive: ZIP format, under 2GB and 10k files within

b. Train and Deploy Models

pip install azure-cognitiveservices-speech

from azure.cognitiveservices.speech import SpeechConfig, SpeechRecognizer, AudioConfig

predictions = []

for root, _, files in os.walk(wav_base_path):
    for file_name in files:
        if file_name.endswith(".wav"):
            audio_file_path = os.path.join(root, file_name)
            audio_config = AudioConfig(filename=audio_file_path)
            speech_recognizer = SpeechRecognizer(speech_config=speech_config, audio_config=audio_config)
            result = speech_recognizer.recognize_once()
            predictions.append(result.text)

c. Azure Results

Test DatasetTrain DatasetsError Rate (Custom vs Baseline)
MagicData 9452 11:27:39sAishell 12+ hrs4.69% / 4.24%
MagicData 9452 11:27:39sAishell+Minds14 32+ hrs: 1+ hr4.67% / 4.23%
MagicData+Aishell+CV13 8721 11:45:52sAishell+CV13 8+ hrs: 7+ hrs2.51% / 3.70%
MagicData+Aishell+CV13 8721 11:45:52sAishell+CV13+Fleurs 8+ hrs: 7+ hrs: 9+ hrs2.48% / 3.70%
The best Azure model was trained with AISHELL-1, mozilla-foundation/common_voice_13_0 and google/fleurs, resulting in 2.48% error rate.

7. Prospect

Key findings and future directions:
  1. Data sources: Chinese sources with high transcript quality are much less available than English sources.
  2. Hardware limitations: Multi-GPU training or more advanced GPUs (NVIDIA 40 series) could help achieve better results with larger models.
  3. LoRA configurations: Effects of different LoRA parameters on PEFT model performance could be explored further.
  4. Speaker Diarization: While Pyannote.audio with Whisper integration shows potential, current diarizing ability in multi-speaker meeting scenarios is still not sufficient.
  5. Azure Speech Services: Keep good audio qualities and word-level accuracy in transcripts. Filtering training audio files that are not in good quality can enhance model performances.

8. References

  1. Anaconda, Inc. (2017). Command reference - conda documentation. conda.io/projects/conda/en/latest/commands
  2. OpenAI (2022, September 21). Introducing Whisper. openai.com/research/whisper
  3. Radford, A., Kim, J.W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2022). Robust Speech Recognition via Large-Scale Weak Supervision.
  4. Gandhi, S. (2022, November 3). Fine-Tune Whisper for Multilingual ASR with Transformers. huggingface.co/blog/fine-tune-whisper
  5. The Linux Foundation (2023). Previous PyTorch Versions. pytorch.org/get-started/previous-versions
  6. Hugging Face, Inc. (2023). Hugging Face Documentations. huggingface.co/docs
  7. Srivastav, V. (2023). fast-whisper-finetuning. github.com/Vaibhavs10/fast-whisper-finetuning
  8. Mangrulkar, S., & Paul, S. (2023). Parameter-Efficient Fine-Tuning Using PEFT. huggingface.co/blog/peft
  9. Bredin, H., et al. (2020). pyannote.audio: neural building blocks for speaker diarization. ICASSP 2020.
  10. Bain, M., Huh, J., Han, T., & Zisserman, A. (2023). WhisperX: Time-Accurate Speech Transcription of Long-Form Audio. INTERSPEECH 2023.
  11. Meta AI (2023, May 22). Introducing speech-to-text, text-to-speech, and more for 1,100+ languages. ai.meta.com/blog/multilingual-model-speech-recognition
  12. Pratap, V., et al. (2023). Scaling Speech Technology to 1,000+ Languages. arXiv.
  13. Zhang, H. L. (2022). PaddleSpeech: An Easy-to-Use All-in-One Speech Toolkit. NAACL 2022.
  14. Ravanelli, M., et al. (2021). SpeechBrain: A General-Purpose Speech Toolkit.
  15. Gao, D., et al. (2022). EURO: ESPnet Unsupervised ASR Open-source Toolkit. arXiv:2211.17196.
  16. ESPnet (2021). espnet_model_zoo. github.com/espnet/espnet_model_zoo
  17. Microsoft (2023). Custom Speech overview - Azure AI Services. learn.microsoft.com/en-us/azure/ai-services/speech-service/custom-speech-overview
  18. Microsoft (2023). Speech service documentation. learn.microsoft.com/en-us/azure/ai-services/speech-service/