Automatic Speech Recognition (ASR), also known as Speech to Text (STT), uses Deep Learning technologies to transcribe speech-included audios to texts. In the fields of Deep Learning Artificial Intelligence, Large Language Models (LLMs) mimic human brains in processing words and phrases, and have the ability to understand and generate text data. LLMs usually contain millions of weights and pre-trained with various kinds of datasets. Specifically, an ASR LLM will convert audio inputs to desired input formats by feature extraction and tokenization.To customize an ASR LLM with ideal performance, fine-tuning procedures of Whisper, an ASR LLM developed by OpenAI, were tested on Google Colaboratory first. Larger models were then deployed in GPU-equipped environments in Windows OS to speed up training and alleviate GPU availability or limit issues on Colab and MacOS. Audio data were investigated on reliability based on information such as audio quality and transcript accuracy. Models were then improved and optimized by data preprocessing and hyper-parameter tuning techniques. In case of failing to resolve GPU memory issues by means of regular fine-tuning, Parameter-Efficient-Fine-Tuning (PEFT) with Low Rank Adaptation (LoRA) was utilized to freeze most parameters to save memory allocation without sacrificing too much in performances. Results were visualized along with loss curves to ensure the fitness and optimization of fine-tuning processes.Possibility of multi-speaker support in Whisper was explored using Neural Speaker Diarization. Integration with Pyannote was implemented using pipeline and WhisperX, a project containing similar ideas with extra features of word-level timestamps and Voice Activity Detection (VAD). WhisperX was tested on long-form transcription with batching as well as diarization.Besides Whisper, other models with ASR functionality were installed and compared with Whisper baseline, including Massively Multilingual Speech (MMS) by Meta AI research, PaddleSpeech by PaddlePaddle, SpeechBrain and ESPNet. Chinese datasets were used to compare these models in CER metrics. In addition, Custom Speech in Azure AI, which supports real-time STT features, was introduced to compare performances (mainly Mandarin Chinese). Then a choice can be made between trained Azure models and loadable models like Whisper for deployment.
Google Colaboratory is a hosted Jupyter Notebook service that has limited free GPU & TPU computing resources. In Google Colaboratory, ipynb extension format is used to edit and execute Python scripts.Log in to Google Colab through Google account, share written scripts with others via “share” on the right top corner of the page, and optionally authorize Colab with a Github account.How to set up environments on Colab:
Select Tab Runtime → Change Runtime to enable GPU for use
Use pip or other package installers to install necessary dependencies
Besides Colab, environments can also be prepared on local PCs. Anaconda is a well-known distribution platform for Data Science field, including data analysis and building machine learning models in Python. It contains Conda, an environment and package manager that helps to manage open-source Python packages and libraries.How to set up environments with Anaconda:
Visual Studio Code, or VS Code, is a powerful source-code editor for Windows, MacOS and Linux with various programming languages available for editing. It supports multiple tasks, including debugging, executing in integrated terminals, enriching functionalities by extensions, and version control by embedded Git.How to set up environments in VS Code:
Open the folder(s) on the left side under EXPLORER and create files inside the folder
On the bottom right, select the environment needed. Execute Python scripts in either interactive window on the top right with IPython kernel installed or executing Python files using commands:
Copy
python xxx.py
An alternative way is to use the ipynb extension (Jupyter Notebook)
The Git icon on the left panel is the place where the source codes are controlled
VS Code needs to reopen if packages in the environment are updated
Compute Unified Device Architecture (CUDA) is a parallel computing platform and Application Programming Interface (API) developed by NVIDIA. It allows developers to use NVIDIA Graphics Processing Units (GPUs) for multiple computing tasks.How to use CUDA GPU:
Install the CUDA Toolkit, which includes necessary libraries, tools, and drivers for developing and running CUDA applications
Check relevant information in Command Prompt with the command:
Copy
nvidia-smi
After setting up CUDA Toolkit, download a GPU-compatible PyTorch version from PyTorch.
When a previous PyTorch version is needed, check the right commands of Previous PyTorch Versions to avoid compatibility issues.
Version check can be performed directly through Python:
Copy
import torchprint(f' CUDA availability on PyTorch is {torch.cuda.is_available()}')print(f' Current PyTorch version is {torch.__version__}')print(f' Current CUDA version is {torch.version.cuda}')print(f' cuDNN version is {torch.backends.cudnn.version()}')print(f' The number of available GPU devices is {torch.cuda.device_count()}')# Use CUDA on the devicedevice = torch.device("cuda")
Hugging Face is a company and an open-source platform dedicated to Natural Language Processing (NLP) and Artificial Intelligence.
Create a Hugging Face account to utilize published models or upload customized models. Personal READ and WRITE tokens can be created on https://huggingface.co/settings/tokens.Common ASR LLMs and their relevant information:
Model
# Params Size
Languages
Task
Structure
OpenAI Whisper
large-v2 1550M
Most languages
Multitasks
Transformer encoder-decoder Regularized
OpenAI Whisper
large 1550M
Most languages
Multitasks
Transformer encoder-decoder
OpenAI Whisper
medium 769M
Most languages
Multitasks
Transformer encoder-decoder
OpenAI Whisper
small 244M
Most languages
Multitasks
Transformer encoder-decoder
guillaumekln faster-whisper
large-v2
Most languages
Multitasks
CTranslate2
facebook wav2vec2
large-960h-lv60-self
English
transcription
Wav2Vec2CTC decoder
facebook wav2vec2
base-960h 94.4M
English
transcription
Wav2Vec2CTC decoder
facebook mms
1b-all 965M
Most languages
Multitasks
Wav2Vec2CTC decoder
Common audio datasets:
Dataset
# hours / Size
Languages
mozilla-foundation common_voice_13_0
17689 validated hrs
108 languages
google fleurs
~12 hrs per language
102 languages
LIUM tedlium
118 to 452 hrs for 3 releases
English
librispeech_asr
~1000 hrs
English
speechcolab gigaspeech
10000 hrs
English
PolyAI minds14
8.17k rows
14 languages
PolyAI/minds14 is primarily for intent detection task, and not ideal for ASR purpose
Open SLR is another useful website that hosts speech and language resources with compressed files. Various audio datasets can be seen along with their brief summaries in the Resources tab.Chinese audio datasets for ASR purposes:
Whisper is an ASR (Automatic Speech Recognition) system released by OpenAI in September, 2022. It was trained on 680,000 hours of multilingual and multitask supervised data, enabling multiple language transcription and translation. The architecture is an encoder-decoder Transformer.The audios will be chunked into 30 seconds and converted into a log-Mel spectrogram, which enables frequencies to be changed into the Mel scale. Then it will be passed into an encoder.Resources:
Step 1: Login through Hugging Face token to enable datasets download
Copy
from huggingface_hub import notebook_loginnotebook_login()
Step 2: Load desired dataset(s) through load_dataset in datasets
Sometimes permissions for access to certain datasets are needed on Hugging Face
Step 3: Preprocess datasets to feed data into Whisper:
Manipulate columns: e.g. remove_columns, cast_column
Normalize transcript, e.g. upper/lowercase, punctuations, special tokens
Change sampling rate to 16k using Audio in Datasets library
Load pre-trained feature extractor and tokenizer from transformers library
Copy
from transformers import WhisperFeatureExtractor, WhisperTokenizerWhisperFeatureExtractor.from_pretrained("model_id")WhisperTokenizer.from_pretrained("model_id")WhisperProcessor.from_pretrained("model_id")
AutoProcessor detects processor type automatically
In tokenizer, usually target languages and tasks are specified:
Copy
language="lang", task="transcribe" # or "translation"
Step 4: Define Data Collator in Sequence to Sequence with label padding
Step 7: Load conditional generation and configure model
Copy
from transformers import WhisperForConditionalGenerationmodel = WhisperForConditionalGeneration.from_pretrained("model_id")model.config.forced_decoder_ids = Nonemodel.config.suppress_tokens = []
Step 8: Define hyperparameters in Seq2SeqTrainingArguments
Copy
training_args = Seq2SeqTrainingArguments( output_dir="kawadlc/whisperv1", # own repo name per_device_train_batch_size=16, # batch size per GPU for train gradient_accumulation_steps=1, # increase by 2x for every 2x decrease in batch size learning_rate=1e-5, # important param to handle overfitting and underfitting issue weight_decay=1e-2, # mechanism of regularization warmup_steps=200, # enhance early performances max_steps=3000, # total optimization step gradient_checkpointing=True, # saving memory evaluation_strategy="steps", # evaluation strategy, others: "epoch" fp16=True, # half-precision floating point format per_device_eval_batch_size=8, # batch size per GPU for evaluation predict_with_generate=True, # do generation generation_max_length=200, # max num of tokens for autoregressive generation eval_steps=500, # num of steps per evaluation report_to=["tensorboard"], # save training logs to tensorboard load_best_model_at_end=True, # best model at the end of output metric_for_best_model="wer", # metric of the best at the end of output greater_is_better=False, # WER lower for better push_to_hub=False, # push to hub, optional)
# lowercase texts and ignore apostrophestext = [s.lower() for s in text]punctuation_without_apostrophe = string.punctuation.replace("'", "")translator = str.maketrans('', '', punctuation_without_apostrophe)text = [s.translate(translator) for s in text]
Parameter-Efficient Fine-tuning (PEFT) approaches only fine-tune a small number of model parameters while freezing most parameters of the pre-trained LLMs, greatly decreasing computational and storage costs.LoRA (Low Rank Adaptation) decomposes the weights of pre-trained models into low-rank matrices or tensors and significantly reduces the number of parameters that need to be fine-tuned.
Speaker Diarization involves segmenting speech audio into distinct segments corresponding to different speakers. The goal is to identify and differentiate individual speakers in an audio stream.
Pyannote-audio is an open-source toolkit for speaker diarization, voice activity detection, and speech turn segmentation.How to use Pyannote.audio with Whisper:
WhisperX integrates Whisper, Phoneme-Based Model (Wav2Vec2) and Pyannote.audio. It claims to be 70x faster in real-time speech recognition than Whisper large-v2 with word-level timestamps and speaker diarization with VAD feature.
PaddleSpeech is a Chinese open-source toolkit on the PaddlePaddle platform. Available architectures include DeepSpeech2, Conformer, and U2 (Unified Streaming and Non-streaming). See the feature list for details.
Copy
pip install pytest-runnerpip install paddlespeech
Copy
from paddlespeech.cli.asr.infer import ASRExecutorasr = ASRExecutor()for audio_file in wav_files: result = asr( model='conformer_wenetspeech', lang='zh', sample_rate=16000, audio_file=audio_file, device=paddle.get_device() ) transcript.append(result)
Key Finding: For Chinese inference, PaddleSpeech had better performance compared to Whisper, while Meta MMS Chinese transcription results were worse than Whisper.
Azure AI Speech Services is a collection of cloud-based speech-related services offered by Microsoft Azure. Custom Speech Projects in Speech Studio can be created in different languages.
Data sources: Chinese sources with high transcript quality are much less available than English sources.
Hardware limitations: Multi-GPU training or more advanced GPUs (NVIDIA 40 series) could help achieve better results with larger models.
LoRA configurations: Effects of different LoRA parameters on PEFT model performance could be explored further.
Speaker Diarization: While Pyannote.audio with Whisper integration shows potential, current diarizing ability in multi-speaker meeting scenarios is still not sufficient.
Azure Speech Services: Keep good audio qualities and word-level accuracy in transcripts. Filtering training audio files that are not in good quality can enhance model performances.