Skip to main content

Author

Boxuan Lyu - Tokyo Institute of Technology

Abstract

This research presents the development of a fast and natural Text-to-Speech (TTS) system for Mandarin Chinese using the Bert-VITS2 framework. The system is specifically tailored for meeting scenarios, generating clear, expressive, and context-appropriate speech. Key Results:
  • Achieved WER of 0.27 (lowest among compared models)
  • Achieved MOS of 2.90 for speech naturalness
  • Successfully synthesized speech up to 22 seconds
  • Trained on AISHELL-3 dataset (85 hours, 218 speakers)

1. Introduction

What is Text-to-Speech?

Text-to-Speech (TTS) technology converts written text into natural-sounding speech. Modern TTS systems leverage deep learning to generate increasingly natural and expressive speech, with applications in:
Text-to-Speech Overview
  • Intelligent assistants
  • Accessible reading solutions
  • Navigation systems
  • Automated customer service

Why Mandarin?

Mandarin Chinese is the most widely spoken language with over a billion speakers. However, it presents unique challenges for TTS due to its tonal nature and complex linguistic structure.

What is Bert-VITS2?

Bert-VITS2 combines pre-trained language models with advanced voice synthesis:
  • BERT integration: Deep understanding of semantic and contextual nuances
  • GAN-style training: Produces highly realistic speech through adversarial learning
  • Based on VITS2: State-of-the-art voice synthesis architecture

2. Methodology

2.1 Dataset Selection

AISHELL-3 was selected for this study:
  • 85 hours of audio
  • 218 speakers
  • ~30 minutes per speaker average
  • High transcription quality
Initial experiments with Alimeeting (118.75 hours) resulted in blank audio generation due to poor transcription quality and low per-speaker duration.
Preprocessing WebUI

Data preprocessing webui interface

2.2 Model Architecture

The Bert-VITS2 framework consists of four main components:
ComponentFunction
TextEncoderProcesses input text with pre-trained BERT for semantic understanding
DurationPredictorEstimates phoneme durations with stochastic variations
FlowModels pitch and energy using normalizing flows
DecoderSynthesizes final speech waveform

2.3 Training Process

Loss Functions

  • Reconstruction Loss: Matches generated speech to ground truth
  • Duration Loss: Minimizes phoneme duration prediction error
  • Adversarial Loss: Encourages realistic speech generation
  • Feature Matching Loss: Aligns intermediate features

Mode Collapse Mitigation

  • Gradient Penalty for discriminator stability
  • Spectral Normalization in generator and discriminator
  • Progressive Training with increasing complexity

Hyperparameters

{
  "train": {
    "batch_size": 20,
    "learning_rate": 0.00001,
    "epochs": 100,
    "bf16_run": true
  },
  "data": {
    "sampling_rate": 44100,
    "n_speakers": 174
  }
}
Training was conducted on a single RTX 4090 GPU with bfloat16 precision.

3. Results and Discussion

Training Dynamics

Initial training showed mode collapse (generating blank speech). After refinement:
  • Discriminator loss stabilized
  • Generator loss showed clear downward trend
  • WER dropped from ~0.5 to ~0.2 during training
Training Loss

Training loss curves

WER Improvement

WER improvement during training

Comparison with Other Models

ModelWERMOS
Ours (Bert-VITS2)0.272.90
myshell-ai/MeloTTS-Chinese5.623.04
fish-speech (GPT) w/o ref0.493.57
Our model achieved the lowest WER, indicating accurate speech generation. However, MOS (naturalness) has room for improvement compared to fish-speech, which has significantly more parameters.

Generation Examples

Successfully synthesized examples including:
  • Short phrases (2-10 seconds)
  • Long-form speech (22 seconds) - outside training data scope

Limitations

Code-switching: The model cannot handle text with mixed languages (e.g., Chinese with English terms like “Speech processing”).

4. Conclusions and Future Work

Achievements

  1. Successfully fine-tuned Bert-VITS2 for Mandarin TTS
  2. Achieved lowest WER among compared models
  3. Mastered methodology to mitigate GAN training challenges
  4. Generated clear, recognizable speech across various durations

Future Directions

  1. Train more steps to improve MOS scores
  2. Address code-switching limitations
  3. Expand to additional speakers and domains

5. References

  1. Ren, Y., et al. (2019). “Fastspeech: Fast, robust and controllable text to speech.” NeurIPS.
  2. Wang, Y., et al. (2017). “Tacotron: Towards end-to-end speech synthesis.” Interspeech.
  3. Kim, J., et al. (2021). “Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech.” ICML.
  4. Kong, J., et al. (2023). “VITS2: Improving Quality and Efficiency of Single-Stage Text-to-Speech.” INTERSPEECH.
  5. Shi, Y., et al. (2020). “AISHELL-3: A Multi-speaker Mandarin TTS Corpus and the Baselines.” ArXiv.
  6. Saeki, T., et al. (2022). “UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022.” INTERSPEECH.

Resources