Skip to main content

Author

Chen Yufeng - Waseda University

Abstract

We are addressing the challenge of accurately assessing machine translation quality while also striving to enhance its accuracy to a level comparable to human translation. Our approach involves employing five distinct benchmark translation models and evaluating their performance using three diverse evaluation metrics. Concurrently, we are dedicated to refining the accuracy of these models through insights gained from prior research and studies.

Table of Contents

  1. Introduction
  2. Dataset
  3. How To Evaluate Machine Translation Accuracy
    • 3.1. BLEU Score
    • 3.2. BLEURT Score
    • 3.3. COMET Score
  4. Five Basic Machine Translation Models And Their Accuracies
    • 4.1. Azure Baseline Model
    • 4.2. Azure Custom Model
    • 4.3. DeepL Model
    • 4.4. Google Translator
    • 4.5. GPT-4 Model
    • 4.6. Comparison And Conclusion
  5. Improve Machine Translation Accuracy
    • 5.1. In-Context Learning for GPT-4
    • 5.2. Hybrid Model
    • 5.3. GPT-4 as a Data Cleaning Tool
  6. Conclusion
  7. References

1. Introduction

With the advancement of AI technology, particularly following the inception of ChatGPT by OpenAI, people are increasingly placing greater trust in the AI industry. As a pivotal component within the realm of natural language processing, machine translation has garnered ever-growing significance. This paper focuses on the evaluation of five fundamental translation models using diverse evaluation metrics, while also delving into methods to enhance the precision of these models to the fullest extent possible.

2. Dataset

The research is centered around the Opus100 (ZH-EN) dataset available on Hugging Face. This dataset comprises one million Chinese-to-English translation instances spanning various domains, rendering Opus100 a fitting choice for training translation models.
Opus100 Dataset
It is imperative to acknowledge the presence of translation inaccuracies within the dataset. While these inaccuracies may ostensibly reduce training accuracy, they concurrently serve as a deterrent against potential overfitting issues.
Furthermore, prior to integration into the Azure AI platform, Opus100 necessitates a preprocessing step to eliminate anomalous symbols present in each sentence.

3. How To Evaluate Machine Translation Accuracy

When faced with a multitude of translation models, selecting the most suitable one for a specific purpose becomes challenging. There exist two fundamental approaches for assessing translation models:
  1. Traditional method: BLEU score
  2. Neural metrics: BLEURT score and COMET score

3.1 BLEU Score

BLEU (Bilingual Evaluation Understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another (Papineni et al., 2002).
import nltk

bleu_scores = []
for reference, pre in zip(reference_translations, prediction):
    reference_tokens = nltk.word_tokenize(reference.lower())
    pre_tokens = nltk.word_tokenize(pre.lower())

    if not reference_tokens or not pre_tokens:
        continue

    bleu_score = nltk.translate.bleu_score.sentence_bleu(
        [reference_tokens], pre_tokens,
        smoothing_function=nltk.translate.bleu_score.SmoothingFunction().method2)
    bleu_scores.append(bleu_score)

average_bleu_score = sum(bleu_scores) / len(bleu_scores)
print("Average BLEU score:", average_bleu_score)

Seven Smoothing Functions

FunctionDescription
Smoothing Function 1Additive (Laplace) Smoothing - adds constant value to prevent zero probabilities
Smoothing Function 2NIST Smoothing - introduces reference length penalty
Smoothing Function 3Chen and Cherry - adapts based on candidate translation length
Smoothing Function 4JenLin - balances additive and adjusted methods
Smoothing Function 5Gao and He - addresses bias towards shorter translations
Smoothing Function 6Bayesian - provides robust estimation for longer sentences
Smoothing Function 7Geometric Mean - calculates geometric mean of n-gram precisions
BLEU has limitations: it fails to account for word order and syntax, and relies primarily on n-gram overlaps without capturing fluency, idiomatic expressions, grammar, and overall coherence.

3.2 BLEURT Score

BLEURT is an evaluation metric for Natural Language Generation. It takes a pair of sentences as input (reference and candidate) and returns a score indicating fluency and meaning preservation (Sellam, 2021).
from bleurt import score

checkpoint = "/path/to/BLEURT-20"
scorer = score.BleurtScorer(checkpoint)
scores = scorer.score(references=reference_translations, candidates=prediction)

total_score = sum(scores) / len(scores)
print("Total Score:", total_score)
Make sure to install TensorFlow beforehand to use BLEURT.

3.3 COMET Score

COMET is a neural framework for training multilingual machine translation evaluation models, designed to predict human judgments of translation quality.
from comet import download_model, load_from_checkpoint

model_path = download_model("Unbabel/wmt22-comet-da")
model = load_from_checkpoint(model_path)

data = []
for src, pre, reference in zip(source_sentences, preds, reference_translations):
    data.append({
        "src": src,
        "mt": pre,
        "ref": reference
    })

model_output = model.predict(data, batch_size=8, gpus=0)
print(model_output)

4. Five Basic Machine Translation Models And Their Accuracies

4.1 Azure Baseline Model

import requests, uuid, json

endpoint = ""
subscription_key = ""
location = ""

path = '/translate'
constructed_url = endpoint + path

params = {
    'api-version': '3.0',
    'from': 'zh',
    'to': 'en'
}

headers = {
    'Ocp-Apim-Subscription-Key': subscription_key,
    'Ocp-Apim-Subscription-Region': location,
    'Content-type': 'application/json',
    'X-ClientTraceId': str(uuid.uuid4())
}

body = []
for i in source_sentences:
    body.append({'text': i})

request = requests.post(constructed_url, params=params, headers=headers, json=body)
response = request.json()

4.2 Azure Custom Model

The Azure custom model is an enhanced version achieved through utilizing additional datasets to further train the Azure baseline model. The custom model’s BLEU score on the Azure platform is 39.45.
When working with the custom model, it must be published on the Azure platform to be invoked when the API is called.

4.3 DeepL Model

DeepL Translator is a neural machine translation service using convolutional neural networks and an English pivot.
import deepl

API_KEY = 'your-api-key'
source_lang = 'ZH'
target_lang = 'EN-US'

translator = deepl.Translator(API_KEY)
results = translator.translate_text(source_sentences, source_lang=source_lang, target_lang=target_lang)

4.4 Google Translator

import requests

def translate_texts(texts, target_language):
    api_key = 'your-api-key'
    url = 'https://translation.googleapis.com/language/translate/v2'
    translations = []

    for text in texts:
        params = {
            'key': api_key,
            'q': text,
            'target': target_language
        }
        response = requests.get(url, params=params)
        if response.status_code == 200:
            data = response.json()
            translated_text = data['data']['translations'][0]['translatedText']
            translations.append(translated_text)

    return translations

4.5 GPT-4 Model

import openai

def translate_text(text_list):
    openai.api_key = 'your-api-key'
    translations = []

    for text in text_list:
        messages = [
            {"role": "system", "content": "You are a translation assistant from Chinese to English. Some rules to remember:\n\n- Do not add extra blank lines.\n- It is important to maintain the accuracy of the contents, but we don't want the output to read like it's been translated. So instead of translating word by word, prioritize naturalness and ease of communication."},
            {"role": "user", "content": text}
        ]

        response = openai.ChatCompletion.create(
            model='gpt-4',
            messages=messages,
            max_tokens=100,
            temperature=0.7,
            timeout=30
        )

        choices = response['choices']
        if len(choices) > 0:
            translation = choices[0]['message']['content']
            translations.append(translation)

    return translations
GPT-4 tends to exhibit a slower response time compared to the other four models. Also, an excessive number of tokens could potentially lead to overloading.

4.6 Comparison And Conclusion

Based on the evaluation results:
BLEU Score Comparison
BLEURT Score Comparison
COMET Score Comparison
Key Findings:
  1. Azure Custom Model emerges as the top performer
  2. DeepL follows closely in second place
  3. Azure Baseline Model claims the third spot
  4. Google Translator and GPT-4 share similar standings
DeepL currently holds the distinction of being the most effective model for translating Chinese to English, especially when users lack pre-training capabilities.

5. Improve Machine Translation Accuracy

Three distinct approaches for improving translation accuracy:

5.1 In-Context Learning for GPT-4

Large language models can improve performance through in-context learning by providing specific task examples in prompts (Brown et al., 2020). Result: The BLEURT score was increased from 0.6486 to 0.6755, demonstrating the effectiveness of in-context learning.

5.2 Hybrid Model

The hybrid threshold model establishes a specific threshold, and different models are used to retranslate when certain sentences fail to meet the threshold.
import requests, uuid, json
import openai
from comet import download_model, load_from_checkpoint

def translate_with_fallback(text):
    translation_from_Azure = Azure_translation(text)
    model_path = download_model("Unbabel/wmt22-comet-da")
    model = load_from_checkpoint(model_path)
    refined_translation = []
    indices_to_correct = []

    for i in range(len(translation_from_Azure)):
        data = [{
            "src": source_sentences[i],
            "mt": translation_from_Azure[i],
            "ref": reference_translations[i]
        }]
        res = model.predict(data, batch_size=1, gpus=0)
        if res.scores[0] < 0.81:
            indices_to_correct.append(i)

    sentences_to_correct = [source_sentences[i] for i in indices_to_correct]
    corrected_sentences = gpt_translation(sentences_to_correct)

    corrected_index = 0
    for i in range(len(translation_from_Azure)):
        if i in indices_to_correct:
            refined_translation.append(corrected_sentences[corrected_index])
            corrected_index += 1
        else:
            refined_translation.append(translation_from_Azure[i])

    return refined_translation

Conclusions of Hybrid Model

  1. The optimal threshold aligns with the COMET score
  2. Best performance comes from Azure Custom + DeepL or DeepL + GPT-4
  3. Nearly all hybrid models surpass individual models
  4. A higher threshold does not necessarily guarantee improved scores

5.3 GPT-4 as a Data Cleaning Tool

GPT-4 can be used to preprocess datasets and correct inaccurate translations:
import openai
import json

pair = {}
for zh, en in zip(source_sentences, reference_translations):
    pair[zh] = en

def translate_text(pair):
    openai.api_key = 'your-api-key'
    translations = []

    for zh, en in pair.items():
        messages = [
            {"role": "system", "content": "You are a Chinese to English translation corrector. You need to modify the incorrect English translations below and correct it by given Chinese sentences, please remember not to use English abbreviations and not add extra blank lines. Fix weird punctuation. And the result should be English sentences only"},
            {"role": "user", "content": json.dumps({"zh": zh, "en": en})}
        ]

        response = openai.ChatCompletion.create(
            model='gpt-4',
            messages=messages,
            max_tokens=100,
            temperature=0.7,
            timeout=30
        )

        choices = response['choices']
        if len(choices) > 0:
            model_response = choices[0]['message']['content']
            translations.append(model_response)

    return translations
Leveraging GPT-4 for data cleaning on both the original text and target text proves to be viable. The scores achieved by Azure baseline on a refined dataset can align with the performance of DeepL on a subpar dataset.

6. Conclusion

This paper investigated machine translation accuracy and methods for enhancement through three evaluation metrics and five benchmark models. Key Conclusions:
  1. DeepL is the most proficient Chinese to English translator
  2. Azure Baseline Model can achieve higher performance with substantial data and adequate training
  3. Hybrid models combining different translation engines improve accuracy
  4. GPT-4 data cleaning improves dataset quality, leading to better model performance
This study acknowledges limitations: manual inspection revealed instances where translation quality did not align with high scores obtained, and some accuracy enhancement methods led to decreased scores.

7. References

  1. Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002). BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics (ACL).
  2. Thibault Sellam (2021). BLEURT
  3. Tom Brown et al. (2020). Language models are few-shot learners.
  4. Daniel Bashir (2023). In-Context Learning, in Context. The Gradient.
  5. Amr Hendy et al. (2023). How Good Are GPT Models at Machine Translation? A Comprehensive Evaluation. Microsoft.
  6. Ricardo Rei (2022). COMET