Machine Translation with Large Language Models and Hallucination Reduction

Author
Abstract
1. Background
Large Language Models
Parameter-Efficient Fine-Tuning (LoRA)
Neural Machine Translation and Hallucination
Decoding Strategies
2. Experiments
Datasets
Evaluation Metrics
Environment
Fine-tuning Configuration
3. Results
In-Distribution Performance (Document-Level)
Final Mixed Training Results
Hallucination Analysis
4. Conclusion
5. Future Work
References

Author

Shuang LIANG - The University of Tokyo

Abstract

Large Language Models (LLMs) have shown outstanding performance in natural language tasks. This article explores fine-tuning Llama 3.1 for Chinese-to-English machine translation while addressing the challenge of hallucination through training and decoding strategies. Key Results:

Fine-tuned model achieved BLEU 40.8 vs baseline 19.6 on document-level data
COMET 0.891 vs baseline 0.820
Successfully mitigated hallucination in long-context translations
Maintained sentence-level quality while improving document-level performance

1. Background

Large Language Models

LLMs like Llama have revolutionized NLP, showing remarkable capabilities in understanding and generating human-like text. They can be fine-tuned for specific tasks, making them ideal for enhancing machine translation.

Parameter-Efficient Fine-Tuning (LoRA)

LoRA (Low-Rank Adaptation) enables fine-tuning without updating all model parameters:

Freezes pre-trained model parameters
Inserts trainable low-rank matrices
Significantly reduces training cost and time

Neural Machine Translation and Hallucination

Hallucination in NMT refers to unfaithful, fabricated, or nonsensical content:

Type	Description
Intrinsic Hallucinations	Output contains incorrect information compared to source
Extrinsic Hallucinations	Model generates additional unrelated content
Perturbation Hallucinations	Drastically different output for perturbed vs unperturbed input
Natural Hallucinations	Connected to noise in the training dataset

Decoding Strategies

Method	Description
Greedy Search	Chooses highest probability token each step
Beam Search	Considers N most probable sequences
Temperature Sampling	Adjusts probability distribution sharpness
Top-p Sampling	Selects from tokens exceeding cumulative probability threshold
Top-k Sampling	Selects from k most likely tokens

2. Experiments

Datasets

Dataset	Documents	Sentences	Words (src/tgt)
NewsCommentary-v18.1	11,147	443,677	16.4M/9.7M
Ted Talks	22	1,949	51K/32K

Evaluation Metrics

BLEU: Bilingual Evaluation Understudy - compares n-grams with references
COMET: Neural framework with state-of-the-art correlation to human judgment

Environment

Model: Llama 3.1 8B Instruct
GPU: NVIDIA A100 (80GB)
Framework: Unsloth for accelerated training

Fine-tuning Configuration

model = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B-Instruct",
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit
)

model = FastLanguageModel.get_peft_model(
    model,
    r=16,  # LoRA rank
    target_modules=["q_proj", "k_proj", "v_proj",
                    "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    use_gradient_checkpointing="unsloth"
)

3. Results

In-Distribution Performance (Document-Level)

Training Samples	BLEU	COMET
10	35.8	0.885
100	36.9	0.889
1,000	39.7	0.890
10,000	40.8	0.891
Baseline	19.6	0.820

Fine-tuning improved BLEU by over 100% compared to baseline on document-level translations.

Training performance: BLEU and COMET scores vs training samples

Final Mixed Training Results

Using 30:1 sentence-to-document ratio:

Evaluation Level	Fine-tuned BLEU	Fine-tuned COMET	Baseline BLEU	Baseline COMET
Document-level	37.7	0.890	19.6	0.820
Sentence-level	30.7	0.862	30.9	0.864

Hallucination Analysis

Types observed:

Premature stopping: Model generates EOS token before completing translation
Redundant content: Document-level models generate lengthy explanations beyond translation

Mitigation strategies:

EOS token probability thresholding
Mixed document/sentence-level training
Careful dataset preparation

Document-level fine-tuned models tend to generate lengthy outputs with implicit prior knowledge, sometimes producing factual but off-topic content.

4. Conclusion

With proper dataset preparation and fine-tuning techniques, it is possible to:

Significantly improve translation quality (2x BLEU improvement)
Mitigate hallucination issues
Maintain sentence-level quality while enhancing document-level performance
Produce more reliable and coherent translations

5. Future Work

Prepare datasets covering various input scenarios (language styles, cultural backgrounds, dialogue topics)
Balance content types in training data to avoid bias
Address named entity errors through post-generation methods
Explore additional hallucination mitigation techniques

References

Kocmi, T., et al. (2022). “Findings of the 2022 conference on machine translation (WMT22).”
Hu, E., et al. (2021). “LoRA: Low-Rank Adaptation of Large Language Models.”
Meta AI. (2024). “Llama 3.1 Model Documentation.”
Ji, Z., et al. (2023). “Survey of Hallucination in Natural Language Generation.”

RAFT Translation (Aug 2024)Bert-VITS2 TTS (Aug 2024)

⌘I

Getting Started

Quick Start Guide

Pricing & Plans

Live Captions & Webinars

PC Voice Translation

Subtitles, Minutes & Dictionary

Mobile App

Admin Features

SSO Configuration

Virtual Office

Productivity Management

Support & FAQ

Research

Hiring

Legal & Security

Machine Translation with Large Language Models and Hallucination Reduction

Author

Abstract

1. Background

Large Language Models

Parameter-Efficient Fine-Tuning (LoRA)

Neural Machine Translation and Hallucination

Decoding Strategies

2. Experiments

Datasets

Evaluation Metrics

Environment

Fine-tuning Configuration

3. Results

In-Distribution Performance (Document-Level)

Final Mixed Training Results

Hallucination Analysis

4. Conclusion

5. Future Work

References

Getting Started

Quick Start Guide

Pricing & Plans

Live Captions & Webinars

PC Voice Translation

Subtitles, Minutes & Dictionary

Mobile App

Admin Features

SSO Configuration

Virtual Office

Productivity Management

Support & FAQ

Research

Hiring

Legal & Security

​Author

​Abstract

​1. Background

​Large Language Models

​Parameter-Efficient Fine-Tuning (LoRA)

​Neural Machine Translation and Hallucination

​Decoding Strategies

​2. Experiments

​Datasets

​Evaluation Metrics

​Environment

​Fine-tuning Configuration

​3. Results

​In-Distribution Performance (Document-Level)

​Final Mixed Training Results

​Hallucination Analysis

​4. Conclusion

​5. Future Work

​References

Author

Abstract

1. Background

Large Language Models

Parameter-Efficient Fine-Tuning (LoRA)

Neural Machine Translation and Hallucination

Decoding Strategies

2. Experiments

Datasets

Evaluation Metrics

Environment

Fine-tuning Configuration

3. Results

In-Distribution Performance (Document-Level)

Final Mixed Training Results

Hallucination Analysis

4. Conclusion

5. Future Work

References