Author
Shuang LIANG - The University of Tokyo
Abstract
Large Language Models (LLMs) have shown outstanding performance in natural language tasks. This article explores fine-tuning Llama 3.1 for Chinese-to-English machine translation while addressing the challenge of hallucination through training and decoding strategies.
Key Results:
- Fine-tuned model achieved BLEU 40.8 vs baseline 19.6 on document-level data
- COMET 0.891 vs baseline 0.820
- Successfully mitigated hallucination in long-context translations
- Maintained sentence-level quality while improving document-level performance
1. Background
Large Language Models
LLMs like Llama have revolutionized NLP, showing remarkable capabilities in understanding and generating human-like text. They can be fine-tuned for specific tasks, making them ideal for enhancing machine translation.
Parameter-Efficient Fine-Tuning (LoRA)
LoRA (Low-Rank Adaptation) enables fine-tuning without updating all model parameters:
- Freezes pre-trained model parameters
- Inserts trainable low-rank matrices
- Significantly reduces training cost and time
Neural Machine Translation and Hallucination
Hallucination in NMT refers to unfaithful, fabricated, or nonsensical content:
| Type | Description |
|---|
| Intrinsic Hallucinations | Output contains incorrect information compared to source |
| Extrinsic Hallucinations | Model generates additional unrelated content |
| Perturbation Hallucinations | Drastically different output for perturbed vs unperturbed input |
| Natural Hallucinations | Connected to noise in the training dataset |
Decoding Strategies
| Method | Description |
|---|
| Greedy Search | Chooses highest probability token each step |
| Beam Search | Considers N most probable sequences |
| Temperature Sampling | Adjusts probability distribution sharpness |
| Top-p Sampling | Selects from tokens exceeding cumulative probability threshold |
| Top-k Sampling | Selects from k most likely tokens |
2. Experiments
Datasets
| Dataset | Documents | Sentences | Words (src/tgt) |
|---|
| NewsCommentary-v18.1 | 11,147 | 443,677 | 16.4M/9.7M |
| Ted Talks | 22 | 1,949 | 51K/32K |
Evaluation Metrics
- BLEU: Bilingual Evaluation Understudy - compares n-grams with references
- COMET: Neural framework with state-of-the-art correlation to human judgment
Environment
- Model: Llama 3.1 8B Instruct
- GPU: NVIDIA A100 (80GB)
- Framework: Unsloth for accelerated training
Fine-tuning Configuration
model = FastLanguageModel.from_pretrained(
model_name="unsloth/Meta-Llama-3.1-8B-Instruct",
max_seq_length=max_seq_length,
dtype=dtype,
load_in_4bit=load_in_4bit
)
model = FastLanguageModel.get_peft_model(
model,
r=16, # LoRA rank
target_modules=["q_proj", "k_proj", "v_proj",
"o_proj", "gate_proj", "up_proj", "down_proj"],
lora_alpha=16,
use_gradient_checkpointing="unsloth"
)
3. Results
| Training Samples | BLEU | COMET |
|---|
| 10 | 35.8 | 0.885 |
| 100 | 36.9 | 0.889 |
| 1,000 | 39.7 | 0.890 |
| 10,000 | 40.8 | 0.891 |
| Baseline | 19.6 | 0.820 |
Fine-tuning improved BLEU by over 100% compared to baseline on document-level translations.
Training performance: BLEU and COMET scores vs training samples
Final Mixed Training Results
Using 30:1 sentence-to-document ratio:
| Evaluation Level | Fine-tuned BLEU | Fine-tuned COMET | Baseline BLEU | Baseline COMET |
|---|
| Document-level | 37.7 | 0.890 | 19.6 | 0.820 |
| Sentence-level | 30.7 | 0.862 | 30.9 | 0.864 |
Hallucination Analysis
Types observed:
- Premature stopping: Model generates EOS token before completing translation
- Redundant content: Document-level models generate lengthy explanations beyond translation
Mitigation strategies:
- EOS token probability thresholding
- Mixed document/sentence-level training
- Careful dataset preparation
Document-level fine-tuned models tend to generate lengthy outputs with implicit prior knowledge, sometimes producing factual but off-topic content.
4. Conclusion
With proper dataset preparation and fine-tuning techniques, it is possible to:
- Significantly improve translation quality (2x BLEU improvement)
- Mitigate hallucination issues
- Maintain sentence-level quality while enhancing document-level performance
- Produce more reliable and coherent translations
5. Future Work
- Prepare datasets covering various input scenarios (language styles, cultural backgrounds, dialogue topics)
- Balance content types in training data to avoid bias
- Address named entity errors through post-generation methods
- Explore additional hallucination mitigation techniques
References
- Kocmi, T., et al. (2022). “Findings of the 2022 conference on machine translation (WMT22).”
- Hu, E., et al. (2021). “LoRA: Low-Rank Adaptation of Large Language Models.”
- Meta AI. (2024). “Llama 3.1 Model Documentation.”
- Ji, Z., et al. (2023). “Survey of Hallucination in Natural Language Generation.”