Skip to main content

Author

Shuang LIANG - The University of Tokyo

Abstract

Large Language Models (LLMs) have shown outstanding performance in natural language tasks. This article explores fine-tuning Llama 3.1 for Chinese-to-English machine translation while addressing the challenge of hallucination through training and decoding strategies. Key Results:
  • Fine-tuned model achieved BLEU 40.8 vs baseline 19.6 on document-level data
  • COMET 0.891 vs baseline 0.820
  • Successfully mitigated hallucination in long-context translations
  • Maintained sentence-level quality while improving document-level performance

1. Background

Large Language Models

LLMs like Llama have revolutionized NLP, showing remarkable capabilities in understanding and generating human-like text. They can be fine-tuned for specific tasks, making them ideal for enhancing machine translation.

Parameter-Efficient Fine-Tuning (LoRA)

LoRA (Low-Rank Adaptation) enables fine-tuning without updating all model parameters:
  • Freezes pre-trained model parameters
  • Inserts trainable low-rank matrices
  • Significantly reduces training cost and time

Neural Machine Translation and Hallucination

Hallucination in NMT refers to unfaithful, fabricated, or nonsensical content:
TypeDescription
Intrinsic HallucinationsOutput contains incorrect information compared to source
Extrinsic HallucinationsModel generates additional unrelated content
Perturbation HallucinationsDrastically different output for perturbed vs unperturbed input
Natural HallucinationsConnected to noise in the training dataset

Decoding Strategies

MethodDescription
Greedy SearchChooses highest probability token each step
Beam SearchConsiders N most probable sequences
Temperature SamplingAdjusts probability distribution sharpness
Top-p SamplingSelects from tokens exceeding cumulative probability threshold
Top-k SamplingSelects from k most likely tokens

2. Experiments

Datasets

DatasetDocumentsSentencesWords (src/tgt)
NewsCommentary-v18.111,147443,67716.4M/9.7M
Ted Talks221,94951K/32K

Evaluation Metrics

  • BLEU: Bilingual Evaluation Understudy - compares n-grams with references
  • COMET: Neural framework with state-of-the-art correlation to human judgment

Environment

  • Model: Llama 3.1 8B Instruct
  • GPU: NVIDIA A100 (80GB)
  • Framework: Unsloth for accelerated training

Fine-tuning Configuration

model = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B-Instruct",
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit
)

model = FastLanguageModel.get_peft_model(
    model,
    r=16,  # LoRA rank
    target_modules=["q_proj", "k_proj", "v_proj",
                    "o_proj", "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    use_gradient_checkpointing="unsloth"
)

3. Results

In-Distribution Performance (Document-Level)

Training SamplesBLEUCOMET
1035.80.885
10036.90.889
1,00039.70.890
10,00040.80.891
Baseline19.60.820
Fine-tuning improved BLEU by over 100% compared to baseline on document-level translations.
Training Performance Chart

Training performance: BLEU and COMET scores vs training samples

Final Mixed Training Results

Using 30:1 sentence-to-document ratio:
Evaluation LevelFine-tuned BLEUFine-tuned COMETBaseline BLEUBaseline COMET
Document-level37.70.89019.60.820
Sentence-level30.70.86230.90.864

Hallucination Analysis

Types observed:
  1. Premature stopping: Model generates EOS token before completing translation
  2. Redundant content: Document-level models generate lengthy explanations beyond translation
Mitigation strategies:
  • EOS token probability thresholding
  • Mixed document/sentence-level training
  • Careful dataset preparation
Document-level fine-tuned models tend to generate lengthy outputs with implicit prior knowledge, sometimes producing factual but off-topic content.

4. Conclusion

With proper dataset preparation and fine-tuning techniques, it is possible to:
  1. Significantly improve translation quality (2x BLEU improvement)
  2. Mitigate hallucination issues
  3. Maintain sentence-level quality while enhancing document-level performance
  4. Produce more reliable and coherent translations

5. Future Work

  1. Prepare datasets covering various input scenarios (language styles, cultural backgrounds, dialogue topics)
  2. Balance content types in training data to avoid bias
  3. Address named entity errors through post-generation methods
  4. Explore additional hallucination mitigation techniques

References

  1. Kocmi, T., et al. (2022). “Findings of the 2022 conference on machine translation (WMT22).”
  2. Hu, E., et al. (2021). “LoRA: Low-Rank Adaptation of Large Language Models.”
  3. Meta AI. (2024). “Llama 3.1 Model Documentation.”
  4. Ji, Z., et al. (2023). “Survey of Hallucination in Natural Language Generation.”