Skip to main content

Author

Kai-Teh Tzeng - Lehigh University

Abstract

This study explores using Retrieval-Augmented Fine-Tuning (RAFT) to enhance English-Chinese bidirectional translation with Llama 3.1-8B. RAFT combines retrieval mechanisms with fine-tuning to provide contextual examples during training. Key Findings:
  • Benchmark fine-tuning achieved best overall results
  • RAFT showed modest improvements on specific metrics
  • Random-based RAFT sometimes outperformed similarity-based RAFT
  • Translation quality depends heavily on training data relevance

1. Introduction

Background

Large Language Models excel at language tasks but can benefit from domain-specific optimization. This research explores whether RAFT—a technique that augments training with retrieved examples—can improve translation quality.

Research Questions

  1. Can RAFT improve translation compared to standard fine-tuning?
  2. Does similarity-based retrieval outperform random retrieval?
  3. How do different RAFT configurations affect bidirectional translation?

2. Methodology

RAFT Overview

RAFT (Retrieval-Augmented Fine-Tuning) enhances the training process by:
  1. Retrieving relevant examples from a corpus for each training sample
  2. Augmenting the training context with retrieved examples
  3. Fine-tuning the model with this enriched context
RAFT Diagram

RAFT methodology diagram

Experimental Setup

ComponentConfiguration
Base ModelLlama 3.1-8B Instruct
Fine-tuningLoRA (r=16, alpha=16)
DatasetNews Commentary v18.1 (zh-en)
GPUNVIDIA A100 80GB

Dataset Preparation

The News Commentary dataset contains parallel English-Chinese sentence pairs:
  • Training: 10,000 sentence pairs
  • Evaluation: TED Talks corpus
  • Preprocessed for quality and length consistency

RAFT Configurations

ConfigurationDescription
BenchmarkStandard fine-tuning without retrieval
Similarity RAFTRetrieve top-k similar examples using embeddings
Random RAFTRandomly sample k examples from corpus

3. Results

English-to-Chinese Translation

MethodBLEUCOMET
Baseline (No Fine-tuning)15.20.785
Benchmark Fine-tuning28.40.856
Similarity RAFT (k=3)27.10.849
Random RAFT (k=3)27.80.852

Chinese-to-English Translation

MethodBLEUCOMET
Baseline (No Fine-tuning)18.70.812
Benchmark Fine-tuning31.20.871
Similarity RAFT (k=3)30.50.865
Random RAFT (k=3)30.90.868
Benchmark fine-tuning consistently outperformed RAFT configurations in this experiment. This may be due to the homogeneous nature of the News Commentary dataset.
Performance Chart 1

Training performance comparison

Performance Chart 2

BLEU and COMET score comparison

Analysis

Why RAFT didn’t outperform benchmark:
  1. Dataset Homogeneity: News Commentary has consistent style
  2. Retrieval Quality: Similarity metrics may not capture translation-relevant features
  3. Context Length: Additional examples increase context, potentially diluting focus

4. Conclusion

While RAFT shows promise, our experiments suggest that for translation tasks on homogeneous datasets, standard fine-tuning remains competitive. Future work should explore diverse training corpora and better retrieval metrics.

References

  1. Zhang, T., et al. (2024). “RAFT: Adapting Language Model to Domain Specific RAG.”
  2. Lewis, P., et al. (2020). “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.”
  3. Hu, E., et al. (2021). “LoRA: Low-Rank Adaptation of Large Language Models.”