Benchmarking HeArBERT: Comparative Analysis with Traditional and Extended Language Models

cover
10 Sept 2024

Authors:

(1) Aviad Rom, The Data Science Institute, Reichman University, Herzliya, Israel;

(2) Kfir Bar, The Data Science Institute, Reichman University, Herzliya, Israel.

Abstract and Introduction

Related Work

Methodology

Experimental Settings

Results

Conclusion and Limitations

Bibliographical References

4. Experimental Settings

Machine Translation. Our machine-translation architecture is based on a simple encoder-decoder framework, which we initialize using weights of the BERT model in focus.[5] For example, if we want our model to translate from Hebrew to Arabic, we might initialize the encoder with the model weights of mBERT and the decoder with the weights of CAMelBERT. To fine-tune the model on machine translation, we use the new “Kol Zchut” (in English, “All Rights”) Hebrew-Arabic parallel corpus[6] which contains over 4,000 parallel articles in the civil-legal domain, corresponding to 140,000 sentence-pairs in Arabic and Hebrew containing 2.13M and 1.8M words respectively. To the best of our knowledge, ours is the first work to report machine translation results using this resource; therefore, no established baseline or benchmark exists for comparison. As the corpus is provided without an official train/test split, we apply a random split with 80% of the data being allocated for training and the remaining 20% for testing, using the train_test_split function of scikit-Learn with a random seed of 42. We evaluate our HeArBERT-based translation against an equivalent system initialized using other models. The standard BLEU metric (Papineni et al., 2002) is employed to contrast the system’s generated translation with the sole reference translation present in the corpus. Each system is fine-tuned for the duration of ten epochs, and we report the best performance observed across all epochs.

Baseline Language Models. To contrast HeArBERT with an equivalent model trained on texts in both Arabic and Hebrew scripts, we pre-train another model and tokenizer following the same procedure, but without the transliteration preprocessing for the Arabic data. This model is denoted with a subscript "NT" (no transliteration): HeArBERTNT.

We compare our model with a number of existing models. The initial model, mBERT, was originally pre-trained on a range of languages, including both Arabic and Hebrew. We also chose a couple of monolingual Arabic language models, with specific versions from Hugging Face provided in a footnote. Specifically, we use CAMeLBERT[7] and GigaBERT[8]. In some experiments, we adopt a technique inspired by Rom and Bar (2021). This involves expanding the vocabulary of an existing Arabic language-model’s tokenizer by appending a Hebrew-transliterated version of each Arabic token and associating it with the original token identifier. We denote such extended models by adding “ET” (extended tokenizer) to the model name.

All these models share the same architecture size as our proposed model.

This paper is available on arxiv under CC BY 4.0 DEED license.


[5] We use HuggingFace’s EncoderDecoderModel.

[6] https://releases.iahlt.org/

[7] CAMeL-Lab/bert-base-arabic-camelbert-mix

[8] lanwuwei/GigaBERT-v4-Arabic-andEnglish