Reflections on HeArBERT: Transliteration Benefits, Research Horizons, and Limitations

10 Sept 2024

Authors:

(1) Aviad Rom, The Data Science Institute, Reichman University, Herzliya, Israel;

(2) Kfir Bar, The Data Science Institute, Reichman University, Herzliya, Israel.

Table of Links

Abstract and Introduction

Related Work

Methodology

Experimental Settings

Results

Conclusion and Limitations

Bibliographical References

6. Conclusion

Arabic and Hebrew, both Semitic languages, display inherent structural resemblances and possess shared cognates. In an endeavor to allow a bilingual language model to recognize these cognates, we introduced a novel language model tailored for Arabic and Hebrew, wherein the Arabic text is transliterated into the Hebrew script prior to both pre-training and fine-tuning. We contrast our model by training another language model on the identical dataset but without the transliteration preprocessing step, in order to assess the impact of transliteration. We fine-tuned our model for the machine translation task, yielding promising outcomes. These results suggest that the transliteration step offers tangible benefits to the translation process.

Comparing to the translation combination involving other language models, we see comparable results; this is encouraging given that the training data we utilized for pre-training the model is approximately 60% smaller than theirs.

As a future avenue of research, we intend to train the model on an expanded dataset and explore scaling its architecture. In this study, our emphasis was on a transformer encoder. We are keen to investigate the effects of implementing transliteration within a decoder architecture, once such a model becomes available for Hebrew.

Limitations

The transliteration algorithm from Arabic to Hebrew is based a simple deterministic lookup table. However, sometimes the transliteration is not that straight forward, and this simple algorithm generates some odd rendering, which we would like to fix. For example, our algorithm does not place a final-form letter at the end of the Arabic word in Hebrew. Another challenge with transliteration into Hebrew is that for some words a Hebrew writer may choose to omit long vowel characters and the readers will still be able to understand the word. This phenomenon is referred to as “Ktiv Hasser” in Hebrew. Yet, there exists a preference for certain word representations over others. This inconsistency makes it more challenging for aligning the transliterated Arabic words to their cognates in the way they are written. Our transliteration algorithm always renders the Arabic word following the Arabic letters, which may be different than how this word is typically written in Hebrew.

Another limitation is the relatively small size of the dataset which we used for pre-training the language model, comparing to other existing language models of the same architecture size.

This paper is available on arxiv under CC BY 4.0 DEED license.

← Previous

Training a Bilingual Language Model by Mapping Tokens onto a Shared Character Space: Results

Up Next →

Foundational Research Behind HeArBERT: Key References and Transliteration Tables