Training a Bilingual Language Model by Mapping Tokens onto a Shared Character Space

10 Sept 2024

Authors:

(1) Aviad Rom, The Data Science Institute, Reichman University, Herzliya, Israel;

(2) Kfir Bar, The Data Science Institute, Reichman University, Herzliya, Israel.

Table of Links

Abstract and Introduction

Related Work

Methodology

Experimental Settings

Results

Conclusion and Limitations

Bibliographical References

Abstract

We train a bilingual Arabic-Hebrew language model using a transliterated version of Arabic texts in Hebrew, to ensure both languages are represented in the same script. Given the morphological, structural similarities, and the extensive number of cognates shared among Arabic and Hebrew, we assess the performance of a language model that employs a unified script for both languages, on machine translation which requires cross-lingual knowledge. The results are promising: our model outperforms a contrasting model which keeps the Arabic texts in the Arabic script, demonstrating the efficacy of the transliteration step. Despite being trained on a dataset approximately 60% smaller than that of other existing language models, our model appears to deliver comparable performance in machine translation across both translation directions.

Keywords: bilingual language model, transliteration, Arabic, Hebrew

1. Introduction

Pre-trained language models have become essential for state-of-the-art performance in mono and multilingual natural language processing (NLP) tasks. They are pre-trained once and can then be fine-tuned for various downstream NLP tasks. It has been shown that language models generalize better on multilingual tasks when the target languages share structural similarity, possibly due to script similarity (K et al., 2020). Typically, language models are trained on sequences of tokens that often correspond to words and subword components.

Arabic and Hebrew are two Semitic languages that share similar morphological structures in the composition of their words, but use distinct scripts for their written forms. The Hebrew script primarily serves Hebrew, but is also employed in various other languages used by the Jewish population. These languages include Yiddish (or “JudeoGerman”), Ladino (or “Judeo-Spanish”), and JudeoArabic, which comprises a cluster of Arabic dialects spoken and written by Jewish communities residing in Arab nations. To some extent, Judeo-Arabic can be perceived as an Arabic variant written in the Hebrew script. Most of the vocabulary in JudeoArabic consists of Arabic words that have been transliterated into the Hebrew script.

Words in two languages that share similar meanings, spellings, and pronunciations are known as cognates. Arabic and Hebrew cognates share similar meanings and spellings despite different scripts. The pronunciation of such cognates are not necessarily the same. Numerous lexicons have been created to record these cognates. One of those lexicons[1] lists a total of 915 Hebrew-Arabic spelling 1https://seveleu.com/pages/ equivalents, of which 435 have been identified as authentic cognates, signifying that they possess identical meanings. Analyzing a parallel HebrewArabic corpus, named Kol Zchut[2] using this lexicon, we found instances of those cognates in about 50% of the sentences.

The purpose of this work is to take advantage of the potentially high frequency of cognates in Arabic and Hebrew in building a bilingual language model using only one script. Subsequently, the model will be fine-tuned on NLP tasks, such as machine translation, which can benefit from the innate bilingual proficiency to achieve better results. To ensure that cognates are mapped onto a consistent character space, the model uses Arabic texts that are transliterated into the Hebrew script, which mimics the writing system used in Judeo-Arabic. We call this new language model HeArBERT. [3]

We test our new model on machine translation, which is considered a downstream task requiring knowledge in two languages, and report on some promising results. In summary, the primary contributions of our work are: (1) we release a new bilingual Arabic-Hebrew language model; and, (2) we show that pre-training a bilingual language model on transliterated texts, as a way for aligning tokens onto the same character space, is beneficial for machine translation.

This paper is available on arxiv under CC BY 4.0 DEED license.

[1] https://seveleu.com/pages/semiic-syntax-morpho/comparative-sem

[2] https://releases.iahlt.org/

[3] https://huggingface.co/aviadrom/HeArBERT

Up Next →

HeArBERT: A Bilingual Model for Arabic-Hebrew Translation Using Transliteration