Neo-Aramaic Web Corpora: Christian Urmi and Turoyo

This resource makes two Neo-Aramaic corpora newly available: Christian Urmi (~ 600,000 tokens) and Ṭuroyo (~ 600,000 tokens). The percentage of morphologically annotated tokens within both corpora is approximately 75%.

Christian Urmi corpus Ṭuroyo corpus

Funding

Russian Foundation for Basic Research, Project 17-04-00472 (2017-2019): project leader — A. K. Lyavdansky.

Annotation and Search

The annotation for both corpora was created with the help of the morphological analyzer UniParser designed by T. Arkhangelskiy. The automatic analysis includes lemmatisation and morphological tagging of tokens. In both corpora, each token (word form) has been provided with its lemma and translation (for C. Urmi, into Russian; for Ṭuroyo, into English and German). In the Ṭuroyo corpus, English lemmata for nouns, adjectives, adverbs and particles have been translated from the German lemmata in H. Ritter’s dictionary (1979). The verbal lemmata were translated from the German of H. Ritter’s grammatical sketch of Ṭuroyo (1990), with some corrections. Translations for the lexemes absent from both above sources were obtained by elicitation from informants and by translating Swedish glosses from the lexicon of J. Beṯ-Şawoce (2012). Russian lemmata for the corpus of C. Urmi were taken from the sketch of the dictionary of C. Urmi created by E. Gavrilova and A. Lyavdansky, and were translated from the glossaries of Khan (2016), Friedrich (1960), and Polotsky (1967).

The universal search system, developed by T. Arkhangelskiy, has been adapted for the corpora of C. Urmi and Ṭuroyo. Searches by lexeme, word form, translation and grammatical tags (see tagsets for each corpus) are all available. One can combine several parameters to make an advanced search query. It is also possible to search several elements with a certain distance between them, and choose a subcorpus for searching. For more information, use the ❔ button on the search pages of both corpora.

The texts are not available in their entirety for reasons of copyright. The maximum context length is therefore 7 sentences.

The Corpus of Christian Urmi

Structure

This corpus of Christian Urmi Neo-Aramaic comprises 46 printed editions of Neo-Aramaic texts in a variety of the Latin script (the Assyrian New Alphabet), which were issued during the 1930s in the Soviet Union. For the history of the Assyrian New Alphabet project and the details of its orthography, see A. Lyavdansky “Neo-Aramaic Texts in the New Alphabet Published in the Soviet Union 1929-1938” (forthcoming). When selecting texts for the corpus, preference was given to literary texts printed according to the rules of the stabilized orthography adopted in 1933. Most of the selected texts are translations of Russian and other literature (fiction) and popular science texts. Some original literary compositions in Christian Urmi have also been included.

Some newspaper and oral texts have also been digitized for inclusion within this corpus, but they have not yet been included within the annotated corpus because they are transcribed according to other systems of orthography.

The complete list of the texts included within this annotated corpus is available via the ‘Select subcorpus’ button. For a bibliography of edited texts in the Assyrian New Alphabet, one may consult this.

Special characters

If the "standard" input method is selected in the settings (which it is by default), the following combinations of characters will be automatically replaced in search terms:

b1 = в
c1 = ç
s1 = ş
t1 = ţ
z1 = ƶ
i1 = ь
e1 = ə

Corpus Creators

E. Gavrilova (RSUH), J. Zarezaeva (RSUH), C. Benyaminova (RSUH) and A. Lyavdansky (IOCS, HSE University) were responsible for the digitization and processing of the constituent texts. Various technical issues were solved with the help of E. Barsky. J. Kipriyanovich and M. Kalinin participated in the final processing of texts.

T. Arkhangelskiy configured the morphological analyser. E. Gavrilova, A. Lyavdansky and T. Arkhangelskiy processed the analyser’s dictionary. T. Arkhangelskiy and A. Lyavdansky created paradigms for the analyser.

Acknowledgments

We express our gratitude to:

N. Kuzin (Freie Universtitӓt, Berlin) for his consultation on various issues related to the functioning of the morphological analyser and to the configuration of the corpus in general;
E. Cohen (Tel Aviv University) and N. Wildner, who have provided us with some digitized texts;
V. Golinets (Jüdische Hochschule, Heidelberg) for his help in providing us with the German editions of the texts in the Assyrian New Alphabet.
T. Arkhangelskiy for his help on all stages of the project and for putting the corpus on the web site.

Contacts

A. Lyavdansky and E. Gavrilova provide technical support for these corpora. Please send any comments and suggestions to alyavdansky@hse.ru.

Future Development of the Corpus

We plan to increase the automatic annotation up to 85-90% of the corpus in the near future, while simultaneously improving the quality of the annotations and corrections. In addition to the Russian lemmata, we will also add their English equivalents, together with metadata for all the constituent texts. The corpus will additionally be expanded and diversified. Our team plans to digitize and add texts of various genres and to create sub-corpora: poetry, journalism, non-fiction, popular science, and metalinguistic literature (e.g. grammars and textbooks of the Assyrian language). Finally, we plan to process the oral component of the corpus of Christian Urmi, for which a new system of orthography will be created.

Ṭuroyo Corpus

Corpus composition

The Ṭuroyo corpus primarily includes oral texts recorded by various scholars (E. Prym, A. Socin, H. Ritter, O. Jastrow, S. Talay) and by native speakers themselves (J. Beṯ-Şawoce) starting from the end of 19th century and until our present time. The following genres are represented in the corpus: folk tales and oral history. There are mono- and multi-speaker texts in our corpus. The latter are represented by interviews recorded by J. Beṯ-Şawoce.

The full list of texts currently in the corpus is available in the tab ‘Select subcorpus’. Currently, our team continues to work on the metadata for the texts.

The texts in the Ṭuroyo corpus have a lot of variability in the orthography because they originate from different speakers of different dialects of different time periods. The texts have been normalized to some degree, but the work on the orthography for the corpus is far from over. As a compromise, we take into account all or most orthographic variants for each lemma or word form without much change. Thus, when searching for one of the variants, users will automatically receive in their output all the variants.

If the "standard" input method is selected in the settings (which it is by default), the following combinations of characters will be automatically replaced in search terms:

' = ʕ
" = ʔ
d_ = ḏ
d_/ = ḏ̣
h/ = ḥ
s1 = š
s/ = ṣ
t_ = ṯ
t/ = ṭ
e1 = ə

Creators of the corpus

The corpus was created by a team of Russian scholars. The digitalization and normalization of the texts was performed at different periods by Y. Furman and S. Loesov (NRU HSE Moscow), N. Kuzin (FU Berlin), M. Kalinin (SCMIPS, Moscow), S. Koval (RSUH, Moscow), and independent researchers E. Barsky and Y. Kirpianovich.

The morphological analyzer and the website were developed by T. Arkhangelskiy. The nominal dictionary of Ṭuroyo necessary for the morphological analysis was checked and corrected by N. Kuzin and S. Koval. The paradigms and the general model of Ṭuroyo morphology were created by T. Arkhangelskiy, Y. Furman, and N. Kuzin.

Acknowledgments

The creators of the corpus express their gratitude to J. Beṯ-Şawoce who has provided scans and .txt files for most of his publications. The corpus could not have come into being without the constant technical support and help from T. Arkhangelskiy.

Contact

The corpus is maintained and further developed by N. Kuzin and Y. Furman. Please send us your suggestions and let us know about mistakes in our corpus by using one of these addresses:

yfurman at hse dot ru
nikitakuzin at zedat dot fu hyphen berlin dot de

Roadmap

Plans for the further development of the corpus:

Improving the quality of POS-tagging
Adding texts of different genres
Correcting existing texts and improving the orthography
Creating the metadata for the texts