Enhancing language technology for Anatolian languages

The Anatolian language branch occupies a unique position in the development of Indo-European languages. It is the earliest offshoot of the Indo-European family, diverging around 4000 years ago. Despite their significance in understanding language evolution, Anatolian languages remain vastly understudied compared to other major divisions in this language family. There are nine Anatolian languages attested from inscriptions dating as early as 18th Century BCE. Only four, Hittite, Luwian, Lycian and Lydian are sufficiently preserved for grammars to have been produced.

Textual scarcity and limited technological support have led to a lack of natural language processing (NLP) tools with Anatolian coverage. Most resources only contain images of fragments of cuneiform tablets or hieroglyphic inscriptions, sometimes enriched with transliteration, annotation, or translation into English, German or French.

This collaborative project brings together scholars from archaeology, linguistics, computer science, and digital humanities to unlock available data into structured, machine-readable formats. The key part of the research is a joint model for Anatolian and related languages of the same historical period. This will enable researchers to reconstruct missing or uncertain tablet fragments and improve annotation and translation.

We will also conduct a comparative linguistic analysis across the language family. This includes studying syntactic and semantic structures as well as domain specificity including numismatics, laws, sepulchral inscriptions, protocols and letters.

Outcomes will facilitate processing and digitisation of manuscripts and tablets at scale and advance NLP for crucial yet underrepresented languages. New datasets and models will be created, filling a major gap in the field of for ancient languages.

MDAP's expertise in NLP, data collection and preparation, model and library development, and linguistic analysis will allow us to develop sustainable, reusable open-source resources. We will work with the team to evaluate OCR models and LLMs’ ability to process the corresponding writing systems. MDAP’s experience working on the undeciphered Linear A script will prove invaluable for tackling Anatolian's complex writing systems.

Who's involved

Chief Investigator

Dr Ekaterina Vylomova, School of Computing and Information Systems, University of Melbourne

Co investigators

Christopher Guest, Graduate Researcher, School of Computing and Information Systems, University of Melbourne

Dr Jey Han Lau, School of Computing and Information Systems, University of Melbourne

Professor Trevor Cohn, School of Computing and Information Systems, University of Melbourne

MDAP research collaborators

Kabir Manandhar Shrestha, Dr Robert Turnbull