Computational models of language change

Agent-based modelling

In my PhD project (2019-2023), I use agent-based models to investigate how individual behaviour can amount to collective behaviour, and how individuals can be stimulated to make choices that lead to desirable collective behaviour. As a case study of collective behaviour, I study language change, a domain for which an abundance of data is available, on the individual and collective level. In a language, speakers would like to minimize effort (self-interest), while still getting their message across (collectively desirable behaviour). I develop agent-based models of language change relying on general principles, which can be applied to other collective systems, such as organizational, societal or traffic systems. My research is funded by the Flanders AI Programme, challenge Multi-agent Collaborative AI.

In a master’s course project (2016-2017), I developed a, more language-specific, agent model of historical change of the genitive in Germanic languages. Data from the Icelandic saga corpus was used, to initialize agents with an ancestor of current Scandinavian languages. In some experiments, language contact between Scandinavian and Middle German agents was simulated.

  • Paper “Modelling change in use of the genitive: an agent-based approach” (unpublished) [pdf]
  • Source code [link]

Reconstructing language phylogeny

In my MSc thesis, I applied the machine learning paradigm, succesful in many computing tasks, to reconstruct language ancestry. I proposed the task of word prediction: by training a machine learning model on pairs of words in two languages, it learns the sound correspondences between the two languages and should be able to predict unseen words. I used two neural network models, a recurrent neural network (RNN) encoder-decoder and a structured perceptron, to perform this task. I have shown that, by performing the task of word prediction, results for multiple tasks in historical linguistics can be obtained, such as phylogenetic tree reconstruction, identification of sound correspondences and cognate detection. The thesis was supervised by Gerhard Jäger (SfS, University of Tübingen) and Jelle Zuidema (ILLC, University of Amsterdam).

In my BSc thesis [pdf], I used bayesian inference to create a kinship tree of Dutch dialects. I used data from the Reeks Nederlandse Dialectatlassen. The words were aligned and converted to phonetic features, in order to be processed by a bayesian inference algorithm. The resulting tree was compared to an existing Dutch dialect map. The thesis was supervised by Alexis Dimitriadis and Martin Everaert from UiL OTS.

Other projects: Linguistic data collection and processing

Crowdsourcing, asking lay people to perform a task, can be a powerful tool to collect data on language in use. At the Dutch Language Institute (INT), I developed the platform Taalradar (language radar), to ask speakers about their attitude towards neologisms and to chart Dutch language variation.

I developed deep neural models for linguistic enrichment (POS tagging and lemmatization) of historical text. A the INT, I co-supervised a MSc thesis by Silke Creten, who evaluated different taggers (deep learning and traditional) on Corpus Gysseling and Corpus Van Reenen-Mulder.

I developed the web interface for DiaMaNT, a diachronic semantic lexicon of the Dutch language. This lexicon connects several historical Dutch lexica in a Linked Open Data-structure. This enables searching for historical synonyms of a concept in all time periods.

I was involved in the software development of CLARIAH Chaining Search, a Python library and Jupyter notebook that facilitates combined search in lexica, corpora and treebanks, and combination and processing of results.

I contributed to TREC OpenSearch, a platform developed at, among others, the University of Amsterdam to evaluate academic search engine algorithms on real users.