Computational models of language change

MSc thesis: Reconstructing language ancestry by performing word prediction with neural networks

In recent years, computational methods have led to new discoveries in the field of historical linguistics. In my thesis, I applied the machine learning paradigm, succesful in many computing tasks, to historical linguistics. I proposed the task of word prediction: by training a machine learning model on pairs of words in two languages, it learns the sound correspondences between the two languages and should be able to predict unseen words.

I used two neural network models, a recurrent neural network (RNN) encoder-decoder and a structured perceptron, to perform this task. I have shown that, by performing the task of word prediction, results for multiple tasks in historical linguistics can be obtained, such as phylogenetic tree reconstruction, identification of sound correspondences and cognate detection.

On top of this, I showed that the task of word prediction can be extended to phylogenetic word prediction, in which information is shared between language pairs, based on the assumed structure of the ancestry tree. This task could be used for protoform reconstruction and could in the future lead to the direct reconstruction of the optimal tree at prediction time.

The thesis was supervised by Gerhard Jäger (SfS, University of Tübingen) and Jelle Zuidema (ILLC, University of Amsterdam).

Agent-based modelling of change in use of the genitive

I developed an agent model of historical change of the genitive in Germanic languages. Data from the Icelandic saga corpus was used, to initialize agents with an Old Norse language model, representing an ancestor of current Scandinavian languages. In some experiments, language contact between Scandinavian and Middle German agents was simulated. Simulations tend to converge to situations observed in current Germanic languages.

This was a course project, in which I collaborated with other students, for the MA course Variation and Change in German/Variation and Change in Scandinavian Languages at the University of Amsterdam, taught by prof. dr. Arjen Versloot.

  • Paper “Modelling change in use of the genitive: an agent-based approach” (unpublished) [pdf]
  • Source code [link]
BSc thesis: Determining Dutch dialect phylogeny using bayesian inference

In my BSc thesis, I explored a topic in dialectometry: using bayesian inference to create a kinship tree of Dutch dialects. I used data from the Reeks Nederlandse Dialectatlassen. The words were aligned and converted to phonetic features, in order to be processed by a bayesian inference algorithm. The resulting tree was compared to an existing Dutch dialect map. The thesis was supervised by Alexis Dimitriadis and Martin Everaert from UiL OTS.

DiaMaNT: diachronic semantic lexicon for Dutch

I develop the web interface for DiaMaNT, a diachronic semantic lexicon of the Dutch language. This lexicon connects several historical Dutch lexica in a Linked Open Data-structure. This enables searching for historical synonyms of a concept in all time periods.

Collection and processing of language data


Crowdsourcing, asking lay people to perform a task, can be a powerful tool to collect data on language in use. The Dutch Language Institute takes part in the EU COST action enetCollect, on crowdsourcing and language learning. In the context of this project, we launched the platform Taalradar (language radar) to ask speakers about their attitude towards neologisms and to chart Dutch language variation. In addition to capturing language use, we are currently using crowdsourcing to filter inappropiate language from web corpora. Our presentations and publications give an overview of our findings.

Deep learning for historical linguistic enrichment

The Dutch Language Institute has a collection of Dutch historical texts from several periods, from the Middle Ages to the present. To unlock the potential of historical texts for linguistic research, they are usually manually annotated with linguistic information, such as part-of-speech and lemma. By automating this enrichment, much larger amounts of text can be made accessible for linguistic research. Linguistic enrichment for historical text is more challenging than for modern text, due to, among others, spelling variation and contraction of words. I worked on deep learning models for linguistic enrichment (POS tagging and lemmatization) for historical texts. I co-supervised a MSc thesis by Silke Creten, who evaluated different taggers (deep learning and traditional) on Corpus Gysseling and Corpus Van Reenen-Mulder.

CLARIAH Chaining search

I was involved in the development of CLARIAH Chaining Search, a Python library and Jupyter notebook that facilitates combined search in lexica, corpora and treebanks, and combination and processing of results.