Initiative for collecting web pages for
Under-Resourced Languages

On this website you can contribute to corpus collection for under-resourced languages by simply entering a URL. The languages are chosen so that they have more than one million of speakers, but up to now there are less than one million of sentences in the Leipzig Corpora Collection. The URLs or domains you provide will be crawled and reviewed for text data in the respective language. After processing you will be presented with statistics for the URLs you provided. The created corpora will be freely available.

Flyer with more information about this project


Dirk Goldhahn, Maciej Sumalvico and Uwe Quasthoff: Corpus collection for under-resourced languages with more than one million speakers. In: Workshop on Collaboration and Computing for Under-Resourced Languages (CCURL), LREC, Portorož, 2016 (CCURL Proceedings)

Dirk Goldhahn, Thomas Eckart and Uwe Quasthoff: A Portal for Corpus Collection for Under-Resourced Languages. In: Workshop of the African Association for Lexicography (AFRILEX), CLASA 2017, Grahamstown, 2017 (Conference Booklet)

Erik Körner, Felix Helfer, Christopher Schröder, Thomas Eckart, and Dirk Goldhahn: Crawling Under-Resourced Languages – A Portal for Community-Contributed Corpus Collection. In: Proceedings of The Workshop on Dataset Creation for Lower-Resourced Languages (DCLRL) within the 13th Language Resources and Evaluation Conference (LREC) (pp. 36–43), European Language Resources Association, Marseille, 2022 (Paper)