Collecting Web Pages for

Under-Resourced Languages

On this website you can contribute to corpus collection for under-resourced languages by simply entering a URL. The languages are chosen so that they have more than one million of speakers, but up to now there are less than one million of sentences in the Leipzig Corpora Collection. The URLs or domains you provide will be crawled and reviewed for text data in the respective language. After processing you will be presented with statistics for the URLs you provided. The created corpora will be freely available.

Flyer with more information about this project

Thank you for your support!

