Crawling Under-Resourced Languages

Collecting Web Pages for

Under-Resourced Languages

On this website you can contribute to corpus collection for under-resourced languages by simply entering a URL. The languages are chosen so that they have more than one million of speakers, but up to now there are less than one million of sentences in the Leipzig Corpora Collection. The URLs or domains you provide will be crawled and reviewed for text data in the respective language. After processing you will be presented with statistics for the URLs you provided. The created corpora will be freely available.

Flyer with more information about this project

Thank you for your support!

Step 1: Select the language

Step 2: Insert the URLs / Upload a URL list

Step 3: Additional User Information

Your URLs have been submitted and will soon be processed. To observe the current working status you can visit: Job List: /submissions
You can enter your email below to automatically receive a notification when the processing finished.

Remember Email address

Receive an email after your URLs have been successfully processed.