Collecting Web Pages for
On this website you can contribute to corpus collection for under-resourced languages by simply entering a URL. The languages are chosen so that they have more than one million of speakers, but up to now there are less than one million of sentences in the Leipzig Corpora Collection. The URLs or domains you provide will be crawled and reviewed for text data in the respective language. After processing you will be presented with statistics for the URLs you provided. The created corpora will be freely available.
Thank you for your support!
Step 1: Select the language
Step 2: Insert the URLs / Upload a URL list
Step 3: Additional User Information
Your URLs have been submitted and will soon be processed. To observe the current working status you can visit:
Job List: /submissions
You can enter your email below to automatically receive a notification when the processing finished.
Receive an email after your URLs have been successfully processed.