User Tools

Site Tools


How to Obtain Licenses and the Treebank Data Sets

Each treebank comes with its own restrictions and licensing requirements. For this reason, there are three different models for obtaining the treebanks. Please read this page all the way through; otherwise you may get only a subset of the data. 1)The Arabic Treebank is distributed by the LDC. 2) The French, Hebrew, and Hungarian treebanks require licenses, and 3) the Basque, German, Korean, Polish, and Swedish treebanks are freely available (partly under creative commons or GPL) for the shared task.

1. Arabic Treebank

The Arabic Treebank is distributed by the LDC. In order to obtain, please download the following license, fill it out, sign it, and fax it to LDC, attention: Ilya Ahtaridis, fax number +1 215-573-2175. Make sure that you include your email address! Alternatively you can also mail a signed scanned copy of the licence to (with object “[SPMRL 2014 Shared task] Arabic data set”). Please note that the Unlabeled Arabic data set is not yet available so please do not mail the LDC yet. You'll receive a mail through the mailing list when it'll be ready (next week as of May 26, 2014).

Arabic: arabic.pdf alternate download

2. Licensed Treebanks

For the French and Hungarian treebanks, fill out the following forms, sign them, scan them, and send them to We will then send you login information for the wiki page from which you can download the train and development sets. Do not send these licenses to the LDC!

French: french.pdf alternate download

Hungarian: hungarian.pdf alternate download


3.Treebanks Available without Specific License for Academic Use

The following treebanks are freely available for the shared task: the German, Hebrew, Korean, and Swedish Treebank.

The Basque Treebank is licensed under the Creative Commons license:

The Polish Treebank is licensed under GPL v3:

All these treebanks will be distributed with the licensed treebanks. In order to obtain them, fill out the following form, stating that you will use the data set only for the shared task, scan it, and send it to Do not send this form to the LDC!

general form: generalform2.pdf alternate download

How to Obtain The Unlabeled Data Sets

Most of the unlabeled data sets we use (at the exception of the French, Hebrew, German, Polish which are covered by the creative common license – cc-by-nc-sa– and Basque, specific research-only license) are subjected to the same license as their treebank counterparts. Once the shared task completed, all free-licensed data will be made available.

1. Licensed Unlabeled Data

For the shared task duration, All data, but Arabic, will be made available through the restricted access download page. The unlabeled arabic data will be made available via the account provided by the LDC.

Arabic unlabeled data: subjected to the same license as the Arabic treebank data set.

Basque Unlabeled data: Basque licensealternate download

2. Free for research Licensed Unlabeled Data

Hungarian, Korean and Swedish: (see General form above)

3. Openly Licensed Unlabeled Data

Note that at the exception of the French (based on the Est Republicain corpus, governed by the cc-by-nc-sa licence) and Hebrew (cc-by-sa 4.0), the raw text of the wikidumps (German, Polish) is subjected to the cc-by-sa license. The status of their added annotations will be explicited shortly.

French: cc-by-nc-sa 2.0

German,Polish: cc-by-sa 3.0

Hebrew: cc-by-sa 4.0

For further references, those precisions are included in the general form cited above.

In Case of Problems

In case of problems, contact If you cannot scan the licenses, we can provide a fax number.

how_to_obtain_licenses_for_the_shared_task_data.txt · Last modified: 2014/05/26 04:39 by seddah