If you don't have access to a scanner, you can also use the “signature” option commonly found in recent pdf viewers (ie in OS X's Preview.app).
We received some questions about the submission procedure, that we'd like to clarify. Even though we didn't make it mandatory on our websites (wiki and spmrl.org's shared task webpage), we would ideally prefer that teams submit results for all languages. However, we realize that the time frame is very tight and that teams may not have the resources to tackle all languages, or that certain approaches may make more sense for some languages than for others. So we will accept submissions for subsets of languages. Please note that papers describing full language set submissions will be allowed more space and eventually have priority for oral presentations.
We provide constituency and dependency treebanks, aligned at the tokens and the part-of-speech levels (tokens are the same, so brackets are for example all in penn treebank style) for 9 languages:
Arabic, Basque, French, German, Hebrew, Hungarian, Korean, Polish and Swedish.
For all languages we provide large unlabeled data set with automatic morpho-syntactic annotation.
Source | type | size (tree tokens) | morph | parsed | Status | |
---|---|---|---|---|---|---|
Arabic | Gigaword | news | 100M | yes (spmrl) | not yet | upcoming |
Basque | web | balanced | 120M | yes (spmrl) | yes | ready |
French | Est Republicain | newswire | 120M | yes (spmrl) | not yet | upcoming |
German | Wikipedia | wiki (edited) | 205M | yes (spmr) | yes | ready |
Hebrew | Wikipedia | wiki (edited) | 160M | yes (spmrl) | yes | ready |
Hungarian | news domain | newswire | 100M | yes (spmrl) | yes | ready |
Korean | news domain | newswire | 80M | yes (spmrl) | not yet | upcoming |
Polish | Wikipedia | wiki (edited) | 100M | yes (SPMRL) | yes | ready |
Swedish | PAROLE | newswire | 24M | yes (SPMRL) | yes | ready |
- Each language root directory (e.g. GERMAN_SPRML) contains two directory pred and gold, where pred is a shortcut for treebank with predicted morphology (POS tags and morphological features, additionnaly lemmas if predicted lemmas were made available) and gold means manually validated morphological data.
- Each of those directories contains two directories: ptb, conll, where ptb means labelled bracketed data (eg. Penn treebank style, conll means CoNLL-X dependency format. In addition, some treebanks come with a native TIGER 2 XML description (with crossing branches), in that case a third directory named xml is present at the same hierarchical level than the others.
- Each of these directories are then divided between dev, train and train5k directories. In case of small treebanks, with a train set size of 5000 sentences, the train directory is omitted. (test is of course not included.)
Each final directory (ie, dev, test..) contains at least the following 4 files:
- *.{gold,pred}.{conll,ptb} : the treebank data (either with or without predicted morphohology)
- *.{gold,pred}.{conll,ptb}.tobeparsed.raw : the leaf nodes of the const treebank (one sentence per line) or the CoNLL's word form column.
- *.{gold,pred}.{conll,ptb}.tobeparsed.tagged : Files are provided with either gold, pred POS and morphological features in the following format for const files : Word POS Features
eg. : (Hungarian, tab separated)
mind C lem=mind|SubPOS=c|Form=c|Coord=w
pedig C lem=pedig|SubPOS=c|Form=s|Coord=w
a T lem=a|SubPOS=f
and the first 6 column of the relevant CoNLL file (ID,FORM,LEMMA,CPOS,FPOS,FEAT)
25 mind mind C C SubPOS=c|Form=c|Coord=w
26 pedig pedig C C SubPOS=c|Form=s|Coord=w
27 a a T T SubPOS=f
- *.{gold,pred}.{conll,ptb}.tobeparsed.tagged.lattices : the previous files are extended with lattices token id as in
(ptb)
24 25 mind C lem=mind|SubPOS=c|Form=c|Coord=w 25
25 26 pedig C lem=pedig|SubPOS=c|Form=s|Coord=w 26
26 27 a T lem=a|SubPOS=f 27
(conll)
24 25 mind mind C C SubPOS=c|Form=c|Coord=w 25
25 26 pedig pedig C C SubPOS=c|Form=s|Coord=w 26
26 27 a a T T SubPOS=f 27
where the first two numerical ids mean start_token and end_token while the last numerical id means the source token (see description in Shared Task Description)
This concerns the use of e.g. morphological analyzers, dictionaries, unlabeled data.
Regarding the resources, we do provide additional, large, (base-)parsed data sets for all languages.
In addition to what we provide, you can use any lexicon you want, assuming that they are made available to the participants. the same holds for additional unlabeled data, word clusters, and externally acquired word signatures. Please contact us if you are not sure about what to use.
Yes, of course. All provided predicted data are supplied as a baseline (even a strong one in some cases) so you can use any alternative tool you might prefer.
In other words, are Task 2 and Task 3 mutually exclusive?
Short story : Task 2 and Task 3 are the same except that for 6 languages gold tokenization = predicted one. So they are not exclusive, they are complementary (the evaluation procedure is not the same)
Details:
Actually, some languages (Hebrew and Arabic) do not come with gold tokenization + predicted morphology, so they're only available with predicted tokenization + predicted morphology or gold tokenization+gold morphology (so no task 2 for those two). For the shared task in 2013, we received comments on the difficulty of dealing with the tokenization of Semitic languages. Of course, we are aware of that point, that was even
one of our main concerns about this shared task: the entry cost for “newcomers” could be quite high.
Nevertheless, we talked about this at length and decided that having the possibility to compare gold token+pred morph vs pred token + pred morph and investigate the bottlenecks more closely is more important than trying to avoid at all costs any potential, though very unlikely, misplaced curiosity on the test data.
So, this is why teams that would prefer to submit results on Arabic and Hebrew with gold token and pred morph. are now allowed to do it, with the restriction that they must
also submit results on the pred token+pred morph data. The idea here is to see how those models compare within those those two scenarios.
For Hebrew, the gold token+pred morph data set is described in the README.spmrl file (using morphette (chrupala et al, 2008) trained on FORM \t CPOS+FEAT (no lemma available)
For Arabic, we trained morephette directly on a subset of the feature field (the atbpos=feature, namely the original treebank Bulkwater tagset) then we converted it to the CATIB tagset (CPOS)
For the FPOS field (the BIES tagset, used also as the main tagset for the ptb files ( see README files), we trained a different model (the conversion gave only 96.20 of pos accuracy, this one 96.60)
and merged back the two predicted tagsets.
Model performance on the dev set (no lexicon used, 3.66% of OOVs)
(tagset: all/ seen / unseen)
atbpos : 90.37 / 91.14/ 70.14
bies : 96.60 / 97.25/ 79.58
Catib : 97.94 / 98.33/ 87.90
for those working on constituency, the stanpos attribute was also mapped back from the atbpos field.
The case of French is a bit different as that treebank includes MWEs annotations at the syntactic level (a special category between POS and non terminal) so because the shared task goal is to evaluate MRL parsing in a realistic scenario, you'll need to parse the French data without MWE annotations. In this case, the tokenization will be evaluated as part of the syntactic annotation (ie. a special label matching for dependencies).
TedEval accepts projective trees only. We will re-projectivise the dep. structure for TedEval.
It's because TedEval requires that all functional labels to be the same between constituency and dependency structures (to perform cross framework evaluation) but this is only the case for Swedish and Hebrew (we didn't have the resources to do that level of normalization for all languages). So in order to provide fair and meaningful comparison with TedEval, we decided to target the lowest common denominator. Anyway, for individual treebanks, tedeval labeled eval will be used if it's not redundant.
Table synthesis:
Labeled Dependency | Evalb's F-score | cross framework | |
---|---|---|---|
raw Arabic / Hebrew | labeled tedeval | Labeled tedeval | unlabeled tedeval |
French | eval07 + mwe accuracy | evalb + mwe acc. | unlabeled tedeval |
the rest | eval07 | evalb | unlabeled tedeval |
This will depend on the total number of submitted system results. Let us say we limit the number of entries to 5 for now, and we'll adjust that number around the end of July.
Well, nice questions. We wish we knew the answer:)
More seriously, it seems that there are many available options those days. We do not force anyone to choose a given architecture (joint model, pipeline, ..). Please note that this list is not exhaustive nor it is an endorsement from our side.
For constituency parsing, even though lexicalized models provide state-of-the-art performance on English, it's less clear from a small morphologically-rich language treebank viewpoint. So focusing on PCFG-LA based parsers, we can point to 3 options, each of these having their own strong points.
For dependency parsing, we're not aware of available working lattice input parser. There was some efforts by Benoit Favre on adding lattice input parsing to Malt (webpage) but according to the author, it's not working that well. Regarding morphological support, there are of course plenty of choices (Malt,Mst, Mate, ISBN,..), we won't list them all.
For more powerful models capable of handling crossing branches trees (native format for German, Swedish and Polish), choice is currently more limited.
Please do not use ready to use models for any of those parsers, all parsing models must be trained on the shared task training sets.