If you don't have access to a scanner, you can also use the “signature” option commonly found in recent pdf viewers (ie in OS X's Preview.app).
We received some questions about the submission procedure, that we'd like to clarify. Even though we didn't make it mandatory on our websites (wiki and spmrl.org's shared task webpage), we would ideally prefer that teams submit results for all languages. However, we realize that the time frame is very tight and that teams may not have the resources to tackle all languages, or that certain approaches may make more sense for some languages than for others. So we will accept submissions for subsets of languages. Please note that papers describing full language set submissions will be allowed more space and eventually got priority for oral presentation.
- Each language root directory (e.g. GERMAN_SPRML) contains two directory pred and gold, where pred is a shortcut for treebank with predicted morphology (POS tags and morphological features, additionnaly lemmas if predicted lemmas were made available) and gold means manually validated morphological data.
- Each of those directories contains two directories: ptb, conll, where ptb means labelled bracketed data (eg. Penn treebank style, conll means CoNLL-X dependency format. In addition, some treebanks come with a native TIGER 2 XML description (with crossing branches), in that case a third directory named xml is present at the same hierarchical level than the others.
- Each of these directories are then divided between dev, train and train5k directories. In case of small treebanks, with a train set size of 5000 sentences, the train directory is omitted. (test is of course not included.)
Each final directory (ie, dev, test..) contains at least the following 4 files:
- *.{gold,pred}.{conll,ptb} : the treebank data (either with or without predicted morphohology)
- *.{gold,pred}.{conll,ptb}.tobeparsed.raw : the leaf nodes of the const treebank (one sentence per line) or the CoNLL's word form column.
- *.{gold,pred}.{conll,ptb}.tobeparsed.tagged : Files are provided with either gold, pred POS and morphological features in the following format for const files : Word POS Features
eg. : (Hungarian, tab separated)
mind C lem=mind|SubPOS=c|Form=c|Coord=w
pedig C lem=pedig|SubPOS=c|Form=s|Coord=w
a T lem=a|SubPOS=f
and the first 6 column of the relevant CoNLL file (ID,FORM,LEMMA,CPOS,FPOS,FEAT)
25 mind mind C C SubPOS=c|Form=c|Coord=w
26 pedig pedig C C SubPOS=c|Form=s|Coord=w
27 a a T T SubPOS=f
- *.{gold,pred}.{conll,ptb}.tobeparsed.tagged.lattices : the previous files are extended with lattices token id as in
(ptb)
24 25 mind C lem=mind|SubPOS=c|Form=c|Coord=w 25
25 26 pedig C lem=pedig|SubPOS=c|Form=s|Coord=w 26
26 27 a T lem=a|SubPOS=f 27
(conll)
24 25 mind mind C C SubPOS=c|Form=c|Coord=w 25
25 26 pedig pedig C C SubPOS=c|Form=s|Coord=w 26
26 27 a a T T SubPOS=f 27
where the first two numerical ids mean start_token and end_token while the last numerical id means the source token (see description in Shared Task Description)
The korean FEATS column is not of the format y=x|y=x|y=x… (with no
tabs or spaces). It has + instead of | and it does not show what is the
feature.
(The FEATS column contains actually “_”, only the CPOS and FPOS are provided)
For Korean, it's a deliberate choice. The Korean tokens are made of morphemes, each of them having its own set of features so when you see
id form lemmas cpos fpos
2 고향은 고향+은 n+j ncn+jxt
it means that the first morpheme has the tag ncn and the second jxt
then
ncn = Non-predicative common noun
and
jxt = Topical auxiliary
(see table 2 in KOREAN_SPMRL/doc/description.orig.pdf )
So it could be transformed into something like FEAT1=..|FEAT2=… + FEAT3=.. | FEAT4=..
but as there's no consensus about how to represent complex morph features at the morpheme level in conll format, we chose to leave that for the participants. Plus, we're really not sure about what to do if there's a clash of features , say in case of a derivational
morpheme, a feature is different between the two parts
If time allows, we'll try to provide conll compatible features but I'm really not sure we'll have time before it's too late to be useful
Update: CoNLL's type features have been included in the data set!!
This concerns the use of e.g. morphological analysers, dictionaries, unlabelled data.
Regarding the resources, you can use any lexicon you want, assuming that they are made available to the participants. For unlabeled data, their use for self training, co-training, etc. is not allowed. But if you use them to generate word clusters or externally acquired word signatures, we would request that either the unlabeled data or the resulting data (clusters, signature list) to be made available to the participants. Please contact us if you're not sure about what to use.
Yes, of course. All provided predicted data are supplied as a baseline (even a strong one in some cases) so you can use any alternative tool you might prefer.
In other words, are Task 2 and Task 3 mutually exclusive?
Short story : Task 2 and Task 3 are the same except that for 6 languages gold tokenization = predicted one. so they are not exclusive, they are complementary (the evaluation procedure is not the same)
Details:
Actually, some languages (Hebrew and Arabic) do not come with gold tokenization + predicted morphology, so they're only available with predicted tokenization + predicted morphology or gold tokenization+gold morphology (so no task 2 for those two). In addition, providing gold tokenization + predicted morphology would ruin the
task 3 as the crucial missing token info will be available). We received comments on the difficulty of dealing with the tokenization of semitic languages. Of course, we are aware of that point, that was even
one of our main concerns about this shared task: the entry cost for “newcomers” could be quite high.
Nevertheless, we talked about this at length and decided that having the possibility to compare gold token+pred morph vs pred token + pred morph and investigate the bottlenecks more closely is more important than trying to avoid at all costs any potential, though very unlikely, misplaced curiosity on the test data.
So, this is why teams that would prefer to submit results on Arabic and Hebrew with gold token and pred morph. are now allowed to do it, with the restriction that they must
also submit results on the pred token+pred morph data. The idea here is to see how those models compare within those those two scenarios.
For Hebrew, the gold token+pred morph data set is described in the README.spmrl file (using morphette (chrupala et al, 2008) trained on FORM \t CPOS+FEAT (no lemma available)
For Arabic, we trained morephette directly on a subset of the feature field (the atbpos=feature, namely the original treebank Bulkwater tagset) then we converted it to the CATIB tagset (CPOS)
For the FPOS field (the BIES tagset, used also as the main tagset for the ptb files ( see README files), we trained a different model (the conversion gave only 96.20 of pos accuracy, this one 96.60)
and merged back the two predicted tagsets.
Model performance on the dev set (no lexicon used, 3.66% of OOVs)
(tagset: all/ seen / unseen)
atbpos : 90.37 / 91.14/ 70.14
bies : 96.60 / 97.25/ 79.58
Catib : 97.94 / 98.33/ 87.90
for those working on constituency, the stanpos attribute was also mapped back from the atbpos field.
The case of French is a bit different as that treebank includes MWEs annotations at the syntactic level (a special category between POS and non terminal) so because the shared task goal is to evaluate MRL parsing in a realistic scenario, you'll need to parse the French data without MWE annotations. In this case, the tokenization will be evaluated as part of the syntactic annotation (ie. a special label matching for dependencies).
TedEval accepts projective trees only. We will re-projectivise the dep. structure for TedEval.
It's because TedEval requires that all functional labels to be the same between constituency and dependency structures (to perform cross framework evaluation) but this is only the case for Swedish and Hebrew (we didn't have the resources to do that level of normalization for all languages). So in order to provide fair and meaningful comparison with TedEval, we decided to target the lowest common denominator. Anyway, for individual treebanks, tedeval labeled eval will be used if it's not redundant.
Table synthesis:
Labeled Dependency | Evalb's F-score | cross framework | |
---|---|---|---|
raw Arabic / Hebrew | labeled tedeval | Labeled tedeval | unlabeled tedeval |
French | eval07 + mwe accuracy | evalb + mwe acc. | unlabeled tedeval |
the rest | eval07 | evalb | unlabeled tedeval |
This will depend on the total number of submitted system results. Let us say we limit the number of entries to 5 for now, and we'll adjust that number around the end of July.
Well, nice questions. We wish we knew the answer:)
More seriously, it seems that there are many available options those days. We do not force anyone to choose a given architecture (joint model, pipeline, ..). Please note that this list is not exhaustive nor it is an endorsement from our side.
For constituency parsing, even though lexicalized models provide state-of-the-art performance on English, it's less clear from a small morphologically-rich language treebank viewpoint. So focusing on PCFG-LA based parsers, we can point to 3 options, each of these having their own strong points.
For dependency parsing, we're not aware of available working lattice input parser. There was some efforts by Benoit Favre on adding lattice input parsing to Malt (webpage) but according to the author, it's not working that well. Regarding morphological support, there are of course plenty of choices (Malt,Mst, Mate, ISBN,..), we won't list them all.
For more powerful models capable of handling crossing branches trees (native format for German, Swedish and Polish), choice is currently more limited.
Please do not use ready to use models for any of those parsers, all parsing models must be trained on the shared task training sets.