We used the same protocol as in Conll 2007 (Nivre et al, 2007) and SPMRL 2013 (Seddah et al, 2013) in two scenarios (with 4 settings)
- Note that the predicted data were provided as baseline, participants were free to use theirs. The Hebrew and Arabic predicted train sets have not been subjected to a cross fold jackniffing so participants were incited to do it (only one team did use their own predicted morphology though: IMS_WROCLAW_SZEGED_CIS for all languages.) - Please check first the pred/full results (most teams) and the gold/5k results (ICT+LORIA)
2014 submissions +Baseline Malt (all sent., punct. eval)
2014 Submissions + IMS_SZEGED_CIS+ Basque Team + Baseline Malt+ Maltoptimizer
For ease of analysis, we also provide results with the entries from teams that participated
to both shared task editions (using more or less the same system but with a semi supervised component - mostly word clusters acquired from the unlabeled data we provided)
The Malt Baselines and Maltoptimizer results were provided by Miguel Ballesteros as part
of his 2013 entries (thanks Miguel!)
(Note that only one team, LORIA, submitted results for Arabic. This is due to the late availability of the Arabic unlabeled data, preventing the others teams to accurately train their models on this language data set).
Average score ranking
Soft average score ranking
Special kudos to the ICT team who provided results only for the gold/5k track on 8 languages and to the LORIA team who provided results for all languages on all tracks. Congratulations for the IMS_WROCLAW_SZEGED_CIS and Basque Team for their state-of-the-art results!
(coming soon)
Note: that we used a modified version of Evalb (Black et al, 91) download
so * (i) it can process the particular format of SPMRL Data, * (ii) unparsed sentences do actually penalize the global score (before, an unparsed sentence was simply ignored, the number of unparsed sentence was just incremented accordingly).
Protocol: All tokens are evaluated (including punctuations), top labels (TOP, S1, ROOT, VROOT) are deleted.
2014 submissions +Baseline Berkeley Parser (generic lexicon, 1 grammar) (all sent., punct. eval)
2014 submissions +2013 IMS entry +Baseline Berkeley Parser (generic lexicon, 1 grammar) (all sent., punct. eval)
Although not included in our initial evaluation protocol, the huge differences in our treebanks' node per terminal ratio, render any cross language comparisons difficult to say the least. We therefore provide Leaf Ancestors (Samson & Babarczy, 2003) results which for some configuration tell a different story.
We used Joachim Wagner's Leaf Ancestor implementation (download)
2014 submissions +Baseline Berkeley Parser (generic lexicon, 1 grammar) (all sent., punct. eval)
2014 submissions +Baseline Berkeley Parser (generic lexicon, 1 grammar) (all sent., punct. eval)
As for any other metrics, those results are better interpretable with the following data in mind:
http://pauillac.inria.fr/~seddah/stats_treebanks_ptb.csv
Non-terminal nodes per terminal ratio (split x languages)
ARABIC | BASQUE | FRENCH | GERMAN | HEBREW | HUNGAR. | KOREAN | POLISH | SWEDISH | |
---|---|---|---|---|---|---|---|---|---|
train | 5.04 | 1.21 | 2.90 | 1.66 | 1.68 | 1.64 | 1.05 | ||
train5k | 6.26 | 1.19 | 2.92 | 1.65 | 2.33 | 1.74 | 1.47 | 1.05 | 1.70 |
test | 5.05 | 1.20 | 2.93 | 1.66 | 2.12 | 1.60 | 1.63 | 1.04 | 1.74 |
dev | 5.11 | 1.33 | 2.99 | 1.57 | 2.11 | 2.08 | 1.58 | 1.05 | 2.07 |
Same disclaimer as for the dependency track applies.
Average score ranking
Soft average score ranking
Congratulations to the both teams: they were brave enough to submit on the constituency track!
—
COMING SOON !!
Arabic and Hebrew data set were provided with generated lattices (disambiguated and non-disambiguated for Hebrew, disambiguated only for Arabic – the data exist though, they should be made available at some points –)
Results on the predicted tokens scenarios are evaluated using Tedeval 2.2 (Tsarfaty et al, 2011,2012) in two modes:
Well, according to Reut, it's an “easy” metrics so speak. First the parsed tree are converted to a generic representation, reprojectivized if needed. (internal format is like a flatten ptb tree) so is the gold, now for each tree a minimum tree edit distance is calculated in term of both node insertion and deletion, and lexical item insertion and deletion. Now this cost then is normalized, which gives an accuracy score. It's a bit similar to LeafAncestor but extended to a tree instead of a set of lineage (leafancestor gives the mean of all edit distance between a gold lineage and and a test lineage (path from a token to the root node).
have a look to section 3 of Reut's acl 2012 paper (http://www.tsarfaty.com/pdfs/acl12.pdf). Interestingly, if you want to evaluate on gold tokenization (to have a clue of what means a tedeval accuracy score), a parseval score (gold_token) on hebrew would give 88.75% while a ted score on the same set would be 93.39 (labeled) and 94.35 (unlabeled). the difference between labeled and unlabeled, is simply that all non terminal in the generic trees are replaced by a dummy symbol.
Once again, like last year, Murphy's law at its extreme ironical implementation. After the unplanned reboot of the clusters and two crash disks needing a full rebuild of a raid 5 disk array on our backup server, the network card and/or the 3rd raid disk seems to be dying resulting in a extreme slowness of any i/o. Of course, this is mid august, in France, so in our cloud absolutely no one can hear you scream. We're working on it though.