This is an old revision of the document!
IMPORTANT (September 26, 2013. 01:26) Due to a wrong initial setting (our complete fault) the baseline constituency Lorg scores for all settings are taken off from the official results (and forgotten altogether)
We'll update the webpage once we're finish with the overview paper.. (September 19, 2013. 15:15): baselines scores for the constituency track have been recalculated – some files were missing so we've done it again. Great. –
We used the same protocol as in Conll 2007 (Nivre et al, 2007) in two 4 scenarios:
Note that the predicted data were provided as baseline, participants were free to use theirs. The French, Hebrew and Arabic predicted train sets have not been subjected to a cross fold jackniffing so participants were incited to do it (only a few did use their own predicted morphology though: Alpage-IGM and Alpage-Dyalog for French, Cadim for Arabic and IMS_SGZEDED_CIS for all languages.
Due to excessive length duration when parsing Arabic long sentences, we applied a length cut-off of 70 to our baseline parsers. The “all sentences results” are however the official results. The ⇐70 ones are given for comparison's sake.
Note: that we used a modified version of Evalb (Black et al, 91) download
so * (i) it can process the particular format of SPMRL Data, * (ii) unparsed sentences do actually penalize the global score (before, an unparsed sentence was simply ignored, the number of unparsed sentence was just incremented accordingly).
Protocol: All tokens are evaluated (including punctuations), top labels (TOP, S1, ROOT, VROOT) are deleted.
note: For Arabic, one of our baseline parser had a bug, leading to the incorrect adding of a spurious enclosing parenthesis, which caused evalb to penalize its scores. Of course, this is less visible in LeafAncestor's scores.
Although not included in our initial evaluation protocol, the huge differences in our treebanks' node per terminal ratio, render any cross language comparisons difficult to say the least. We therefore provide Leaf Ancestors (Samson & Babarczy, 2003) results which for some configuration tell a different story.
We used Joachim Wagner's Leaf Ancestor implementation (download)
As for any other metrics, those results are better interpretable with the following data in mind:
Non-terminal nodes per terminal ratio (split x languages)
Congratulations to the IMS-SZEGED-CIS Team: they were the only one to be brave enough to submit on the constituency track ! —
Arabic and Hebrew data set were provided with generated lattices (disambiguated and non-disambiguated for Hebrew, disambiguated only for Arabic – the data exist though, they should be made available at some points –)
Results on the predicted tokens scenarios are evaluated using Tedeval 2.2 (Tsarfaty et al, 2011,2012) in two modes:
Note that the IMS-SZEGED-CIS ptb scores are lower than expected in labeled mode. This because the trees were not annotated with functions.
Even though the cross framework results are still pending, those results also include the results from IMS-SZEGED-CIS on the constituency track.
French is a different case: as for many languages, multi word expressions (MWEs) are of prevalent use in French. The French Treebank (Abeille et al, 2003) has then the particularity of having being built with MWEs from the start. As this shared task focused on real world evaluation (parsing raw text as much as possible) and because MWEs are annotated at the morpho-syntactic level in French, we decided to provide French raw text but evaluate parsing jointly with MWEs evaluations.
How to interpret those lines ?
For example the line
F_mwe: 97.40 R_mwe: 97.96 P_mwe: 97.68 F_cmp: 97.89 R_cmp: 98.54 P_cmp: 98.22 F_mwe+P: 97.40 R_mwe+P: 97.96 P_mwe+P: 97.68 file: XX.parsed
is a short cut for the following information
Total nb of sentences : 2541 file: XX.parsed
|Full MWEs||97.40||97.96||97.68||(gold = 4043, sys = 4020, correct = 3938)|
|Full MWEs with correct POS||97.40||97.96||97.68||(gold = 4043, sys = 4020, correct = 3938)|
|Components||97.89||98.54||98.22||(gold = 6697, sys = 6653, correct = 6556)|
for the *_mwe+P score being null (0.0) it simply means that the feature mwehead was not provided in the test file (not mandatory, but useful to see if the total mwe prediction was indeed accurate) check the FAQ for details on the way to calculate those
grab the latest version of the evaluator here http://pauillac.inria.fr/~seddah/do_eval_dep_mwe.pl
We actually do have them (only two data points are still missing, we're recalculating them)
metrics you're seeing are
Acc. (x100) → tedeval accuracy Ex. gold (%) → exact match wrt the gold (ptb for the const file and conll for the dep files) Ex. gen (%) → exact matcj to the generalized gold (that is the generic tree being the intersection of the two other gold) Norm. → Normalisation factor.
Given the time constraints, we only compared the IMS rest (ptb vs conll) and we gave a baseline (
Well, according to Reut, it's an “easy” metrics so speak. First the parsed tree are converted to a generic representation, reprojectivized if needed. (internal format is like a flatten ptb tree) so is the gold, now for each tree a minimum tree edit distance is calculated in term of both node insertion and deletion, and lexical item insertion and deletion. Now this cost then is normalized, which gives an accuracy score. It's a bit similar to LeafAncestor but extended to a tree instead of a set of lineage (leafancestor gives the mean of all edit distance between a gold lineage and and a test lineage (path from a token to the root node).
have a look to section 3 of Reut's acl 2012 paper (http://www.tsarfaty.com/pdfs/acl12.pdf). Interestingly, if you want to evaluate on gold tokenization (to have a clue of what means a tedeval accuracy score), a parseval score (gold_token) on hebrew would give 88.75% while a ted score on the same set would be 93.39 (labeled) and 94.35 (unlabeled). the difference between labeled and unlabeled, is simply that all non terminal in the generic trees are replaced by a dummy symbol.
Murphy's law at its extreme implementation. Last issue was the cluster monitoring dying because some magic numbers were exhausted by the shell so it killed all evaluations (and most of them took more than 12 hours because of a) a race condition somewhere and b) the server room was too hot so the server slowed down the cpu frequencey to 1ghz (instead of way, way more) and so did the ram bandwith.)