User Tools

Site Tools


Gold Tokens Evaluation

IMPORTANT (September 26, 2013. 01:26) Due to a wrong initial setting (our complete fault) the baseline constituency Lorg scores for all settings are taken off from the official results (and forgotten altogether)
We'll update the webpage once we're finish with the overview paper..
(September 19, 2013. 15:15): baselines scores for the constituency track have been recalculated – some files were missing so we've done it again. Great. –

Dependency Parsing Track

We used the same protocol as in Conll 2007 (Nivre et al, 2007) in two 4 scenarios:

  • Full train set size ⇒ with gold or predicted morphology (POS tag and features)
  • 5k sentences train set size ⇒ with gold or predicted morphology (POS tag and features).

Note that the predicted data were provided as baseline, participants were free to use theirs. The French, Hebrew and Arabic predicted train sets have not been subjected to a cross fold jackniffing so participants were incited to do it (only a few did use their own predicted morphology though: Alpage-IGM and Alpage-Dyalog for French, Cadim for Arabic and IMS_SGZEDED_CIS for all languages.

(all sentences, including punct,

All languages ranking

  • 2nd Alpage-Dyalog
  • 3rd MaltOptimier

Breakdown per language

(coming soon)

Constituency track

Gold Tokens

Due to excessive length duration when parsing Arabic long sentences, we applied a length cut-off of 70 to our baseline parsers. The “all sentences results” are however the official results. The ⇐70 ones are given for comparison's sake.

Parseval's Evaluation

Note: that we used a modified version of Evalb (Black et al, 91)

* (i) it  can process the particular format of SPMRL Data, 
* (ii) unparsed sentences do actually penalize the global score (before, an unparsed sentence was simply ignored, the number of unparsed sentence was just incremented accordingly).

Protocol: All tokens are evaluated (including punctuations), top labels (TOP, S1, ROOT, VROOT) are deleted.

csv all csv <=70

note: For Arabic, one of our baseline parser had a bug, leading to the incorrect adding of a spurious enclosing parenthesis, which caused evalb to penalize its scores. Of course, this is less visible in LeafAncestor's scores.

LeafAncestor's Evaluation

Although not included in our initial evaluation protocol, the huge differences in our treebanks' node per terminal ratio, render any cross language comparisons difficult to say the least. We therefore provide Leaf Ancestors (Samson & Babarczy, 2003) results which for some configuration tell a different story.
We used Joachim Wagner's Leaf Ancestor implementation (download)

csv all csv <=70

As for any other metrics, those results are better interpretable with the following data in mind:

Non-terminal nodes per terminal ratio (split x languages)

train 5.04 1.21 2.90 1.66 1.68 1.64 1.05
train5k 6.26 1.19 2.92 1.65 2.33 1.74 1.47 1.05 1.70
test 5.05 1.20 2.93 1.66 2.12 1.60 1.63 1.04 1.74
dev 5.11 1.33 2.99 1.57 2.11 2.08 1.58 1.05 2.07

All languages ranking


Congratulations to the IMS-SZEGED-CIS Team: they were the only one to be brave enough to submit on the constituency track ! —


Arabic and Hebrew

Arabic and Hebrew data set were provided with generated lattices (disambiguated and non-disambiguated for Hebrew, disambiguated only for Arabic – the data exist though, they should be made available at some points –)

Results on the predicted tokens scenarios are evaluated using Tedeval 2.2 (Tsarfaty et al, 2011,2012) in two modes:

  • A fully labeled mode (where edges, either from const. trees or dependencies, are decorated by their original labels). This mode allow for a full comparison between dependency parses produced on gold tokens and predicted tokens from the raw source text.
  • An unlabeled mode which allows for cross-framework comparison (between const. and dep. parsers). In order to perform a fully labeled evaluation of a const. tree, each edges needs to bear a function label.

Note that the IMS-SZEGED-CIS ptb scores are lower than expected in labeled mode. This because the trees were not annotated with functions.

Even though the cross framework results are still pending, those results also include the results from IMS-SZEGED-CIS on the constituency track.


French is a different case: as for many languages, multi word expressions (MWEs) are of prevalent use in French. The French Treebank (Abeille et al, 2003) has then the particularity of having being built with MWEs from the start. As this shared task focused on real world evaluation (parsing raw text as much as possible) and because MWEs are annotated at the morpho-syntactic level in French, we decided to provide French raw text but evaluate parsing jointly with MWEs evaluations.

How to interpret those lines ?

For example the line

F_mwe: 97.40 R_mwe: 97.96 P_mwe: 97.68 F_cmp: 97.89 R_cmp: 98.54 P_cmp: 98.22 F_mwe+P: 97.40 R_mwe+P: 97.96 P_mwe+P: 97.68 file: XX.parsed

is a short cut for the following information

Total nb of sentences : 2541 file: XX.parsed

Recall Precision Fscore
Full MWEs 97.40 97.96 97.68 (gold = 4043, sys = 4020, correct = 3938)
Full MWEs with correct POS 97.40 97.96 97.68 (gold = 4043, sys = 4020, correct = 3938)
Components 97.89 98.54 98.22 (gold = 6697, sys = 6653, correct = 6556)

for the *_mwe+P score being null (0.0) it simply means that the feature mwehead was not provided in the test file (not mandatory, but useful to see if the total mwe prediction was indeed accurate) check the FAQ for details on the way to calculate those

grab the latest version of the evaluator here


We actually do have them (only two data points are still missing, we're recalculating them)

  • The evaluation protocol is the following:
  • train5k files
  • Gold morphology (and pred, but here gold matters the most as it alleviates the difference of predicted morphology accuracy in the various languages)
  • and evaluated on a subset of the test file : First 5000 tokens with respect to sentence boundaries (so that will give 5007 tokens for French and 4983 for Arabic for example, as in the conll2007 test files)

metrics you're seeing are

Acc. (x100)  → tedeval accuracy Ex. gold (%) → exact match wrt the gold (ptb for the const file and conll for the dep files) Ex. gen (%)  → exact matcj to the generalized gold (that is the generic tree being the intersection of the two other gold) Norm.  → Normalisation factor.



Given the time constraints, we only compared the IMS rest (ptb vs conll) and we gave a baseline (


Why are Tedeval scores higher than eval07's or evalb's?

Well, according to Reut, it's an “easy” metrics so speak. First the parsed tree are converted to a generic representation, reprojectivized if needed. (internal format is like a flatten ptb tree) so is the gold, now for each tree a minimum tree edit distance is calculated in term of both node insertion and deletion, and lexical item insertion and deletion. Now this cost then is normalized, which gives an accuracy score. It's a bit similar to LeafAncestor but extended to a tree instead of a set of lineage (leafancestor gives the mean of all edit distance between a gold lineage and and a test lineage (path from a token to the root node).

have a look to section 3 of Reut's acl 2012 paper ( Interestingly, if you want to evaluate on gold tokenization (to have a clue of what means a tedeval accuracy score), a parseval score (gold_token) on hebrew would give 88.75% while a ted score on the same set would be 93.39 (labeled) and 94.35 (unlabeled). the difference between labeled and unlabeled, is simply that all non terminal in the generic trees are replaced by a dummy symbol.

What are the parsers used as baseline for the const. track?

  • BASELINE_BKY_TAGGED: Last version of the berkeley parser with POS tag supplied
  • BASELINE_BKY_RAW: Last version of the berkeley parser with raw tokens only
  • BASELINE_LAST_TAGGED: Last experimental version of the Lorg parser used with POS tag supplied for unknown word, with a sentence length cut-off of 70 (no analysis given beyond 70).
  • BASELINE_TAGGED: Older version of Lorg with POS tag forced

Why does it take so long to get the cross framework results ?

Murphy's law at its extreme implementation. Last issue was the cluster monitoring dying because some magic numbers were exhausted by the shell so it killed all evaluations (and most of them took more than 12 hours because of a) a race condition somewhere and b) the server room was too hot so the server slowed down the cpu frequencey to 1ghz (instead of way, way more) and so did the ram bandwith.)

official_results_pages_news.txt · Last modified: 2015/04/24 22:48 by dseddah