User Tools

Site Tools


official_results_page

This is an old revision of the document!


Gold Tokens Evaluation

Dependency Parsing Track

We used the same protocol as in Conll 2007 (Nivre et al, 2007) and SPMRL 2013 (Seddah et al, 2013) in two scenarios (with 4 settings)

  • Full train set size ⇒ with gold or predicted morphology (POS tag and features)
  • 5k sentences train set size ⇒ with gold or predicted morphology (POS tag and features).

- Note that the predicted data were provided as baseline, participants were free to use theirs. The Hebrew and Arabic predicted train sets have not been subjected to a cross fold jackniffing so participants were incited to do it (only one team did use their own predicted morphology though: IMS_WROCLAW_SZEGED_CIS for all languages.) - Please check first the pred/full results (most teams) and the gold/5k results (ICT+LORIA)

2014 submissions +Baseline Malt (all sent., punct. eval)

Web

csv

2014 Submissions + IMS_SZEGED_CIS+ Basque Team + Baseline Malt+ Maltoptimizer
For ease of analysis, we also provide results with the entries from teams that participated to both shared task editions (using more or less the same system but with a semi supervised component - mostly word clusters acquired from the unlabeled data we provided)

Web

Csv

The Malt Baselines and Maltoptimizer results were provided by Miguel Ballesteros as part of his 2013 entries (thanks Miguel!)

All languages ranking

(Note that only one team, LORIA, submitted results for Arabic. This is due to the late availability of the Arabic unlabeled data, preventing the others teams to accurately train their models on this language data set).

Average score ranking

  • 1st Loria (9 languages)
  • 2nd IMS_WROCLAW_SZEGED_CIS (8 languages)
  • 3rd BASQUE TEAM (5 languages)

Soft average score ranking

  • 1st IMS_WROCLAW_SZEGED_CIS
  • 2nd BASQUE TEAM
  • 3st Loria

Special kudos to the ICT team who provided results only for the gold/5k track on 8 languages and to the LORIA team who provided results for all languages on all tracks. Congratulations for the IMS_WROCLAW_SZEGED_CIS and Basque Team for their state-of-the-art results!

Breakdown per language

(coming soon)


Constituency track

Gold Tokens

Parseval's Evaluation

Note: that we used a modified version of Evalb (Black et al, 91) download

so 
* (i) it  can process the particular format of SPMRL Data, 
* (ii) unparsed sentences do actually penalize the global score (before, an unparsed sentence was simply ignored, the number of unparsed sentence was just incremented accordingly).

Protocol: All tokens are evaluated (including punctuations), top labels (TOP, S1, ROOT, VROOT) are deleted.

2014 submissions +Baseline Berkeley Parser (generic lexicon, 1 grammar) (all sent., punct. eval)

web csv

2014 submissions +2013 IMS entry +Baseline Berkeley Parser (generic lexicon, 1 grammar) (all sent., punct. eval)

web csv

LeafAncestor's Evaluation

Although not included in our initial evaluation protocol, the huge differences in our treebanks' node per terminal ratio, render any cross language comparisons difficult to say the least. We therefore provide Leaf Ancestors (Samson & Babarczy, 2003) results which for some configuration tell a different story.
We used Joachim Wagner's Leaf Ancestor implementation (download)

2014 submissions +Baseline Berkeley Parser (generic lexicon, 1 grammar) (all sent., punct. eval)

web csv

2014 submissions +Baseline Berkeley Parser (generic lexicon, 1 grammar) (all sent., punct. eval)

web csv

As for any other metrics, those results are better interpretable with the following data in mind:

http://pauillac.inria.fr/~seddah/stats_treebanks_ptb.csv

Non-terminal nodes per terminal ratio (split x languages)

ARABIC BASQUE FRENCH GERMAN HEBREW HUNGAR. KOREAN POLISH SWEDISH
train 5.04 1.21 2.90 1.66 1.68 1.64 1.05
train5k 6.26 1.19 2.92 1.65 2.33 1.74 1.47 1.05 1.70
test 5.05 1.20 2.93 1.66 2.12 1.60 1.63 1.04 1.74
dev 5.11 1.33 2.99 1.57 2.11 2.08 1.58 1.05 2.07

All languages ranking

Same disclaimer as for the dependency track applies.

Average score ranking

  • 1st Alpage-lexparser (9 languages)
  • 2nd IMS_WROCLAW_SZEGED_CIS (8 languages)

Soft average score ranking

  • 1st IMS_WROCLAW_SZEGED_CIS
  • 2nd Alpage-lexparser

Congratulations to the both teams: they were brave enough to submit on the constituency track!

PREDICTED TOKENS EVALUATION

COMING SOON !!

Arabic and Hebrew

Arabic and Hebrew data set were provided with generated lattices (disambiguated and non-disambiguated for Hebrew, disambiguated only for Arabic – the data exist though, they should be made available at some points –)

Results on the predicted tokens scenarios are evaluated using Tedeval 2.2 (Tsarfaty et al, 2011,2012) in two modes:

  • A fully labeled mode (where edges, either from const. trees or dependencies, are decorated by their original labels). This mode allow for a full comparison between dependency parses produced on gold tokens and predicted tokens from the raw source text.
  • An unlabeled mode which allows for cross-framework comparison (between const. and dep. parsers). In order to perform a fully labeled evaluation of a const. tree, each edges needs to bear a function label.

Faq

Why are Tedeval scores usually higher than eval07's or evalb's?

Well, according to Reut, it's an “easy” metrics so speak. First the parsed tree are converted to a generic representation, reprojectivized if needed. (internal format is like a flatten ptb tree) so is the gold, now for each tree a minimum tree edit distance is calculated in term of both node insertion and deletion, and lexical item insertion and deletion. Now this cost then is normalized, which gives an accuracy score. It's a bit similar to LeafAncestor but extended to a tree instead of a set of lineage (leafancestor gives the mean of all edit distance between a gold lineage and and a test lineage (path from a token to the root node).

have a look to section 3 of Reut's acl 2012 paper (http://www.tsarfaty.com/pdfs/acl12.pdf). Interestingly, if you want to evaluate on gold tokenization (to have a clue of what means a tedeval accuracy score), a parseval score (gold_token) on hebrew would give 88.75% while a ted score on the same set would be 93.39 (labeled) and 94.35 (unlabeled). the difference between labeled and unlabeled, is simply that all non terminal in the generic trees are replaced by a dummy symbol.

What are the parsers used as baseline for the const. track?

  • BASELINE_BKY_TAGGED: Last version of the berkeley parser with POS tag supplied
  • BASELINE_BKY_RAW: Last version of the berkeley parser with raw tokens only

Why does it take so long to get the cross framework results ?

Once again, like last year, Murphy's law at its extreme ironical implementation. After the unplanned reboot of the clusters and two crash disks needing a full rebuild of a raid 5 disk array on our backup server, the network card and/or the 3rd raid disk seems to be dying resulting in a extreme slowness of any i/o. Of course, this is mid august, in France, so in our cloud absolutely no one can hear you scream. We're working on it though.

official_results_page.1408325952.txt.gz · Last modified: 2014/08/18 03:39 by seddah