User Tools

Site Tools


official_results_page

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Last revision Both sides next revision
official_results_page [2014/08/18 03:33]
seddah [All languages ranking]
official_results_page [2014/08/18 03:45]
seddah
Line 13: Line 13:
  
  
-[[http://​pauillac.inria.fr/​~seddah/​2014official_conll-all.spmrl_results.html|Web ]]+[[http://​pauillac.inria.fr/​~seddah/​2014official_leaf-all.spmrl_results.html|Web ]]
  
-[[http://​pauillac.inria.fr/​~seddah/​2014official_conll-all.csv| csv]]+[[http://​pauillac.inria.fr/​~seddah/​2014official_leaf-all.csv| csv]]
  
  
Line 132: Line 132:
 --- ---
 ====== PREDICTED TOKENS EVALUATION ====== ====== PREDICTED TOKENS EVALUATION ======
 +** COMING SOON !!**\\
  
 ===== Arabic and Hebrew ===== ===== Arabic and Hebrew =====
Line 142: Line 143:
  
  
-  * [[http://​pauillac.inria.fr/​~seddah/​predicted_token.tedeval.labeled.csv| Tedeval Labeled evaluation (csv)]] ​ 
-  * [[http://​pauillac.inria.fr/​~seddah/​tedeval_unlabeled.pred_tagged+pred_token.csv| Tedeval Unlabeled evaluation ​ (csv)]] 
- 
-Note that the IMS-SZEGED-CIS ptb scores are lower than expected in labeled mode. This because the trees were not annotated with functions. 
- 
-Even though the cross framework results are still pending, those results also include the results from IMS-SZEGED-CIS on the constituency track. 
- 
-===== French ===== 
- 
-French is a different case: as for many languages, multi word expressions (MWEs) are of prevalent use in French. The French Treebank (Abeille et al, 2003) has then the particularity of having being built with MWEs from the start. As this shared task focused on real world evaluation (parsing raw text as much as possible) and because MWEs are annotated at the morpho-syntactic level in French, we decided to provide French raw text but evaluate parsing jointly with MWEs evaluations. 
- 
-  * [[http://​pauillac.inria.fr/​~seddah/​official-mwe.spmrl_results.html| mwe scores on French (web)]] 
-  * [[http://​pauillac.inria.fr/​~seddah/​mwe_french_eval_conll.csv| mwe scores on French (csv)]] 
- 
-How to interpret those lines ? 
- 
-For example ​ the line 
- 
-**F_mwe: 97.40 R_mwe:​ 97.96 P_mwe:​ 97.68 F_cmp:​ 97.89 R_cmp:​ 98.54 P_cmp:​ 98.22 F_mwe+P:​ 97.40 R_mwe+P:​ 97.96 P_mwe+P:​ 97.68 file: XX.parsed** 
- 
-is a short cut for the following information 
- 
-Total nb of sentences : 2541 file: XX.parsed 
- 
-^                             ^ Recall ​ ^ Precision ​ ^ Fscore ​ ^                                            | 
-| Full MWEs                   | 97.40   ​| ​     97.96 | 97.68       ​| ​ (gold = 4043, sys = 4020, correct = 3938) | 
-| Full MWEs with correct POS  | 97.40   ​| ​     97.96 | 97.68       ​| ​ (gold = 4043, sys = 4020, correct = 3938) | 
-| Components ​                 | 97.89   ​| ​     98.54 | 98.22       ​| ​ (gold = 6697, sys = 6653, correct = 6556) | 
- 
- 
- 
- 
- 
-for the *_mwe+P score being null (0.0) it simply means that the feature mwehead was not provided in the test file (not mandatory, but useful to see if the total mwe prediction ​ was indeed accurate) 
-check the FAQ for details on the way to calculate those 
- 
-grab the latest version of the evaluator here 
-http://​pauillac.inria.fr/​~seddah/​do_eval_dep_mwe.pl 
- 
- 
- 
- 
-====== CROSS FRAMEWORKS EVALUATION ====== 
-We actually do have them (only two data points are still missing, we're recalculating them) 
- 
-  * The evaluation protocol is the following:​\\ 
-  * train5k files 
-  * Gold  morphology (and pred, but here gold matters the most as it alleviates the difference of predicted morphology accuracy in the various languages) 
-  * and evaluated on a subset of the test file : First  5000 tokens with respect to sentence boundaries (so that will give 5007 tokens for French and 4983 for Arabic for example, as in the conll2007 test files) 
-  
-metrics you're seeing are 
- 
-Acc. (x100)  ->​ tedeval accuracy 
-Ex. gold (%) -> exact match wrt the gold (ptb for the const file and conll for the dep files) 
-Ex. gen (%)  -> exact matcj to the generalized gold (that is the generic tree being the intersection of the two other gold) 
-Norm.  ​ -> Normalisation factor. 
- 
- 
-[[http://​pauillac.inria.fr/​~seddah/​official_cross_tedeval_unlabled-70-5ktok.spmrl_results.html|web]] 
- 
-[[http://​pauillac.inria.fr/​~seddah/​official_cross_tedeval_unlabled-70-5ktok.csv|csv]] 
- 
-Given  the time constraints,​ we only compared the IMS rest (ptb vs conll) and we gave a baseline ( 
- 
- 
----- 
 ====== Faq ====== ====== Faq ======
  
-=== Why are Tedeval scores higher than eval07'​s or evalb'​s?​ ===+=== Why are Tedeval scores ​usually ​higher than eval07'​s or evalb'​s?​ ===
  
 Well, according to Reut, it's an "​easy"​ metrics so speak. Well, according to Reut, it's an "​easy"​ metrics so speak.
Line 232: Line 167:
   * BASELINE_BKY_TAGGED:​ Last version of the berkeley parser with POS tag supplied   * BASELINE_BKY_TAGGED:​ Last version of the berkeley parser with POS tag supplied
   * BASELINE_BKY_RAW:​ Last version of the berkeley parser with raw tokens only   * BASELINE_BKY_RAW:​ Last version of the berkeley parser with raw tokens only
-  * BASELINE_LAST_TAGGED:​ Last experimental version of the Lorg parser used with POS tag supplied for unknown word, with a sentence length cut-off of 70 (no analysis given beyond 70). 
-  * BASELINE_TAGGED:​ Older version of Lorg with POS tag forced 
  
 === Why does it take so long to get the cross framework results ? === === Why does it take so long to get the cross framework results ? ===
-Murphy'​s law at its extreme implementation. ​Last issue was the cluster monitoring dying because some magic numbers were exhausted by the shell so it killed all evaluations (and most of them took more than 12 hours because ​of a) a race condition somewhere ​ and b) the server ​room was too hot so the server slowed down the cpu frequencey ​to 1ghz (instead ​of wayway more) and so did the ram bandwith.)+Once again, like last year, Murphy'​s law at its extreme ​ironical ​implementation. ​After the unplanned reboot of the clusters ​and two crash disks needing a full rebuild ​of a raid 5 disk array on our backup ​serverthe network card and/​or ​the 3rd raid disk seems to be dying resulting in a extreme slowness ​of any i/o. Of course, this is mid august, in France, so in our cloud absolutely no one can hear you scream. We're working on it though.
official_results_page.txt · Last modified: 2014/08/24 02:58 by seddah