User Tools

Site Tools


shared_task_description

Shared Task Description

(Please note that the description is susceptible to be enriched in the following days)

For the shared task, we have nine different treebanks: Arabic, Basque, French, German, Hebrew, Hungarian, Korean, Polish, and Swedish.

Although not always represented in treebanks, most languages have phenomena where space delimited tokens do not correspond to words as active ingredients in syntax. This covers multi-word expressions such as in spite of in English, but also phenomena in Arabic, where conjunctions are attached to the following word.

In order to provide the means to produce and evaluate more realistic parsing models, whenever it's possible, we provide data following different tokenization schemes: one with word gold segmentation, and one in which we have unsegmented text as it would appear in newspapers, etc.

Parsing Scenarios


We have two syntactic frameworks:

  1. constituent structure
  2. dependency structure

Constituent structures are available in two formats: an extended PTB bracketed style (eg Penn Treebank with morphological features expressed at the POS or non terminal levels, see below) and, if available, the Tiger 2 format. The latter has the possibility to represent trees with crossing branches, allowing the use of more powerful parsing models (in term of expressivity) than pure PCFG-based parsers.

Dependency structures are available in the CoNLL'07 format.

All treebank instances (dep. and const.) are aligned at the token level and share the same POS tagset and morphological features.

Participants can choose either one of those frameworks, or both, or one by conversion from the other.

Input scenarios


The types of scenarios that we assume in terms of input are as follows:

  1. Gold tags: gold word segmentation and gold tags are given
  2. Predicted tags, gold segmentation: gold word segmentation is given (for the languages where it matters, otherwise standard words), POS tags and morph. features are automatically predicted (available for Basque, German, Hungarian, Korean, Polish, Swedish)
  3. Fully Predicted: sentences are tokenized; raw words are given (available for French, Arabic and Hebrew)

The evaluation task focuses on the last two scenarios but participants are strongly encouraged to provide “gold mode” parsing results so that a performance ceiling can be determined for each system/framework.

Training set size scenario


Our data set contains treebanks with different size (from 6k to 50k sentences). So in order to allow for a fair comparison between treebank/parsing model pairs, we also provide training sets with a common size of 5000 sentences. Participants should thus also provide results from parsing models trained on the small data set.

To sum up, we have different scenarios with regard to the size of the training sets:

  • the full training set
  • a 5000 sentence training set
  • the full test set will be used for both scenarios, we will later sample a common subset with similar properties in term of sentence length and number of tokens.

Input Formats


The input format is a variant of the CoNLL format for dependencies. This is necessary to represent word segmentation issues and easily allow to include morphological features and alternative analysis. We mark the beginning and the end of words, which do not have to correspond to what we call tokens, which can consist of more than one word.

  • In Hebrew, where several (syntactically important) morphemes create a word, we will have something like the following: hebrewIn.pdf
  • Note that if one wants to deliver a lattice in which segmentation is ambiguous, they can do so by adding lines for alternative spans or alternative tags of spans. These lines need not be sorted. See the (real-world) example segmentation lattice here: multi.pdf
  • or the German morphology lattice file (predicted from the SMOR analyser):
  • 0 1 Der PRELS gender=fem|case=dat|number=sg| 1
  • 0 1 Der PRELS gender=masc|case=nom|number=sg| 1
  • 0 1 Der PDS gender=fem|case=dat|number=sg| 1

The format of Form/Lemma/CPos/FPos/Feats is the exact same as in the CoNLL format, including vertical bars separating morphemes, and = separate feature values. The only additional value in addition to the CoNLL ones is the original token ID in the last column.

Output Formats


In order to evaluate all scenarios, we consider the terminals present in both the trees, and we will need to keep track of how the given parse is related to the original word tokens (this is true both for parsing over a segmentation lattice or for MWEs)

  • For dependency trees, the parsers are to deliver the standard CoNLL format
  • For constituency trees we require the standard PTB-like trees over terminals (one tree per line)

For the scenario based on fully raw text, we additionally require a file containing the token IDs. That is, for the Hebrew sentence of BCLM HNEIM, after disambiguation as follows B CL FL HM H NEIM, we would get a (constituent/ dependency) tree for these 6 morphological segments, and in a parallel file, we would get the following line:

1 1 1 1 2 2

saying that the first 4 leaves in the syntax structure correspond to the first word in the raw text, and the last 2 leaves correspond to the second word.

This may also be used for multiword expressions: for “I live in Tel Aviv” we would get the tree as usual over the 5 terminals, and the line

1 2 3 4 4

In all cases, the tree terminals hang separately under their parents (in the case of a MWE they will hang flat under a shared parent).

Evaluating All Scenarios

update *February 2014: the evalb package that was available on Djame's site was not the correct one. if your version doesn't have the -X switch, it's the buggy one.

  • French MWE Evaluation

On top of classical evalb and eval07.pl evaluation, we will also provide results on multiword expression. Thanks to Marie Candito, the evaluator for dependencies output is provided on tools. (see test/tools/do_eval_dep_mwe.pl) In the very next days, we'll provide the same script for mwe eval of constituency parses, however here's the readme of the current tool.

SPMRL 2013 shared task dependency evaluation script for French.

EXPECTED FORMAT for marking MWEs:

 The script supposes that all MWEs are flat, with one component governing
 all the other components of the MWE with dependencies labeled <MWE_LABEL>.
 If provided, the additional information of the part-of-speech of the MWE
 is expected to be given as value of a <MWE_POS_FEAT> feature, on the head token
 of the MWE.

OUTPUT:

 The script outputs in any case two evaluations, and possibly a third one :
  1. precision/recall/Fmeas on components of MWEs (excluding heads of MWEs)

A component of MWE is counted as correct if it is attached to the same

   token as in the gold file, with label <MWE_LABEL>
  1. precision/recall/Fmeas on full MWEs

A MWE is counted as correct if its sequence of tokens also forms

   a MWE in gold file
  1. if both the gold file and the system files do contain at least one <MWE_POS_FEAT> feature,

then a third evaluation is also provided, which uses a stricter criteria

   for full MWEs : they have to be composed of the same tokens as in gold file AND the gold
   and predicted part-of-speech for the MWE have to match.

USAGE: perl do_eval_dep_mwe.pl [OPTIONS] -g <gold standard conll> -s <system output conll>

 [ -mwe_label <MWE_LABEL> ] label used for components of MWEs. Default = dep_cpd
 [ -mwe_pos_feat <MWE_POS_FEAT> ] use to define the feature name that marks heads of MWEs. Default = mwehead
 [ -help ] 
shared_task_description.txt · Last modified: 2015/02/20 19:10 by dseddah