User Tools

Site Tools


frequently_asked_questions

Do I have to scan the signed license files?

If you don't have access to a scanner, you can also use the “signature” option commonly found in recent pdf viewers (ie in OS X's Preview.app).


Do I need to submit for all languages?

We received some questions about the submission procedure, that we'd like to clarify. Even though we didn't make it mandatory on our websites (wiki and spmrl.org's shared task webpage), we would ideally prefer that teams submit results for all languages. However, we realize that the time frame is very tight and that teams may not have the resources to tackle all languages, or that certain approaches may make more sense for some languages than for others. So we will accept submissions for subsets of languages. Please note that papers describing full language set submissions will be allowed more space and eventually have priority for oral presentations.


What data do you provide?

We provide constituency and dependency treebanks, aligned at the tokens and the part-of-speech levels (tokens are the same, so brackets are for example all in penn treebank style) for 9 languages:
Arabic, Basque, French, German, Hebrew, Hungarian, Korean, Polish and Swedish.

For all languages we provide large unlabeled data set with automatic morpho-syntactic annotation.

Source type size (tree tokens) morph parsed Status
Arabic Gigaword news 100M yes (spmrl) not yet upcoming
Basque web balanced 120M yes (spmrl) yes ready
French Est Republicain newswire 120M yes (spmrl) not yet upcoming
German Wikipedia wiki (edited) 205M yes (spmr) yes ready
Hebrew Wikipedia wiki (edited) 160M yes (spmrl) yes ready
Hungarian news domain newswire 100M yes (spmrl) yes ready
Korean news domain newswire 80M yes (spmrl) not yet upcoming
Polish Wikipedia wiki (edited) 100M yes (SPMRL) yes ready
Swedish PAROLE newswire 24M yes (SPMRL) yes ready

How are the data organized?

Directory structure

- Each language root directory (e.g. GERMAN_SPRML) contains two directory pred and gold, where pred is a shortcut for treebank with predicted morphology (POS tags and morphological features, additionnaly lemmas if predicted lemmas were made available) and gold means manually validated morphological data.

- Each of those directories contains two directories: ptb, conll, where ptb means labelled bracketed data (eg. Penn treebank style, conll means CoNLL-X dependency format. In addition, some treebanks come with a native TIGER 2 XML description (with crossing branches), in that case a third directory named xml is present at the same hierarchical level than the others.

- Each of these directories are then divided between dev, train and train5k directories. In case of small treebanks, with a train set size of 5000 sentences, the train directory is omitted. (test is of course not included.)


Provided Files

Each final directory (ie, dev, test..) contains at least the following 4 files:

- *.{gold,pred}.{conll,ptb} : the treebank data (either with or without predicted morphohology)

- *.{gold,pred}.{conll,ptb}.tobeparsed.raw : the leaf nodes of the const treebank (one sentence per line) or the CoNLL's word form column.

- *.{gold,pred}.{conll,ptb}.tobeparsed.tagged : Files are provided with either gold, pred POS and morphological features in the following format for const files : Word POS Features

eg. : (Hungarian, tab separated)

mind C lem=mind|SubPOS=c|Form=c|Coord=w
pedig C lem=pedig|SubPOS=c|Form=s|Coord=w
a T lem=a|SubPOS=f

and the first 6 column of the relevant CoNLL file (ID,FORM,LEMMA,CPOS,FPOS,FEAT)

25 mind mind C C SubPOS=c|Form=c|Coord=w
26 pedig pedig C C SubPOS=c|Form=s|Coord=w
27 a a T T SubPOS=f

- *.{gold,pred}.{conll,ptb}.tobeparsed.tagged.lattices : the previous files are extended with lattices token id as in

(ptb)
24 25 mind C lem=mind|SubPOS=c|Form=c|Coord=w 25
25 26 pedig C lem=pedig|SubPOS=c|Form=s|Coord=w 26
26 27 a T lem=a|SubPOS=f 27

(conll)
24 25 mind mind C C SubPOS=c|Form=c|Coord=w 25
25 26 pedig pedig C C SubPOS=c|Form=s|Coord=w 26
26 27 a a T T SubPOS=f 27

where the first two numerical ids mean start_token and end_token while the last numerical id means the source token (see description in Shared Task Description)


Which resources are allowed?

This concerns the use of e.g. morphological analyzers, dictionaries, unlabeled data.

Regarding the resources, we do provide additional, large, (base-)parsed data sets for all languages.

In addition to what we provide, you can use any lexicon you want, assuming that they are made available to the participants. the same holds for additional unlabeled data, word clusters, and externally acquired word signatures. Please contact us if you are not sure about what to use.


For Task 2, can we predict our own tags instead of using the ones provided by the organizers?

Yes, of course. All provided predicted data are supplied as a baseline (even a strong one in some cases) so you can use any alternative tool you might prefer.


Are all tasks applicable for all languages?

In other words, are Task 2 and Task 3 mutually exclusive?

Short story : Task 2 and Task 3 are the same except that for 6 languages gold tokenization = predicted one. So they are not exclusive, they are complementary (the evaluation procedure is not the same)
Details:
Actually, some languages (Hebrew and Arabic) do not come with gold tokenization + predicted morphology, so they're only available with predicted tokenization + predicted morphology or gold tokenization+gold morphology (so no task 2 for those two). For the shared task in 2013, we received comments on the difficulty of dealing with the tokenization of Semitic languages. Of course, we are aware of that point, that was even one of our main concerns about this shared task: the entry cost for “newcomers” could be quite high. Nevertheless, we talked about this at length and decided that having the possibility to compare gold token+pred morph vs pred token + pred morph and investigate the bottlenecks more closely is more important than trying to avoid at all costs any potential, though very unlikely, misplaced curiosity on the test data.

So, this is why teams that would prefer to submit results on Arabic and Hebrew with gold token and pred morph. are now allowed to do it, with the restriction that they must also submit results on the pred token+pred morph data. The idea here is to see how those models compare within those those two scenarios.

For Hebrew, the gold token+pred morph data set is described in the README.spmrl file (using morphette (chrupala et al, 2008) trained on FORM \t CPOS+FEAT (no lemma available)
For Arabic, we trained morephette directly on a subset of the feature field (the atbpos=feature, namely the original treebank Bulkwater tagset) then we converted it to the CATIB tagset (CPOS)
For the FPOS field (the BIES tagset, used also as the main tagset for the ptb files ( see README files), we trained a different model (the conversion gave only 96.20 of pos accuracy, this one 96.60) and merged back the two predicted tagsets.

Model performance on the dev set (no lexicon used, 3.66% of OOVs)
(tagset: all/ seen / unseen)
atbpos : 90.37 / 91.14/ 70.14
bies : 96.60 / 97.25/ 79.58
Catib : 97.94 / 98.33/ 87.90

for those working on constituency, the stanpos attribute was also mapped back from the atbpos field.

The case of French is a bit different as that treebank includes MWEs annotations at the syntactic level (a special category between POS and non terminal) so because the shared task goal is to evaluate MRL parsing in a realistic scenario, you'll need to parse the French data without MWE annotations. In this case, the tokenization will be evaluated as part of the syntactic annotation (ie. a special label matching for dependencies).


How will TedEval deal with non-projective dependencies?

TedEval accepts projective trees only. We will re-projectivise the dep. structure for TedEval.


Is there a reason to exclude labelled evaluation in the raw scenario?

It's because TedEval requires that all functional labels to be the same between constituency and dependency structures (to perform cross framework evaluation) but this is only the case for Swedish and Hebrew (we didn't have the resources to do that level of normalization for all languages). So in order to provide fair and meaningful comparison with TedEval, we decided to target the lowest common denominator. Anyway, for individual treebanks, tedeval labeled eval will be used if it's not redundant.

Table synthesis:

Labeled Dependency Evalb's F-score cross framework
raw Arabic / Hebrew labeled tedeval Labeled tedeval unlabeled tedeval
French eval07 + mwe accuracy evalb + mwe acc. unlabeled tedeval
the rest eval07 evalb unlabeled tedeval

Can we make multiple submissions for the same setup?

This will depend on the total number of submitted system results. Let us say we limit the number of entries to 5 for now, and we'll adjust that number around the end of July.


What parser(s) should I use?

Well, nice questions. We wish we knew the answer:)

More seriously, it seems that there are many available options those days. We do not force anyone to choose a given architecture (joint model, pipeline, ..). Please note that this list is not exhaustive nor it is an endorsement from our side.

For constituency parsing, even though lexicalized models provide state-of-the-art performance on English, it's less clear from a small morphologically-rich language treebank viewpoint. So focusing on PCFG-LA based parsers, we can point to 3 options, each of these having their own strong points.

  • Yoav Goldberg's Lattice enabled version of the Berkeley parser: This version can output parses directly suitable for Tedeval.
    webpage Notes
  • DCU's Lorg parser. A multi-threaded C++ implementation of a PCFG-LA based parser. It can process lattice input and has the possibility to extract interesting word signatures acquired via information gain.
    webpageNotes
  • UMD's Feature-Rich PCFG-LA Parser: a multi-threaded version of the Berkeley parser that features, among others things, a very rich lexical model that can handle morphological features.
    webpage Notes

For dependency parsing, we're not aware of available working lattice input parser. There was some efforts by Benoit Favre on adding lattice input parsing to Malt (webpage) but according to the author, it's not working that well. Regarding morphological support, there are of course plenty of choices (Malt,Mst, Mate, ISBN,..), we won't list them all.

For more powerful models capable of handling crossing branches trees (native format for German, Swedish and Polish), choice is currently more limited.

  • Wolfgang's Maier rparse: a data driven LCFRS parser. Provides a formally sound way of handling long distance dependencies.
    webpage
  • Andreas van Cranenburg's Disco-Dop parser: A DOP-LCFRS parser!
    webpage

Please do not use ready to use models for any of those parsers, all parsing models must be trained on the shared task training sets.

frequently_asked_questions.txt · Last modified: 2014/05/26 04:57 by seddah