Introduction

The XMG system corresponds to what is usually called a “metagrammar compiler” (see below). More precisely it is a tool for designing large scaled grammars for natural language. Provided a compact representation of grammatical information, XMG combines elementary fragments of information to produce a fully redundant strongly lexicalised grammar. It is worth noticing that by XMG, we refer to both

  • a formalism allowing one to describe the linguistic information contained in a grammar,
  • a device computing grammar rules from a description based on this formalism.

What is a metagrammar ?

This term has been introduced at the end of the 1990s by MH Candito. During her PhD, she proposed a new process to generate semi-automatically a Tree Adjoining Grammar (TAG) from a reduced description that captures the linguistic generalizations appearing among the trees of the grammar. This reduced description is the metagrammar.

What is a metagrammar compiler ?

Once we have described the grammar rules by specifying the way structure is shared, i.e. by defining reusable fragments, we use a specific tool to combine these. Such a tool is called a metagrammar compiler.

What is XMG-2 ?

A distinction has to be made between XMG and XMG-2 (sometimes called XMG-NG). XMG is a metagrammar compiler dedicated to the generation of Tree Adjoining Grammars and Interaction Grammars. XMG-2 is a whole new project which has been developed at the LIFO (University of Orléans) and the SFB 991 (University of Düsseldorf). XMG-2 makes it possible to create new compilers, adapted to other generation tasks. Its modularity allows to simply assemble Domain Specific Languages, and automatically generate the processing chain for these languages.

In other words, XMG-2 is a tool which allows to generate compilers such as XMG: a metacompiler (or compiler compiler).

This user documentation of XMG-2 is based on the documentation for XMG, and includes the new features provided by the recent extensions.

First steps

This section presents the different ways XMG can be used, and how to use it to generate a first resource from a toy example.

Installation

There are several ways to use XMG: it can be installed on the system (only for linux users), used on a virtual system, or through a webpage. The first two options are recommanded for developing large scale resources.

Option 1: standard installation

if you are using a Debian based distribution (like Ubuntu), open a terminal and follow the following steps:

Git:

  sudo apt-get install git

Download and install Gecode (4.0 not supported yet):

From here: http://www.gecode.org/download.html (recommended: http://www.gecode.org/download/gecode-3.7.3.tar.gz, also available here: Gecode 3.7.3).

  ./configure --disable-qt --disable-gist
  make 

You should read the following if the installation is successful:

Compilation of Gecode finished successfully. To use Gecode, either add 
  /.../gecode-3.7.3 to your search path for libraries, or install Gecode using
 make install       

Then, as suggested, you can type:

  make install    

Note that you might need to throw this command as superuser (sudo make install). If the installation succeeds, you should be able to run Gecode, you can try it by typing:

./examples/queens

If the installation fails, there is probably a dependency missing. To install them:

  sudo apt-get install g++    
  sudo apt-get install make

Download and install YAP (Yet Another Prolog):

  git clone https://github.com/spetitjean/yap-6.3.git

Then:

  ./configure --without-readline
  make 
  make install   

For dependencies:

  apt-get install libgmp3-dev

Install Python3 (>3.2):

  sudo apt-get install python3 python3-yaml python3-pyqt4

Download XMG:

  git clone https://github.com/spetitjean/XMG-2.git    

You can also get it as an archive here: XMG-NG (this solution will not allow you to update XMG-2).

Add XMG-2 to your PATH

Edit your ~/.bashrc file and add this line (your path_to_xmg should be for example ~/xmg-ng):

  export PATH=path_to_xmg:$PATH    

To edit the bashrc file, you can type:

emacs ~/.bashrc

Option 2: using Virtualbox

A VirtualBox image of XMG is available for an easier installation. Use VirtualBox and download one of the XMG virtual images:

Using XMG without installing anything

An online compiler is available at this address: http://xmg.phil.hhu.de/index.php/upload/workbench.

Updating XMG-2

To get the latest version of XMG-2, regardless of the installation option you chose, you can type this command (in the xmg-ng directory):

git pull

Creating a first compiler

The instructions detailed here is equivalent to using the script reinstall.sh (see section Scripts). This means that you can skip this section by only typing:

./reinstall.sh
(at the root of the XMG-2 installation directory)

Before compiling a metagrammar, a compiler needs to be created. XMG-2 assembles compilers by combining compiler fragments called bricks. These bricks are distributed into packages called contributions. For example:

  • the contribution core provides bricks offering support for the basic features of a compiler
  • the contribution treemg makes it possible to process tree descriptions
  • the contribution synsemCompiler makes the synsem compiler (equivalent to XMG-1) available

Installing a contribution, with the command install makes all the bricks of this contribution available for being assembled.

First, install the needed contributions for the synsem compiler:

xmg bootstrap               
cd contributions            
xmg install core          
xmg install treemg         
xmg install compat         
xmg install synsemCompiler  

Then, build the compiler:

cd synsemCompiler/compilers/synsem
xmg build

After these operations, the compiler synsem (Tree Adjoining Grammar with semantics based on predicate logic) is available.

Compiling a toy-metagrammar

The XMG system includes a toy metagrammars that we highly recommend to manipulate. The files containing these metagrammars should be in the Metagrammars directory of the XMG installation. To compile one of the synsem examples (adapted to the compiler we just built), just type:

  xmg compile synsem MetaGrammars/synsem/TagExample.mg

(see also List of XMG's options below) The result of this compilation will be a file named TagExample.xml.

To launch the GUI, type:

  xmg gui tag

You can then open the grammar file (.xml) which was generated by the compiler (Fichier → Ouvrir un XML).

Compiling an existing metagrammar

To compile metagrammars which were created using XMG1, it is usually necessary to use the –notype option to cancel the type checking steps which did not exist in XMG1.

Writing a Metagrammar

This section gives the general shape of a Metagrammar. The resource itself is described with domain specific languages (depending on the type of resource) which are provided by XMG dimensions. The different description languages available will be presented in the section Dimensions.

Choosing a compiler

The first decision which needs to be made is the choice of the compiler. This decision depends on the type of linguistic resource to describe. Each compiler was created for a specific grammar engineering task, and features a set of dimensions. This means that each compiler comes with its own language. The list of available dimensions is given in the next section (Dimensions), and a list of existing compilers using these dimensions is given in the section Bricks, contributions and commands.

Getting started

A metagrammar is composed of one or several text files, which are usually using the prefix .mg or .xmg. Any text editor can be used to write XMG code, although Emacs is recommended because of the different XMG modes created for it:

  • the emacs and vim modes for XMG-1 (only tree descriptions and predicate semantics).
  • new emacs modes inspired from this one, which are automatically generated when a compiler is built (in the file .install/yap/xmg/compiler/X/generated/emacs_mode, where X is the name of the compiler).
  • a more advanced emacs mode for tree descriptions and frame semantics: https://github.com/xmg-hhu/xmg-mode

The XMG online compiler also provides an online interactive editor: http://xmg.phil.hhu.de/index.php/upload/workbench.

Including data from other files

To ease their development and reuse Metagrammars can be written in separated files. For example, all type declarations can be isolated in a file. To include the code of a file into another:

include file_to_include.mg

Principles and Constants

Principles

The first piece of information one has to give in a metagrammar is the principles that will be needed to compute the grammar structures. The instruction used to do this is the use principle with (constraints) dims (dimensions) statement. For instance, one may decide to force the syntactic structures of the output grammar to have the grammatical function gf with the value subj only once. This is told by:

  use unicity with (gf = subj) dims (syn)

In the syn dimension, we use the unicity principle on the attribute-value gf = subj. The description of the unicity principle, together with all information about principles and how to use/create them, can be found in the section Principles and plugins.

Note that principles use as parameters pieces of information that are associated to nodes with the status property (see below).

Types and Constants

Every piece of information in a XMG metagrammar is typed. This is of course the case for values in feature structures, but also for syntactic nodes, dimensions, classes, etc. There are 4 ways of defining types:

  • as an enumerated type, using the syntax type Id = {Val1,…,ValN} such as in:
  type CAT={n,v,p}    

(note that the values associated to a type are constants)

  • as an integer interval, using the syntax type Id = [I1 .. I2] such as in:
      type PERS=[1 .. 3]
  • as a structured definition (T1 … Tn represent types) type Id = [ id1 : T1 , id2 : T2 , …, idn : Tn ], such as in:
  type ATOMIC=[
       mode : MODE,
       num : NUMBER,
       gen : GENDER,
       pers : PERS]
  • as an unspecified type type Id !, such as in:
  type LABEL !

(this is useful when one wants to avoid having to define acceptable values for every single piece of information). Note that XMG integrates 3 predefined types: int, bool (whose associated values are + and -) and string.

Properties

Once types have been defined, we can define typed properties that will be associated to the nodes used in the tree descriptions. The role of these properties is either

  1. to provide specific information to the compiler so that additional treatments can be done on the output structures to ensure their well-formedness or
  2. to decorate nodes with a label that is linked to the target formalism and that will appear in the output (see XMG's graphical output). The syntax used to define properties is property Id : Type, such as in:
  property extraction : bool    

A set of properties is specific to principles: it is the case for the properties color and rank. This means that when using these principles, these properties must be declared. See the section Principles and plugins for more information about how to use these properties.

Properties can also be used to give a “global” name to a node, thanks to the name property. To perform interfacing with the lexicon, one may want to give global names to some specific nodes, in order to be able to refer to these nodes in the lexicon. Such an interfacing can be used for instance to manage semantic information. To associate global names that will appear in the semi-automatically produced grammar, you have to:

  • declare an enumerate type containing all the names you will use:
type NAME = {subjNode, objNode, anchor}
  • declare a property name of this type:
property name  : NAME
  • associate to the specific nodes the predefined names:
node (mark=subst,name=objNode)[cat=n]

N.B.: make sure these name properties will not cause node unification failures, ie. do not give different names to nodes that will be merged. At the end, the node names are visible in the output file (as an attribute of the node element):

<node type="subst" name="objNode">

Features

Eventually we have to define typed features that are associated to nodes in several syntactic formalisms such as Feature-Based Tree Adjoining Grammars (FBTAG) or Interaction Grammars (IG). The definition of a feature is done by writing feature Id : Type, such as in:

  feature num : NUMBER

Up to now, we have seen the declarations that are needed by the compiler to perform different tasks (syntax checking, output processing, etc). Next we will see the heart of the metagrammar: the definition of the clauses, ie the classes.

Classes

Here we will see how to define classes (i.e. the abstractions in the XMG formalism). Note that in TAG these classes refer to tree fragments. A class always begins with class Id, such as in:

  class CanonicalSubj

N.B. A class may be parametrized, in that case the parameters are between square brackets and separated by a colon. Parameters should be identifiers which do not appear in the namespace of the class. The values for the parameters are given when a class is instantiated. Values can be constants, variables, or other class instances.

Import

To reach a better factorization, a class can inherit from another one. This is done by invoking import Id (where Id is a class name), such as in:

  import TopClass[]

That is to say, the metagrammar corresponds to an inheritance hierarchy. But what does inherit mean here ? In fact, the content of the imported class is made available to the daughter class. More precisely, a class uses identifiers to refer to specific pieces of information. When a class inherits from another one, it can reuse the identifiers of its mother class (provided they have been exported, see below). Thus, some node can be specialized by adding new features and so on.

Note that XMG allows multiple inheritance, and besides it offers an extended control of the scope of the inherited identifiers, since one can restrict the import to specific identifiers, and also rename imported identifiers. Restriction is done by using the keyword as:

import Class[] as [?V1,..., ?Vn]

will only import the variables listed (?V1,…,?Vn) to the scope of the current class. Renaming is also made possible by the keyword as, by using the = sign:

import Class[] as [?V1,...,?Vi=?X,...,?Vn] 

will do the same as the previous example, except that the variable initially named ?Vi will be known in this namespace as ?X. This is especially useful to avoid name conflicts.

Export

As we just saw, we use identifiers in each class. One important point when defining a class is the scope we want these identifiers to have. More precisely we can give (or not) an extern visibility to each identifier by using the export declaration. Only exported identifiers will be available when inheriting or calling (ie instantiating) a class. Identifiers are exported using export id1 id2 … idn such as in:

  export X Y

Identifiers

In XMG, identifiers can refer either to a node, the value of a node property, or the value of a node feature. But whatever an identifier refers to, it must have been declared before by typing declare id1 id2 … idn, such as in:

  declare ?X ?Y ?Z

Note that in the declare section the prefix ? (for variables) and ! (for skolem constants) are mandatory.

Content

Once the identifiers have been declared and their scope defined, we can start describing the content of the class. Basically this content is given between curly-brackets. This content can either be:

  • a statement
  • a conjunction of statements represented by S1 ; S2 in the XMG formalism
  • a disjunction of statements represented by S1 | S2
  • a statement associated to an interface (see Interface)

By statement we mean:

  • an expression: E (that is a variable, a constant, an attribute-value matrix, a reference (by using a dot operator, see the example below), a disjunction of expressions, or an atomic disjunction of constant values such as @{n,v,s}),
  • a unification equation: E1=E2,
  • a class instanciation: ClassId[] (note that the square-brackets after the class id are mandatory even if the instantiated class has no parameter),
  • a description belonging to a dimension: this is where the main description task takes place (see section Dimensions)

Mutexes

Mutexes are the way provided by XMG to specify which classes are incompatible. To specify which classes are mutually exclusive, you first have to define a mutual exclusion set by typing mutex Id such as in:

mutex SUBJ-INV

Then classes need to be added to this set by invoking mutex Id += ClassId such as in:

mutex SUBJ-INV += CanonicalObject
mutex SUBJ-INV += InvertedNominalSubject

Here we specify that we cannot use in the same description both the CanonicalObject and the InvertedNominalSubject classes.

Note that in the metagrammar file, the mutex definitions have to be placed after the type, property and feature declarations and before the valuations. This means that they can appear just before class definitions, between them, or right after.

Valuations

Once all the classes have been defined, we can ask for the evaluation of the classes that will trigger the combination of the fragments (ie classes calling classes that contain disjunction and/or conjunction of fragments). For each of these specific classes, we will obtain an accumulated tree description that may lead to the building of 0, 1 or more TAG trees. The syntax of the evaluation instruction in XMG is value Id, such as in:

  value n0Vn1 

Dimensions

Dimensions contain the linguistic descriptions, which are composed of constraints. Each XMG dimension comes with a specific set of constraints, which allow to describe different structures (trees, feature structures, etc). This section presents the different dimensions supported by XMG, and their description languages.

SYN: a tree description language

The <syn> dimension allows to describe trees, initially to create Tree Adjoining Grammars or Interaction Grammars. To use this language, you can either build a new compiler using the brick syn (contribution treemg) or use one of the existing compilers including the dimension: synsem (contribution synsemCompiler, with predicate based semantics) or synframe (contribution synframeCompiler, with frame based semantics).

A syntactic description is given following the pattern <syn>{ formulas }. Now what kind of formulas does a syntactic description contain ? The answer is nodes. These nodes are in relation with each other. In XMG, you may give a name to a node by using a variable, and also associate properties and/or features with it. The classic node definition is node ?id ( prop1=val1 , … , propN=valN ) [ feat1=val1 , … , featN=valN ] such as in:

  node ?Y (gf=subj)[cat=n]

Here we have a node that we refer to by using the ?Y variable. This node has the property gf (grammatical function) associated with the value subj, and the feature structure [cat=n] (note that associating a variable to a node is optional).

Once you defined the nodes of the tree fragment, you can describe how they are related to each other. To do this, you have the following operators:

  • → strict dominance
  • →+ strict large dominance (transitive non-reflexive closure)
  • →* large dominance (transitive reflexive closure)
  • » strict precedence
  • »+ strict large precedence (transitive non-reflexive closure)
  • »* large precedence (transitive reflexive closure)
  • = node equation

Each subformula you define can be added conjunctively (using “;”) or disjunctively (using “|”) to the description. For instance, the fragment:

can be represented by the following code in XMG:

  class Example
    declare ?X ?Y ?Z
    {<syn>{
      node ?X [cat=S] ; node ?Y [cat=N] ; node ?Z [cat=V] ;
      ?X -> ?Y ; ?X -> ?Z ; ?Y >> ?Z
    }
  }

XMG also supports an alternative way of specifiyng how the nodes are related to each other. This alternative syntax should allow the user to both define the nodes and give their relations at the same time:

  • node { node } strict dominance
  • node { …+node } strict large dominance (transitive non-reflexive closure)
  • node { …node } large dominance (transitive reflexive closure)
  • node node strict precedence
  • node ,,,+node strict large precedence (transitive non-reflexive closure)
  • node ,,,node large precedence (transitive reflexive closure)
  • = node equation

Thus the tree fragment above could be defined in the XMG syntax the following way:

  class Example
  {<syn>{
     node [cat=S] {
             node [cat=N]
             node [cat=V]
             }
     }
  }

Note that the use of variables to refer to the nodes becomes useless inside the fragment, nonetheless we may want to assign variables to node to reuse them later through inheritence.

IFACE: connecting dimensions

Interfaces correspond to attribute-value matrices, allowing one to associate a global name to an identifier. The syntax of the interface is the following (the interface is between square-brackets):

class Id
{ ... }*= [Name1=Id1, ... , NameN=IdN]

The *= operator represents unifying extension. When a class is valuated, the descriptions (contained in the classes) it refers to are accumulated. At the same time, the interfaces associated with these descriptions are accumulated. The semantics of their accumulation may correspond to unification.

Let us see the use of an interface in an example. Considering the tree fragment used so far. Imagine we want to refer to the N node outside of the class. To do so, we give this node a global name. We can do this by using the following interface:

class Example
declare ?X ?Y ?Z
{<syn>{
     node ?X [cat=S] {
                      node ?Y [cat=N]
                      node ?Z [cat=V]
                     }
     }*=[subj = ?Y]
}

In a class A which is combined with Example, you can constraint the identification of a local node X with the subj node of Example by reusing the feature subj in the interface of A: *=[subj=?X] Note that the interface may also be used to give names to properties or feature values.

The interface can also be accessed as a regular dimension, meaning that the *= operator can be replaced as follows:

class Example
declare ?X ?Y ?Z
{<syn>{
     node ?X [cat=S] {
                      node ?Y [cat=N]
                      node ?Z [cat=V]
                     }
     };
 <iface>{[subj = ?Y]}
} 
  

FRAME: describing semantics using typed feature structures

The <frame> dimension can be used in a compiler by using the frame brick (contribution framemg). A set a pre-assembled compilers use this brick: synframeCompiler (with Tree Adjoining Grammars) and framelpcompiler (with morphological descriptions).

This dimension allows to describe typed feature structures. These structures use conjunctive types, which means that types are not atomic, but rather sets of elementary types. When two typed feature structures get unified, the type of the resulting structure is determined by a type hierarchy. In the simple case, and if the types are compatible, the resulting type is the union of both types.

Type hierarchies are defined in two steps. First, the declaration of the atomic types:

frame-types = {t1,t2,...,tn}

where t1, t2, …, tn are constants.

In a second time, the atomic types get organized into a hierarchy by specifying constraints:

frame-constraints = {c1, c2,..., cn }

where c1, c2, …, cn are type constraints. Several sorts of them are available:

  • constraints concerning subtyping: t1 t2 … tn → tt1 tt2 … ttn
  • incompatibility constraints: t1 t2 → -
  • constraints concerning attributes t1 t2 … tn → c1 … cn , with c1 … cn constraints on attributes

Constraints on attributes can be of the following types:

  • existence constraint: att : +
  • value constraint: att : val
  • path equality att1 = att2

Note that all attributes in these constraints can be paths, using dots. For example, actor.name : + means that there is an attribute actor, and that the value of this attribute has an attribute name.

The following example makes use of all the types of constraints:

frame-types = {event, motion, activity, causation, locomotion}
frame-constraints = { 
        causation -> event,
        motion -> event,
        activity -> event,
        motion causation -> -,
        activity causation -> -,
        activity motion -> locomotion,
        activity -> actor:+,
        motion -> mover:+,
        causation -> cause:+ effect:+
}

The first three constraints are subsumption constraints. causation → event means for example that all frames of type causation also have type event. The two next constraint express incompatibilities of types, meaning for instance that a frame cannot have both types motion and causation. activity motion → locomotion means that all frames having both type activity and motion will also have type locomotion. The three last constraints concern attributes. For instance, causation → cause:+ effect:+ makes sure that all frames of type causation will have attributes cause and effect, both with value +.

 <frame>{
  [causation,
    actor:?X1,
    theme:?X2,
    cause:    [activity,
                 actor: ?X1,
                 theme: ?X2],
    effect:?IN[activity,
                 actor: ?X2]
  ]
 }
 

Exporting the type hierarchy

XMG computes the type hierarchy to handle the unification of typed features structures during the compilation of the metagrammar. However, to be able to reuse this type hierarchy with the generated resource (with a parser for example), the hierarchy needs to be exported. When compiling the metagrammar, the option –more activates the export of additional useful resources, which is the hierarchy in our case. For a file called example.mg, the complete command is the following:

xmg compile synframe example.mg --force --more

The compiled grammar can then be found in the file example.xml and the type hierarchy in the file more.mac.

SEM: describing semantics using predicates

Here we will see how to describe semantic information with predicates. Basically, this dimension allows one to describe:

  • predicates with 0, 1 or more arguments and a label,
  • negation,
  • a specific relation called “scope_over” for dealing with quantifiers,
  • and semantic identifiers.

So the language of the semantic dimension is:

Description ::= l:p(E_1,...,E_n) | ~ l:p(E_1,...,E_n) | E_i << E_j | E 

In XMG concrete syntax, one may define a class with a semantic content by:

class BinaryRel
declare !L ?X ?Y ?P
{ 
  <sem>{ !L:?P(?X,?Y) }*=[pred=?P]
}

That is to say, we define the class BinaryRel in which 3 variables and a skolem constant (prefixed by “!”) are declared. This class only contains semantic information (dimension <sem>), more precisely it contains a predicate (whose value is the variable ?P) of arity 2, its arguments are the variables ?X and ?Y. !L represents the label associated to this predicate. Note that we use the interface dimension to give the name pred to ?P. Further, this variable may be unified with a constant, and the value of the predicate thus given. Finally, it is possible to define a class containing both a semantic and syntactic dimension, and these dimensions may share identifiers. Besides sharing identifiers may also be done by using the interface dimension. Thus XMG provides efficient devices to define a syntax / semantics interface within the metagrammar.

MOPH_LP: describing morphology with ordered fields

This dimension allows to form words by assembling morphemes. First, fields need to be defined and ordered (using constraints), then information can be added to the fields. The description language offered by dimension consists of only one keyword and two operators.

  • field definition of a field
  • » linear precedence between fields
  • : affectation of a value to an attribute

The following class shows concretely how the dimension can be used:

class plural_suffix
{
  <morph>{
  field suffix;
  root >> suffix;
        suffix <- "s"
  }
}

A field suffix is created, and placed on the right of another field root (defined in another class). The string “s” is added into the new field. This complete example can be found in the MetaGrammars/lp_morph/example.mg file of XMG-2 (or on GitHub).

Metagrammars containing contributions to the morph_lp dimension can be compiled with the compilers named lp (only morphology) and framelp (with semantic frames):

xmg compile lp file.mg

LEMMA: describing lexicons of lemmas

This dimensions is used when parsing (with a TAG for example). The typical use of these lexicons is to list which TAG families are compatible with the lemmas of the language. The description language basically allows to associate values to different attributes. An example of class using the lemma dimension is as follows:

class LemmeAller
{
  <lemma> {
    entry <- "aller";
    cat   <- v;
    fam   <- n0Vloc1
   }
}

where entry is the lemma, fam a TAG family which can use this lemma as anchor, and cat the syntactic category of the anchor.

Metagrammars containing contributions to the <lemma> dimension (only) must be compiled with the compiler named lex:

xmg compile lex lemma.mg

MORPHO: describing lexicons of inflected forms

This dimensions is used when parsing (with a TAG for example). The typical use of these lexicons is to list which lemmas are compatible with the inflected forms of the language. The description language basically allows to associate values to different attributes. An example of class using the morpho dimension is as follows:

class a
{
  <morpho> {
    morph <- "a";
    lemma <- "avoir";
    cat   <- v
   }
}

where morph is the inflected form, lemma is the lemma associated to this inflected form, and cat the syntactic category of the inflected form.

Metagrammars containing contributions to the <morpho> dimension (only) must be compiled with the compiler named mph:

xmg compile mph morph.mg

Examples

Simple TAG example

Now, we will see in details how to write a metagrammar. We will define a metagrammar generating a small TAG for French. This small TAG will contain 2 trees, namely the ones representing a transitive verb either with a canonical subject or a subject in relative position.

Specifying data

First thing to do: defining the principles, types, properties and features we will use. For the sake of clarity, we will only constraint the produced trees to have no duplicate grammatical function. That is to say, we will only activate the unicity principle with the gf property as parameter:

use unicity with (gf = subj) dims (syn)
use unicity with (gf = obj) dims (syn)

We will deal with few types in this example. We only pay attention to grammatical functions and syntactic categories. The first one is a node property and the second one a node feature (ie part of the TAG formalism):

type CAT = {n,v,s}
type GF = {subj, obj}
property gf : GF
feature cat : CAT

Defining blocs (tree fragments)

The metagrammatical rule we will use is the following:

transitive = (CanSubject | RelSubject) ; Active ; Object 

So we will handle 4 tree fragments: Active, CanSubject, Object, and RelSubject. The class transitive will consist of an abstraction on a conjunctive combination including a disjunction on the subject that is used. The Active class corresponds to the verbal spine:

class Active
export ?X ?Y
declare ?X ?Y
{<syn>{
      ?X -> ?Y
      }
}

The CanSubject class corresponds to the Example class introduced previously:

class CanSubject
export ?X ?Y ?Z
declare ?X ?Y ?Z
{ <syn>{
      node ?X [cat = s]{
              node ?Y (gf=subj)[cat=n]
              node ?Z [cat = v]
              }
      }
}

The Object class is the symetric class of CanSubject:

class Object
export ?X ?Y ?Z
declare ?X ?Y ?Z
{ <syn>{
      node ?X [cat = s]{
              node ?Y [cat = v]
              node ?Z (gf=obj)[cat=n]
              }
      }
}

The RelSubject class and its concrete syntax are given below:

class RelSubject
export ?X ?Y ?Z
declare ?X ?Y ?Z ?U ?V
{ <syn>{
      node ?U [cat = n]{
              node ?V [cat = n]
              node ?X [cat = s]{
                      node ?Y (gf=subj)[cat=n]
                      node ?Z [cat = v]
                      }
              }
      }
}

At this point, we may wonder why associating variables to nodes ? The answer is that we still have to merge these fragments, we will use the exported variables to unify specific nodes.

From tree fragments to trees

Once the basic blocs have been defined, we can combine them to produce the expected trees. We define the transitive class:

class transitive
declare ?SU ?OB ?AC
{
      ?SU = {CanSubject[] | RelSubject[]} ; ?OB = Object[]  ; ?AC = Active[] ;
      ?SU.?X = ?OB.?X ; ?SU.?Z = ?OB.?Y ; ?SU.?X = ?AC.?X ;
      ?SU.?Z = ?AC.?Y 
}

In this class, we use the dot operator to associate a variable to the record of exported identifiers. For instance, ?OB being the variable representing the Object class, ?OB.?X refers to the ?X variable of this class, provided it has been exported. In the transitive class we combine conjunctively 3 classes (one being either CanSubject or RelSubject, and Object, and Active). We also unify their s and v nodes so that the tree fragments get merged. Note that we may prefer using a color system to semi-automatize this node unification (see Controlling fragment combination semi automatically by coloring nodes). Eventually, we know that the transitive class contains all the information needed to build 2 TAG trees. So we ask for its evaluation by invoking:

value transitive

As a result we obtain the 2 following trees (the first one represents the relative subject, and the second one the canonical subject) :

The whole metagrammar

use unicity with (gf = subj) dims (syn)
use unicity with (gf = obj) dims (syn)
type CAT = {n,v,s}
type GF = {subj, obj}
property gf : GF
feature cat : CAT

class Active
export ?X ?Y
declare ?X ?Y
{<syn>{
      ?X -> ?Y
      }
}

class CanSubject
export ?X ?Y ?Z
declare ?X ?Y ?Z
{ <syn>{
      node ?X [cat = s]{
              node ?Y (gf=subj)[cat=n]
              node ?Z [cat = v]
              }
      }
}

class Object
export ?X ?Y ?Z
declare ?X ?Y ?Z
{ <syn>{
      node ?X [cat = s]{
              node ?Y [cat = v]
              node ?Z (gf=obj)[cat=n]
              }
      }
}

class RelSubject
export ?X ?Y ?Z
declare ?X ?Y ?Z ?U ?V
{ <syn>{
      node ?U [cat = n]{
              node ?V [cat = n]
              node ?X [cat = s]{
                      node ?Y (gf=subj)[cat=n]
                      node ?Z [cat = v]
                      }
              }
      }
}

class transitive
declare ?SU ?OB ?AC
{
      { ?SU=CanSubject[] | ?SU=RelSubject[] } ; ?OB = Object[]  ; ?AC = Active[] ;
      ?SU.?X = ?OB.?X ; ?SU.?Z = ?OB.?Y ; ?SU.?X = ?AC.?X ;
      ?SU.?Z = ?AC.?Y 
}

value transitive

More examples

More examples can be found in the Metagrammars folder of the XMG installation directory. Some grammars are also available on the resources page of the XMG website.

Principles and plugins

This section contains descriptions of existing XMG principles, and a method for the user to create their own principles .

Solvers and principles

XMG's most complex task is to compute all possible models for the descriptions of the metagrammar. These descriptions are sets of constraints, therefore extracting the models is a constraint satisfaction problem. Every dimension comes with its own solver (sometimes identity), which builds only structures with the right properties. For example, the syn dimension (which allows to describe trees) accumulates tree descriptions: nodes, dominance and precedence constraints. The syn dimension comes with a solver called tree, makes sure that all solutions of the constraint satisfaction problem will be well-formed tree (one root, etc) which also takes into account the constraints given in the metagrammar.

Principles are sets of constraints which come in addition to a solver. They are useful to describe constraints which cannot be described in classes. The metagrammar makes it possible to express constraint between two objects (two nodes for instance) when they can be referred to with variables, but it is sometimes needed to express constraints between one structure from a class and a structure that might appear in another class, or several classes, or not appear at all. In other words, the XMG constraints can be considered as local (to the class) constraints, whereas principles allow to express global constraints.

Three “historical” principles are provided by XMG, for the dimension syn, namely:

  • unicity: uniqueness on a specific attribute-value
  • rank: ordering of clitics by means of associating the rank property to nodes
  • color: automatization of the node merging by assigning color to nodes

but more principles can also be added as plugins. This is what led to the creation of the new principles:

  • precedes: two properties are given as parameters. A node with the first property must precede one with the second property
  • requires: two properties are given as parameters. If a node with the first property exists, then a node with the second property must also exist.
  • excludes: two properties are given as parameters. If a node with the first property exists, then no node with the second property can exist.

Colors

The colors principle consists in the use of a color language to semi-automatize node unification during tree description solving. This idea has been proposed by B. Crabbé (see [Crabbé and Duchier, 04]). The process is the following:

  1. we decorate nodes with colors (red, black or white),
  2. the description solving is extended so that the nodes are unified according to specific color combination rules:

That is to say: a black node may be unified with 0, 1 or more white nodes and thus produces a black node, a white node has to be unified with a black one producing a black node, and eventually a red node cannot be merged with any other node. As a result, a satisfying model is a model where all the nodes are either black or red.

The important advantage of this color labelling process is that we do not need to explicitly specify all the node unifications that have to be performed. Actually the saturation of colors will trigger these unifications. In other words we can think of nodes in terms of “relative addresses”. This means that we do not have to manage node variables (which correspond to “absolute addresses”) as the colors give a way to refer to “mergeable” nodes, ie black nodes that can be unified (thus that can receive a fragment). By lessening the use of variables, we prevent name conflicts and thus we can for instance easily reuse the same tree fragment within the same tree description (this happens in TAG for trees with double prepositional phrase).

To use the colors principle, the metagrammar must include the following declarations:

use color with () dims (syn)
type COLOR ={red,black,white}
property color : COLOR

When the principle is used, every node needs to be affected a color (or to be unified with a node having one). As a reminder, giving such a property to a node is done as follows:

node ?X (color=red)

Rank

The rank principle is used to express linear orders between nodes which do not appear in the same classes. For example, in languages where there are strict constraints on the order of clitic pronouns, this principle makes the description task easier. The idea is to give a rank to every node representing a clitic, and this rank will make sure that if other clitics are added to the description (with their own ranks), they will be placed on the right side of this clitic. In a description where two nodes have respectively ranks 3 and 4, the only valid solutions will be the ones where the node with rank 3 precedes (not necessarily immediately) the one with rank 4.

Warning: the rank principle only applies to sister nodes. If two ranked nodes have different mother nodes, no linear precedence constraint will apply on them.

To use the rank principle, the metagrammar must include the following declarations:

use rank with () dims (syn)
type RANK = [X..Y]
property rank : RANK

where X and Y and integers indicating the lowest and highest values for a rank. When the principle is used, nodes can be affected a rank. As a reminder, giving such a property to a node is done as follows:

node ?X (rank=3)

Unicity

To use the unicity principle, the metagrammar must include the following declaration:

use unicity with (attribute=value) dims (syn)

where the pair of parameters (attribute and value) should only be seen in one node (at most) in every valid model. The principle unicity can be used either for features or properties. For example, with the following instance of the principle:

use unicity with (rank=1) dims (syn)

models will be able to have only zero or one node of rank 1.

Requires

To use the requires principle, the metagrammar must include the following declaration:

use requires with ( attribute1= value1, attribute2=value2 ) dims (syn)

Excludes

To use the excludes principle, the metagrammar must include the following declaration:

use excludes with ( attribute1= value1, attribute2=value2 ) dims (syn)

Precedes

To use the precedes principle, the metagrammar must include the following declaration:

use precedes with ( attribute1= value1, attribute2=value2 ) dims (syn)

Plugins

XMG-2 makes it possible for a user to create their own principles, without involving too much programming efforts. The solution provided to create these new principles is to use plugins for solvers. More documentation coming soon.

Bricks, contributions and commands

As stated previously, XMG-2 compilers are built using bricks, which implement the compiling steps for parts of metagrammatical languages. For example, the avm brick contain all the support for feature structures, and the syn brick for the language of the <syn> dimension. Bricks are distributed in contributions, for instance the core contribution, which contains all the basic features of XMG-2. The treemg contribution contains all the bricks which, in addition to the ones of the core contribution, allow to build the synsem compiler (equivalent to XMG-1).

Making bricks available is done by installing them. The install command takes as parameter a contribution and installs all the bricks provided by it.

All contributions with names ending with “compiler” are special, as they contain a different type of bricks. These bricks contain description of compilers which need to be created before one can use them. Creating a compiler is done with the command build.

Compilers are usually named after the dimensions they feature (synframe provides both the <syn> and the <frame> dimensions). The following compilers can be installed:

framelpcompiler: for morphological description with frame semantics.

lexCompiler: to create lemma files for a TAG parser (as LexConverter).

lpcompiler: for morphological descriptions.

mphcompiler: to create files of inflected forms for a TAG parser (as LexConverter).

synframeCompiler: for TAG descriptions with frame semantics.

syn2frameCompiler: the same with a slightly different tree description language.

synsemCompiler: for TAG descriptions with predicate semantics (XMG-1).

tfcompiler: for morphological descriptions specified using topological fields.

To learn more about how these compilers were assembled, and how to assemble customized compilers for specific description tasks, see [Petitjean et al., 2016].

The commands provided by XMG can be separated in two categories: some of them will be used by any user writing a linguistic resource, the others will be reserved to developers of XMG extensions.

User commands:

  • xmg bootstrap: installs the basic features of XMG.
  • xmg install path_to_contribution: makes a contribution available.
  • xmg build: assembles a compiler according to a yaml description.
  • xmg gui gui_name: starts a GUI. The only GUI provided up to now is tag.
  • xmg compile compiler_name path_to_metagrammar: compiles a metagrammar with a given compiler. The options for this command are:
    • –force to generate the grammar even if an XML file already exists
    • –latin to manipulate metagrammars written in latin encoding
    • –debug to print some useful information about compilation
    • –notype to disable the strong type checking (equivalent to XMG1)
    • –more to generate additional files (type hierarchy, etc)
    • –output or -o allow to specify an output file (the default is the name of the metagrammar file with the xml or json extension)

Developper commands:

  • xmg startcommand
  • xmg startyaplib
  • xmg startbrick
  • xmg startcompiler
  • xmg startpylib
  • xmg startcontrib

Scripts

To ease the installation of some compilers, scripts are available at the root of the XMG-2 installation directory.

By typing:

./reinstall.sh

The synsem compiler will be built and installed (all the other contributions will be uninstalled). To add the lex and the mph compiler, one can use the script

./install_lex_mph.sh

Other scripts are by convention named after the compiler(s) they install. All scripts starting with reinstall will first uninstall all existing compilers, all scripts starting with install will add compilers to the already installed set of compilers.

./reinstall_all.sh

will make all other bricks available (compilers still need to be built).

Tools

XMGTOOL

XMGTOOL, packaged with XMG-2, is a utility to compare outputs generated by different metagrammars. It is tipically used for debugging, while extending the grammar, or to compare the outputs of XMG-1 and XMG-2. XMGTOOL helps tracking which classes generate more or less models, or where the entries of the grammar differ.

First, the command pickle transforms the grammar into a format XMGTOOL can handle: xmgtool pickle grammar_file output_file will produce the file output_file which will allow to analyse the grammar contained in grammar_file (produced by XMG).

The command fstat compares the numbers of models for each class contained in two grammars. xmgtool fstat file1 file2 will print all classes for which the number of entries differ for the two grammars contained in file1 and file2 (these files must have been produced by the command pickle).

Viewers

Grammars generated with XMG can be viewed using the default GUI packaged with XMG, as showed in the introduction. Other options are available to visualize the generated resource, each of them offering support for different types of grammars:

Parsers

The resources created with XMG can of course be used for parsing. TuLiPA allows to parse LTAG grammars with predicate based semantics. Its new version, TuLiPA-frames provides a parser for LTAG with frame semantics.

Errors and support

Common errors

This section (still in construction) lists some errors that can be encountered while developing a resource with XMG.

Tokenizer errors

  • unrecognized: the given symbol is not supported by the tokenizer. You may check the encoding of the file or try to use the –latin option.

Syntax errors

  • expected: syntax error. Check the different languages of the section Dimensions. Maybe the used compiler is not the right one?

Type errors

  • incompatible types: the value of an attribute does not have the expected type. Check the type declarations.
  • unknown constant: all constants, except for the boolean values + and - (type bool), must be declared (a value for an enumerated type for example).
  • multiple definitions of constant (in type definitions): the same identifier is used to refer to two constants (a value can only have one type).
  • variable not declared: a variable appears in the class, but is not imported, nor declared, nor a class parameter.
  • property_not_declared: a node is given a property which was not defined in the headers.
  • feature not declared: a node is given a feature which was not defined in the headers.
  • multiple definitions of feature: the same identifier is used to refer to two different features.
  • type not defined: a structure is given (in the headers) a type which is not defined.
  • multiple definitions of type: the same identifier is used to refer to two types.
  • incompatible expressions: the types of two expressions are not compatible.
  • value not in range: an integer variable has a value incompatible with its definition (out of the bounds).

Unfolder errors

  • cycle detected with class: the given class creates a cycle in the class hierarchies (it calls a class which is already one of its ancestors in the hierarchy). This is forbidden as the resource generated would be infinite.
  • no class set to be valued: there should be at least one axiom in a metagrammar (see Valuations).

Other common problems

  • Uncolored nodes: in a XMG metagrammar using the syn dimension and the colors principle, a color needs to be given to all nodes appearing in the accumulation. Warnings will be displayed if some nodes do not have colors. Grammars developed with XMG-1 can contain uncolored nodes, but they are ignored by the compiler. With XMG-2, you can simply remove these nodes from the metagrammar to obtain the same result.

Support

To report any bug concerning XMG, please use the issue tracker. You can also use the tracker for any question or request for assistance (installing, developing with XMG).

Please also use the GitHub page if you would like to request new extensions for XMG, or to share your own extensions or resources.

Bibliography

Related papers:

  • [Crabbé et al., 2013] Crabbé, B., Duchier, D., Gardent, C., Le Roux, J., and Parmentier, Y. (2013). XMG : eXtensible MetaGrammar. Computational Linguistics, 39(3):1–66.
  • [Petitjean et al., 2016] Petitjean, S., Duchier, D., and Parmentier, Y. (2016). XMG 2: Describing Description Languages. In Logical Aspects of Computational Linguistics. Celebrating 20 Years of LACL (1996–2016) 9th International Conference, LACL 2016, Nancy, France, December 5-7, 2016, Proceedings 9, pages 255–272. Springer Berlin Heidelberg.
  • [Crabbé and Duchier, 04] Benoît Crabbé and Denys Duchier. Metagrammar Redux, in Proceedings of CSLP’04, Roskilde, Denmark, 2004.