Extensible grammars

This chapter describes the whole syntax and semantics of the extensible grammars of camlp5.

The extensible grammars are the most advanced parsing tool of camlp5. They apply to streams of characters using a lexer which has to be previously defined by the programmer. In camlp5, the syntax of the ocaml language is defined with extensible grammars, which makes camlp5 a bootstrapped system (it compiles its own features by itself).

Getting started

The extensible grammars are a system to build grammar entries which can be extended dynamically. A grammar entry is an abstract value internally containing a stream parser. The type of a grammar entry is "Grammar.Entry.e t" where "t" is the type of the values returned by the grammar entry.

To start with extensible grammars, it is necessary to build a grammar, a value of type "Grammar.g", using the function "Grammar.gcreate":

value g = Grammar.gcreate lexer;

where "lexer" is a lexer previously defined. See the section explaining the interface with lexers. In a first time, it is possible to use a lexer of the module "Plexer" provided by camlp5:

value g = Grammar.gcreate (Plexer.gmake ());

Each grammar entry is associated with a grammar. Only grammar entries of the same grammar can call each other. To create a grammar entry, one has to use the function "Grammar.Entry.create" with takes the grammar as first parameter and a name as second parameter. This name is used in case of syntax errors. For example:

value exp = Grammar.Entry.create g "expression";

To apply a grammar entry, the function "Grammar.Entry.parse" can be used. Its first parameter is the grammar entry, the second one a stream of characters:

Grammar.Entry.parse exp (Stream.of_string "hello");

But if you experiment this, since the entry was just created without any rules, you receive an error message:

Stream.Error "entry [expression] is empty"

To add grammar rules to the grammar entry, it is necessary to extend it, using a specific syntactic statement: "EXTEND".

Syntax of the EXTEND statement

The "EXTEND" statement is added in the expressions of the ocaml language when the syntax extension kit "pa_extend.cmo" is loaded. Its syntax is:

          expression ::= extend
              extend ::= "EXTEND" extend-body "END"
         extend-body ::= global-opt entries
          global-opt ::= "GLOBAL" ":" entry-names ";"
                       | <nothing>
         entry-names ::= entry-name entry-names
                       | entry-name
               entry ::= entry-name ":" position-opt "[" levels "]"
        position-opt ::= "FIRST"
                       | "LAST"
                       | "BEFORE" label
                       | "AFTER" label
                       | "LEVEL" label
                       | <nothing>
              levels ::= level "|" levels
                       | level
               level ::= label-opt assoc-opt "[" rules "]"
           label-opt ::= label
                       | <nothing>
           assoc-opt ::= "LEFTA"
                       | "RIGHTA"
                       | "NONA"
                       | <nothing>
               rules ::= rule "|" rules
                       | rule
                rule ::= psymbols-opt "->" expression
                       | psymbols-opt
        psymbols-opt ::= psymbols
                       | <nothing>
            psymbols ::= psymbol ";" psymbols
                       | psymbol
             psymbol ::= symbol
                       | pattern "=" symbol
              symbol ::= keyword
                       | token
                       | token string
                       | entry-name
                       | entry-name "LEVEL" label
                       | "SELF"
                       | "NEXT"
                       | "LIST0" symbol
                       | "LIST0" symbol "SEP" symbol
                       | "LIST1" symbol
                       | "LIST1" symbol "SEP" symbol
                       | "OPT" symbol
                       | "[" rules "]"
                       | "(" symbol ")"
             keyword ::= string
               token ::= uident
               label ::= string
          entry-name ::= qualid
              qualid ::= qualid "." qualid
                       | uident
                       | lident
              uident ::= 'A'-'Z' ident
              lident ::= ('a'-'z' | '_' | utf8-char) ident
               ident ::= ident-char*
          ident-char ::= ('a'-'a' | 'A'-'Z' | '0'-'9' | '_' | ''' | utf8-char)
           utf8-char ::= '\128'-'\255'

Other statements, "GEXTEND", "DELETE_RULE", "GDELETE_RULES" are also defined by the same syntax extension kit. See further.

In the description above, ony "EXTEND" and "END" are new keywords (reserved words which cannot be used in variables, constructors or module names). The other strings (e.g. "GLOBAL", "LEVEL", "LIST0", "LEFTA", etc.) are not reserved.

Semantics of the EXTEND statement

The EXTEND statement starts with the "EXTEND" keyword and ends with the "END" keyword.

Global indicator

After the first keyword, it is possible to see the identifier "GLOBAL" followed by a colon, a list of entries names and a semicolon. It says that these entries correspond to visible (previously defined) entry variables, in the context of the EXTEND statement, the other ones being locally and silently defined inside.

Example:

value exp = Grammar.Entry.create g "exp";
EXTEND
  GLOBAL: exp;
  exp: [ [ x = foo; y = bar ] ];
  foo: [ [ "foo" ] ];
  bar: [ [ "bar" ] ];
END;

The entry "exp" is an existing variable (defined by value exp = ...). On the other hand, the entries "foo" and "bar" have not been defined. Because of the GLOBAL indicator, the system define them locally.

Without the GLOBAL indicator, the three entries would have been considered as global variables, therefore the ocaml compiler would say "unbound variable" under the first undefined entry, "foo".

Entries list

Then the list of entries extensions follow. An entry extension starts with the entry name followed by a colon. An entry may have several levels corresponding to several stream parsers which call the ones the others (see further).

Optional position

After the colon, it is possible to specify a where to insert the defined levels:

Levels

After the optional "position", the level list follow. The levels are separated by vertical bars, the whole list being between brackets.

A level starts with an optional label, which corresponds to its name. This label is useful to specify this level in case of future extension, using the position (see previous section).

The level continues with an optional associativity indicator, which can be:

Rules

At last, the grammar rule list appear. The rules are separated by vertical bars, the whole list being brackets.

A rule looks like a match case in the "match" statement or a parser case in the "parser" statement: a list of psymbols (see next paragraph) separated by semicolons, followed by a right arrow and an expression, the semantic action. Actually, the right arrow and expression are optional: in this case, it is equivalent to an expression which would be the unit "()" constructor.

A psymbol is either a pattern, followed with the equal sign and a symbol, or by a symbol alone. It corresponds to a test of this symbol, whose value is bound to the pattern if any.

Symbols

A symbol is either:

The syntactic analysis follow the list of symbols. If it fails, depending on the first items of the rule (see the section about the kind of grammars recognized):

Rules insertion

Remember that "EXTEND" is a statement, not a declaration: the rules are added in the entries at run time. Each rule is internally inserted in a tree, allowing the left factorization of the rule. For example, if two rules have as list of symbols:

"if"; e1 = expr; "then"; e2 = expr
"if"; e1 = expr; "then"; e2 = expr; "else"; e3 = expr

the rules are inserted in a tree and the result look like:

"if"
   expr
     "then"
       expr
         "else"
           expr
         <nothing>

This tree is built as long as rules are inserted. When used, by applying the function "Grammar.Entry.parse" to the current entry, the input is matched with that tree, starting from the tree root, descending on it as long as the parsing advances.

There is a different tree by entry level.

Meta symbols

Extra symbols exist, allowing to manipulate lists or optional symbols. They are:

Machinery

Each entry level contains internally two trees:

... to be completed ...

Kind of grammar

... to be completed ...

Interface with the lexer

To create a grammar, the function "Grammar.gcreate" must be called, with a lexer as parameter.

A simple solution, as possible lexer, is the predefined lexer built by "Plexer.gmake ()", lexer used for the ocaml grammar of camlp5. In this case, you can just put it as parameter of "Grammar.gcreate" and it is not necessary to read this section.

Otherwise, if you want to create your own lexer, and interface it with the extensible grammars, it is necessary to follow some steps. This section explains the notions of "token pattern", "token" and the "match function", necessary to build the interface.

Token patterns

In grammar rules, the keywords and tokens symbols are internally represented by values of type "Token.pattern" defined like this:

type pattern = (string * string);

where the two strings represent:

  1. The constructor name, an identifier starting with an uppercase character,
  2. The possible parameter to match with, if any.

The empty constructor name corresponds to a keyword. The empty parameter corresponds to a match to any value of the symbol.

For example, in the rule containing:

"for"; i = LIDENT; "="; e1 = SELF; "to"; e2 = SELF

the token patterns are:

The symbol UIDENT "Foo" in a rule would be represented by the token pattern:

("UIDENT", "Foo")

On the other hand, the symbol "SELF" is a specific symbol of the EXTEND syntax: it does not correspond to a token pattern and is represented differently. A token constructor name must not belong to the specific symbols: SELF, NEXT, LIST0, LIST1 and OPT.

Match function

When one writes a lexer, one generally creates a type for the tokens that the lexer returns. Any type is possible. One has to provide a function telling how a token (type defined by the programmer) is matched against a token pattern (predefined type, see above).

If you don't want to busy about that, you can choose the type "(string * string)" for your token type (just like the token pattern type) and use the default match function provided in camlp5: "Token.default_match". If you chose that solution, you can skip the rest of this section.

Otherwise, if you want to use your own specific token type (or if you don't have the choice), you have to write your specific match function.

This function takes a token pattern and return a function matching a token, returning the matched string or raising the exception "Stream.Failure" if the token does not match.

Example:

type mytoken = [ Ident of string | Int of int | Comma | Equal | Keyw of string ];

value mymatch =
  fun
  [ ("IDENT", "") -> fun [ Ident s -> s | _ -> raise Stream.Failure ]
  | ("INT", "") -> fun [ Int i -> string_of_int i | _ -> raise Stream.Failure ]
  | ("", ",") -> fun [ Comma -> "" | _ -> raise Stream.Failure ]
  | ("", "=") -> fun [ Equal -> "" | _ -> raise Stream.Failure ]
  | ("", s) ->
      fun
      [ Keyw k -> if k = s then "" else raise Stream.Failure
      | _ -> raise Stream.Failure ] ]
;

Notice that, for efficiency, it is necessary to write this function like above, as a match of token patterns returning, for each case, the function which matches the token, not a function matching the token pattern and the token together and returning a string for each case.

... to be completed ...

Grammar.gcreate

The function "Grammar.gcreate" takes a g-lexer as parameter. A g-lexer is a value of type "Token.glexer t" where "t" is the type of the tokens. The programmer of the lexer may choose any type he wants, on condition that he provides also a correct function "tok_match" (see further) which tells how a token pattern is matched against this token type.

A suitable value for a g-lexer is the result of "Plexer.gmake ()" which is the lexer used for the ocaml language in camlp5.

g-lexer type

The type "Token.glexer" is a record type containing six fields:

The function "tok_using" is called by the EXTEND statement while treating the grammar rules. When encountering a keyword or a token symbol, this function is called:

Since this list is a little bit complicated, there are default functions and values that can be used in the first time:

Example: if you have a lexer function named 'lexer', you can make the glexer value like this:

{Token.tok_func = lexer;
 Token.tok_using _ = (); Token.tok_removing _ = ();
 Token.tok_match = Token.default_match;
 Token.tok_text = Token.lexer_text;
 Token.tok_comm = None}

... to be completed ...

Functorial interface

... to be completed ...