Extensible grammars
This chapter describes the whole syntax and semantics of the extensible grammars of camlp5.
The extensible grammars are the most advanced parsing tool of camlp5. They apply to streams of characters using a lexer which has to be previously defined by the programmer. In camlp5, the syntax of the ocaml language is defined with extensible grammars, which makes camlp5 a bootstrapped system (it compiles its own features by itself).
Getting started
The extensible grammars are a system to build grammar entries which can be extended dynamically. A grammar entry is an abstract value internally containing a stream parser. The type of a grammar entry is "Grammar.Entry.e t" where "t" is the type of the values returned by the grammar entry.
To start with extensible grammars, it is necessary to build a grammar, a value of type "Grammar.g", using the function "Grammar.gcreate":
value g = Grammar.gcreate lexer;
where "lexer" is a lexer previously defined. See the section explaining the interface with lexers. In a first time, it is possible to use a lexer of the module "Plexer" provided by camlp5:
value g = Grammar.gcreate (Plexer.gmake ());
Each grammar entry is associated with a grammar. Only grammar entries of the same grammar can call each other. To create a grammar entry, one has to use the function "Grammar.Entry.create" with takes the grammar as first parameter and a name as second parameter. This name is used in case of syntax errors. For example:
value exp = Grammar.Entry.create g "expression";
To apply a grammar entry, the function "Grammar.Entry.parse" can be used. Its first parameter is the grammar entry, the second one a stream of characters:
Grammar.Entry.parse exp (Stream.of_string "hello");
But if you experiment this, since the entry was just created without any rules, you receive an error message:
Stream.Error "entry [expression] is empty"
To add grammar rules to the grammar entry, it is necessary to extend it, using a specific syntactic statement: "EXTEND".
Syntax of the EXTEND statement
The "EXTEND" statement is added in the expressions of the ocaml language when the syntax extension kit "pa_extend.cmo" is loaded. Its syntax is:
expression ::= extend extend ::= "EXTEND" extend-body "END" extend-body ::= global-opt entries global-opt ::= "GLOBAL" ":" entry-names ";" | <nothing> entry-names ::= entry-name entry-names | entry-name entry ::= entry-name ":" position-opt "[" levels "]" position-opt ::= "FIRST" | "LAST" | "BEFORE" label | "AFTER" label | "LEVEL" label | <nothing> levels ::= level "|" levels | level level ::= label-opt assoc-opt "[" rules "]" label-opt ::= label | <nothing> assoc-opt ::= "LEFTA" | "RIGHTA" | "NONA" | <nothing> rules ::= rule "|" rules | rule rule ::= psymbols-opt "->" expression | psymbols-opt psymbols-opt ::= psymbols | <nothing> psymbols ::= psymbol ";" psymbols | psymbol psymbol ::= symbol | pattern "=" symbol symbol ::= keyword | token | token string | entry-name | entry-name "LEVEL" label | "SELF" | "NEXT" | "LIST0" symbol | "LIST0" symbol "SEP" symbol | "LIST1" symbol | "LIST1" symbol "SEP" symbol | "OPT" symbol | "[" rules "]" | "(" symbol ")" keyword ::= string token ::= uident label ::= string entry-name ::= qualid qualid ::= qualid "." qualid | uident | lident uident ::= 'A'-'Z' ident lident ::= ('a'-'z' | '_' | utf8-char) ident ident ::= ident-char* ident-char ::= ('a'-'a' | 'A'-'Z' | '0'-'9' | '_' | ''' | utf8-char) utf8-char ::= '\128'-'\255'
Other statements, "GEXTEND", "DELETE_RULE", "GDELETE_RULES" are also defined by the same syntax extension kit. See further.
In the description above, ony "EXTEND" and "END" are new keywords (reserved words which cannot be used in variables, constructors or module names). The other strings (e.g. "GLOBAL", "LEVEL", "LIST0", "LEFTA", etc.) are not reserved.
Semantics of the EXTEND statement
The EXTEND statement starts with the "EXTEND" keyword and ends with the "END" keyword.
Global indicator
After the first keyword, it is possible to see the identifier "GLOBAL" followed by a colon, a list of entries names and a semicolon. It says that these entries correspond to visible (previously defined) entry variables, in the context of the EXTEND statement, the other ones being locally and silently defined inside.
- If an entry, which is extended in the EXTEND statement, is in the GLOBAL list, but is not defined in the context of the EXTEND statement, the ocaml compiler will fail with the error "unbound value".
- If there is no GLOBAL indicator, and an entry, which is extended in the EXTEND statement, is not defined in the contex of the EXTEND statement, the ocaml compiler will also fail with the error "unbound value".
Example:
value exp = Grammar.Entry.create g "exp"; EXTEND GLOBAL: exp; exp: [ [ x = foo; y = bar ] ]; foo: [ [ "foo" ] ]; bar: [ [ "bar" ] ]; END;
The entry "exp" is an existing variable (defined by value exp = ...). On the other hand, the entries "foo" and "bar" have not been defined. Because of the GLOBAL indicator, the system define them locally.
Without the GLOBAL indicator, the three entries would have been considered as global variables, therefore the ocaml compiler would say "unbound variable" under the first undefined entry, "foo".
Entries list
Then the list of entries extensions follow. An entry extension starts with the entry name followed by a colon. An entry may have several levels corresponding to several stream parsers which call the ones the others (see further).
Optional position
After the colon, it is possible to specify a where to insert the defined levels:
- The identifier "FIRST" (resp. "LAST") indicates that the level must be inserted before (resp. after) all possibly existing levels of the entry. They become their first (resp. last) levels.
- The identifier "BEFORE" (resp. "AFTER") followed by a level label (a string) indicates that the levels must be inserted before (resp. after) that level, if it exists. If it does not exist, the extend statement fails at run time.
- The identifier "LEVEL" followed by a level label indicates that the first level defined in the extend statement must be inserted at the given level, extending and modifying it. The other levels defined in the statement are inserted after this level, and before the possible levels following this level. If there is no level with this label, the extend statement fails at run time.
- By default, if the entry has no level, the levels defined in the statement are inserted in the entry. Otherwise the first defined level is inserted at the first level of the entry, extending or modifying it. The other levels are inserted afterwards (before the possible second level which may previously exist in the entry).
Levels
After the optional "position", the level list follow. The levels are separated by vertical bars, the whole list being between brackets.
A level starts with an optional label, which corresponds to its name. This label is useful to specify this level in case of future extension, using the position (see previous section).
The level continues with an optional associativity indicator, which can be:
- LEFTA for left associativity (default),
- RIGHTA for right associativity,
- NONA for no associativity.
Rules
At last, the grammar rule list appear. The rules are separated by vertical bars, the whole list being brackets.
A rule looks like a match case in the "match" statement or a parser case in the "parser" statement: a list of psymbols (see next paragraph) separated by semicolons, followed by a right arrow and an expression, the semantic action. Actually, the right arrow and expression are optional: in this case, it is equivalent to an expression which would be the unit "()" constructor.
A psymbol is either a pattern, followed with the equal sign and a symbol, or by a symbol alone. It corresponds to a test of this symbol, whose value is bound to the pattern if any.
Symbols
A symbol is either:
- a keyword (a string): the input must match this keyword,
- a token name (an identifier starting with an uppercase character), optionally followed by a string: the input must match this token (any value if no string, or that string if a string follows the token name), the list of the available tokens depending on the associated lexer (the list of tokens available with "Plexer.gmake ()" is: LIDENT, UIDENT, TILDEIDENT, TILDEIDENTCOLON, QUESTIONIDENT, INT, INT_l, INT_L, INT_n, FLOAT, CHAR, STRING, QUOTATION, ANTIQUOT and EOI; other lexers may propose other lists of tokens),
- an entry name, which correspond to a call to this entry,
- an entry name followed by the identifier "LEVEL" and a level label, which correspond to the call to this entry at that level,
- the identifier "SELF" which is a recursive call to the present entry, according to the associativity (i.e. it may be a call at the current level, to the next level, or to the top level of the entry): "SELF" is equivalent to the name of the entry itself,
- the identifier "NEXT", which is a call to the next level of the current entry,
- a left brace, followed by a list of rules separated by vertical bars, and a right brace: equivalent to a call to an entry, with these rules, inlined,
- a meta symbol (see further),
- a symbol between parentheses.
The syntactic analysis follow the list of symbols. If it fails, depending on the first items of the rule (see the section about the kind of grammars recognized):
- the parsing may fail by raising the exception "Stream.Error"
- the parsing may continue with the next rule.
Rules insertion
Remember that "EXTEND" is a statement, not a declaration: the rules are added in the entries at run time. Each rule is internally inserted in a tree, allowing the left factorization of the rule. For example, if two rules have as list of symbols:
"if"; e1 = expr; "then"; e2 = expr "if"; e1 = expr; "then"; e2 = expr; "else"; e3 = expr
the rules are inserted in a tree and the result look like:
"if" expr "then" expr "else" expr <nothing>
This tree is built as long as rules are inserted. When used, by applying the function "Grammar.Entry.parse" to the current entry, the input is matched with that tree, starting from the tree root, descending on it as long as the parsing advances.
There is a different tree by entry level.
Meta symbols
Extra symbols exist, allowing to manipulate lists or optional symbols. They are:
- LIST0 followed by a symbol: this is a list of this symbol, possibly empty,
- LIST0 followed by a symbol, SEP and another symbol: this is a list, possibly empty, of the first symbol separated by the second one,
- LIST1 followed by a symbol: this is a list of this symbol, with at least one element,
- LIST0 followed by a symbol, SEP and another symbol: this is a list, with at least one element, of the first symbol separated by the second one,
- OPT followed by a symbol: equivalent to "this symbol or nothing".
Machinery
Each entry level contains internally two trees:
- A tree of the rules starting with the current entry name or by the identifier "SELF",
- A tree of the rules not starting with the current entry name nor by "SELF".
... to be completed ...
Kind of grammar
... to be completed ...
Interface with the lexer
To create a grammar, the function "Grammar.gcreate" must be called, with a lexer as parameter.
A simple solution, as possible lexer, is the predefined lexer built by "Plexer.gmake ()", lexer used for the ocaml grammar of camlp5. In this case, you can just put it as parameter of "Grammar.gcreate" and it is not necessary to read this section.
Otherwise, if you want to create your own lexer, and interface it with the extensible grammars, it is necessary to follow some steps. This section explains the notions of "token pattern", "token" and the "match function", necessary to build the interface.
Token patterns
In grammar rules, the keywords and tokens symbols are internally represented by values of type "Token.pattern" defined like this:
type pattern = (string * string);
where the two strings represent:
- The constructor name, an identifier starting with an uppercase character,
- The possible parameter to match with, if any.
The empty constructor name corresponds to a keyword. The empty parameter corresponds to a match to any value of the symbol.
For example, in the rule containing:
"for"; i = LIDENT; "="; e1 = SELF; "to"; e2 = SELF
the token patterns are:
- for the keywords "for", "=" and "to":
- ("", "for")
- ("", "=")
- ("", "to"))
- for the token symbol LIDENT:
- ("LIDENT", "")
The symbol UIDENT "Foo" in a rule would be represented by the token pattern:
("UIDENT", "Foo")
On the other hand, the symbol "SELF" is a specific symbol of the EXTEND syntax: it does not correspond to a token pattern and is represented differently. A token constructor name must not belong to the specific symbols: SELF, NEXT, LIST0, LIST1 and OPT.
Match function
When one writes a lexer, one generally creates a type for the tokens that the lexer returns. Any type is possible. One has to provide a function telling how a token (type defined by the programmer) is matched against a token pattern (predefined type, see above).
If you don't want to busy about that, you can choose the type "(string * string)" for your token type (just like the token pattern type) and use the default match function provided in camlp5: "Token.default_match". If you chose that solution, you can skip the rest of this section.
Otherwise, if you want to use your own specific token type (or if you don't have the choice), you have to write your specific match function.
This function takes a token pattern and return a function matching a token, returning the matched string or raising the exception "Stream.Failure" if the token does not match.
Example:
type mytoken = [ Ident of string | Int of int | Comma | Equal | Keyw of string ]; value mymatch = fun [ ("IDENT", "") -> fun [ Ident s -> s | _ -> raise Stream.Failure ] | ("INT", "") -> fun [ Int i -> string_of_int i | _ -> raise Stream.Failure ] | ("", ",") -> fun [ Comma -> "" | _ -> raise Stream.Failure ] | ("", "=") -> fun [ Equal -> "" | _ -> raise Stream.Failure ] | ("", s) -> fun [ Keyw k -> if k = s then "" else raise Stream.Failure | _ -> raise Stream.Failure ] ] ;
Notice that, for efficiency, it is necessary to write this function like above, as a match of token patterns returning, for each case, the function which matches the token, not a function matching the token pattern and the token together and returning a string for each case.
... to be completed ...
Grammar.gcreate
The function "Grammar.gcreate" takes a g-lexer as parameter. A g-lexer is a value of type "Token.glexer t" where "t" is the type of the tokens. The programmer of the lexer may choose any type he wants, on condition that he provides also a correct function "tok_match" (see further) which tells how a token pattern is matched against this token type.
A suitable value for a g-lexer is the result of "Plexer.gmake ()" which is the lexer used for the ocaml language in camlp5.
g-lexer type
The type "Token.glexer" is a record type containing six fields:
- tok_func: function returning a token stream from a char stream,
- tok_using: function checking if a token pattern is recognized by the lexer,
- tok_removing: function telling the lexer that a token pattern has been removed,
- tok_match: function telling how a token pattern is parsed,
- tok_text: function returning the name of a token pattern,
- tok_comm: option value to ask the lexer to record the locations of the comments.
The function "tok_using" is called by the EXTEND statement while treating the grammar rules. When encountering a keyword or a token symbol, this function is called:
- if it is a token pattern, to check that this token is really recognized by the lexer,
- if it is a keyword, to check that also, or to allow the lexer to record this keyword in its tables.
Since this list is a little bit complicated, there are default functions and values that can be used in the first time:
- Token.lexer_func_of_parser is a provided function returning a value suitable value for the field tok_func from a char stream parser,
- Token.default_match is an acceptable function, suitable for the field tok_match,
- Token.lexer_text is an acceptable function, suitable for the field tok_text,
- the function "fun _ -> ()" can be used for the fields tok_using and tok_removing in a first time,
- the field tok_comm can be set to None.
Example: if you have a lexer function named 'lexer', you can make the glexer value like this:
{Token.tok_func = lexer; Token.tok_using _ = (); Token.tok_removing _ = (); Token.tok_match = Token.default_match; Token.tok_text = Token.lexer_text; Token.tok_comm = None}
... to be completed ...
Functorial interface
... to be completed ...