Extensible grammars
This chapter describes the whole syntax and semantics of the extensible grammars of camlp5.
The extensible grammars are the most advanced parsing tool of camlp5. They apply to streams of characters using a lexer which has to be previously defined by the programmer. In camlp5, the syntax of the ocaml language is defined with extensible grammars, which makes camlp5 a bootstrapped system (it compiles its own features by itself).
Getting started
The extensible grammars are a system to build grammar entries which can be extended dynamically. A grammar entry is an abstract value internally containing a stream parser. The type of a grammar entry is "Grammar.Entry.e t" where "t" is the type of the values returned by the grammar entry.
To start with extensible grammars, it is necessary to build a grammar, a value of type "Grammar.g", using the function "Grammar.gcreate":
value g = Grammar.gcreate lexer;
where "lexer" is a lexer previously defined. See the section explaining the interface with lexers. In a first time, it is possible to use a lexer of the module "Plexer" provided by camlp5:
value g = Grammar.gcreate (Plexer.gmake ());
Each grammar entry is associated with a grammar. Only grammar entries of the same grammar can call each other. To create a grammar entry, one has to use the function "Grammar.Entry.create" with takes the grammar as first parameter and a name as second parameter. This name is used in case of syntax errors. For example:
value exp = Grammar.Entry.create g "expression";
To apply a grammar entry, the function "Grammar.Entry.parse" can be used. Its first parameter is the grammar entry, the second one a stream of characters:
Grammar.Entry.parse exp (Stream.of_string "hello");
But if you experiment this, since the entry was just created without any rules, you receive an error message:
Stream.Error "entry [expression] is empty"
To add grammar rules to the grammar entry, it is necessary to extend it, using a specific syntactic statement: "EXTEND".
Syntax of the EXTEND statement
The "EXTEND" statement is added in the expressions of the ocaml language when the syntax extension kit "pa_extend.cmo" is loaded. Its syntax is:
expression ::= extend extend ::= "EXTEND" extend-body "END" extend-body ::= global-opt entries global-opt ::= "GLOBAL" ":" entry-names ";" | <nothing> entry-names ::= entry-name entry-names | entry-name entry ::= entry-name ":" position-opt "[" levels "]" position-opt ::= "FIRST" | "LAST" | "BEFORE" label | "AFTER" label | "LEVEL" label | <nothing> levels ::= level "|" levels | level level ::= label-opt assoc-opt "[" rules "]" label-opt ::= label | <nothing> assoc-opt ::= "LEFTA" | "RIGHTA" | "NONA" | <nothing> rules ::= rule "|" rules | rule rule ::= psymbols-opt "->" expression | psymbols-opt psymbols-opt ::= psymbols | <nothing> psymbols ::= psymbol ";" psymbols | psymbol psymbol ::= symbol | pattern "=" symbol symbol ::= keyword | token | token string | entry-name | entry-name "LEVEL" label | "SELF" | "NEXT" | "LIST0" symbol | "LIST0" symbol "SEP" symbol | "LIST1" symbol | "LIST1" symbol "SEP" symbol | "OPT" symbol | "[" rules "]" | "(" symbol ")" keyword ::= string token ::= uident label ::= string entry-name ::= qualid qualid ::= qualid "." qualid | uident | lident uident ::= 'A'-'Z' ident lident ::= ('a'-'z' | '_' | utf8-char) ident ident ::= ident-char* ident-char ::= ('a'-'a' | 'A'-'Z' | '0'-'9' | '_' | ''' | utf8-char) utf8-char ::= '\128'-'\255'
Other statements, "GEXTEND", "DELETE_RULE", "GDELETE_RULE" are also defined by the same syntax extension kit. See further.
In the description above, ony "EXTEND" and "END" are new keywords (reserved words which cannot be used in variables, constructors or module names). The other strings (e.g. "GLOBAL", "LEVEL", "LIST0", "LEFTA", etc.) are not reserved.
Semantics of the EXTEND statement
The EXTEND statement starts with the "EXTEND" keyword and ends with the "END" keyword.
GLOBAL indicator
After the first keyword, it is possible to see the identifier "GLOBAL" followed by a colon, a list of entries names and a semicolon. It says that these entries correspond to visible (previously defined) entry variables, in the context of the EXTEND statement, the other ones being locally and silently defined inside.
- If an entry, which is extended in the EXTEND statement, is in the GLOBAL list, but is not defined in the context of the EXTEND statement, the ocaml compiler will fail with the error "unbound value".
- If there is no GLOBAL indicator, and an entry, which is extended in the EXTEND statement, is not defined in the contex of the EXTEND statement, the ocaml compiler will also fail with the error "unbound value".
Example:
value exp = Grammar.Entry.create g "exp"; EXTEND GLOBAL: exp; exp: [ [ x = foo; y = bar ] ]; foo: [ [ "foo" ] ]; bar: [ [ "bar" ] ]; END;
The entry "exp" is an existing variable (defined by value exp = ...). On the other hand, the entries "foo" and "bar" have not been defined. Because of the GLOBAL indicator, the system define them locally.
Without the GLOBAL indicator, the three entries would have been considered as global variables, therefore the ocaml compiler would say "unbound variable" under the first undefined entry, "foo".
Entries list
Then the list of entries extensions follow. An entry extension starts with the entry name followed by a colon. An entry may have several levels corresponding to several stream parsers which call the ones the others (see further).
Optional position
After the colon, it is possible to specify a where to insert the defined levels:
- The identifier "FIRST" (resp. "LAST") indicates that the level must be inserted before (resp. after) all possibly existing levels of the entry. They become their first (resp. last) levels.
- The identifier "BEFORE" (resp. "AFTER") followed by a level label (a string) indicates that the levels must be inserted before (resp. after) that level, if it exists. If it does not exist, the extend statement fails at run time.
- The identifier "LEVEL" followed by a level label indicates that the first level defined in the extend statement must be inserted at the given level, extending and modifying it. The other levels defined in the statement are inserted after this level, and before the possible levels following this level. If there is no level with this label, the extend statement fails at run time.
- By default, if the entry has no level, the levels defined in the statement are inserted in the entry. Otherwise the first defined level is inserted at the first level of the entry, extending or modifying it. The other levels are inserted afterwards (before the possible second level which may previously exist in the entry).
Levels
After the optional "position", the level list follow. The levels are separated by vertical bars, the whole list being between brackets.
A level starts with an optional label, which corresponds to its name. This label is useful to specify this level in case of future extension, using the position (see previous section).
The level continues with an optional associativity indicator, which can be:
- LEFTA for left associativity (default),
- RIGHTA for right associativity,
- NONA for no associativity.
Rules
At last, the grammar rule list appear. The rules are separated by vertical bars, the whole list being brackets.
A rule looks like a match case in the "match" statement or a parser case in the "parser" statement: a list of psymbols (see next paragraph) separated by semicolons, followed by a right arrow and an expression, the semantic action. Actually, the right arrow and expression are optional: in this case, it is equivalent to an expression which would be the unit "()" constructor.
A psymbol is either a pattern, followed with the equal sign and a symbol, or by a symbol alone. It corresponds to a test of this symbol, whose value is bound to the pattern if any.
Symbols
A symbol is either:
- a keyword (a string): the input must match this keyword,
- a token name (an identifier starting with an uppercase character), optionally followed by a string: the input must match this token (any value if no string, or that string if a string follows the token name), the list of the available tokens depending on the associated lexer (the list of tokens available with "Plexer.gmake ()" is: LIDENT, UIDENT, TILDEIDENT, TILDEIDENTCOLON, QUESTIONIDENT, INT, INT_l, INT_L, INT_n, FLOAT, CHAR, STRING, QUOTATION, ANTIQUOT and EOI; other lexers may propose other lists of tokens),
- an entry name, which correspond to a call to this entry,
- an entry name followed by the identifier "LEVEL" and a level label, which correspond to the call to this entry at that level,
- the identifier "SELF" which is a recursive call to the present entry, according to the associativity (i.e. it may be a call at the current level, to the next level, or to the top level of the entry): "SELF" is equivalent to the name of the entry itself,
- the identifier "NEXT", which is a call to the next level of the current entry,
- a left brace, followed by a list of rules separated by vertical bars, and a right brace: equivalent to a call to an entry, with these rules, inlined,
- a meta symbol (see further),
- a symbol between parentheses.
The syntactic analysis follow the list of symbols. If it fails, depending on the first items of the rule (see the section about the kind of grammars recognized):
- the parsing may fail by raising the exception "Stream.Error"
- the parsing may continue with the next rule.
Meta symbols
Extra symbols exist, allowing to manipulate lists or optional symbols. They are:
- LIST0 followed by a symbol: this is a list of this symbol, possibly empty,
- LIST0 followed by a symbol, SEP and another symbol: this is a list, possibly empty, of the first symbol separated by the second one,
- LIST1 followed by a symbol: this is a list of this symbol, with at least one element,
- LIST0 followed by a symbol, SEP and another symbol: this is a list, with at least one element, of the first symbol separated by the second one,
- OPT followed by a symbol: equivalent to "this symbol or nothing".
Rules insertion
Remember that "EXTEND" is a statement, not a declaration: the rules are added in the entries at run time. Each rule is internally inserted in a tree, allowing the left factorization of the rule. For example, with this list of rules (borrowed from the camlp5 sources):
"method"; "private"; "virtual"; l = label; ":"; t = poly_type "method"; "virtual"; "private"; l = label; ":"; t = poly_type "method"; "virtual"; l = label; ":"; t = poly_type "method"; "private"; l = label; ":"; t = poly_type; "="; e = expr "method"; "private"; l = label; sb = fun_binding "method"; l = label; ":"; t = poly_type; "="; e = expr "method"; l = label; sb = fun_binding
the rules are inserted in a tree and the result looks like:
"method" |-- "private" | |-- "virtual" | | |-- label | | |-- ":" | | |-- poly_type | |-- label | |-- ":" | | |-- poly_type | | |-- ":=" | | |-- expr | |-- fun_binding |-- "virtual" | |-- "private" | | |-- label | | |-- ":" | | |-- poly_type | |-- label | |-- ":" | |-- poly_type |-- label |-- ":" | |-- poly_type | |-- "=" | |-- expr |-- fun_binding
This tree is built as long as rules are inserted. When used, by applying the function "Grammar.Entry.parse" to the current entry, the input is matched with that tree, starting from the tree root, descending on it as long as the parsing advances.
There is a different tree by entry level.
Semantic action
The semantic action, i.e. the expression following the right arrow in rules, contain in its environment:
- the variables bound by the patterns of the symbols found in the rules,
- the specific variable "loc" which contain the location of the whole rule in the source.
The location is an abstract type defined in the module "Stdpp" of camlp5.
It is possible to change the name of this variable by using the option "-loc" of camlp5. For example, compiling a file like this:
camlp5r -loc foobar file.ml
the variable name, for the location will be "foobar" instead of "loc".
The DELETE_RULE statement
The "DELETE_RULE" statement is also added in the expressions of the ocaml language when the syntax extension kit "pa_extend.cmo" is loaded. Its syntax is:
expression ::= delete-rule delete-rule ::= "DELETE_RULE" delete-rule-body "END" delete-rule-body ::= entry-name ":" symbols symbols ::= symbol symbols | symbol
See the syntax of the EXTEND statement for the meaning of the syntax entries not defined above.
The entry is scanned for a rule matching the giving symbol list. When found, the rule is removed. If no rule is found, the exception "Not_found" is raised.
Extensions FOLD0 and FOLD1
When loading "pa_extfold.cmo" after "pa_extend.cmo", the entry "symbol" of the EXTEND statement is extended with what is named the fold iterators, like this:
symbol ::= "FOLD0" simple_expr simple_expr symbol | "FOLD1" simple_expr simple_expr symbol | "FOLD0" simple_expr simple_expr symbol "SEP" symbol | "FOLD1" simple_expr simple_expr symbol "SEP" symbol simple_expr ::= expr (level "simple")
Like their equivalent with the lists iterators: "LIST0", "LIST1", "LIST0SEP", "LIST1SEP", they read a sequence of symbols, possibly with the separators, but instead of building the list of these symbols, apply a fold function to each symbol, starting at the second "expr" (which must be a expression node) and continuing with the first "expr" (which must be a function taking two expressions and returing a new expression).
The list iterators can be seen almost as a specific case of these fold iterators where the initial "expr" would be:
<:expr< [] >>
and the fold function would be:
fun e1 e2 -> <:expr< [$e1$ :: $e2$ ] >>
except that, implemented like that, they would return the list in reverse order.
Actually, a program using them can be written with the lists iterators with the semantic action applying the function "List.fold_left" to the returned list, except that with the fold iterators, this operation is done as long as the symbols are read on the input, no intermediate list being built.
Example, file "sum.ml":
#load "pa_extend.cmo"; #load "pa_extfold.cmo"; #load "q_MLast.cmo"; let loc = Stdpp.dummy_loc in EXTEND Pcaml.expr: [ [ "sum"; e = FOLD0 (fun e1 e2 -> <:expr< $e2$ + $e1$ >>) <:expr< 0 >> Pcaml.expr SEP ";"; "end" -> e ] ] ; END;
which can be compiled like this:
ocamlc -pp camlp5r -I +camlp5 -c sum.ml
and tested:
ocaml -I +camlp5 camlp5r.cma sum.cmo Objective Caml version ... Camlp5 Parsing version ... # sum 3;4;5 end; - : int = 12
Extensions SLIST0, SLIST1 and SOPT
The specific iterators "SLIST0", "SLIST1", "SLIST0SEP", "SLIST1SEP" and "SOPT" are used in the file "q_MLast.ml". They allow to generate rules for antiquotations of kind "list" and "opt".
They are not supposed to be used by the programmer.
For information:
The symbol "SLIST0 symb" is equivalent to the rule symbol:
[ a = a_list -> a | a = LIST0 symb -> Qast.List a ]
Same for the other specific iterators.
The entry "a_list" and the constructor "Qast.List" are locally defined in "q_MLast.ml". This system allows the updating of the source file "q_MLast.ml" (syntax tree quotations, with antiquotations, in revised syntax) from the file "pa_r.ml" (revised syntax). These grammars are close the one to the other, except that "q_MLast.ml" can parse antiquotations, what is done by replacing the list iterators by these specific iterators. This operation is done through the shell script "mk_q_MLast.sh" in the "meta" directory of the camlp5 sources.
Grammar machinery
We explain here the detail of the mechanism of the parsing of an entry.
Start and Continue
At each entry level, the rules are separated into two trees:
- The tree of the rules not starting with the current entry name nor by "SELF".
- The tree of the rules starting with the current entry name or by the identifier "SELF", this symbol not being included in the tree.
They determine two functions:
- The function named "start", analyzing the first tree.
- The function named "continue", taking, as parameter, a value previously parsed, and analyzing the second tree.
A call to an entry, using "Grammar.Entry.parse" correspond to a call to the "start" function of the first level of the entry.
The "start" function tries its associated tree. If it works, it calls the "continue" function of the same level, giving the result of "start" as parameter. If this "continue" function fails, this parameter is simply returned. If the "start" function fails, the "start" function of the next level is tested. If there is no more levels, the parsing fails.
The "continue" function first tries the "continue" function of the next level. If it fails, or if it is the last level, it tries its associated tree, then calls itself again, giving the result as parameter. If its associated tree fails, it returns its extra parameter.
Associativity
While testing the tree, there is a special case for rules ending with SELF or with the current entry name. For this last symbol, there is a call to the "start" function: of the current level if the level is right associative, or of the next level otherwise.
There is no behaviour difference between left and non associative, because, in case of syntax error, the system attempts, anyway, to recover the error by applying the "continue" function of the previous symbol (if this symbol is a call to an entry).
When a SELF or the current entry name is encountered in the middle of the rule (i.e. if it is not the last symbol), there is a call to the "start" function of the first level of the current entry.
Example. Let us consider the following grammar:
EXTEND expr: [ "minus" LEFTA [ x = SELF; "-"; y = SELF -> x -. y ] | "power" RIGHTA [ x = SELF; "**"; y = SELF -> x ** y ] | "simple" [ "("; x = SELF; ")" -> x | x = INT -> float_of_int x ] ] ; END
The left "SELF"s of the two levels "minus" and "power" correspond to a call to the next level. In the level "minus", the right "SELF" also, and the left associativity is treated by the fact that the "continue" function is called (starting with the keyword "-" since the left "SELF" is not part of the tree). On the other hand, for the level "power", the right "SELF" corresponds to a call to the current level, i.e. the level "power" again. At end, the "SELF" between parentheses of the level "simple" correspond to a call to the first level, namely "minus" in this grammar.
Errors and recovery
Like for stream parsers, two exceptions may happen: "Stream.Failure" or "Stream.Error". The first one indicates that the parsing just could not start. The second one indicates that the parsing started but failed further.
In stream parsers, when the first symbol of a rule has been accepted, all the symbols of the same rule must be accepted, otherwise the exception "Stream.Error" is raised.
Here, in extensible grammars, unlike stream parsers, before the "Stream.Error" exception, the system attempts to recover the error by the following trick: if the previous symbol of the rule was a call to another entry, the system calls the "continue" function of that entry, which may resolve the problem.
In extensible grammars, the exceptions are encapsulated with the exception "Stdpp.Exc_located" giving the location of the error together with the exception itself.
Tokens starting rules
Another improvement (than the error recovery) is the fact that, when a rule starts with several tokens and/or keywords, all these tokens and keywords are tested in one time, and the possible "Stream.Error" may happen, only from the symbol following them on, if any.
Kind of grammar
The kind of grammar is predictive parsing grammar, i.e. recursive descent parsing without backtrack. But with some nuances, due to the improvements (error recovery and token starting rules) indicated in the previous sections.
The Grammar module
The Grammar module contains all what is necessary to manipulate grammars and entries. It contains:
Main types and values
- type g = 'abstract;
- The type of grammars, holding entries.
- value gcreate : Token.glexer (string * string) -> g;
- Create a new grammar, without keywords, using the lexer given as parameter.
- value tokens : g -> string -> list (string * int);
-
Given a grammar and a token pattern constructor, returns the list of
the corresponding values currently used in all entries of this grammar.
The integer is the number of times this pattern value is used.
Examples:
- The call [Grammar.tokens g ""] returns the keywords list.
- The call [Grammar.tokens g "IDENT"] returns the list of all usages of the pattern "IDENT" in the EXTEND statements.
- value glexer : g -> Token.glexer token;
- Return the lexer used by the grammar
- type parsable = 'abstract;
- value parsable : g -> Stream.t char -> parsable;
- Type and value allowing to keep the same token stream between several calls of entries of the same grammar, to prevent loss of tokens. To be used with Entry.parse_parsable below
module Entry = sig type e 'a = 'x; value create : g -> string -> e 'a; value parse : e 'a -> Stream.t char -> 'a; value parse_token : e 'a -> Stream.t token -> 'a; value parse_parsable : e 'a -> parsable -> 'a; value name : e 'a -> string; value of_parser : g -> string -> (Stream.t token -> 'a) -> e 'a; value print : e 'a -> unit; value find : e 'a -> string -> e Obj.t; external obj : e 'a -> Gramext.g_entry token = "%identity"; end;
-
Module to handle entries.
- /Entry.e/ is the type for entries returning values of type ['a].
- /Entry.create g n/ creates a new entry named [n] in the grammar [g].
- /Entry.parse e/ returns the stream parser of the entry [e].
- /Entry.parse_token e/ returns the token parser of the entry [e].
- /Entry.parse_parsable e/ returns the parsable parser of the entry [e].
- /Entry.name e/ returns the name of the entry [e].
- /Entry.of_parser g n p/ makes an entry from a token stream parser.
- /Entry.print e/ displays the entry [e] using [Format].
- /Entry.find e s/ finds the entry named [s] in [e]'s rules.
- /Entry.obj e/ converts an entry into a [Gramext.g_entry] allowing to see what it holds ([Gramext] is visible, but not documented).
- value of_entry : Entry.e 'a -> g;
- Return the grammar associated with an entry.
Printing grammar entries
The function "Grammar.Entry.print" displays the current contents of an entry. Interesting for debugging, to look at the result of a syntax extension, to see the names of the levels.
The display does not include the patterns nor the semantic actions, whose sources are not recorded in the grammar entries data.
Moreover, the local entries (see the section about the GLOBAL indicator) are indicated with a star ("*") to inform that they are not directly accessible.
Clearing grammars and entries
module Unsafe : sig value gram_reinit : g -> Token.glexer token -> unit; value clear_entry : Entry.e 'a -> unit; end;
-
Module for clearing grammars and entries. To be manipulated
with care, because: 1) reinitializing a grammar destroys all tokens
and there may have problems with the associated lexer if there are
keywords; 2) clearing an entry does not destroy the tokens used
only by itself.
- /Unsafe.reinit_gram g lex/ removes the tokens of the grammar and sets [lex] as a new lexer for [g]. Warning: the lexer itself is not reinitialized.
- /Unsafe.clear_entry e/ removes all rules of the entry [e].
Functorial interface
Alternative for grammars use. Grammars are no more Ocaml values: there is no type for them. Modules generated preserve the rule "an entry cannot call an entry of another grammar" by normal OCaml typing.
module type GLexerType = sig type te = 'x; value lexer : Token.glexer te; end;
- The input signature for the functor [Grammar.GMake]: [te] is the type of the tokens.
module type S = sig type te = 'x; type parsable = 'x; value parsable : Stream.t char -> parsable; value tokens : string -> list (string * int); value glexer : Token.glexer te; module Entry : sig type e 'a = 'y; value create : string -> e 'a; value parse : e 'a -> parsable -> 'a; value parse_token : e 'a -> Stream.t te -> 'a; value name : e 'a -> string; value of_parser : string -> (Stream.t te -> 'a) -> e 'a; value print : e 'a -> unit; external obj : e 'a -> Gramext.g_entry te = "%identity"; end; module Unsafe : sig value gram_reinit : Token.glexer te -> unit; value clear_entry : Entry.e 'a -> unit; end; value extend : Entry.e 'a -> option Gramext.position -> list (option string * option Gramext.g_assoc * list (list (Gramext.g_symbol te) * Gramext.g_action)) -> unit; value delete_rule : Entry.e 'a -> list (Gramext.g_symbol te) -> unit; end;
-
Signature type of the functor [Grammar.GMake]. The types and
functions are almost the same than in generic interface, but:
- Grammars are not values. Functions holding a grammar as parameter do not have this parameter yet.
- The type [parsable] is used in function [parse] instead of the char stream, avoiding the possible loss of tokens.
- The type of tokens (expressions and patterns) can be any type (instead of (string * string)); the module parameter must specify a way to show them as (string * string).
- module GMake (L : GLexerType) : S with type te = L.te;
Miscellaneous
- value error_verbose : ref bool;
- Flag for displaying more information in case of parsing error; default = [False].
- value warning_verbose : ref bool;
- Flag for displaying warnings while extension; default = [True].
- value strict_parsing : ref bool;
- Flag to apply strict parsing, without trying to recover errors; default = [False].
- value print_entry : Format.formatter -> Gramext.g_entry 'te -> unit;
- General printer for all kinds of entries (obj entries).
- value iter_entry : (Gramext.g_entry 'te -> unit) -> Gramext.g_entry 'te -> unit;
- [Grammar.iter_entry f e] applies [f] to the entry [e] and transitively all entries called by [e]. The order in which the entries are passed to [f] is the order they appear in each entry. Each entry is passed only once. *)
- value fold_entry : (Gramext.g_entry 'te -> 'a -> 'a) -> Gramext.g_entry 'te -> 'a -> 'a;
- [Grammar.fold_entry f e init] computes [(f eN .. (f e2 (f e1 init)))], where [e1 .. eN] are [e] and transitively all entries called by [e]. The order in which the entries are passed to [f] is the order they appear in each entry. Each entry is passed only once. *)
Interface with the lexer
To create a grammar, the function "Grammar.gcreate" must be called, with a lexer as parameter.
A simple solution, as possible lexer, is the predefined lexer built by "Plexer.gmake ()", lexer used for the ocaml grammar of camlp5. In this case, you can just put it as parameter of "Grammar.gcreate" and it is not necessary to read this section.
The section first introduces the notion of "token patterns" which are the way the tokens and keywords symbols in the EXTEND statement are represented. Then follow the description of the type of the parameter of "Grammar.gcreate".
Token patterns
A token pattern is a value of the type defined like this:
type pattern = (string * string);
This type represents values of the token and keywords symbols in the grammar rules.
For a token symbol in the grammar rules, the first string is the token constructor name (starting with an uppercase character), the second string indicates whether the match is "any" (the empty string) or some specific value of the token (an non-empty string).
For a keyword symbol, the first string is empty and the second string is the keyword itself.
For example, given this grammar rule:
"for"; i = LIDENT; "="; e1 = SELF; "to"; e2 = SELF
the different symbols and keywords are represented by the following couples of strings:
- the keyword "for" is represented by ("", "for"),
- the keyword "=" by ("", "="),
- the keyword "to" by ("", "to")),
- and the token symbol LIDENT by ("LIDENT", "").
The symbol UIDENT "Foo" in a rule would be represented by the token pattern:
("UIDENT", "Foo")
Notice that the symbol "SELF" is a specific symbol of the EXTEND syntax: it does not correspond to a token pattern and is represented differently. A token constructor name must not belong to the specific symbols: SELF, NEXT, LIST0, LIST1 and OPT.
The glexer record
The type of the parameter of the function "Grammar.gcreate" is "glexer", defined in the module "Token". It is a record type with the following fields:
tok_func
It is the lexer itself. Its type is:
Stream.t char -> (Stream.t (string * string) * location_function);
The lexer takes a character stream as parameter and must answer a couple of: a token stream, the tokens being represented by a couple of strings, and a location function.
The location function is a function taking, as parameter, a integer corresponding to a token number in the stream (starting from zero), and returning the location of this token in the source. It is important to get the good locations in the semantic actions of the grammar rules.
Notice that, despite the lexer takes a character stream as parameter, it is not mandatory to use the stream parsers technology to write the lexer. What is important is that it does the job.
tok_using
It is a function of type:
pattern -> unit
The parameter of this function is the representation of a token symbol or a keyword symbol in grammar rules. See the section about token patterns.
This function is called for each token symbol and each keyword encountered in the grammar rules of the EXTEND statement. Its goal is to allow the lexer to check that the tokens and keywords do respect the lexer rules. It checks that the tokens exist and are not mispelled. It can be also used to enter the keywords in the lexer keyword tables.
Setting it as the function that does nothing is possible, but the check of correctness of tokens is not done.
In case or error, the function must raise the exception "Token.Error" with an error message as parameter.
tok_removing
It is a function of type:
pattern -> unit
It is called by the DELETE_RULE statement for each token symbol and each keyword that an occurence of them is no more used. This can be interesting for keywords, if the lexer record the number of occurences of the keywords: when the number of occurences falls to zero, the keyword can be removed from the lexer tables.
tok_match
It is a function of type:
pattern -> ((string * string) -> unit)
The function tells how a token of the input stream is matched against a token pattern. Both are represented by a couple of strings.
This function takes a token pattern as parameter and return a function matching a token, returning the matched string or raising the exception "Stream.Failure" if the token does not match.
Notice that, for efficiency, it is necessary to write this function as a match of token patterns returning, for each case, the function which matches the token, not a function matching the token pattern and the token together and returning a string for each case.
An acceptable function is provided in the module "Token" and is named "default_match". Its code looks like this:
value default_match = fun [ (p_con, "") -> fun (con, prm) -> if con = p_con then prm else raise Stream.Failure | (p_con, p_prm) -> fun (con, prm) -> if con = p_con && prm = p_prm then prm else raise Stream.Failure ] ;
tok_text
It is a function of type:
pattern -> string
Destinated to error messages, it takes a token pattern as parameter and return the string giving its name.
It is possible to use the predefined function "lexer_text" of the Token module. This function just returns the name of the token pattern constructor and its parameter if any.
For example, with this default function, the token symbol IDENT would be written as IDENT in error message (e.g. "IDENT expected"). The "text" function may decide to print it differently, e.g., as "identifier".
tok_comm
It is a mutable field of type:
option (list location)
It asks the lexer (the lexer function should do it) to record the locations of the comments in the program. Setting this field to "None" indicates that the lexer must not record them. Setting it to "Some []" indicated that the lexer must put the comments location list in the field, which is mutable.
Minimalist version
If a lexer have been written, named "lexer", here is the minimalist version of the value suitable as parameter to "Grammar.gcreate":
{Token.tok_func = lexer; Token.tok_using _ = (); Token.tok_removing _ = (); Token.tok_match = Token.default_match; Token.tok_text = Token.lexer_text; Token.tok_comm = None}
Functorial interface
The normal interface for grammars described in the previous sections has two drawbacks:
- First, the type of tokens of the lexers must be "(string * string)"
- Second, since the entry type has no parameter to specify the grammar it is bound to, there is no static check that entries are compatible, i.e. belong to the same grammar. The check is done at run time.
The functorial interface resolve these two problems. The functor takes a module as parameter where the token type has to be defined, together with the lexer returning streams of tokens of this type. The resulting module define entries compatible the ones to the other, and this is controlled by the ocaml type checker.
The syntax extension must be done with the statement GEXTEND, instead of EXTEND, and deletion by GDELETE_RULE instead of DELETE_RULE.
The glexer type
In the section about the interface with the lexer, we presented the glexer type as a record without type parameter. Actually, this type is defined as:
type glexer 'te = { tok_func : lexer_func 'te; tok_using : pattern -> unit; tok_removing : pattern -> unit; tok_match : pattern -> 'te -> string; tok_text : pattern -> string; tok_comm : mutable option (list location) } ;
where the type parameter is the type of the token, which can be any type, different from "(string * string)", providing the lexer function (tok_func) returns a stream of this token type and the match function (tok_match) indicates how to match values of this token type against the token patterns (which remain defined as "(string * string)").
Here is an example of an user token type and the associated match function:
type mytoken = [ Ident of string | Int of int | Comma | Equal | Keyw of string ]; value mymatch = fun [ ("IDENT", "") -> fun [ Ident s -> s | _ -> raise Stream.Failure ] | ("INT", "") -> fun [ Int i -> string_of_int i | _ -> raise Stream.Failure ] | ("", ",") -> fun [ Comma -> "" | _ -> raise Stream.Failure ] | ("", "=") -> fun [ Equal -> "" | _ -> raise Stream.Failure ] | ("", s) -> fun [ Keyw k -> if k = s then "" else raise Stream.Failure | _ -> raise Stream.Failure ] | _ -> raise (Token.Error "bad token in match function") ] ;
The functor parameter
The type of the functor parameter is defined as:
module type GLexerType = sig type te = 'x; value lexer : Token.glexer te; end;
The token type must be specified (type "te") and the lexer also, with the interface for lexers, of the glexer type defined above, the record fields being described in the section "interface with the lexer", but with a general token type.
The resulting grammar module
Once a module of type "GLexerType" has been built (previous section, it is possible to create a grammar module by applying the functor "Grammar.GMake". For example:
module MyGram = Grammar.GMake MyLexer;
Notice that the function "Entry.parse" of this resulting module does not take a character stream as parameter, but a value of type "parsable". This function is equivalent to the function "parse_parsable" of the non functorial interface. In short, the parsing of some character stream "cs" by some entry "e" of the example grammar above, must be done by:
MyGram.Entry.parse e (MyGram.parsable cs)
instead of:
MyGram.Entry.parse e cs
GEXTEND and GDELETE_RULE
The "GEXTEND" and "GDELETE_RULE" statements are also added in the expressions of the ocaml language when the syntax extension kit "pa_extend.cmo" is loaded. They have to be used for grammars defined with the functorial interface. Their syntax are:
expression ::= gextend | gdelete-rule gdelete-rule ::= "GDELETE_RULE" gdelete-rule-body "END" gextend ::= "GEXTEND" gextend-body "END" gextend-body ::= grammar-module-name extend-body gdelete-rule-body ::= grammar-module-name delete-rule-body grammar-module-name ::= qualid
See the syntax of the EXTEND statement for the meaning of the syntax entries not defined above.
An example
Here is a small calculator of expressions. They are given as parameters of the command.
File "calc.ml":
#load "pa_extend.cmo"; value g = Grammar.gcreate (Plexer.gmake ()); value e = Grammar.Entry.create g "expression"; EXTEND e: [ [ x = e; "+"; y = e -> x + y | x = e; "-"; y = e -> x - y ] | [ x = e; "*"; y = e -> x * y | x = e; "/"; y = e -> x / y ] | [ x = INT -> int_of_string x | "("; x = e; ")" -> x ] ] ; END; open Printf; for i = 1 to Array.length Sys.argv - 1 do { let r = Grammar.Entry.parse e (Stream.of_string Sys.argv.(i)) in printf "%s = %d\n" Sys.argv.(i) r; flush stdout; };
The link needs the library "gramlib.cma" provided with camlp5:
ocamlc -pp camlp5r -I +camlp5 gramlib.cma test/calc.ml -o calc
Examples:
$ ./calc '239*4649' 239*4649 = 1111111 $ ./calc '(47+2)/3' (47+2)/3 = 16