Stream lexers
The file "pa_lex.cmo" is a Camlp5 syntax extension kit for parsers of streams of the type 'char'. This syntax is shorter and more readable than its equivalent version written with classical stream parsers. But, like classical parsers, they use recursive descendant parsing. They are also pure syntax sugar, and each lexer written with this syntax can be written using normal parsers syntax.
Introduction
Classical parsers in OCaml apply to streams of any type of values. For the specific type "char", it has been possible to shorten encoding in several ways, in particular by using strings to group several characters together, and by hidding the management of a "lexing buffer", a data structure recording the matched characters.
Let us take an example. The following function parses an identifier, composed of letters, digits, underscores, quotes and utf-8 bytes, and record the result in a buffer. In classical parsers syntax, this could be written like this:
value rec ident buf = parser [ [: `('A'..'Z' | 'a'..'z' | '0'..'9' | '_' | ''' | '\128'..'\255' as c); buf = ident (B.add c buf) ! :] -> buf | [: :] -> buf ] ;
With the new syntax, it is possible to write it as:
value rec ident = lexer [ "A..Za..z0..9_'\128..\255" ident! | ];
The two codes are strictly equivalent, but the lexer version is easier to write and understand, and is much shorter.
Syntax
When loading the syntax extension pa_lex.cmo
, the
OCaml syntax is extended as follows:
expression ::= lexer lexer ::= "lexer" "[" rules "]" rules ::= rules rule | <nothing> rule ::= symbols [ "->" action ] symbols ::= symbols symbol err | <nothing> symbol ::= "_" no-record-opt | STRING no-record-opt | simple-expression | "?=" "[" lookaheads "]" | "[" rules "]" no-record-opt ::= "/" | <nothing> simple-expression ::= LIDENT | CHAR | "(" <expression> ")" lookaheads ::= lookaheads "|" lookahead-sequence | lookahead-sequence lookahead-sequence ::= lookahead-symbols | STRING lookahead-symbols ::= lookahead-symbols lookahead-symbol | lookahead-symbol lookahead-symbol ::= CHAR | "_" err ::= "?" simple-expression | "!" | <nothing> action ::= expression
The identifiers "STRING
", "CHAR
" and
"LIDENT
" above represent the OCaml tokens corresponding
to string, character and lowercase identifier (identifier starting
with a lowercase character).
Moreover, together with that syntax extension, another extension is
added the entry expression
, typically for the semantics
actions of the "lexer
" statement above, but not only. It
is:
expression ::= "$" "add" STRING | "$" "buf" | "$" "empty" | "$" "pos"
Remark: the identifiers "add", "buf", "empty" and "pos" do not become keywords (they are not reserved words) but just identifiers. On the contrary, the identifier "lexer", which introduces the syntax, is a new keyword and cannot be used as variable identifier any more.
Semantics
A lexer defined in the syntax above is a shortcut version of a parser applied to the specific case of streams of characters. It could be written with a normal parser. The proposed syntax is much shorter, easier to use and to understand, and silently takes care of the lexing buffer for the programmer. The lexing buffers are data structures, which are passed as parameters to called lexers and returned by them.
Our lexers are of the type:
B.t -> Stream.t char -> u
where "u
" is a type which depends on what the lexer
returns. If there is no semantic action (since it it optional), this
type is automatically "B.t
".
"B
" is a module which must be defined by the user. It
has to contain the lexing buffer type "t
" and some
variables and functions:
empty
: the empty lexing bufferadd
: the way to add a character to a lexing bufferget
: the way to get a string from the lexing buffer
A possible implementation, using "list char
" as lexing
buffer type ("B.t
"), recording the characters at top of
the list (therefore creating a list in reverse order) could be:
(* tool function, converting a reversed list of char into a string *) value rev_implode cl = let s = String.create (List.length cl) in loop (String.length s - 1) cl where rec loop i = fun [ [c :: cl] -> do { s.[i] := c; loop (i - 1) cl } | [] -> s ] ; (* the lexing buffer module *) module B = struct type t = list char; value empty = []; value add c l = [c :: l]; value get = rev_implode; end ;
A lexer is a function with two parameters: the first one is the
lexing buffer itself, and the second one the stream. When called, it
tries to match the stream against its first rule. If it fails, it
tries its second rule, and so on, up to its last rule. If the last
rule fails, the lexer fails by raising the exception
"Stream.Failure
". All of this is
the usual behaviour of stream parsers.
In a rule, when a character is matched, it is inserted into the lexing buffer, except if the "no record" feature is used (see further).
Rules which have no semantic action return the lexing buffer itself.
Symbols
The different kinds or symbols in a rule are:
- The token "underscore", which represents any character. Fails only if the stream is empty.
- A string which represent a matching of any character in the
string. Notice that it is a choice between
all the characters, not the sequence of
these characters. To indicate a sequence, you have to use several
symbols, the ones behind the others. Character ranges can be
inserted using the characters "
..
". For example, to specify a match of any letter or digit, you can write "A..Za..z0..9
". Conversely, the beginning of a comment in the OCaml language has to be written:"(" "*"
, not"(*"
which would mean "the character left parenthesis or the character star". - An expression corresponding to a call to another lexer, which takes the buffer as first parameter and has to return another lexing buffer with all characters found in the stream catened to the lexing buffer.
- The sequence "
?=
" introducing lookahead characters. - A rule, recursively, between brackets, inlining a lexer.
In the first two cases (namely, underscore and string), the symbol can be optionally followed by the character "slash" specifying that the found symbol must not be added into the lexing buffer. By default, it is. Useful, for example, when writing a lexer parsing strings, when the initial double quote and final double quote have not to be part of the string itself.
Moreover, a symbol can be followed by an optional error indicator, which can be:
- The character
?
(question mark) followed by a string expression, telling that, if there is a syntax error at this point (i.e. the symbol is not matched although the beginning of the rule was), the exceptionStream.Error
is raised with that string as parameter. Without this indicator, it is raised with the empty string. This is the same behaviour than with classical stream parsers. - The character
!
(exclamation mark), which is just an indicator to let the syntax expander optimize the code. If the programmer is sure that the symbol never fails (i.e. never raisesStream.Failure
), in particular if this symbol recognizes the empty rule, he can add this exclamation mark. If it is used correctly (the compiler cannot check it), the behaviour is identical as without the!
, except that the code is shorter and faster, and can sometimes be tail recursive. If the indication is not correct, the behaviour of the lexer is undefined. This feature exists also in classical stream parsers (it is a new feature added in 2007).
Specific expressions
When loading this syntax extension, the
entry <expression>
, at level labelled "simple" of
the OCaml language is extended with the following
rules:
$add
followed by a string, specifing that the programmer wants to add all characters of the string in the lexing buffer. It returns the new lexing buffer. It corresponds to an iteration of calls toB.add
with all characters of the string with the current lexing buffer as initial parameter.$buf
which returns the lexing buffer converted into string.$empty
which returns an empty lexing buffer.$pos
which return the current position in the stream in number of characters (starting at zero).
Lookahead
Lookahead is useful in some cases, when factorization of rules is impossible. To understand how it is useful, a first remark must be done, about the usual behaviour of Camlp5 stream parsers.
Stream parsers (including these lexers) use a limited parsing algorithm, in a way that when the first symbol of a rule is matched, all the following symbols of the same rule must apply, otherwise it is a syntax error. There is no backtrack. In most of the cases, left factorization of rules resolve conflicting problems. For example, in parsers of tokens (which is not our case here, since we parse only characters), when one writes a parser to recognize both the typicall grammar rules "if..then..else" and the shorter "if..then..", the solution is to write a rule starting with "if..then.." followed with a call to a parser recognizing "else.." or nothing.
Sometimes, however, this left factorization is not possible. A lookahead of the stream to check the presence of some elements (these element being characters, if we are using this "lexer" syntax) might be necessary to decide whether or not it is a good idea to start the rule. This lookahead feature may unfreeze several characters from the input stream but without removing them.
Syntactically, a lookahead starts with ?=
and is
followed by one or several lookahead sequences separated by the
vertical bar |
, the whole list being enclosed by
braces.
If there are several lookaheads, they must all be of the same size (contain the same number of characters).
If the lookahead sequence is just a string, it corresponds to all characters of this string in the order (which is different for strings outside lookahead sequences, representing a choice of all characters).
Examples of lookaheads:
?= [ _ ''' | '\\' _ ] ?= [ "<<" | "<:" ]
The first line above matches a stream whose second character is a
quote or a stream whose first character is a backslash (real example
in the lexer of OCaml, in the library of Camlp5, named
"plexer.ml"). The second line matches a stream starting with the two
characters <
and <
or starting with
the two characters <
and :
(this is
another example in the same file).
Semantic actions of rules
By default, the result of a "lexer" is the current lexing buffer,
which is of type "B.t
". But it is possible to return
other values, by adding "->
" at end of rules followed
by the expression you want to return, like in usual pattern matching
in OCaml.
An interesting result, for example, could be the string
corresponding to the characters of the lexing buffer. This can be
obtained by returning the value "$buf
".
A complete example
A complete example can be seen in the sources of Camlp5, file "lib/plexer.ml". This is the lexer of OCaml, either "normal" or "revised" syntax.
Compiling
To compile a file containing lexers, just
load pa_lex.cmo
using one of the following methods:
- Either by adding
pa_lex.cmo
among the Camlp5 options. See the Camlp5 manual page or documentation. - Or by adding
#load "pa_lex.cmo";
anywhere in the file, before the usages of this "lexer" syntax.
How to display the generated code
You can see the generated code, for a file "bar.ml" containing lexers, by typing in a command line:
camlp5r pa_lex.cmo pr_r.cmo bar.ml
To see the equivalent code with stream parsers, use:
camlp5r pa_lex.cmo pr_r.cmo pr_rp.cmo bar.ml