Lexers of streams - pa_lex.cmo

The file pa_lex.cmo is a Camlp5 syntax extension for parsers for streams of the type 'char'. This syntax is shorter in length and more readable than its equivalent version written with classical stream parsers. Like classical parsers, they use recursive descendant parsing. They are also pure syntax sugar, and each lexer written with this syntax can be written using normal parsers syntax.

Introduction

Classical parsers in OCaml apply to streams of any type of values. But, for the specific type "char", it has been possible to shorten codes in several ways, in particular by using strings to group several characters together, and by hidding the management of a "lexing buffer", a data structure recording the matched characters.

In order to motivate our work, here is a small example. Let us look at the following parser. It parses an identifier, composed of letters, digits, underscores, quotes and utf-8 characters, and record the result in a buffer. In classical parsers syntax, this could be written like this:

value rec ident buf =
  parser
  [ [: `('A'..'Z' | 'a'..'z' | '0'..'9' | '_' | ''' | '\128'..'\255'
       as c); buf = ident (B.add c buf) ! :] -> buf
  | [: :] -> buf ];

With the new syntax, it is possible to write it as:

value rec ident = lexer [ "A..Za..z0..9_'\128..\255" ident! | ];

The two codes are strictly equivalent, but the lexer version is easier to write and understand, and is much shorter.

Syntax

When loading the syntax extension pa_lex.cmo, the OCaml syntax is extended as follows:

        expression ::= lexer
             lexer ::= "lexer" "[" rules "]"
             rules ::= rules rule
                     | <nothing>
              rule ::= symbols [ "->" action ]
           symbols ::= symbols symbol err
                     | <nothing>
            symbol ::= "_" no_record_opt
                     | STRING no_record_opt
                     | simple_expression
                     | "?=" "[" lookaheads "]"
                     | "[" rules "]"
     no_record_opt ::= "/"
                     | <nothing>
 simple_expression ::= LIDENT
                     | CHAR
                     | "(" <expression> ")"
        lookaheads ::= lookaheads "|" lookahead_sequence
                     | lookahead_sequence
lookahead_sequence ::= lookahead_symbols
                     | STRING
 lookahead_symbols ::= lookahead_symbols lookahead_symbol
                     | lookahead_symbol
  lookahead_symbol ::= CHAR
                     | "_"
               err ::= "?" simple_expression
                     | "!"
                     | <nothing>
            action ::= expression

The identifiers "STRING", "CHAR" and "LIDENT" above represent the OCaml tokens corresponding to string, character and lowercase identifier (identifier starting with a lowercase character).

Moreover, together with that syntax extension, another extension is added the entry expression, typically for the semantics actions of the "lexer" statement above, but not only. It is:

expression ::= "$" "add" STRING
             | "$" "buf"
             | "$" "empty"
             | "$" "pos"

Remark: the identifiers "add", "buf", "empty" and "pos" do not become keywords (they are not reserved words) but just identifiers. On the contrary, the identifier lexer, which introduces the syntax, is a new keyword and cannot be used as variable identifier any more.

Semantics

A lexer defined in the syntax above is a shortcut version of a parser applied to the specific case of streams of characters. It could be written with a normal parser. The proposed syntax is much shorter, easier to use and to understand, and silently takes care of the lexing buffer for the programmer. The lexing buffers are data structures, which are passed as parameters to called lexers and returned by them.

Our lexers are of the type:

B.t -> Stream.t char -> u

where "u" is a type which depends on what the lexer returns. If there is no semantic action (since it it optional), this type is automatically "B.t".

"B" is a module which must be defined by the user. It has to contain the lexing buffer type "t" and some variables and functions:

A possible implementation, using "list char" as lexing buffer type ("B.t"), recording the characters at top of the list (therefore creating a list in reverse order) could be:

(* tool function, converting a reversed list of char into a string *)
value rev_implode cl =
  let s = String.create (List.length cl) in
  loop (String.length s - 1) cl where rec loop i =
    fun
    [ [c :: cl] -> do { s.[i] := c; loop (i - 1) cl }
    | [] -> s ]
 ;

 (* the lexing buffer module *)
 module B =
   struct
     type t = list char;
     value empty = [];
     value add c l = [c :: l];
     value get = rev_implode;
   end
;

A lexer is a function with two parameters: the first one is the lexing buffer itself, and the second one the stream. When called, it tries to match the stream against its first rule. If it fails, it tries its second rule, and so on, up to its last rule. If the last rule fails, the lexer fails by raising the exception "Stream.Failure". All of this is the usual behaviour of stream parsers.

In a rule, when a character is matched, it is inserted into the lexing buffer, except if the "no record" feature is used (see further).

Rules which have no semantic action return the lexing buffer itself.

Symbols

The different kinds or symbols in a rule are:

In the first two cases (namely, underscore and string), the symbol can be optionally followed by the character "slash" specifying that the found symbol must not be added into the lexing buffer. By default, it is. Useful, for example, when writing a lexer parsing strings, when the initial double quote and final double quote have not to be part of the string itself.

Moreover, a symbol can be followed by an optional error indicator, which can be:

Specific expressions

When loading this syntax extension, the entry <expression>, at level labelled "simple" of the OCaml language is extended with the following rules:

Lookahead

Lookahead is useful in some cases, when factorization of rules is impossible. To understand how it is useful, a first remark must be done, about the usual behaviour of Camlp5 stream parsers.

Stream parsers (including these lexers) use a limited parsing algorithm, in a way that when the first symbol of a rule is matched, all the following symbols of the same rule must apply, otherwise it is a syntax error. There is no backtrack. In most of the cases, left factorization of rules resolve conflicting problems. For example, in parsers of tokens (which is not our case here, since we parse only characters), when one writes a parser to recognize both the typicall grammar rules "if..then..else" and the shorter "if..then..", the solution is to write a rule starting with "if..then.." followed with a call to a parser recognizing "else.." or nothing.

Sometimes, however, this left factorization is not possible. A lookahead of the stream to check the presence of some elements (these element being characters, if we are using this "lexer" syntax) might be necessary to decide whether or not it is a good idea to start the rule. This lookahead feature may unfreeze several characters from the input stream but without removing them.

Syntactically, a lookahead starts with ?= and is followed by one or several lookahead sequences separated by the vertical bar |, the whole list being enclosed by braces.

If there are several lookaheads, they must all be of the same size (contain the same number of characters).

If the lookahead sequence is just a string, it corresponds to all characters of this string in the order (which is different for strings outside lookahead sequences, representing a choice of all characters).

Examples of lookaheads:

?= [ _ ''' | '\\' _ ]
?= [ "<<" | "<:" ]

The first line above matches a stream whose second character is a quote or a stream whose first character is a backslash (real example in the lexer of OCaml, in the library of Camlp5, named "plexer.ml"). The second line matches a stream starting with the two characters < and < or starting with the two characters < and : (this is another example in the same file).

Semantic actions of rules

By default, the result of a "lexer" is the current lexing buffer, which is of type "B.t". But it is possible to return other values, by adding "->" at end of rules followed by the expression you want to return, like in usual pattern matching in OCaml.

An interesting result, for example, could be the string corresponding to the characters of the lexing buffer. This can be obtained by returning the value "$buf" (see section 3.2)

A complete example

A complete example can be seen in the sources of Camlp5, file "lib/plexer.ml". This is the lexer of OCaml, either "normal" or "revised" syntax.

Compiling

To compile a file containing lexers, just load pa_lex.cmo using one of the following methods:

How to display the generated code

You can see the generated code, for a file "bar.ml" containing lexers, by typing in a command line:

camlp5r pa_lex.cmo pr_r.cmo bar.ml

To see the equivalent code with stream parsers, use:

camlp5r pa_lex.cmo pr_r.cmo pr_rp.cmo bar.ml

Copyright 2007 Daniel de Rauglaudre (INRIA)

Valid XHTML 1.1