Go to the first, previous, next, last section, table of contents.


The Pxp_yacc module

open Pxp_types
open Pxp_dtd
open Pxp_document
exception ID_not_unique

class
 type [ 'ext ] index =
object

The type of indexes over the ID attributes of the elements. This type is the minimum requirement needed by the parser to create such an index.

 'ext = 'ext node #extension
  method add : string -> 'ext node -> unit

Add the passed node to the index. If there is already an ID with the passed string value, the exception ID_not_unique should be raised. (But the index is free also to accept several identical IDs.)

 : string -> 'ext node

Finds the node with the passed ID value, or raises Not_found

end
;;

class [ 'ext ] hash_index : 
object

This is a simple implementation of 'index' using a hash table.

 'ext = 'ext node #extension
  method add : string -> 'ext node -> unit

See above.

 : string -> 'ext node

See above.

 : (string, 'ext node) Hashtbl.t

Returns the hash table.

end
;;
type config =
    { warner : collect_warnings;

An object that collects warnings.

 : bool;

Whether error messages contain line numbers or not. The parser is 10 to 20 per cent faster if line numbers are turned off; you get only byte positions in this case.

 : bool;

true: turns a special mode for processing instructions on. Normally, you cannot determine the exact location of a PI; you only know in which element the PI occurs. This mode makes it possible to find the exact location out: Every PI is artificially wrapped by a special node with type T_pinstr. For example, if the XML text is <a><?x?><?y?></a>, the parser normally produces only an element object for "a", and puts the PIs "x" and "y" into it (without order). In this mode, the object "a" will contain two objects with type T_pinstr, and the first object will contain "x", and the second "y": the object tree looks like - Node with type = T_element "a" - Node with type = T_pinstr "x" + contains processing instruction "x" - Node with type = T_pinstr "y" + contains processing instruction "y"

Notes: (1) In past versions of PXP this mode was called processing_instructions_inline, and it produced nodes of type T_element "-pi" instead of T_pinstr. (2) The T_pinstr nodes are created from the pinstr exemplars in your spec

 : bool;

true: the topmost element of the XML tree is not the root element, but the so-called super root. The root element is a son of the super root. The super root is a node with type T_super_root. The following behaviour changes, too: - PIs occurring outside the root element and outside the DTD are added to the super root instead of the document object - If enable_pinstr_nodes is also turned on, the PI wrappers are added to the super root

For example, the document <?x?><a>y</a><?y?> is normally represented by: - document object + contains PIs x and y - reference to root node with type = T_element "a" - node with type = T_data: contains "y" With enabled super root node: - document object - reference to super root node with type = T_super_root + contains PIs x and y - root node with type = T_element "a" - node with type = T_data: contains "y" If also enable_pinstr_nodes: - document object - reference to super root node with type = T_super_root - node with type = T_pinstr "x" + contains PI "x" - root node with type = T_element "a" - node with type = T_data: contains "y" - node with type = T_pinstr "y" + contains PI "y" Notes: (1) In previous versions of PXP this mode was called virtual_root, and it produced an additional node of type T_element "-vr" instead of T_super_root. (2) The T_super_root node is created from the super root exemplar in your spec.

 : bool;

When enabled, comments are represented as nodes with type = T_comment. To access the contents of comments, use the method "comment" for the comment nodes. These nodes behave like elements; however, they are normally empty and do not have attributes. Note that it is possible to add children to comment nodes and to set attributes, but it is strongly recommended not to do so. There are no checks on such abnormal use, because they would cost too much time, even when no comment nodes are generated at all.

Comment nodes should be disabled unless you must parse a third-party XML text which uses comments as another data container.

The nodes of type T_comment are created from the comment exemplars in your spec.

 : rep_encoding;

Specifies the encoding used for the *internal* representation of any character data. Note that the default is still Enc_iso88591.

 : bool;

Whether the "standalone" declaration is recognized or not. This option does not have an effect on well-formedness parsing: in this case such declarations are never recognized.

Recognizing the "standalone" declaration means that the value of the declaration is scanned and passed to the DTD, and that the "standalone-check" is performed.

Standalone-check: If a document is flagged standalone='yes' some additional constraints apply. The idea is that a parser without access to any external document subsets can still parse the document, and will still return the same values as the parser with such access. For example, if the DTD is external and if there are attributes with default values, it is checked that there is no element instance where these attributes are omitted - the parser would return the default value but this requires access to the external DTD subset.

 : bool;

Whether the file name, the line and the column of the beginning of elements are stored in the element nodes. This option may be useful to generate error messages.

Positions are only stored for: - Elements - Wrapped processing instructions (see enable_pinstr_nodes) For all other node types, no position is stored.

You can access positions by the method "position" of nodes.

 : bool;

Whether the parser does a second pass and checks that all IDREF and IDREFS attributes contain valid references. This option works only if an ID index is available. To create an ID index, pass an index object as id_index argument to the parsing functions (such as parse_document_entity; see below).

"Second pass" does not mean that the XML text is again parsed; only the existing document tree is traversed, and the check on bad IDREF/IDREFS attributes is performed for every node.

 : bool;

If true, and if DFAs are available for validation, the DFAs will actually be used for validation. If false, or if no DFAs are available, the standard backtracking algorithm will be used. DFA = deterministic finite automaton.

DFAs are only available if accept_only_deterministic_models is "true" (because in this case, it is relatively cheap to construct the DFAs). DFAs are a data structure which ensures that validation can always be performed in linear time.

I strongly recommend using DFAs; however, there are examples for which validation by backtracking is faster.

 : bool;

Whether only deterministic content models are accepted in DTDs.

The following options are not implemented, or only for internal use.

 : bool;
    }
type source =
    Entity of ((dtd -> Pxp_entity.entity) * Pxp_reader.resolver)
  | ExtID of (ext_id * Pxp_reader.resolver)

val from_channel : 
      ?system_encoding:encoding -> ?id:ext_id -> ?fixenc:encoding -> 
      in_channel -> source

val from_string :
      ?fixenc:encoding -> string -> source

val from_file :
      ?system_encoding:encoding -> string -> source

Notes on sources (version 2):

Sources specify where the XML text to parse comes from. Sources not only represent character streams, but also external IDs (i.e. SYSTEM or PUBLIC names), and they are interpreted as a specific encoding of characters. A source should be associated with an external ID, because otherwise it is not known how to handle relative names.

There are two primary sources, Entity and ExtID, and several functions for derived sources. First explanations for the functions:

from_channel: The XML text is read from an in_channel. By default, the channel is not associated with an external ID, and it is impossible to resolve relative SYSTEM IDs found in the document. If the ?id argument is passed, it is assumed that the channel has this external ID. If relative SYSTEM IDs occur in the document, they can be interpreted; however, it is only possible to read from "file:" IDs. By default, the channel automatically detects the encoding. You can set a fixed encoding by passing the ?fixenc argument.

from_string: The XML text is read from a string. It is impossible to read from any external entity whose reference is found in the string. By default, the encoding of the string is detected automatically. You can set a fixed encoding by passing the ?fixenc argument.

from_file: The XML text is read from the file whose file name is passed to the function (as UTF-8 string). Relative system IDs can be interpreted by this function. The ?system_encoding argument specifies the character encoding used for file names (sic!). By default, UTF-8 is assumed.

Examples:

from_file "/tmp/file.xml": reads from this file, which is assumed to have the ID SYSTEM "file://localhost/tmp/file.xml".

let ch = open_in "/tmp/file.xml" in from_channel ~id:(System "file://localhost/tmp/file.xml") ch This does the same, but uses a channel.

from_channel ~id:(System "http://host/file.xml") ch reads from the channel ch, and it is assumed that the ID is SYSTEM "http://host/file.xml". If there is any relative SYSTEM ID, it will be interpreted relative to this location; however, there is no way to read via HTTP. If there is any "file:" SYSTEM ID, it is possible to read the file.

The primary sources:

- ExtID(x,r): The identifier x (either the SYSTEM or the PUBLIC name) of the entity to read from is passed to the resolver, and the resolver finds the entity and opens it. The intention of this option is to allow customized resolvers to interpret external identifiers without any restriction. The Pxp_reader module contains several classes allowing the user to compose such a customized resolver from predefined components.

ExtID is the interface of choice for own extensions to resolvers.

- Entity(m,r): You can implementy every behaviour by using a customized entity class. Once the DTD object d is known that will be used during parsing, the entity e = m d is determined and used together with the resolver r. This is only for hackers.

val default_config : config

- Warnings are thrown away - Error messages will contain line numbers - Neither T_super_root nor T_pinstr nor T_comment nodes are generated - The internal encoding is ISO-8859-1 - The standalone declaration is checked - Element positions are stored - The IDREF pass is left out - If available, DFAs are used for validation - Only deterministic content models are accepted

val default_extension : ('a node extension) as 'a

A "null" extension; an extension that does not extend the functionality

val default_spec : ('a node extension as 'a) spec

Specifies that you do not want to use extensions.

val parse_dtd_entity : config -> source -> dtd

Parse an entity containing a DTD (external subset), and return this DTD.

val extract_dtd_from_document_entity : config -> source -> dtd

Parses a closed document, i.e. a document beginning with <!DOCTYPE...>, and returns the DTD contained in the document. The parts of the document outside the DTD are actually not parsed, i.e. parsing stops when all declarations of the DTD have been read.

val parse_document_entity : 
  ?transform_dtd:(dtd -> dtd) ->
  ?id_index:('ext index) ->
  config -> source -> 'ext spec -> 'ext document

Parse a closed document, i.e. a document beginning with <!DOCTYPE...>, and validate the contents of the document against the DTD contained and/or referenced in the document.

If the optional argument ~transform_dtd is passed, the following modification applies: After the DTD (both the internal and external subsets) has been parsed, the function ~transform_dtd is called, and the resulting DTD is actually used to validate the document.

If the optional argument ~transform_dtd is missing, the parser behaves in the same way as if the identity were passed as ~transform_dtd.

If the optional argument ~id_index is present, the parser adds any ID attribute to the passed index. An index is required to detect violations of the uniqueness of IDs.

val parse_wfdocument_entity : 
  config -> source -> 'ext spec -> 'ext document

Parse a closed document (see parse_document_entity), but do not validate it. Only checks on well-formedness are performed.

val parse_content_entity  : 
  ?id_index:('ext index) ->
  config -> source -> dtd -> 'ext spec -> 'ext node

Parse a file representing a well-formed fragment of a document. The fragment must be a single element (i.e. something like <a>...</a>; not a sequence like <a>...</a><b>...</b>). The element is validated against the passed DTD, but it is not checked whether the element is the root element specified in the DTD.

If the optional argument ~id_index is present, the parser adds any ID attribute to the passed index. An index is required to detect violations of the uniqueness of IDs.

val parse_wfcontent_entity : 
  config -> source -> 'ext spec -> 'ext node

Parse a file representing a well-formed fragment of a document (see parse_content_entity). The fragment is not validated, only checked for well-formedness.


Go to the first, previous, next, last section, table of contents.