The type 'document' represents parsed HTML documents. Element (name, args, subnodes): is an element node for an element of type 'name' (i.e. written <name ...>...</name>) with arguments 'args' and subnodes 'subnodes' (the material within the element). The arguments are simply name/value pairs. Entity references (something like %xy;) occuring in the values are NOT resolved. Arguments without values (e.g. <select name="x" multiple>: here, "multiple" is such an argument) are represented as (name,name), i.e. the name is returned as value. As argument names are case-insensitive, the names are all lowercase. Data s: is a character data node. Again, entity references are contained as such and not as what they mean.
type document = Element of (string * (string*string) list * document list) | Data of string ;;
val no_end_tag : string list ref;;
List of tags which are always empty. This variable is pre-configured,
but you may want to change it.
It is important to know which elements are always empty, because HTML
allows it to omit the end tag for them. For example,
<a><b>x</a> is parsed as
Element("b",,); Data "x" )
if we know that "a" is an empty element, but it is wrongly parsed as
Element("b",, [ Data "x"]) )
if "a" is actually empty but we do not know it.
An example of such a tag is "br".
val special_tag : string list ref;;
Note that comments are not recognized within special elements; comments are returned as character material.
val parse_string : string -> document list
Parses the HTML document from a string and returns it
val parse_file : in_channel -> document list
Parses the HTML document from a file and returns it
Go to the first, previous, next, last section, table of contents.