Canonical XML

This document defines a subset of XML called canonical XML. The intended use of canonical XML is in testing XML processors, as a representation of the result of parsing an XML document.

Every well-formed XML document has a unique structurally equivalent canonical XML document. Two structurally equivalent XML documents have a byte-for-byte identical canonical XML document. Canonicalizing an XML document requires only information that an XML processor is required to make available to an application.

A canonical XML document conforms to the following grammar:


CanonXML    ::= Pi* element Pi*

element     ::= Stag (Datachar | Pi | element)* Etag

Stag        ::= '<'  Name Atts '>'

Etag        ::= '</' Name '>'

Pi          ::= '<?' Name ' ' (((Char - S) Char*)? - (Char* '?>' Char*)) '?>'

Atts        ::= (' ' Name '=' '"' Datachar* '"')*

Datachar    ::= '&amp;' | '&lt;' | '&gt;' | '&quot;'

                 | '&#9;'| '&#10;'| '&#13;'

                 | (Char - ('&' | '<' | '>' | '"' | #x9 | #xA | #xD))

Name        ::= (see XML spec)

Char        ::= (see XML spec)

S           ::= (see XML spec)

Attributes are in lexicographical order (in Unicode bit order).

A canonical XML document is encoded in UTF-8.

Ignorable white space is considered significant and is treated equivalently to data.

James Clark