I'm not sure I follow this either :)

Currently we emit an XML declaration which says we are using ISO8859-1 encoding. Unicode code points in the range 0X00 to 0XFF have the same value as the ISO8859-1 character values. If we wish to send Unicode code points with values > 0XFF then we have to emit character references (e.g. &x1FF;)

If we were to change the encoding to UTF-8 or UTF-16 then we would never have to emit character references (though we still could if we wanted to).

The XML 1.0 spec forbids some Unicode code points from appearing in a well formed XML document (only these code points are allowed: #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] - see section 2.2 of the spec.). Note that USASCII control characters other than HT, CR and NL are not allowed. Using a character reference doesn't make any difference <a>&#x0;</a> is not a well formed XML document and should be rejected by an XML parser (MinML used not to complain about this - later versions do).

There is another little wrinkle with CR and LF. An XML parser is required to "normalise" line endings (see section 2.11 of the spec). This normalisation involves replacing CR NL or CR with NL. This normalisation does not occur if the CR is represented using a character reference.

So a correct XML writer should do the following:

1/ refuse to write characters with Unicode code points which are not allowed in an XML document.

2/ replace characters with a Unicode code point which is not allowed in the encoding being used with the appropriate character reference.

3/ replace <,& and > with either the pre defined entities (&lt; etc) or with a character reference.

4/ replace all CR characters with a character reference.

If we wanted to have the greatest possible chance of interoperating we should emit no XML encoding declaration and replace code points with values > 0X7F with character references.


I would recommend Tim Bray's Annotated XML spec http://www.xml.com/ axml/testaxml.htm if you would like to check that I have the details right.




John Wilson
The Wilson Partnership
http://www.wilson.co.uk




Reply via email to