On Tue, 13 August 2002, Marco Cimarosti wrote:

> 
> John Cowan wrote:
> > The following characters were explicitly permitted by XML 1.0 but are
> > not in the "recommended" 1.1 set:
> > 
> [...]
> > U+FEFF ZWNBSP
> 
> How do parsers detect the endianness of XML files in UTF-16 (and the very
> fact that they are UTF-16)?

I assume that U+FEFF ZWNBSP is included in this list precisely because it is now used 
solely with
the semantics of a Byte Order Mark, and its original meaning as ZWNBSP is deprecated 
in favour of
U+2060 WORD JOINER.

My understanding is that this list only refers to characters that are not permitted 
within XML
names. The BOM is placed at the head of the XML file, before the XML declaration, and 
is thus the
first character encountered by the parser.

The parser works out the encoding and endianness of the XML file from the value of the 
BOM :
0xFEFF = UTF-16 BE
0xFFFE = UTF-16 LE
0x0000FEFF = UTF-32 BE
0xFFFE0000 = UTF-32 LE
0xEFBBBF = UTF-8

Andrew

Reply via email to