>If this is true -- that U+FEFF is a kind of meta-character that doesn't
>really belong to the text per se -- then it should be equally true for
>UTF-8, whether its role is as a true Byte Order Mark (needed in UTF-16
>and UTF-32 but not UTF-8) or as a signature (potentially useful in all
>Unicode CES's).  Only in its evil-twin role as a zero-width no-break
>space is it truly part of the text, in which case the previous
>discussion comments about white-space characters applies.

For what it is worth, the XML doc
(http://www.w3.org/TR/2000/REC-xml-20001006#sec-documents) says this about
the BOM:

>Entities encoded in UTF-16 must begin with the Byte Order Mark ... This is
>an >encoding signature, not part of either the markup or the character data
>of the XML document. XML processors must be able to use this character to
>>differentiate between UTF-8 and UTF-16 encoded documents.

The implication seems to be that in XML, at least, UTF-8 will not have a
BOM (or an encoding declaration).  Other parts of the doc, especially
Appendix F, seem to recognize that anything can come either with or without
a BOM.  Anything not either UTF-8 or UTF-16 must have an encoding
declaration as well.






Reply via email to