Off-list, as this isn't really related to the development of HTML whatsoever.

On 3 Mar 2008, at 08:54, Martin Duerst wrote:

I don't see anything making a BOM illegal in UTF-16LE/UTF-16BE, in
fact, the only mention I find of it with regards to either in Unicode
5.0 is "In UTF-16(BE|LE), an initial byte sequence <(FE FF|FF FE)> is
interpreted as U+FEFF zero width no-break space."

That's exactly it. To make it very explicit, there is one codepoint
(U+FEFF) and two functions: BOM and ZWNBSP. What the above says is
that U+FEFF at the start of files marked as UTF-16LE/UTF-16BE is
always ZWNBSP, and therefore is never a BOM. This means that a leading
BOM is forbidden.

Ah. My mistake: thinking of ZWNBSP as just being the character name, and not its specific meaning in the context (which of course is important for U+FEFF).

If there are HTML files that can start with arbitrary characters, then
it might be okay to have a UTF-16LE or UTF-16BE file start with U +FEFF,
because this can then be interpreted as a ZWNBSP (although a ZWNBSP
at the start of a file doesn't make a lot of sense). If HTML files
have to start with markup, then a UTF-16LE or UTF-16BE HTML file
cannot start with U+FEFF, because a ZWNBSP isn't markup.
(Last time I knew HTML, it had to have at least a <title> element,
so it had to start with markup, but I don't know that is working
out in HTML5.)

A conformant document must start with a doctype, but for a non- conforming document a (leading) ZWNBSP will just end up at the start of <body> (i.e., it gets treated like any other non-ASCII space character).


--
Geoffrey Sneddon
<http://gsnedders.com/>


Reply via email to