Re: BOM and UTF-16LE/BE (was: Re: several messages about handling encodings in HTML)

Geoffrey Sneddon Mon, 03 Mar 2008 08:56:43 -0800

Off-list, as this isn't really related to the development of HTMLwhatsoever.


On 3 Mar 2008, at 08:54, Martin Duerst wrote:

I don't see anything making a BOM illegal in UTF-16LE/UTF-16BE, in
fact, the only mention I find of it with regards to either in Unicode
5.0 is "In UTF-16(BE|LE), an initial byte sequence <(FE FF|FF FE)> is
interpreted as U+FEFF zero width no-break space."


That's exactly it. To make it very explicit, there is one codepoint
(U+FEFF) and two functions: BOM and ZWNBSP. What the above says is
that U+FEFF at the start of files marked as UTF-16LE/UTF-16BE is
always ZWNBSP, and therefore is never a BOM. This means that a leading
BOM is forbidden.

Ah. My mistake: thinking of ZWNBSP as just being the character name,and not its specific meaning in the context (which of course isimportant for U+FEFF).

If there are HTML files that can start with arbitrary characters, then

it might be okay to have a UTF-16LE or UTF-16BE file start with U+FEFF,

because this can then be interpreted as a ZWNBSP (although a ZWNBSP
at the start of a file doesn't make a lot of sense). If HTML files
have to start with markup, then a UTF-16LE or UTF-16BE HTML file
cannot start with U+FEFF, because a ZWNBSP isn't markup.
(Last time I knew HTML, it had to have at least a <title> element,
so it had to start with markup, but I don't know that is working
out in HTML5.)

A conformant document must start with a doctype, but for a non-conforming document a (leading) ZWNBSP will just end up at the startof <body> (i.e., it gets treated like any other non-ASCII spacecharacter).



--
Geoffrey Sneddon
<http://gsnedders.com/>

Re: BOM and UTF-16LE/BE (was: Re: several messages about handling encodings in HTML)

Reply via email to