That is not sufficient. The first three bytes could represent a real content
character, ZWNBSP or they could be a BOM. The label doesn't tell you.

This is similar to UTF-16 CES vs UTF-16BE CES. In the first case, 0xFE 0xFF
represents a BOM, and is not part of the content. In the second case, it
does *not* represent a BOM -- it represents a ZWNBSP, and must not be
stripped. The difference here is that the encoding name tells you exactly
what the situation is.

Mark
__________________________________
http://www.macchiato.com
►  “Eppur si muove” ◄

----- Original Message -----
From: "Murray Sargent" <[EMAIL PROTECTED]>
To: "Joseph Boyle" <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>
Sent: Friday, November 01, 2002 12:42
Subject: RE: Names for UTF-8 with and without BOM


> Joseph Boyle says: "It would be useful to have official names to
> distinguish UTF-8 with and without BOM."
>
> To see if a UTF-8 file has no BOM, you can just look at the first three
> bytes. Is this a problem? Typically when you care about a file's
> encoding form, you plan to read the file.
>
> Thanks
> Murray
>
>
>


Reply via email to