In a message dated 2001-04-10 3:04:09 Pacific Daylight Time, 
[EMAIL PROTECTED] writes:

>  When looking at a document would it be safe to assume that if you found any
>  of the following Byte Order Marks 
>  *    0xFFFE (UCS-2 Little Endian)
>  *    0xFEFE (UCS-2 Big Endian)

should be 0xFEFF

>  *    0xEFBBBF (UTF-8)
>  That the document is encoded with that encoding format. That means that if 
I
>  found the first 3 octets to be EF BB EF could I assume I am dealing with a
>  UTF-8 Document.

That is usually a safe assumption and a good practice, except that if the 
first two bytes are 0xFF 0xFE, you should check the next two to see if they 
are 0x00 0x00 (which would signify little-endian UCS-4).

Also, think in terms of UTF-16, not UCS-2.

>  Apart from UTF and Unicode/UCS encoding formats do any other "legacy"
>  character sets use Byte Order Marks?

Good question.  I have not heard of any.

To follow up, what about signatures that are not necessarily byte order 
marks?  UTF-8 does not need a BOM, so the signature 0xEF 0xBB 0xBF is useful 
for the purpose Tomás mentioned, to indicate the encoding.  Do any other 
character sets have such signatures?

-Doug Ewell
 Fullerton, California

Reply via email to