Markus Scherer <[EMAIL PROTECTED]> wrote:

> notepad always saves unicode-encoded files with the appropriate
> signature byte sequence, like most other microsoft-apps and many
> other well-behaved applications.
>
> they are the first 2 to 4 bytes in the text file, encode U+feff
> in the particular encoding scheme, and are as follows:
>
> utf-8:      ef bb bf
> utf-16be:   fe ff
> utf-16le:   ff fe
> utf-32be:   00 00 fe ff
> utf-32le:   ff fe 00 00 (check before utf-16le!)
> scsu:       0e fe ff (unfortunately rather rarely used)

Not even CLOSE to a complete list.  From the forthcoming(1) bestseller
"The Quadrature of Unicode":

UTF-1:       F7 64 4C
UTF-7:       2B 2F 76 38 2D        "+/v8-"
UTF-7d5:     BF FB FF
UTF-8C1:     BB ED DF
UTF-9:       93 FD FF
UTF-EBCDIC:  DD 73 66 73
UTF-mu(2):   9F 9B FF
UCN(3):      5C 75 66 65 66 66     "\ufeff"
DUCK(4):     81 FE FF

Needless to say, most of these additional encoding forms/schemes range
from the sublime to the ridiculous.  Don't use any of them in the real
world except UTF-7, UTF-EBCDIC, and UCN, and those only when you must.
(Although I'm considering recommending UTF-1 to people who insist on
C1 transparency and Latin-1 legibility in a UTF.)

Notes:
(1) Don't look for it in your local bookstore any time soon.
(2) Submitted by a fellow list member (along with the book title).
(3) Universal Character Name convention, also known as Java escape
     sequences.
(4) Doug's Unicode Compression Kludge, invented in 1996 before I knew
     about any of the real UTF's.  Nicknamed "UTF-Doug" by Peter
     Constable in a 1998 discussion.

-Doug Ewell
 Fullerton, California

Reply via email to