Re: [unicode] More ways to encode U+FEFF (was: Re: Designing

Doug Ewell Wed, 06 Sep 2000 23:48:38 -0700
David Starner <[EMAIL PROTECTED]> wrote:

> On Tue, Jul 18, 2000 at 08:47:41PM -0800, Doug Ewell wrote:
>> Not even CLOSE to a complete list.  From the forthcoming(1) bestseller
>> "The Quadrature of Unicode":

<snip>

> Do any of these actually use an initial BOM in practice? I'm about
> to write a Unicode signature detector for Ngeadal, and I may as well
> detect anything I can. (And since Ngeadal may end up supporting any
> of the above I can get specs on . . .)

My posting from July 18 was semi-serious.

UTF-1 has been removed from the Unicode Standard.  Its advantages of C1
transparency and near-Latin-1 transparency were offset by its use of
7-bit ASCII characters in multibyte sequences and its computational
inefficiency.  It has been superseded by UTF-8.  In any case, any UTF-1
data that may exist in the real world probably would not have a BOM,
since widespread recommendation of the BOM-as-signature came after the
replacement of UTF-1 by UTF-8.

Most UTF-7 data probably does not have a BOM either, but if it did, the
exact bytes would not necessarily be 2B 2F 76 38 2D, but would depend
on the character immediately following the BOM.

UTR #16, which specifies UTF-EBCDIC (it may be a UTS by now; I haven't
checked), does specify the use of a BOM-as-signature.  So if there is
any UTF-EBCDIC data in the real world, you would probably want to check
for that signature.

UCN data probably will not have a BOM, but the sequence "\ufeff" (and
case-shifted equivalents) certainly seems as though it could only be
intended to have that meaning.

All the others are private or semi-private experiments, and regardless
of their merits or faults, you will almost certainly never encounter
any real-world data encoded in them.

-Doug Ewell
 Fullerton, California
Re: [unicode] More ways to encode U+FEFF (was: Re: Designing

Reply via email to