David Starner <[EMAIL PROTECTED]> wrote: > On Tue, Jul 18, 2000 at 08:47:41PM -0800, Doug Ewell wrote: >> Not even CLOSE to a complete list. From the forthcoming(1) bestseller >> "The Quadrature of Unicode": <snip> > Do any of these actually use an initial BOM in practice? I'm about > to write a Unicode signature detector for Ngeadal, and I may as well > detect anything I can. (And since Ngeadal may end up supporting any > of the above I can get specs on . . .) My posting from July 18 was semi-serious. UTF-1 has been removed from the Unicode Standard. Its advantages of C1 transparency and near-Latin-1 transparency were offset by its use of 7-bit ASCII characters in multibyte sequences and its computational inefficiency. It has been superseded by UTF-8. In any case, any UTF-1 data that may exist in the real world probably would not have a BOM, since widespread recommendation of the BOM-as-signature came after the replacement of UTF-1 by UTF-8. Most UTF-7 data probably does not have a BOM either, but if it did, the exact bytes would not necessarily be 2B 2F 76 38 2D, but would depend on the character immediately following the BOM. UTR #16, which specifies UTF-EBCDIC (it may be a UTS by now; I haven't checked), does specify the use of a BOM-as-signature. So if there is any UTF-EBCDIC data in the real world, you would probably want to check for that signature. UCN data probably will not have a BOM, but the sequence "\ufeff" (and case-shifted equivalents) certainly seems as though it could only be intended to have that meaning. All the others are private or semi-private experiments, and regardless of their merits or faults, you will almost certainly never encounter any real-world data encoded in them. -Doug Ewell Fullerton, California

