Quoting Marco Cimarosti <[EMAIL PROTECTED]>: > Doug Ewell wrote: > > In UTF-16 practically any sequence of bytes is valid, and since you > > can't assume you know the language, you can't employ distribution > > statistics. Twelve years ago, when most text was not Unicode and all > > Unicode text was UTF-16, Microsoft documentation suggested a heuristic > > of checking every other byte to see if it was zero, which of course > > would only work for Latin-1 text encoded in UTF-16. > > I beg to differ. IMHO, analyzing zero bytes is a viable for detecting > BOM-less UTF-16 and UTF-32. BTW, I didn't know (and I don't quite care) that > this method was suggested first by Microsoft: to me, it seems quite > self-evident. > > It is extremely unlikely that a text file encoded in any single- or > multi-byte encoding (including UTF-8) would contain a zero byte, so the > presence of zero bytes is a strong enough hint for UTF-16 (or UCS-2) or > UTF-32.
False positives can be caused by the use of U+0000 (which is most often encoded as 0x00) which some applications do use in text files. Hence you need to look for sequences where there is a null octet every other octet, which increases the risk of false negatives: False negatives can be caused by text that doesn't contain any Latin-1 characters. The method can be used reliably with text files that are guaranteed to contain large amounts of Latin-1 - in particular files for which certain ASCII characters are given an application-specific meaning; for instance XML and HTML files, comma-delimited files, tab-delimited files, vCards and so on. It can be particularly reliable in cases where certain ASCII characters will always begin the document (e.g. XML). -- Jon Hanna <http://www.hackcraft.net/> *Thought provoking quote goes here*

