On Jul 19, 2013, at 12:42 PM, Mark Davis ☕ <[email protected]> wrote:
> Popping up a level. > > ICU (and some other libraries) have heuristic encoding detection, that will > take a sequence of bytes and come up with a likely encoding id. However, the ICU encoding detection typically requires more than 4 bytes (usually at least 10 characters worth of bytes) in order to make a reasonable guess. - Peter E > > > Mark > > — Il meglio è l’inimico del bene — > > > On Fri, Jul 19, 2013 at 8:40 PM, Whistler, Ken <[email protected]> wrote: > > > > Suppose that these hex bytes: > > > > C3 83 C2 B1 > > > > show up in a message and the message contains no hint what its encoding is. > > > > Perhaps it is 8859-1, in which case the message consists of four 1-byte > > characters: > > > > C3 = Ã > > 83 = the “no break here” character > > C2 = Â > > B1 = ± > > > > Perhaps it is UTF-8, in which case the message consists of two 2-byte > > characters: > > > > C383 = 쎃 > > C2B1 = 슱 > > Actually, that would be interpreting it as UTF-16, not as UTF-8. That > can probably be quickly ruled out if the rest of the text is not obviously > in UTF-16. > > Interpreted as UTF-8, it would be: > > C3 83 --> U+00C3 = Ã > C2 B1 --> U+00B1 = ± > > More likely than the other two alternatives you cite. > > Of course, you also have to consider serial corruptions as a possibility. > > It could have started out as UTF-8 C3 B1 --> U+00F1 = ñ. > > Then the <C3 B1> got misinterpreted as Latin-1, and then re-misinterpreted > as UTF-8 again. > > --Ken > > > >

