> Suppose that these hex bytes: > > C3 83 C2 B1 > > show up in a message and the message contains no hint what its encoding is. > > Perhaps it is 8859-1, in which case the message consists of four 1-byte > characters: > > C3 = Ã > 83 = the “no break here” character > C2 = Â > B1 = ± > > Perhaps it is UTF-8, in which case the message consists of two 2-byte > characters: > > C383 = 쎃 > C2B1 = 슱
Actually, that would be interpreting it as UTF-16, not as UTF-8. That can probably be quickly ruled out if the rest of the text is not obviously in UTF-16. Interpreted as UTF-8, it would be: C3 83 --> U+00C3 = Ã C2 B1 --> U+00B1 = ± More likely than the other two alternatives you cite. Of course, you also have to consider serial corruptions as a possibility. It could have started out as UTF-8 C3 B1 --> U+00F1 = ñ. Then the <C3 B1> got misinterpreted as Latin-1, and then re-misinterpreted as UTF-8 again. --Ken

