On 07/19/2013 11:51 AM, Costello, Roger L. wrote:
Hi Folks,
Suppose that these hex bytes:
C3 83 C2 B1
show up in a message and the message contains no hint what its encoding is.
Perhaps it is 8859-1, in which case the message consists of four 1-byte
characters:
C3 = Ã
83 = the “no break here” character
C2 = Â
B1 = ±
Perhaps it is UTF-8, in which case the message consists of two 2-byte
characters:
C383 = 쎃
C2B1 = 슱
That's not how UTF-8 works. Instead in UTF-8 it would be:
C3 83 = LATIN CAPITAL LETTER A WITH TILDE
C2 B1 = PLUS-MINUS SIGN
It's unlikely that any other encoding will pass a UTF-8 validity test
for inputs longer than just a few bytes. So you can rule-in or rule-out
UTF-8 fairly easily. You can also look for BOMs to get UTF-16 and UTF-32.
After that, there are various heuristics that can be applied, and people
have written things that attempt to guess encodings. An example from
Perl is
http://search.cpan.org/~dankogai/Encode-2.51/lib/Encode/Guess.pm
but it requires a list of possible encodings that it experiments with.
Or, perhaps it is some other encoding.
What does one do in such a situation?
/Roger