Re: What does one do if the encoding is unknown and all you have is a sequence of bytes?

Karl Williamson Fri, 19 Jul 2013 14:27:07 -0700

On 07/19/2013 11:51 AM, Costello, Roger L. wrote:

Hi Folks,


Suppose that these hex bytes:

        C3 83 C2 B1

show up in a message and the message contains no hint what its encoding is.

Perhaps it is 8859-1, in which case the message consists of four 1-byte 
characters:

C3 = Ã
83 = the “no break here” character
C2 = Â
B1 = ±

Perhaps it is UTF-8, in which case the message consists of two 2-byte 
characters:

C383 = 쎃
C2B1 = 슱


That's not how UTF-8 works.  Instead in UTF-8 it would be:

 C3 83 = LATIN CAPITAL LETTER A WITH TILDE
 C2 B1 = PLUS-MINUS SIGN

It's unlikely that any other encoding will pass a UTF-8 validity testfor inputs longer than just a few bytes. So you can rule-in or rule-outUTF-8 fairly easily. You can also look for BOMs to get UTF-16 and UTF-32.

After that, there are various heuristics that can be applied, and peoplehave written things that attempt to guess encodings. An example fromPerl is

http://search.cpan.org/~dankogai/Encode-2.51/lib/Encode/Guess.pm
but it requires a list of possible encodings that it experiments with.

Or, perhaps it is some other encoding.

What does one do in such a situation?

/Roger

Re: What does one do if the encoding is unknown and all you have is a sequence of bytes?

Reply via email to