Cesar David Rodas Maldonado wrote:
> I wanted to ask how can i know if a given text is UTF8 or ISO-8859-1?
Well, there might be a way if you only want to know if the text is UTF-8
or ISO-8859-1 (it means that you already know that is one is the other).

There are some invalid UTF-8 sequences. If you have something like 0xE5
0x61, that in ISO-8859-1 means 'áa' = 'áa', you have an invalid
UTF-8 sequence and then you know that it is not an UTF-8 charset.

How do you test it? If you're using Linux you can try to convert your
string from UTF-8 to UTF-16 using glibc's iconv. If it is an ISO-8859-1
text and you are lucky enough to be an invalid UTF-8 (most likely) then
you have it!

If you are not using Linux then you should learn how unicode chars are
encoded in UTF-8 (like in http://en.wikipedia.org/wiki/UTF-8) to make
your own algorithm.

There is a very good chance it will work fine for you. Every ISO-8859-1
document with non-ASCII chars I try to open with UTF-8 editors gives me
errors.

What do you think?

Besides that, I know that there are a few probabilistic methods to
determine the charset a text is encoded. Mozilla Firefox uses them a
lot. There are many papers on this subject on the internet. Google is
your friend.

Best regards,
Daniel Colchete

Reply via email to