Cesar David Rodas Maldonado wrote: > I wanted to ask how can i know if a given text is UTF8 or ISO-8859-1? Well, there might be a way if you only want to know if the text is UTF-8 or ISO-8859-1 (it means that you already know that is one is the other).
There are some invalid UTF-8 sequences. If you have something like 0xE5 0x61, that in ISO-8859-1 means 'áa' = 'áa', you have an invalid UTF-8 sequence and then you know that it is not an UTF-8 charset. How do you test it? If you're using Linux you can try to convert your string from UTF-8 to UTF-16 using glibc's iconv. If it is an ISO-8859-1 text and you are lucky enough to be an invalid UTF-8 (most likely) then you have it! If you are not using Linux then you should learn how unicode chars are encoded in UTF-8 (like in http://en.wikipedia.org/wiki/UTF-8) to make your own algorithm. There is a very good chance it will work fine for you. Every ISO-8859-1 document with non-ASCII chars I try to open with UTF-8 editors gives me errors. What do you think? Besides that, I know that there are a few probabilistic methods to determine the charset a text is encoded. Mozilla Firefox uses them a lot. There are many papers on this subject on the internet. Google is your friend. Best regards, Daniel Colchete