Richard

> I have an app that needs to auto-detect Unicode and plain text, and render 
> them correctly based on that auto-detection.
> 
> I have the UTF16 stuff working, but with UTF8 I have a problem:  there is no 
> BOM to let me know if it's Unicode, and some plain text files will 
> occasionally have high-ASCII values in them (like the dagger symbol).
> 
> What patterns should I be looking for in the binary data of a file to 
> distinguish UTF8 from plain text?

These are the "Rules of Thumb" that I have used to try to determine the 
encoding type of text files. I feel that I achieved more than 90 per cent 
success but that may because most of the files only included true ASCII 
characters (0 -127). The script only tries to distinguish between ASCII, UTF-8, 
MacRoman and Windows 1252 Codepage (the US default for Windows).

Rules of Thumb, applied in the following order:

1. If the string starts with a BOM, the encoding infered by the BOM will be 
returned.

2. If the string contains only characters in the range 0x00 - 0x7F, it is an 
ASCII string.

3. If the string contains more UTF-8 multi-byte characters than it does invalid 
utf-8 characters and invalid multi-byte sequences, it is a UTF-8 string.

4. If the string contains characters in the range 0xA0 - 0xFF but none in the 
range 0x80 - 0x9F, it is an ISO-8859-1 string.

5. If the string contains any of 0x81, 0x8D, 0x8F, 0x90 or 0x9D, it is a 
MacRoman string. .

6. If the string contains carriage returns but no line feeds, it is a MacRoman 
string.

7. It is a Windows 1252 Codepage string.

The approach I take in the script is to count the different types of characters 
in the text and then apply the rules of thumb. The script is written in REBOL 
so will probably not be even be of help as a guide. However, the documentation 
includes a table of the differences between UTF-8, Windows 1252 and MacRoman 
which you may find useful. You can find it at 
http://www.rebol.org/documentation.r?script=str-enc-utils.r

Regards

Peter



_______________________________________________
use-revolution mailing list
[email protected]
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution

Reply via email to