----- Original Message ----- From: "Markus Scherer" <[EMAIL PROTECTED]> To: "unicode" <[EMAIL PROTECTED]> Sent: Tuesday, October 28, 2003 11:35 PM Subject: Re: unicode on Linux
> You should use Unicode internally - UTF-16 when you use ICU or most other libraries and software. > > Externally, that is for protocols and files and other data exchange, you need to identify (input: > determine; output: label) the encoding of the data and convert between it and Unicode. > If you can choose the output encoding, then stay with one of the Unicode charsets > (UTF-8 or SCSU etc.), or else the input:determine strategy will work fine for UTF-8 or SCSU, provided that the leading BOM is explicitly encoded. I know that this is not recommanded (at least for UTF-8), but I have several examples showing that files encoded in UTF-8 without the BOM fail to be identified correctly as UTF-8. In that case, the result of this automatic determination is still quite random and depends on the content of the text. The idea that "if a text (without BOM) looks like valid UTF-8, then it is UTF-8; else it uses another legacy encoding" does not work in practice and also leads to too many false positives. > - if you are absolutely certain that they suffice - use US-ASCII or ISO 8859-1. OK for US-ASCII, but even ISO-8859-1 should no more be used without explicit labelling (with meta-data or other means) of its encoding: here also we've got problems if the text finally looks like valid UTF-8 (however these cases are more rare). However, statistically, the way UTF-8 encodes trailing bytes in range 0x80 to 0xBF after leading bytes in the range starting at 0xC0, is a good indicator in many texts to know if it's ISO-8859-1 or UTF-8: in ISO-8859-1, this would produce many sequences starting by a single lowercase accented letters and one or more uppercase accented letters, something that is not impossible but unlikely. The presence of the BOM however creates a sequence that is valid in ISO-8859-1, but extremely unlikely in actual texts. The exceptions exist, but they will mostly occur in very short texts containing extended Latin1 characters that are not letters. That's why an algorithm that tries to guess (in absence of explicit labelling) if a text is either UTF-8 or ISO-8859-1 should always assume it is UTF-8 if it validates with strict UTF-8 encoding rules. Some problems do exist however, with the relaxed rules for UTF-8 as it was defined in the IESG RFC. These old texts (that are valid for this old version of the UTF-8 encoding) still exist now (and may persist for some times in relational databases that were feeded with them and not scanned for reencoding). I just wonder why Unicode still maintains that a BOM _should_ not be used in UTF-8 texts. I think the opposite: as long as the BOM will not cause problems, such as complete text files which can natively be transported without any explicit encoding labelling, it should be used. If the plain text is self-labelled (such as a XML or XHTML source file, but not HTML4 files even if they use a <meta tag>), that leading BOM may be omitted. The case where the BOM should not be used is when the text needs to be limited to very short sizes (but in that case the environment where such short strings are used should have a way to specify and transport the encoding labelling information as part of its basic protocol). This case applies for example to individual table fields in databases.