On Thursday, August 07, 2003 2:40 AM, Doug Ewell <[EMAIL PROTECTED]> wrote:
> Kenneth Whistler <kenw at sybase dot com> wrote: > > > But I challenge you to find anything in the standard that > > *prohibits* such sequences from occurring. > > I've learned that this question of "illegal" or "invalid" character > sequences is one of the main distinguishing factors between those who > truly understand Unicode and those who are still on the Road to > Enlightenment. > > Very, very few sequences of Unicode characters are truly "invalid" or > "illegal." Unpaired surrogates are a rare exception. > > In almost all cases, a given sequence might give unexpected results > (e.g. putting a combining diacritic before the base character) or > might be ineffectual (e.g. putting a variation selector before an > arbitrary character), but it is still perfectly legal to encode and > exchange such a sequence. For Unicode itself this is true, but what users want is interoperability of the encoded text with accurate rendering rules. In practice, this means that any undefined or unpredictable behavior will mean lack of interoperability and should not be used. The standard should then highly promote what is a /valid/ encoding for text with regard of interoperability for all text processing algorithms including parsing combining sequences, collation, and computing character properties from those /valid/ encoded sequences. We don't have to care much if some encoded text considered valid under Unicode/ISO-IEC10646 is rendered or processed differently or unpredictably, provided that this does not affect common text for actual languages. In fact the standard specifies that ALL sequences made of code points in U+0000 to U+10FFFF (excluding U+xFEFF, U+xFFFF and surrogates in U+D800 to U+DFFF) are valid under ISO/IEC 10646, but it does not attempt to assign properties or behavior to ALL of these characters or encoded sequences, as this is the job of Unicode to specify this behavior. If there's something to enhance in the Unicode standard (not in the ISO/IEC 10646), it's exactly the specification of interoperable encoded sequences. This certainly means that concrete examples for actual languages must be documented. Just assigning properties to individual ISO/IEC 10646 characters is not enough, and Unicode should concentrate more efforts in the actual encoding of text and not only on individual characters. So for me, the "validity" of text is a ISO/IEC 10646 concept (shared now with Unicode versions for the assignment of characters in the repertoire), related only to the legally usable code points, and Unicode speaks about "well-formed" or "ill-formed" sequences, or about "normalized" sequences and transformations that preserve the actual text semantics. There is no ambiguity in ISO/IEC 10646 for the character assignments. But composed sequences are the real problem, for which Unicode must seek agreements: the W3C character model is only based on the simplified combining sequences, but Unicode should go further with much more precise rules for the encoding of actual text, even before any attempt to describe other transformation algorithms (only the NF* transformations have for now a stability policy, but actual text writers need also stability for the text composition rules for actual languages. We certainly don't need more assigned code points for existing scripts. But more rules for the actual representation of text using these scripts, and how distinct scripts can interact and be mixed. There's some rules already specified for Combining jamos, or combining Latin/Cyrillic/Greek alphabets, or for Hiragana/Katakana, but we are still far from an agreement for Hebrew, and even for some Han composed sequences, which still lack a specification needed for interoperability. The current wording of "Unicode validity" is for me very weak, and probably defective. What it designates is only a ISO10646 validity for used code points, and the validity of their UTF* transformations, based on individual code points. The kind of validity rules users want with Unicode is a conformance of the actually encoded scripts for actual languages, for interoperability and data exchange. The fact that Unicode is born by trying to maximize the roundtrip convertibility with legacy codepages or encoded character sets has introduced many difficulties: first the base+combining characters model was introduced as fundamental for alphabetized scripts with separate letters for vowels. Then there's the case of Brahmic scripts which complicates things, as Unicode has chosen to support both the ISCII standard model with nuktas and viramas in logical encoding order, and the TIS620 model for Thai and Lao with a physical model. On the opposite the combining jamos model is remarkably simple, and it still follows the logical model shared by alphabetized scripts. Looking now at the difficulties of encoding Tengwar reveals most of the difficulties that already exist for Thai, and now Hebrew, and subtle needed artefacts needed in existing scripts used to transliterate foreign languages. Some of these difficulties are also affecting now the general alphabetized scripts (Latin notably), showing that the ummutable model used to encode base letters and diacritics is not universal. So Unicode will need to extend and specify much more its own character model to support more scripts and languages, including in the case of transliterations. May be in the future, this will lead to defining a new level of conformance by defining something that is more precise than just some basic canonical equivalence rules (for NF* transforms and XML), with more precise definitions of "ill-formed" or "defective" sequences (I confess that I do not understand the need to deferentiate both concepts, and this current separation is really more confusive than helpful to understand the Unicode standard). What this means, is that we need something saying "Unicode valid text" and not just "Unicode encoded text" which just relates to the shared assignment of code points to individual characters. The current "valid" term should be left to the ISO/IEC 10646 standard, and to the very few Unicode algorithms that handle only individual code points (such as UTF* encoding forms and schemes), but its current definition is not helping implementers and writers to produce interoperable textual data. If the term "valid" cannot be changed, then I suggest defining "conforming" for encoded text independantly of its validity (a "conforming text" would still need to use a "valid encoding"). -- Philippe. Spams non tolérés: tout message non sollicité sera rapporté à vos fournisseurs de services Internet.

