We need an official Unicode Lint. Jony
> -----Original Message----- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of Philippe Verdy > Sent: Thursday, August 07, 2003 4:28 PM > To: [EMAIL PROTECTED] > Subject: SPAM: Re: Questions on ZWNBS - for line initial > holam plus alef > > > On Thursday, August 07, 2003 2:40 AM, Doug Ewell > <[EMAIL PROTECTED]> wrote: > > > Kenneth Whistler <kenw at sybase dot com> wrote: > > > > > But I challenge you to find anything in the standard that > > > *prohibits* such sequences from occurring. > > > > I've learned that this question of "illegal" or "invalid" character > > sequences is one of the main distinguishing factors between > those who > > truly understand Unicode and those who are still on the Road to > > Enlightenment. > > > > Very, very few sequences of Unicode characters are truly > "invalid" or > > "illegal." Unpaired surrogates are a rare exception. > > > > In almost all cases, a given sequence might give unexpected results > > (e.g. putting a combining diacritic before the base character) or > > might be ineffectual (e.g. putting a variation selector before an > > arbitrary character), but it is still perfectly legal to encode and > > exchange such a sequence. > > For Unicode itself this is true, but what users want is > interoperability of the encoded text with accurate rendering > rules. In practice, this means that any undefined or > unpredictable behavior will mean lack of interoperability and > should not be used. > > The standard should then highly promote what is a /valid/ > encoding for text with regard of interoperability for all > text processing algorithms including parsing combining > sequences, collation, and computing character properties from > those /valid/ encoded sequences. > > We don't have to care much if some encoded text considered > valid under Unicode/ISO-IEC10646 is rendered or processed > differently or unpredictably, provided that this does not > affect common text for actual languages. > > In fact the standard specifies that ALL sequences made of > code points in U+0000 to U+10FFFF (excluding U+xFEFF, U+xFFFF > and surrogates in U+D800 to U+DFFF) are valid under ISO/IEC > 10646, but it does not attempt to assign properties or > behavior to ALL of these characters or encoded sequences, as > this is the job of Unicode to specify this behavior. > > If there's something to enhance in the Unicode standard (not > in the ISO/IEC 10646), it's exactly the specification of > interoperable encoded sequences. This certainly means that > concrete examples for actual languages must be documented. > Just assigning properties to individual ISO/IEC 10646 > characters is not enough, and Unicode should concentrate more > efforts in the actual encoding of text and not only on > individual characters. > > So for me, the "validity" of text is a ISO/IEC 10646 concept > (shared now with Unicode versions for the assignment of > characters in the repertoire), related only to the legally > usable code points, and Unicode speaks about "well-formed" or > "ill-formed" sequences, or about "normalized" sequences and > transformations that preserve the actual text semantics. > > There is no ambiguity in ISO/IEC 10646 for the character > assignments. But composed sequences are the real problem, for > which Unicode must seek agreements: the W3C character model > is only based on the simplified combining sequences, but > Unicode should go further with much more precise rules for > the encoding of actual text, even before any attempt to > describe other transformation algorithms (only the NF* > transformations have for now a stability policy, but actual > text writers need also stability for the text composition > rules for actual languages. > > We certainly don't need more assigned code points for > existing scripts. But more rules for the actual > representation of text using these scripts, and how distinct > scripts can interact and be mixed. There's some rules already > specified for Combining jamos, or combining > Latin/Cyrillic/Greek alphabets, or for Hiragana/Katakana, but > we are still far from an agreement for Hebrew, and even for > some Han composed sequences, which still lack a specification > needed for interoperability. > > The current wording of "Unicode validity" is for me very > weak, and probably defective. What it designates is only a > ISO10646 validity for used code points, and the validity of > their UTF* transformations, based on individual code points. > The kind of validity rules users want with Unicode is a > conformance of the actually encoded scripts for actual > languages, for interoperability and data exchange. > > The fact that Unicode is born by trying to maximize the > roundtrip convertibility with legacy codepages or encoded > character sets has introduced many difficulties: first the > base+combining characters model was introduced as fundamental > for alphabetized scripts with separate letters for vowels. > Then there's the case of Brahmic scripts which complicates > things, as Unicode has chosen to support both the ISCII > standard model with nuktas and viramas in logical encoding > order, and the TIS620 model for Thai and Lao with a physical > model. On the opposite the combining jamos model is > remarkably simple, and it still follows the logical model > shared by alphabetized scripts. > > Looking now at the difficulties of encoding Tengwar reveals > most of the difficulties that already exist for Thai, and now > Hebrew, and subtle needed artefacts needed in existing > scripts used to transliterate foreign languages. Some of > these difficulties are also affecting now the general > alphabetized scripts (Latin notably), showing that the > ummutable model used to encode base letters and diacritics is > not universal. So Unicode will need to extend and specify > much more its own character model to support more scripts and > languages, including in the case of transliterations. > > May be in the future, this will lead to defining a new level > of conformance by defining something that is more precise > than just some basic canonical equivalence rules (for NF* > transforms and XML), with more precise definitions of > "ill-formed" or "defective" sequences (I confess that I do > not understand the need to deferentiate both concepts, and > this current separation is really more confusive than helpful > to understand the Unicode standard). What this means, is that > we need something saying "Unicode valid text" and not just > "Unicode encoded text" which just relates to the shared > assignment of code points to individual characters. The > current "valid" term should be left to the ISO/IEC 10646 > standard, and to the very few Unicode algorithms that handle > only individual code points (such as UTF* encoding forms and > schemes), but its current definition is not helping > implementers and writers to produce interoperable textual data. > > If the term "valid" cannot be changed, then I suggest > defining "conforming" for encoded text independantly of its > validity (a "conforming text" would still need to use a > "valid encoding"). > > > -- > Philippe. > Spams non tol�r�s: tout message non sollicit� sera > rapport� � vos fournisseurs de services Internet. > > > >

