Valid encodings

Jony Rosenne Thu, 14 Aug 2003 12:01:29 -0700

We need an official Unicode Lint.

Jony


> -----Original Message-----
> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED] On Behalf Of Philippe Verdy
> Sent: Thursday, August 07, 2003 4:28 PM
> To: [EMAIL PROTECTED]
> Subject: SPAM: Re: Questions on ZWNBS - for line initial 
> holam plus alef
> 
> 
> On Thursday, August 07, 2003 2:40 AM, Doug Ewell 
> <[EMAIL PROTECTED]> wrote:
> 
> > Kenneth Whistler <kenw at sybase dot com> wrote:
> > 
> > > But I challenge you to find anything in the standard that
> > > *prohibits* such sequences from occurring.
> > 
> > I've learned that this question of "illegal" or "invalid" character 
> > sequences is one of the main distinguishing factors between 
> those who 
> > truly understand Unicode and those who are still on the Road to 
> > Enlightenment.
> > 
> > Very, very few sequences of Unicode characters are truly 
> "invalid" or 
> > "illegal."  Unpaired surrogates are a rare exception.
> > 
> > In almost all cases, a given sequence might give unexpected results 
> > (e.g. putting a combining diacritic before the base character) or 
> > might be ineffectual (e.g. putting a variation selector before an 
> > arbitrary character), but it is still perfectly legal to encode and 
> > exchange such a sequence.
> 
> For Unicode itself this is true, but what users want is 
> interoperability of the encoded text with accurate rendering 
> rules. In practice, this means that any undefined or 
> unpredictable behavior will mean lack of interoperability and 
> should not be used.
> 
> The standard should then highly promote what is a /valid/ 
> encoding for text with regard of interoperability for all 
> text processing algorithms including parsing combining 
> sequences, collation, and computing character properties from 
> those /valid/ encoded sequences.
> 
> We don't have to care much if some encoded text considered 
> valid under Unicode/ISO-IEC10646 is rendered or processed 
> differently or unpredictably, provided that this does not 
> affect common text for actual languages.
> 
> In fact the standard specifies that ALL sequences made of 
> code points in U+0000 to U+10FFFF (excluding U+xFEFF, U+xFFFF 
> and surrogates in U+D800 to U+DFFF) are valid under ISO/IEC 
> 10646, but it does not attempt to assign properties or 
> behavior to ALL of these characters or encoded sequences, as 
> this is the job of Unicode to specify this behavior.
> 
> If there's something to enhance in the Unicode standard (not 
> in the ISO/IEC 10646), it's exactly the specification of 
> interoperable encoded sequences. This certainly means that 
> concrete examples for actual languages must be documented. 
> Just assigning properties to individual ISO/IEC 10646 
> characters is not enough, and Unicode should concentrate more 
> efforts in the actual encoding of text and not only on 
> individual characters.
> 
> So for me, the "validity" of text is a ISO/IEC 10646 concept 
> (shared now with Unicode versions for the assignment of 
> characters in the repertoire), related only to the legally 
> usable code points, and Unicode speaks about "well-formed" or 
> "ill-formed" sequences, or about "normalized" sequences and 
> transformations that preserve the actual text semantics.
> 
> There is no ambiguity in ISO/IEC 10646 for the character 
> assignments. But composed sequences are the real problem, for 
> which Unicode must seek agreements: the W3C character model 
> is only based on the simplified combining sequences, but 
> Unicode should go further with much more precise rules for 
> the encoding of actual text, even before any attempt to 
> describe other transformation algorithms (only the NF* 
> transformations have for now a stability policy, but actual 
> text writers need also stability for the text composition 
> rules for actual languages.
> 
> We certainly don't need more assigned code points for 
> existing scripts. But more rules for the actual 
> representation of text using these scripts, and how distinct 
> scripts can interact and be mixed. There's some rules already 
> specified for Combining jamos, or combining 
> Latin/Cyrillic/Greek alphabets, or for Hiragana/Katakana, but 
> we are still far from an agreement for Hebrew, and even for 
> some Han composed sequences, which still lack a specification 
> needed for interoperability.
> 
> The current wording of "Unicode validity" is for me very 
> weak, and probably defective. What it designates is only a 
> ISO10646 validity for used code points, and the validity of 
> their UTF* transformations, based on individual code points. 
> The kind of validity rules users want with Unicode is a 
> conformance of the actually encoded scripts for actual 
> languages, for interoperability and data exchange.
> 
> The fact that Unicode is born by trying to maximize the 
> roundtrip convertibility with legacy codepages or encoded 
> character sets has introduced many difficulties: first the 
> base+combining characters model was introduced as fundamental 
> for alphabetized scripts with separate letters for vowels. 
> Then there's the case of Brahmic scripts which complicates 
> things, as Unicode has chosen to support both the ISCII 
> standard model with nuktas and viramas in logical encoding 
> order, and the TIS620 model for Thai and Lao with a physical 
> model. On the opposite the combining jamos model is 
> remarkably simple, and it still follows the logical model 
> shared by alphabetized scripts.
> 
> Looking now at the difficulties of encoding Tengwar reveals 
> most of the difficulties that already exist for Thai, and now 
> Hebrew, and subtle needed artefacts needed in existing 
> scripts used to transliterate foreign languages. Some of 
> these difficulties are also affecting now the general 
> alphabetized scripts (Latin notably), showing that the 
> ummutable model used to encode base letters and diacritics is 
> not universal. So Unicode will need to extend and specify 
> much more its own character model to support more scripts and 
> languages, including in the case of transliterations.
> 
> May be in the future, this will lead to defining a new level 
> of conformance by defining something that is more precise 
> than just some basic canonical equivalence rules (for NF* 
> transforms and XML), with more precise definitions of 
> "ill-formed" or "defective" sequences (I confess that I do 
> not understand the need to deferentiate both concepts, and 
> this current separation is really more confusive than helpful 
> to understand the Unicode standard). What this means, is that 
> we need something saying "Unicode valid text" and not just 
> "Unicode encoded text" which just relates to the shared 
> assignment of code points to individual characters. The 
> current "valid" term should be left to the ISO/IEC 10646 
> standard, and to the very few Unicode algorithms that handle 
> only individual code points (such as UTF* encoding forms and 
> schemes), but its current definition is not helping 
> implementers and writers to produce interoperable textual data.
> 
> If the term "valid" cannot be changed, then I suggest 
> defining "conforming" for encoded text independantly of its 
> validity (a "conforming text" would still need to use a 
> "valid encoding").
> 
> 
> -- 
> Philippe.
> Spams non tol�r�s: tout message non sollicit� sera
> rapport� � vos fournisseurs de services Internet.
> 
> 
> 
>

Valid encodings

Reply via email to