Peter Kirk followed up: > On 07/08/2003 07:27, Philippe Verdy wrote: > > >On Thursday, August 07, 2003 2:40 AM, Doug Ewell <[EMAIL PROTECTED]> wrote: > > > >>Kenneth Whistler <kenw at sybase dot com> wrote: > >> > >>>But I challenge you to find anything in the standard that > >>>*prohibits* such sequences from occurring. > >>> > >>> > >>I've learned that this question of "illegal" or "invalid" character > >>sequences is one of the main distinguishing factors between those who > >>truly understand Unicode and those who are still on the Road to > >>Enlightenment. > >> > >>... > >> > >If the term "valid" cannot be changed, then I suggest defining > >"conforming" for encoded text independantly of its validity (a > >"conforming text" would still need to use a "valid encoding"). > > > As a very quick thought, maybe what we need is not restrictions to the > Unicode standard but a set of rules for each language or group of > languages, defining exactly how Unicode characters should be used to > write the words etc of that language. Such definitions might be > independent of the actual Unicode standard.
I emphatically agree with Peter on this. The impulse to get the Unicode Standard to head down the road to becoming the "spelling standard" for all languages of the world has to be constrained, simply because there is not the expertise or the bandwidth in the UTC to accomplish this and because it isn't the business of the UTC in the first place. This is the kind of task which *must* be distributed to the relevant stakeholders around the world, wherever they may be and however their relevant jurisdictions are defined and constituted. The establishment of orthographic rules for particular language in the context of the Unicode Standard means transferring the notion of what the printed conventions for that language are -- whatever they may be -- into a determination of exactly which Unicode characters are to be used to represent those conventions, including any constraints on cooccurrence with particular format control characters, and so on. The scope of the task of defining rendering rules in the Unicode Standard is generic to script behavior -- establishing the general rules of the road, as it were, for how the scripts behave in the encoding, so that people and implementations have a determinate sense of what order characters should be in, what it means for combining characters to "combine" with base characters, how format control characters may impact script rendering generically, and so on. But beyond that, one is getting into the realm of orthographic rules for particular languages or jurisdictions and the realm of typographic conventions for particular styles and regions. Making those determinations belongs to the stakeholders themselves: ministries, academies, associations, type designers, whoever. It is precisely because the developers of the Unicode Standard cannot foresee all possible orthographic conventions and uses to which the standard may be put in representing text that it is deliberately permissive: essentially any sequence of characters is "legal", and it is up to the users of the standard to determine, for them, what is a *sensible* sequence of characters for their multitudinous purposes. --Ken

