On Tue, 20 Jan 2004 10:32:06 -0700, John Jenkins wrote: > > 1) U+9CE6 is a traditional Chinese character (a kind of swallow) > without a SC counterpart encoded. However, applying the usual rules > for simplifications, it would be easy to derive a simplified form which > one could conceivably see in a book printed in the PRC. Rather than > encode the simplified form, the UTC would prefer to represent the SC > form using U+9CE6 + a variation selector. >
If a simplified form of a given CJK ideograph is used, then it deserves encoding properly. There are newly-coined simplified forms in CJK-B and CJK-C, so why not add newly used simplified forms to CJK-C or whereever if they are really needed ? To borrow Michael's term, this use of variation selectors is simply pseudo-coding. If a Chinese publishing house were going to print a book in simplified characters that included a simplified form of U+9CE6, would they go the lengths of applying to Unicode to define an appropriate standardised variant for U+9CE6, and then trying to create a font that implemented variation selectors ? Or would they simply use a font that mapped a simplified glyph form to U+9CE6 (or the PUA) ? If it is so important to formally define the existence of a simplified form of an existing character, then why not encode it properly ?? > 2) Your best friend has the last name of "turtle," but he doesn't use > any of the encoded forms for the turtle character to represent it. He > insists on writing it in yet another way and wants to be able to > include his name as he writes it in the source code he edits. The UTC > ends up accommodating him using U+2A6C9 (which is the closest turtle to > his last name) + a variation selector. 1. Unicode Design Principle 3 : "The Unicode Standard encodes characters, not glyphs." This is simple glyph variant. I insist on writing the "A" in my name with two cross-bars. Will the UTC kindly accommodate me by providing an appropriate standardised variant for U+0041 ? (In fact, come to think of it I have idiosyncratic ways of writing all of the letters in my name ...) The plain fact of the matter is that the *character* turtle is already encoded, and if someone wants to use a different glyph form for this character then he or she should design their own font with the appropriate glyph mapped to U+9F9C or U+9F9F. 2. Unicode does not encode private-use characters. I can't find chapter and verse for it, but I was always under the impression that Unicode did not encode private-use characters. > 3) You're editing a critical edition of an ancient MS, and you find > that your author, who talks a lot about handkerchiefs, uses U+5E28 > quite a bit, but varies between the "ears-in" form and the "ears-out" > form almost at random. Rather than lose the distinction which *may* be > meaningful, you (with the UTC's blessing) use U+5E28 for the ears-in > form (as Unicode uses) and U+5E28 + a variation selector for the > ears-out form. This example actually opens up the biggest can of worms. As someone who has a passion for transcribing ancient manuscripts, in Chinese and other scripts, I fully appreciate the desire to be able to represent every little idiosyncrasy of a manuscript or inscription in plain text Unicode. But the simple fact of the matter is that you can't. My apologies for repeating myself, but Unicode Design Principle 3 states that "The Unicode Standard encodes characters, not glyphs." (and Section 2.2 of TUS elaborates on this statement). Unless Unicode becomes a Glyph Encoding Standard instead of a Character Encoding Standard, then how on earth can the UTC allow VSs to be used for simple glyph variants ? And if it's OK for CJK ideographs, then why not for every other Unicoded script ? Glyph variations are of paramount interest to textual scholars and epigraphers of all scripts, not just Chinese. To take a random example from the Celtic Inscribed Stones Project (CISP), this is a palaeographgic description of a cross slab at Kirk Maughold in the Isle of Man, inscribed [--]I IN CHRISTI NOMINE CRUCIS CHRISTI IMAGENEM : Kermode/1907, 112: `we have here the diamond-shaped O, the N like an H, and the M like a double H, all characteristics of the Hiberno-Saxon manuscripts and sculptured stones of the period. Other characteristic forms are the square-shaped C and the peculiar G, the like of which I have not seen elsewhere. But some of the letters are minuscules, as p, d, b, r, and a; while in the contraction for CHRISTI, in each case the R differs from the ordinary small R in CRUCIS, representing, in fact, the Greek Rho!'. [http://www.ucl.ac.uk/archaeology/cisp/database/stone/maugh_4.html] If we go down the road of encoding epigraphic and palaeographic glyph variants for CJK and other scripts I'm afraid that we'll soon find that 256 Variation Selectors just isn't enough. Andrew

