> a. Modify the grapheme cluster boundary rules to account for > X CGJ NSM as a grapheme cluster. > > b. Change CGJ from Mn to Me.
It doesn't even need (a) to make this work. Because the committee changed NSMs to be ignorable, X NSM* CGJ NSM* Y doesn't break. So only (b) would need to be changed. Mark ————— Γνῶθι σαυτόν — Θαλῆς [For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr] http://www.macchiato.com ----- Original Message ----- From: "Kenneth Whistler" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Cc: <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> Sent: Tuesday, March 05, 2002 18:00 Subject: Re: How to make "oo" with combining breve/macron over pair? > David Hopwood said: > > > Kenneth Whistler wrote: > > > Kent Karlsson's suggestion: > > > > > > > I vaguely suggested adding > > > > an enclosing (in some sense) invisible combining character to > > > > solve this: <o, CGJ, o, invisible-enclosing, combining breve>. > > > > No character has been designated for such use, though. And I > > > > haven't made a formal proposal yet. > > > > > > (i.e. create a generic way to make a non-enclosing combining mark > > > apply to a grapheme cluster, by encoding an invisible enclosing > > > combining mark) > > > > For this approach to work, <invisible-enclosing> must have combining > > class 0, and be in Grapheme_Extend and general category Mn. > > Actually, it must be general category Me, since that is what indicates > a combining *enclosing* mark. > > > Because it > > involves a new character, it can't be included in the standard until > > Unicode 3.3, > ^^^^^^^^^^^ > > Aghhh! Don't even introduce that nasty concept. The UTC and the > editorial committee are already working on the Unicode 4.0 book > draft, and many people would be sorely tempted to quit in disgust > if we had to produce yet another UAX for Unicode 3.3 before > 4.0 was finished! > > > > > An alternative is to use CGJ itself for <invisible-enclosing>, i.e. > > <o, CGJ, o, CGJ, combining breve>. This works because: > > > > - CGJ has combining class 0, so it prevents the breve from composing > > with the second o. > > - CGJ has general category Mn and is invisible, as required. > > It currently has general category Mn, but would have to be changed > to Me to make this work. > > > - it is straightforward to modify the grapheme breaking rules to > > treat this as a single cluster, by adding the rule "Link Extend". > > (This assumes the corrections to the other rules that I described > > in my comments.) > > Actually, I am finding myself attracted to the parsimony of this > approach. In answer to Rick's suggestion to just encode the two we > know about and be done with it, and his concern that we are headed here > for terminal Markupville, note the following: > > 1. Rendering applications already have to deal with combining > enclosing marks (well, at least if they choose to support them). > That means identifying what they enclose, and then adjusting any > following combining mark to apply to the enclosure. (cf. TUS 3.0, > p. 50). If the CGJ is just an invisible combining enclosing mark, > then effectively it encloses the (invisible) bounding box of > the preceding characters in its scope, and any following > combining marks are adjusted to apply to that bounding box, which > is the enclosure. A simple generalization without any new architectural > implications. > > 2. Applications concerned with grapheme cluster boundaries already > (as of Unicode 3.2, at least) have to deal with the function > of CGJ in creating grapheme clusters. That is, they will have > to cope with the modified rules in Unicode 3.2 for grapheme > cluster boundaries, and the new Grapheme_XXX properties that > take the CGJ into account. > > So no new characters and no new architectural implications. Simply > two minor tweaks: > > a. Modify the grapheme cluster boundary rules to account for > X CGJ NSM as a grapheme cluster. > > b. Change CGJ from Mn to Me. > > That appears to be it, and in principle it should solve the > missing double (or treble) diacritic representation problem permanently. > On the downside, it might be awhile before rendering engines > and font definitions really catch up to it. That is, the whole > notion of "adjusting" a diacritic to apply to an enclosure is > fairly sophisticated, since it may involve context-dependent > rules and arbitrary shape modifications -- not merely moving > a glyph origin point based on a preceding glyph's metrics. > > On the other hand, hacked up fonts for limited dictionary > usage could be rather quick and easy. For the old Webster's > pronunciation guides, the entities are really the oomacr > and oobreve shown in the examples that started this thread. > Simply preform those entities as glyphs in a font, and map them > to <o, CGJ, o, CGJ, combining_macron> and > to <o, CGJ, o, CGJ, combining_breve> respectively. Presto, > you have a Unicode representation for the text, and a > reliable font rendering for them, without any fancy-dancing > about dynamic positional adjustments. The fallback rendering, > in applications and fonts not wise to the CGJ rules would > be {o o-macron} and {o o-breve}, which while not exact, > is at least comprehensible and close enough for gummint work. > > I think this might be the way to go, but it is too late to > sneak into Unicode 3.2, as any such changes clearly would > require UTC debate and agreement. But it is simple enough that > it might be accomplished fairly quickly after Unicode 3.2. > > --Ken > >

