Peter countered: > > > Could this finally be the missing "killer ap" for the CGJ? > > > > It will be perfect to allow an application like XML to encode Hebrew > > text using Unicode 4.0 rules (and before). > > It is not perfect. CGJ is supposed to be significant (and kept in the > text) for a variety of processes, such as searching and sorting. To use > this for Biblical Hebrew, though, it should be ignored in such processes.
Why? The point is that: <patah, CGJ, hiriq> is one thing, and <hiriq, CGJ, patah> is another. You *want* those sequences to be distinct, right? Even if the text has been normalized, right? That was the whole problem with: <patah, hiriq> <hiriq, patah> which are canonically equivalent, since they both normalize to: <hiriq, patah> So the CGJ *is* significant for searching (and sorting). If you want one sequence, you search for <patah, CGJ, hiriq>, if you want the other, you search for <hiriq, CGJ, patah>. If you don't care, and want to find either, *then* you strip out the CGJ and normalize before comparison. This, by the way, is completely in keeping with the intended treatment of CGJ in other instances and falls out automatically from the definition of the UCA for collation. CGJ defaults to null weights in the UCA. You tailor combinations of characters in contractions with it to get special weights for sequences like <c, CGJ, h> if they have to contrast with <c, h>. But for Biblical Hebrew, you don't even have to do that, because to get the contrast between <patah, CGJ, hiriq> and <hiriq, CGJ, patah>, you simply have to have the weights for patah and for hiriq and then block the reordering. Voil�, it just works. Of course, you are going to have to tailor for Biblical Hebrew, anyway, since the points for Hebrew default to ignorable, so if you want to search and sort on them, you have to give them significant weight differences to start with. For a direct search on the binary string, you also don't have to do anything to get the appropriate distinction between the two representations. I thought this was the goal all along: we just have the two vowels, one after the other, and they should stay put, not reordering. > It's another hack. And cloning 14 Hebrew vowels and diacritic marks to give them new combining classes is not? It seems to me that the suggestion of this use of the CGJ is much more in keeping with its narrowed semantic as defined by the UTC. Remember, we used to think the CGJ was a "grapheme cluster constructor" and could be used to build targets for enclosing combining marks. For a variety of reasons we gave up on that. The text convention I am suggesting for Biblical Hebrew is much less of a stretch for CGJ than trying to make it serve as a "grapheme cluster constructor" was. Essentially it is a no-op. Given CGJ's current definition and set of properties, a CGJ introduced into the particular vowel contexts you are concerned about should result in all the effects you are asking for. It might take awhile for Uniscribe and other implementations to catch up to that actual behavior, but as I read the standard, that is how they *should* behave. --Ken

