On 28/10/2003 04:49, Kent Karlsson wrote:

Philippe Verdy wrote:


There's a counter example with the position of the circumflex on the
lowercase t (I can't remember for which language it occurs, sorry), which is
in some cases not the one that its combining class would normally take.



There are also the cases of comma below a small g (Lithuanian),
which is rendered turned above the g, and of ring below g (IPA)
that should be rendered above the g... Neither of these invalidate,
or puts to question, the combining classes of comma below (and
cedilla...) or ring below, as far as I can see.


Also, in the commonly used Hebrew *transliteration*, the same function (fricative pronunciation) is indicated by a macron above g and p but below b, d, k and t, for the same reason. It occurs only with these letters (sometimes also written below h). There might be an argument for using instead of g and p plus combining macron g and p plus combining line below - especially as if these were ever capitalised the line would probably be moved below. But there would need to be a clear rule that such combining marks are moved from below to above g and p.

So far, it has been noticed that some Hebrew and Arabic marks,
mostly the vowel marks, ...

For Hebrew also dagesh, rafe, sin and shin dots, and meteg; and for Arabic, shadda. Basically anything with "unique" combining classes, a concept which seems to have been removed from the text, but not removed from the database as it should have been.

... have inappropriate combining classes.
The solution suggested by the UTC is to use CGJ. But it also has
to be simple and practicable. Putting a CGJ after each occurrence
of the characters with badly assigned combining class effectively
gives them a combining class of 0. Perhaps not ideal, and indeed
a kludge. But simple and practical. A keyboard layout, for instance,
can just generate a CGJ after each troublesome Arabic and Hebrew
mark. With current keyboard layout specification mechanisms,
that's about the best that can be done on the keyboard side of it.


That depends on the mechanism. With a mechanism such as Keyman from www.tavultesoft.com, it is possible to define that, for example, key A generates <CGJ, patah> if the previous key press generated a dagesh or sin or shin dot, but just patah if the previous key press generated just a base character. Such a mechanism can stop superfluous CGJ's being generated in continuously typed text, but it cannot cope properly with editing as it does not have access to the environment of text already entered, only to what the keyboard has previously generated. More comprehensive mechanisms can be defined but they require the keyboard to have access to the backing store.

Removing superfluous CGJs should be done by a separate utility.
Trying to build that into normalisation is probably not such a good
idea.


Understood. Could it perhaps be defined in Unicode as an additional pre-normalisation step which is recommended but not required?

It would of course be trivial to specify that CGJ or CCO is ignored in collation. In fact I think CGJ already is. This implies that superfluous CGJs do not affect searching, sorting and spell checking. As long as fonts also ignore them (except in special show all characters modes), the main detrimental effect will be to waste a lot of storage space.

Defining new characters to replace the troublesome ones, a more elegant solution, has been rejected by the UTC. On compatibility
grounds, IIRC.


/kent k


Was this actually considered and rejected by the UTC? I understood that the proposal, for Hebrew (http://scripts.sil.org/cms/sites/nrsi/media/BibHebAltCharsProposal.pdf), had simply not been proceeded with, on the basis of widespread opposition expressed on this list and the general acceptance (including by the UTC, http://www.unicode.org/consortium/utc-minutes/UTC-096-200308.html items 96-C20 and 96-A72) of the CGJ alternative. I am not trying to resurrect the proposal which I oppose, but there are people who are still concerned that it might reappear, and be pushed through the UTC by the consortium members who support it, without adequate reference back to the objectors who are not represented on the UTC. So it would be good news if the UTC had actually rejected it.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/





Reply via email to