Ken, I am trying to get a grasp on the problem. Thanks for your explanations. If you continue typing slowly enough, perhaps it will eventually get through.
>>And the fact that you and others arguing for the canonical ordering change don't seem to recognize the distinction is part of the reason why we appear to be talking past each other. I agree. But the implications of keeping the current canonical order are also staggering. It seems there must be extra rules* for biblical Hebrew which will have to be written into every keyboard, search engine, and conversion table. For example, if someone wants to search for "laim", the keyboard will have to insert a character such as CGJ, between the two vowels before searching the normalized data. If the keyboard doesn't know about the required CGJ, then the search engine must insert it before searching. The search engine returns the results with the CGJ and the font used to display must know how to handle it. Also Uniscribe must know. Ultimately, it seems that every process will have to recognize and maintain only normalized data. Or am I off-base? And how will the keyboard know when to insert a CGJ? The user is not supposed to know about it. So will we program the keyboard to recognize all forms of "Yerushalaim"? Or perhaps we will just always insert CGJ between any two vowels? To me, the problem is expanding exponentially. > There are many other examples of problems with the current > canonical order. Many other examples that aren't merely more examples of the generic issue which can be addressed by CGJ insertion? Short List of *Extra Rules or Things I Need a Solution for" right meteg left meteg after a hataf vowel Upper Punctum Lower Punctum Upper Double (thousands) dot, if 05C4 is the upper single (hundreds) dot Reversed Nun Any sequence of two vowels, including "laim" example Any second vowel, such as for alternate pronunciation, which appears after the final low cant - thus a vowel-cant-vowel sequence Another example I believe is the Adonai vowel markings on the name of God The current mix of high-low, left and right is extraordinarily and inordinately complex, as if it were intended to be impossible to program. The top 6 can be handled by adding characters to the Unicode set for Hebrew, if the canonical classes are set reasonably. In the meantime, we are trying to substitute Latin marks in the 0300 series, but there seem to be conflicts there. We've talked about inserting a control character and perhaps that would work on the next two problems, although it is not working at present. I would really have to go back and re-think the entire project if I were to accept canonical order as the required store order, rather than the sort order it was designed to be. Joan Wardell NRSI-SIL Kenneth Whistler <[EMAIL PROTECTED] To: [EMAIL PROTECTED] > cc: [EMAIL PROTECTED], [EMAIL PROTECTED] Subject: Re: Yerushala(y)im - or Biblical Hebrew 07/28/2003 05:32 PM Please respond to Kenneth Whistler Joan Wardell responded to: > > Why can't we just fix the database? :) > KW: > Because changing the canonical ordering classes (in ways not > allowed by the stability policies) breaks the normalization > *algorithm* and the expected test results it is tested against. > JW: > If the "expected test results" are bad data, it shouldn't matter > then if it is consistent. O.k. Stop right there. The expected test results are, in fact, *good* data. They accurately reflect the current statement of the algorithm, which was the point. > Are you > saying that somewhere there are lots of people who have worked very hard to > implement > Hebrew as it is currently described in Unicode 3 and they would have to > "start over" if we > changed the canonical order? And the biggest fear is that the data today > won't be > consistent with the data in the new order? No, I am not. And the fact that you and others arguing for the canonical ordering change don't seem to recognize the distinction is part of the reason why we appear to be talking past each other. The reason why the UTC defends the stabilization of the Unicode normalization specification is generic: it is the stability of the specification itself which is at issue and which impacts implementations in libraries, databases, applications, protocols, ... In the case of people reporting that one or another particular fixed position class doesn't result in optimal text representation or ordering distinctions in combining marks for Hebrew, or Arabic, or Burmese, or ..., those considerations are utterly beside the point for stability of normalization per se. *Any* such changes to "correct" behavior would result in what would be considered by many others to be a fatal flaw in normalization itself. That is why I have been assiduously promoting an alternate approach (insertion of CGJ) which does *not* impact normalization, but which gives Biblical Hebrew a straightforward means of representing all the distinctions it needs to maintain, even in normalized text. > My point is that there *is* no > data today, > because anyone who has attempted to produce biblical Hebrew data in the > current > canonical order would have stopped and said "Wait a minute! This won't > work". It "won't work" (by which is meant, it won't maintain all the distinctions you want to maintain in plain text, under the assumption that plain text will be normalized) under certain assumptions about how Biblical Hebrew data should be "spelled". It *will* work under other assumptions about spelling, which is what the CGJ proposal is all about. > > That's what I'm saying. And I have no particular problem with the CGJ > suggestion, but > it doesn't go far enough. I don't think we can use it to fix meteg, a mark > which occurs in > three different positions around a low vowel, yet is canonically ordered > before the shin/sin > dots! Will we put one CGJ on the right to indicate a right meteg and one on > the left to indicate > a left meteg? No. I have no objection to encoding one more meteg character, as has been proposed, if it is reliably distinguished from the existing meteg. John Hudson has already argued that that is enough to enable dealing with the rest of the rendering distinctions contextually. > There are many other examples of problems with the current > canonical order. Many other examples that aren't merely more examples of the generic issue which can be addressed by CGJ insertion? > > The apparent simplest solution to all the problems is to correct the > canonical order. In this case the "apparent simplest" solution is actually the worst, for the reasons I enumerated earlier in this thread. > Yes, I am talking about the person writing a batch conversion from existing > data into > Unicode. That would be me. If you were only suggesting we insert one CGJ, I > wouldn't complain. O.k. Don't. ;-) > But we are looking at re-writing the font, the keyboards, and the > conversion so that we can > work around the numerous problems with canonical order. I am selfishly > preferring that > you "normalizers" re-write your code. :) I understand the impetus for this. It would be wonderful if the UTC could wave a magic wand over this, and then at such-and-such a date the problem would just go away. But while, sure, I can locate the particular places in the code for my own library implementation of normalization where the canonical combining classes for hiriq and patah are defined, and yes, it would be a simple matter for me to change two numbers there, here is *my* point: that doesn't fix the problem. It creates a new version of normalization incompatible with the last version, and while I can control the two numbers in my own source code, I can*not* control the worldwide deployment of everybody's normalization code in infrastructure, applications, and protocols. All I could do at that point would be to watch (in either ignorance or horror) as incompatible versions of normalization, rolled out asynchronously over time, started creating interoperability problems. *You* should, in fact, be concerned about such a prospect, because it is the Biblical Hebrew data which would be most impacted by inconsistent, dueling versions of Unicode normalization, if it ever came to that. --Ken