On 14/07/2004 18:40, Kenneth Whistler wrote:

...

OK. But this is not a unique case. For example, in Hebrew Silluq and Meteg, Dagesh and Shuruq are pairs of different marks which share a glyph and so a Unicode character but may need to be distinguished for certain processes.


Can you show a pre-existing ISO character encoding standard, such as ISO 5429, for which there are bibliographic implementations whose conversion to Unicode is blocked by an encoding distinction not maintained in Unicode for these particular cases? ...


No, but I can show a pre-existing clearly defined encoding, see http://wts.edu/hebrew/whmcodemanual.html dated 1982, especially point 1 "We now distinguish holem waw (`OW') from waw followed by holem", i.e. Holam Male from Vav Haluma, and point 2 re the three variants of Meteg. Texts based on these encodings have been in the public domain and circulated widely since 1982, and are available from such repositories as CCAT and the Oxford Text Archive. Conversion of these texts to Unicode is blocked by the current failure of Unicode to distinguish Holam Male from Vav Haluma and to distinguish three poisitions of Meteg.


... If so, then
you would have an analogous situation. ...


The only lack of analogy is that no one sought to get official ISO approval for an encoding which has been a de facto standard among Hebraists for more than 20 years.


... If not, then you are simply
talking about functional distinctions for the same encoded diacritic,
which might be needed to be maintained for some kinds of processing,
for which people can use whatever kinds of conventions they sit
fit to deal with the issue -- but the issue doesn't rise to the
level of an encoding issue requiring formal intervention by WG2,
in my opinion.



I accept that this may be true of the Meteg/Silluq and Dagesh/Shuruq distinctions; but not of the Holam male/Vav Haluma and Meteg positioning distinctions which do involve graphical distinctions.


...

Should similar encodings with CGJ be proposed to make these distinctions?


If formal maintenance of a collation distinction between two otherwise identically *appearing* pieces of text -- based on whatever analytic status of the text is relevant -- is at issue, then representation of one sequence with CGJ and one without is a recommended way by the Unicode Standard to introduce a distinction which a tailored collation can then weight differently to get the required collation difference.



OK. But the problem here is that sometimes there *is* a graphical distinction between umlaut and tr�ma, and one might expect bibliographers to make use of fonts which do make the distinction to view their data. Unfortunately the chosen encoding with CGJ is not supposed to support such graphical distinctions even when they would of course be very helpful for maintenance of a database of mixed data. It seems to me that this solution will also "result in massive
data representation ambiguities for German data" (quote from N2819). But then my main interest is not in German but in Hebrew.


...

256 variation selectors won't do if they have all been defined unchangeably with the wrong properties e.g combining class. On the other hand, if the UTC is prepared to ignore the combining class and normalisation problems involved in using one combining class zero character, CGJ, to modify a combining mark,


This completely misconstrues the solution in question for the German umlaut and tr�ma in bibliographic records. The CGJ is not introduced "to modify a combining mark". Instead, two text elements required to be distinguished in German bibliographic data are represented by two distinct sequences:

X + COMBINING DIAERESIS
X + CGJ + COMBINING DIAERESIS

This is completely in keeping with the intent of the CGJ in the
standard, and the proposal did not, in any way, "ignore the
combining class and normalisation problems" in this case.
... Which, by the way, is why the solution met with unanimous
approval in WG2, without objection from the UTC liaison.



N2819 does not deal with the issue of how to encode a base character (X) plus tr�ma and another combining mark (M). Should this be <X, M, CGJ, COMBINING DIAERESIS>, or <X, CGJ, M, COMBINING DIAERESIS>, or <X, CGJ, COMBINING DIAERESIS, M>? How is this issue affected by whether the combining class of M is less than, equal to or greater than that of COMBINING DIAERESIS? How do these sequences behave when normalised? The distinction is not necessarily theoretical because in some languages (certainly in Greek although I guess there is no ambiguity with umlaut there) a diaeresis indicating separation can co-occur with other accents. The German bibliographers need guidance on how to convert such combinations to Unicode while preserving the distinction from umlaut.

it may as well ignore the identical problems involved in using variation selectors, also combining class zero, with combining marks.



What you have been suggesting to do, however, *does* advocate ignoring the problems involved in attempting to use variation selectors to formally distinguish variants of combining marks.



No, I have attempted to deal with these issues, in the old thread on "Variation selectors and vowel marks", and have described in some detail what might be done in situations where the modified combining mark and another mark are on the same base character. I accept that I did not find a fully satisfactory solution, but I certainly did not ignore the problem. But the umlaut/tr�ma proposal fails to discuss this problem at all and so can reasonably be accused of ignoring it.


-- Peter Kirk [EMAIL PROTECTED] (personal) [EMAIL PROTECTED] (work) http://www.qaya.org/




Reply via email to