> From: Peter Kirk [mailto:[EMAIL PROTECTED] > Sent: Friday, May 28, 2004 1:40 PM
> Well, I understood the semantic content of a text to be the meaning of > the words Unicode encodes characters, not languages, not morphemes, not senses of words. The character semantics of "Sally" and of "Sally" transliterated into Hebrew are not the same. > >Her "Phoenician words" in this case are probably something like her > >name, or a transliteration of English words. > Is it really in the scope of Unicode to encode such trivialities? I have > a key ring with my name "written" in an Egyptian hieroglyphic > pseudo-alphabet. Will such abuse of Egyptian hieroglyphs have to be > taken into account in the possible Unicode proposal for this script? Why is that an abuse of hieroglyphs any more than Hebrew text transliterated or transcribed in Latin characters, or Arabic text transcribed in Hangul characters? Unicode is uninterested in what the content of the text is; it encodes characters, not text. It is up to users and implementers to decide what texts those characters can represent. So, absolutely, it is in the scope of Unicode. > Children invent all kinds of alphabets in which to write their names; > will all of these have to be encoded in Unicode? The scenario did not involve children inventing an alphabet; it involved students making a history presentation that touched on, among other things the Phoenician script. > Well, if anyone has another scenario to propose, let's see it. Fine. Scenario (undesireable): The editor of a UCLA journal on ancient Indo-European linguistics receives submissions from numerous sources for publication in the journal. Certain formatting requirements are specified for submissions wrt the kinds of document elements used, paragraph formatting and overall page layout. As is often the case in similar situations, however, no constraints are placed on fonts used. Submissions are accepted in various file formats, included Word DOC, RTF, and certain XML or SGML languages. Once approved for publication, submissions will be converted to one common file format and will be typeset using one collection of fonts. Submissions regularly contain characters in a variety of scripts / writing systems: Latin, Cyrillic, Old Italic, IPA, various Latin transliteration schemes, etc. Very often, the submitted text is formatted using fonts that the editor does not herself have. In some cases, the submission is formatted but not consistently marked up; in other cases, the text is marked up to identify document elements but not formatted at all. Markup does not always identify the language of text as in some cases the language may be unknown, or the text is an analytical reconstruction and not in any actual, known language; and because authors cannot be assumed to know how to do this with their applications. With some regularity, a submission makes reference to Phoenician characters or includes examples in Phoenician-script text. Also, on rare occasion, submissions will cite Hebrew-language words, which are intended to be presented with square Hebrew glyphs. The Phoenician characters have exactly the same encoded representation as the square Hebrew text. As a result, fallback or default formatting will cause all such text to appear with square Hebrew glyphs, and therefore before the editor can provide a draft to her panel of reviewers, she must go through a laborious process to carefully read each submission to ensure that what she provides to reviewers has the intended presentation as either Phoenician glyphs or square Hebrew glyphs. This add to her workload in reviewing all submissions, and especially so for any submissions that contain either Hebrew or Phoenician. On some occasions, this leads to costly delays in publication. On some occasions, incorrect glyphs are not spotted in proofs until after publication, requiring additional work to add corrigenda to subsequent editions, and detracting from the perceived quality of the journal as a whole. Alternate scenario (desireable): The editor receives submissions as described above. Because Phoenician script and Hebrew script are encoded distinctly, there is never any concern as to how text provided to reviewers will appear. She saves many hours of work both in preparing submissions for reviewers and in final typesetting. Embarrassing errors and the need to publish corrigenda are significantly reduced. Now tell me that's an unrealistic or trivial scenario. > Well, I have used Shoebox and Toolbox. I have also used your company's > products, which at least allow me to add a script name field to my > database but don't allow me to tailor collations. But I was thinking in > terms of tailored collation weights for the Unicode collation algorithm. > These are much more complex than setting up a new language configuration > for Shoebox or Toolbox. I suspect few Semitic paleographers are using MS database products. Also, from what I have seen, it is not at all uncommon for researchers in academia to have access to technology-support staff, including programmers. Not necessarily in every case, but every time I've interacted with someone associated with a university on such issues, they have had access to some kind of support of this type. (That's one of the things their funding requests are for.) Moreover, unless I'm mistaken, the collation weights in this case would *not* be difficult to deal with, and in addition there have already been offers to do that work. Moreover, the Semitic paleographers have indicated that their preference is to encoded all of their text using the square Hebrew characters, so the character-folding issue is at best an occasional concern that many will never actually have to deal with. I'm still completely unconvinced that the need for character folding is a significant impediment. Peter Peter Constable Globalization Infrastructure and Font Technologies Microsoft Windows Division

