Re: Interleaved collation of related scripts

Peter Kirk Fri, 14 May 2004 08:07:32 -0700

On 13/05/2004 14:33, Kenneth Whistler wrote:

Peter Kirk noted:

PS Multi-language bibliographies are common in Russian books. They are usually printed with the Latin script entries following the Cyrillic script ones, but I have seen interleaved ones.

Chris Jacobs noted:

has an index in which greek and latin script are interleaved.

The greek words are sorted according to their transliteration:

Ì” sorts as h Ï† sorts as ph
These illustrate the typical situation with cross-script,
cross-language interfiling: They are *custom* solutions for
particular indexing problems. And they may involve issues of
transliteration or other adaptation to make like match with
like for the purposes of the people using the interfiled list.
Such tasks should *not* be attributed to the default collation
element table for the Unicode Collation Algorithm. ...

I agree that such situations are typical of cross-script interfiling, and so I do not support any suggestion of including a general mechanism for this in the default collation table. This table is not the place to define general purpose transliteration schemes.

But there is an exceptional issue within the family of north-west Semitic scripts, which may apply also to others e.g. Greek, Coptic and archaic Greek - possibly also the Indic scripts. Within these sets of scripts there is NO ambiguity about which characters correspond to which, as they have identical repertoires, with possibly additional letters in some of the scripts for which no equivalent can be defined in the other scripts. These are marginal cases where some users prefer disunification and others prefer unification. Furthermore, they are cases where texts originally in the same language and script are encoded in Unicode in a variety of scripts, because of changes in Unicode e.g. Coptic disunification and because of different scholarly preferences.

For such cases, in my opinion, a good case can be made for interfiling the scripts in the default algorithm. The major advantage of doing this is to allow integrated searching of text corpora in which texts have been encoded in more than one script.

...

Mike Ayers is on the right track here, I believe. The scenarios which people are adducing in arguing for interfiling should be addressed instead by appropriately designed normalizations -- which can be implemented using fairly easy-to-program, reusable scripts. Then sort on the *normalized* data using a much, much simpler collation table to accomplish what you need.

Mike Ayers suggested that users should write Perl scripts. This is something which computer geeks may be able to do, but it is simply impossible for the rest of humanity including scholars of ancient languages. Perl is not "God's gift to academic researchers" in general, although it may be God's gift to computer geeks.

The other problem with this is that the large corpora to be searched are not necessarily directly available to the users for normalisation. I can't normalise the whole Internet before doing a Google search for a Coptic or Phoenician word. What I need is a search engine which can (at least as a tailoring) collate together Coptic and Greek, Phoenician and Hebrew.

Ken wrote separately, to Dean Snyder:

Nobody plans to take away your rights and ability to continue doing what you now do, if it works very well for you. Please, sir, continue doing what you are doing with your current data.

Understood, and I note the smiley. But if some people continue to do what they are doing and others follow a new script, that is a recipe for confusion. The whole point of Unicode is to bring some consistency into the previous mess of different character encodings and masquerades. If the Unicode staff are now saying that it is OK to write Phoenician either with Hebrew characters masquerading as Phoenician or with the proposed Phoenician block, that opens the way to perpetuation of the confusion which existed before Unicode. It really would be far better, in the long run, if you said openly that anyone who continues to write Phoenician with Hebrew characters after the new block is accepted is wrong and breaking the standard, and should change their practices immediately.

But then if you said that you would of course add a lot more flame to the fire, and you would be forced to consider properly whether such proposals as the separate Phoenician script have consensus support from the majority of regular professional users of the script.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: Interleaved collation of related scripts

Reply via email to