Re: Interleaved collation of related scripts

Kenneth Whistler Thu, 13 May 2004 15:12:00 -0700

Peter Kirk noted:

> > PS Multi-language bibliographies are common in Russian books. They are
> > usually printed with the Latin script entries following the Cyrillic
> > script ones, but I have seen interleaved ones.


Chris Jacobs noted:

> has an index in which greek and latin script are interleaved.
> 
> The greek words are sorted according to their transliteration:
> 
>  ̔ sorts as h
> φ sorts as ph

These illustrate the typical situation with cross-script,
cross-language interfiling: They are *custom* solutions for
particular indexing problems. And they may involve issues of
transliteration or other adaptation to make like match with
like for the purposes of the people using the interfiled list.

Such tasks should *not* be attributed to the default collation
element table for the Unicode Collation Algorithm. It is
just inappropriate design, failing to separate functions
into appropriate layers. Throwing too many requirements
at the default table has at least two bad results:
A. It makes the table itself more complex, which means that
*all* implementations that deal with it have to deal with
additional complexity -- complexity that is often irrelevant
except to the barest minority of specialized users of sorting.
B. It makes it more difficult to figure out how to tailor
and customize the base tables and their behavior for those
instances where something really specialized actually *is*
needed (such as the Greek and Latin index cited above).

It is the same kind of error, in my opinion, as designing
a language parser, for example, and then requiring that it
handle character input in any encoding. If that task is
attributed to the *lexer* itself, you end up with an
unholy mess. The correct design is to use a correctly
architected character set conversion module, convert all the
input into Unicode, and design the lexer to handle Unicode
character input.

Mike Ayers is on the right track here, I believe. The scenarios
which people are adducing in arguing for interfiling should
be addressed instead by appropriately designed normalizations --
which can be implemented using fairly easy-to-program,
reusable scripts. Then sort on the *normalized* data using
a much, much simpler collation table to accomplish what you
need.

People expecting to import their particular normalization
needs *into* the default collation element table, expecting
thereby to get "for free" the behavior they want right off
the shelf in Windows sorting API's, are, in effect, doing harm
to all users of the UCA, without actually buying themselves
the flexibility that they need to accomplish what they need
to do in the end, anyway.

--Ken

Re: Interleaved collation of related scripts

Reply via email to