Re: Unicode collation algorithm - interpretation]

Jim Melton Mon, 19 Feb 2001 17:58:10 -0800
Mike,

Thanks for your response.  I find myself disappointed that there isn't more 
participation in this discussion (from others than you and I), but it will 
undoubtedly come ;^)

At 05:05 PM 02/11/2001 +0000 Sunday, J M Sykes wrote:
>I think you misunderstand me. The "maximum level" I was referring to is that
>mentioned in UTR#10, section 4, "Main algorithm", 4.3 "Form a sort key for
>each string", para 2, which reads:
>
><quote>
>An implementation may allow the maximum level to be set to a smaller level
>than the available levels in the collation element array. For example, if
>the maximum level is set to 2, then level 3 and higher weights (including
>the normalized Unicode string) are not appended to the sort key. Thus any
>differences at levels 3 and higher will be ignored, leveling any such
>differences in string comparison.
></quote>

I couldn't have given you the exact quote or reference, but I was aware of 
this fact.  However, I interpreted it in a manner that suggested that the 
operation "set to a smaller level" was not (necessarily?) dynamic at a 
given invocation of a collation.  I interpreted it to mean that a given 
collation could be created from the array with one or more of the upper 
levels/weights not appearing.

>We can safely assume that at least some users will require sometimes exact,
>sometimes inexact comparisons (at least for pseudo-equality, to a lesser
>extent for sorting).

No disagreement here!  In fact, when I was at Digital, a hard-fought topic 
was specifically the one you've been raising: case-[in]sensitive and 
"accent"-in]sensitive comparisons and ordering.

>We can also safely assume that users will wish to get the performance
>benefit of some preprocessing.
>
>It is clearly possible to preprocess as far as the end of step 2 of the
>Unicode collation algorithm without committing to a level. I understand you
>to say that several implementors have concluded that this level of
>preprocessing is not cost-effective, in comparison to going all the way to
>the sort key. I am in no position to dispute that conclusion.

Actually, that may not be what I meant.  I say "may" because I'm still not 
sure that we're talking about the same thing.  What I meant was that I 
believe that some implementations produce a code module that provides the 
behaviors of the Unicode collation algorithm for a specific collation 
element table (I believe this is the right term --- I mean the table that 
indicates the weights applied to each text element for a particular 
culture, script, language, etc.).  The code that this module contains would 
implement each step of the algorithm, but would have a preset, unchangeable 
answer for "the maximum level in the collation element array" mentioned in 
step 3.1 of the algorithm in UTR#10.

However, by reading the collation algorithm *after* reading recent messages 
from you, I now see that there is a different interpretation that I have no 
reason to believe is actually prohibited or not intended --- that the 
choice to "de-append" one or more levels might be done dynamically.

Nonetheless, I believe that there may be implementations (conforming ones, 
I think?) that do not support such dynamic selection of "maximum level".

>I'm also unclear what an SQL-implementor is likely to supply as "a
>collation", though I imagine (only!) that it might be a part only of the
>CTT/CET appropriate to the script used by a particular culture, and with
>appropriate tailoring. But I have no reason to expect the executable
>("compiled"?) code the implements the algorithm to vary depending on the
>collation, or on the level (case-blind &c) specified by the user for a
>particular comparison.

As I stated above, I think there may be such implementations, but I would 
be very happy to have this refuted (even if by a statement from the Unicode 
people and by the ISO 14651 people that such an implementation would be 
non-conforming).  It is certainly very useful to Western cultures to 
quickly and inexpensively provide case-varying and "accent"-varying 
collations, even though these notions may be totally alien to many other 
cultures.

> > Of course, if you really want to specify an SQL collation name that
> > somehow identifies 2 or 3 or 4 (or more) collations built in
> > conformance with ISO
> > 14651 and then use an additional parameter to choose between them, I guess
> > that's possible (but not, IMHO, desirable).
>
>Unless you mean for performance reasons, I'd be interested to know why not
>desirable.

Actually, I meant that I would find it undesirable to build a "wrapper" 
around three, four, or more collation routines that merely accepts the 
additional parameter and selects among the "nested" collation 
routines.  That seems unnecessarily awkward and clumsy.

I think I see that we've been talking at cross purposes (or at least with 
different assumptions and understandings), which is not uncommon in such 
discussions.  I hope I've sorted things out now, at least for myself.  If 
you, or somebody, is able to convince me that all (conforming) 
implementations of the Unicode collation algorithm must be able to select 
the maximum level dynamically, then I think we are going to be in firm 
agreement about approaching this.  If not, then I will probably remain a 
bit skeptical ;^)

I note that Tex Texin sent out a message that certainly seems to support 
your interpretation that the levels can be set dynamically.  In fact, Tex's 
explanations were enormously helpful to me in understanding the 
implementation approaches that are likely to be taken (thanks, Tex!).  If 
Tex's explanation is authoritative, then we're probably done and I am both 
happy and in agreement with this approach.  However, I had not previously 
heard the interpretations that I got from Tex's note so I am obviously 
still learning...

Thanks!
    Jim
========================================================================
Jim Melton --- Editor of ISO/IEC 9075-* (SQL)     Phone: +1.801.942.0144
Oracle Corporation            Oracle Email: mailto:[EMAIL PROTECTED]
1930 Viscounti Drive          Standards email: mailto:[EMAIL PROTECTED]
Sandy, UT 84093-1063           Personal email: mailto:[EMAIL PROTECTED]
USA                                                Fax : +1.801.942.3345
========================================================================
=  Facts are facts.  However, any opinions expressed are the opinions  =
=  only of myself and may or may not reflect the opinions of anybody   =
=  else with whom I may or may not have discussed the issues at hand.  =
========================================================================
Re: Unicode collation algorithm - interpretation]

Reply via email to