RE: locale-aware string comparisons

Whistler, Ken Mon, 31 Dec 2012 15:43:50 -0800

Well, in answering the question which was actually posed here:

1. ISO/IEC 10646 has absolutely nothing to say about this issue, because 10646 
does not define case mapping at all.


2. The Unicode Standard *does* define case mapping, of course, as well as case 
folding. The relevant details are in Section 3.13 of the standard, supported by 
various data files in the Unicode Character Database. TUS 6.2, Section 3.13, p. 
117, does define toUpperCase(X) and toLowerCase(X), but those are string 
mapping operations, not directly comparable to Linux (and in general Unix) 
toupper() and tolower(), which are character mapping functions. The closer 
correlates to Linux toupper() and tolower() are Unicode's definitions of 
Uppercase_Mapping(C) and Lowercase_Mapping(C). However, there is a significant 
difference lurking, in that the Unicode case mapping definitions are not 
locale-sensitive. The full case mappings do include two conditional sets of 
mappings (from SpecialCasing.txt) for Lithuanian and for Turkish and Azeri, 
mostly affecting the behavior of the dot on "i", but the use of those 
conditional mappings depends on the availability of explicit language context.

This contrasts with the Linux (and in general Unix) toupper() and tolower() 
functions, which in principle, at least, are locale-sensitive, depending on the 
current locale setting, and in particular on whether the LC_CTYPE category in 
the locale has a non-null list of mappings for toupper and/or tolower in it.

Perhaps even more importantly, the Unicode Standard does not state anything 
regarding the details of the behavior of the APIs strcasecmp() or tolower() or 
toupper() in libc. Those are the concerns of the C and POSIX specs, not the 
Unicode Standard. Nor could the Unicode Standard really get involved in this, 
precisely because  that behavior involves locales, and locales are outside the 
scope of the Unicode Standard.

3. Regarding LDML and CLDR, somebody with specific expertise on CLDR may have 
to jump in here, but while locales clearly *are* in the scope of LDML and CLDR, 
there is currently little if anything they have to say about specific case 
mapping rules.

As regards the particulars of the question, I suspect that it would depend in 
part on how strcasecmp(), str_tolower() and str_toupper() are implemented (I am 
assuming string conversions APIs here based on the tolower() and toupper() 
APIs), but there probably *are* instances where the results would diverge. The 
most likely source of trouble would be Turkish case mapping. In particular, if 
you compare U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE to a canonically 
equivalent sequence of <U+0049, U+0307>, there may be conundrums. If 
strcasecmp() is implemented based on Turkish case folding, then strcasecmp( 
U+0130, <U+0049, U+0307> ) == 0. If str_tolower() is based on Turkish case 
mapping, then str_tolower( U+0130 ) == <U+0069, U+0307>, so strcmp(str_tolower( 
U+0130), str_ tolower( <U+0049,U+0307> ) == 0, *but* str_toupper( U+0130 ) == 
U+0130 and str_toupper( <U+0049,U+0307> ) == <U+0049,U+0307>, so 
strcmp(str_toupper( U+0130 ), str_toupper( <U+0049,U+0307> ) != 0. The two 
upperc!
 ased versions are *canonically* equivalent, but you wouldn't expect a strcmp() 
operation to be checking normalization of strings. So unless the 
implementations of str_tolower() and str_ toupper() were doing canonical 
normalization as well as case mapping, you could indeed find some odd edge 
cases for Turkish casing, at least.

--Ken

> Given (just) the data in 10646, Unicode and cldr, are there any locales
> where a case-insensitive match should be different than a case-preserving
> match of the results of lower-casing the two strings?
> 
> Ie, in terms of locale-aware versions of the typical libc functions,
> should strcasecmp(s1,s2) ever generate different results than
> strcmp(tolower(s1),tolower(s2)) or strcmp(toupper(s1),toupper(s2))?
> (By mentioning strcmp() et al, I do not exclude mb or w versions of
> those functions.)
> 
> And to be clear, the questions isn't about any specific, existing
> implementation but only about what the 10646, unicode and cldr suite
> of standards have to say on the matter.
> 
> Thanks,
> 
> -JimC
> --
> James Cloos <[email protected]>         OpenPGP: 1024D/ED7DAEA6

RE: locale-aware string comparisons

Reply via email to