Re: [HACKERS] strcmp() tie-breaker for identical ICU-collated strings
On Fri, Jun 9, 2017 at 11:09 AM, Robert Haaswrote: >> Isn't that what strxfrm() is? > > Yeah, just with bugs. If ICU has a non-buggy equivalent, then we can > make this work. I agree that it probably isn't worth using strxfrm() again, simply because the glibc implementation is buggy, and glibc as a project is not at all concerned about how badly that would affect PostgreSQL. I would like to point out on this thread that the strcmp() tie-breaker is also a big blocker to implementing normalized keys in B-Tree indexes (at least, if you want to get them for collated text, which I think you really need to make the implementation effort worth it). This is something that is discussed in a section on the normalized keys wiki page I created recently [1]. [1] https://wiki.postgresql.org/wiki/Key_normalization#ICU.2C_text_equality_semantics.2C_and_hashing -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] strcmp() tie-breaker for identical ICU-collated strings
On Fri, Jun 9, 2017 at 1:45 PM, Peter Eisentrautwrote: > On 6/9/17 12:17, Robert Haas wrote: >> IOW, suppose there >> were a collation API call distill() which had the property that >> strcmp(distill(X), distill(Y)) == 0 iff X and Y are considered equal >> under that collation. Then, you could define your hash function as >> hash_any(distill(X)). Alternatively, if the collation library >> provided its own hashing function, that would be fine too, and >> probably faster. > > Isn't that what strxfrm() is? Yeah, just with bugs. If ICU has a non-buggy equivalent, then we can make this work. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] strcmp() tie-breaker for identical ICU-collated strings
On Fri, Jun 9, 2017 at 10:45 AM, Robert Haaswrote: >> But they are getting the sort order they need. They just don't get the >> equality semantics they expect. > > You're right. If we happened to ever guarantee the user a stable sort, then I'd be wrong. We don't, though. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] strcmp() tie-breaker for identical ICU-collated strings
On 6/9/17 12:17, Robert Haas wrote: > IOW, suppose there > were a collation API call distill() which had the property that > strcmp(distill(X), distill(Y)) == 0 iff X and Y are considered equal > under that collation. Then, you could define your hash function as > hash_any(distill(X)). Alternatively, if the collation library > provided its own hashing function, that would be fine too, and > probably faster. Isn't that what strxfrm() is? -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] strcmp() tie-breaker for identical ICU-collated strings
On Fri, Jun 9, 2017 at 12:18 PM, Peter Geogheganwrote: > On Fri, Jun 9, 2017 at 9:17 AM, Robert Haas wrote: >> I'm not exactly sure what is possible or >> desirable, but I would not be too surprised to hear complaints about >> the observed behavior different from the "pure" ICU behavior because >> of the tiebreak, and at least some users might even find it worth >> giving up hashing in order to get the exact sort order they need. > > But they are getting the sort order they need. They just don't get the > equality semantics they expect. You're right. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] strcmp() tie-breaker for identical ICU-collated strings
On Fri, Jun 9, 2017 at 9:17 AM, Robert Haaswrote: > I'm not exactly sure what is possible or > desirable, but I would not be too surprised to hear complaints about > the observed behavior different from the "pure" ICU behavior because > of the tiebreak, and at least some users might even find it worth > giving up hashing in order to get the exact sort order they need. But they are getting the sort order they need. They just don't get the equality semantics they expect. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] strcmp() tie-breaker for identical ICU-collated strings
On Fri, Jun 9, 2017 at 11:46 AM, Tom Lanewrote: > Peter Eisentraut writes: >> On 6/9/17 11:12, Tom Lane wrote: >>> https://www.postgresql.org/message-id/27064.1134753...@sss.pgh.pa.us > >> Good to know. That just says that if we were to go with the strcoll() >> result only, things would work correctly. > > There's still the hashing problem. Tom, that mailing list discussions is very illuminating. Thanks for digging it up. Regarding the question of hashing, one way to support that would be if we had some sort of canonicalization function. IOW, suppose there were a collation API call distill() which had the property that strcmp(distill(X), distill(Y)) == 0 iff X and Y are considered equal under that collation. Then, you could define your hash function as hash_any(distill(X)). Alternatively, if the collation library provided its own hashing function, that would be fine too, and probably faster. On the other hand, is there any rule that says we have to support hashing? Certainly, if we defined a new datatype collated_text, it could have a btree opfamily and no hash opfamily. It's trickier with only one datatype, but possibly we could come up with a way for an opfamily to be consulted about whether it is available for a given choice of collation. I'm not exactly sure what is possible or desirable, but I would not be too surprised to hear complaints about the observed behavior different from the "pure" ICU behavior because of the tiebreak, and at least some users might even find it worth giving up hashing in order to get the exact sort order they need. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] strcmp() tie-breaker for identical ICU-collated strings
Peter Eisentrautwrites: > On 6/9/17 11:12, Tom Lane wrote: >> https://www.postgresql.org/message-id/27064.1134753...@sss.pgh.pa.us > Good to know. That just says that if we were to go with the strcoll() > result only, things would work correctly. There's still the hashing problem. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] strcmp() tie-breaker for identical ICU-collated strings
On 6/9/17 11:12, Tom Lane wrote: > Robert Haaswrites: >> I have to admit that I'm still a little confused about what's actually >> going on here. Commit says that it "fixes inconsistent behavior under >> glibc's hu_HU locale", but it doesn't say what sort of inconsistent >> behavior it fixes. > > Unfortunately we were not good back then about linking commits to > list discussions, but a bit of excavation in the archives found this: > > https://www.postgresql.org/message-id/27064.1134753...@sss.pgh.pa.us Good to know. That just says that if we were to go with the strcoll() result only, things would work correctly. Again, some other details to work out. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] strcmp() tie-breaker for identical ICU-collated strings
Robert Haaswrites: > I have to admit that I'm still a little confused about what's actually > going on here. Commit says that it "fixes inconsistent behavior under > glibc's hu_HU locale", but it doesn't say what sort of inconsistent > behavior it fixes. Unfortunately we were not good back then about linking commits to list discussions, but a bit of excavation in the archives found this: https://www.postgresql.org/message-id/27064.1134753...@sss.pgh.pa.us regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] strcmp() tie-breaker for identical ICU-collated strings
On 6/9/17 10:31, Robert Haas wrote: > + * In some locales strcoll() can claim that nonidentical strings are > + * equal. Believing that would be bad news for a number of reasons, > + * so we follow Perl's lead and sort "equal" strings according to > + * strcmp(). > > Again, however, the reasons why believing it would be bad news are not > enumerated. It is merely asserted that there is more than one such > reason. I suspect that there were just issues that haven't been thought through yet, including hashing. More generally, the code's receptiveness to internationalization issues is ever expanding. Early code probably also thought that using multibyte characters or non-C locales was bad news. Over time, we have worked those issues out. This might be just be one more. > So, what's special about text that it can never report two > non-byte-for-byte values as equal? And could we consider changing > that, so that users can select an ICU collator and get exactly the > behavior ICU delivers, without the extra tiebreak? I don't think there is anything special. We just need to work through the details. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] strcmp() tie-breaker for identical ICU-collated strings
On Fri, Jun 2, 2017 at 2:22 PM, Peter Geogheganwrote: > On Fri, Jun 2, 2017 at 10:34 AM, Amit Khandekar > wrote: >> Ok. I was thinking we are doing the tie-breaker because specifically >> strcoll_l() was unexpectedly returning 0 for some cases. Now I get it, >> that we do that to be compatible with texteq(). > > Both of these explanations are correct, in a way. See commit 656beff. I have to admit that I'm still a little confused about what's actually going on here. Commit says that it "fixes inconsistent behavior under glibc's hu_HU locale", but it doesn't say what sort of inconsistent behavior it fixes. It added a comment - which remains to this day - saying this: + * In some locales strcoll() can claim that nonidentical strings are + * equal. Believing that would be bad news for a number of reasons, + * so we follow Perl's lead and sort "equal" strings according to + * strcmp(). Again, however, the reasons why believing it would be bad news are not enumerated. It is merely asserted that there is more than one such reason. Now, it is obviously not true in general that a comparison operator can never deem two values which are not byte-for-byte identical as equal, because citext does exactly that (indeed, that's the point). I thought maybe citext could get away with it because it lacked indexing support but, nope, it has indexing support. Also, the in-core numeric data type has the same property ('1.0'::numeric = '1'::numeric, but scale() reveals that they are not byte-for-byte identical). So, what's special about text that it can never report two non-byte-for-byte values as equal? And could we consider changing that, so that users can select an ICU collator and get exactly the behavior ICU delivers, without the extra tiebreak? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] strcmp() tie-breaker for identical ICU-collated strings
On 2 June 2017 at 23:52, Peter Geogheganwrote: > On Fri, Jun 2, 2017 at 10:34 AM, Amit Khandekar > wrote: >> Ok. I was thinking we are doing the tie-breaker because specifically >> strcoll_l() was unexpectedly returning 0 for some cases. Now I get it, >> that we do that to be compatible with texteq(). > > Both of these explanations are correct, in a way. See commit 656beff. > >> Secondly, I was also considering if ICU especially has a way to >> customize an ICU locale by setting some attributes which dictate >> comparison or sorting rules for a set of characters. I mean, if there >> is such customized ICU locale defined in the system, and we use that >> to create PG collation, I thought we might have to strictly follow >> those rules without a tie-breaker, so as to be 100% conformant to ICU. >> I can't come up with an example, or may there isn't one, but , say , >> there is a locale which is supposed to sort only by lowest comparison >> strength (de@strength=1 ?? ). In that case, there might be many >> characters considered equal, but PG < operator or > operator would >> still return true for those chars. > > In the terminology of the Unicode collation algorithm, PostgreSQL > "forces deterministic comparisons" [1]. There is a lot of information > on the details of that within the UCA spec. > > If we ever wanted to offer a case insensitive collation feature, then > we wouldn't necessarily have to do the equivalent of a full strxfrm() > when hashing, at least with collations controlled by ICU. Perhaps we > could instead use a collator whose UCOL_STRENGTH is only UCOL_PRIMARY > to build binary sort keys, and leave the rest to a ucol_equal() call > (within texteq()) that has the usual UCOL_STRENGTH for the underlying > PostgreSQL collation. > > I don't think it would be possible to implement case insensitive > collations by using some pre-existing ICU collation that is case > insensitive. Instead, an implementation might directly vary collation > strength of any given collation to achieve case insensitivity. > PostgreSQL would know that this collation was case insensitive, so > regular collations wouldn't need to change their > behavior/implementation (to use ucol_equal() within texteq(), and so > on). Ah ok. Understood, thanks. Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] strcmp() tie-breaker for identical ICU-collated strings
On Fri, Jun 2, 2017 at 10:34 AM, Amit Khandekarwrote: > Ok. I was thinking we are doing the tie-breaker because specifically > strcoll_l() was unexpectedly returning 0 for some cases. Now I get it, > that we do that to be compatible with texteq(). Both of these explanations are correct, in a way. See commit 656beff. > Secondly, I was also considering if ICU especially has a way to > customize an ICU locale by setting some attributes which dictate > comparison or sorting rules for a set of characters. I mean, if there > is such customized ICU locale defined in the system, and we use that > to create PG collation, I thought we might have to strictly follow > those rules without a tie-breaker, so as to be 100% conformant to ICU. > I can't come up with an example, or may there isn't one, but , say , > there is a locale which is supposed to sort only by lowest comparison > strength (de@strength=1 ?? ). In that case, there might be many > characters considered equal, but PG < operator or > operator would > still return true for those chars. In the terminology of the Unicode collation algorithm, PostgreSQL "forces deterministic comparisons" [1]. There is a lot of information on the details of that within the UCA spec. If we ever wanted to offer a case insensitive collation feature, then we wouldn't necessarily have to do the equivalent of a full strxfrm() when hashing, at least with collations controlled by ICU. Perhaps we could instead use a collator whose UCOL_STRENGTH is only UCOL_PRIMARY to build binary sort keys, and leave the rest to a ucol_equal() call (within texteq()) that has the usual UCOL_STRENGTH for the underlying PostgreSQL collation. I don't think it would be possible to implement case insensitive collations by using some pre-existing ICU collation that is case insensitive. Instead, an implementation might directly vary collation strength of any given collation to achieve case insensitivity. PostgreSQL would know that this collation was case insensitive, so regular collations wouldn't need to change their behavior/implementation (to use ucol_equal() within texteq(), and so on). [1] http://unicode.org/reports/tr10/#Forcing_Deterministic_Comparisons -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] strcmp() tie-breaker for identical ICU-collated strings
On 2 June 2017 at 03:18, Thomas Munrowrote: > On Fri, Jun 2, 2017 at 9:27 AM, Peter Geoghegan wrote: >> On Thu, Jun 1, 2017 at 2:24 PM, Thomas Munro >> wrote: >>> Why should ICU be any different than the system provider in this >>> respect? In both cases, we have a two-level comparison: first we use >>> the collation-aware comparison, and then as a tie breaker, we use a >>> binary comparison. If we didn't do a binary comparison as a >>> tie-breaker, wouldn't the result be logically incompatible with the = >>> operator, which does a binary comparison? Ok. I was thinking we are doing the tie-breaker because specifically strcoll_l() was unexpectedly returning 0 for some cases. Now I get it, that we do that to be compatible with texteq(). Secondly, I was also considering if ICU especially has a way to customize an ICU locale by setting some attributes which dictate comparison or sorting rules for a set of characters. I mean, if there is such customized ICU locale defined in the system, and we use that to create PG collation, I thought we might have to strictly follow those rules without a tie-breaker, so as to be 100% conformant to ICU. I can't come up with an example, or may there isn't one, but , say , there is a locale which is supposed to sort only by lowest comparison strength (de@strength=1 ?? ). In that case, there might be many characters considered equal, but PG < operator or > operator would still return true for those chars. -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] strcmp() tie-breaker for identical ICU-collated strings
On Fri, Jun 2, 2017 at 9:27 AM, Peter Geogheganwrote: > On Thu, Jun 1, 2017 at 2:24 PM, Thomas Munro > wrote: >> Why should ICU be any different than the system provider in this >> respect? In both cases, we have a two-level comparison: first we use >> the collation-aware comparison, and then as a tie breaker, we use a >> binary comparison. If we didn't do a binary comparison as a >> tie-breaker, wouldn't the result be logically incompatible with the = >> operator, which does a binary comparison? > > I agree with that assessment. I think you *could* make a logically consistent set of operations with no binary tie-breaker. = could be defined in terms of strcoll and hash could hash the output of strxfrm, but it it'd be impractical and slow. In order to take advantage of simple and fast = and hash, we go the other way and teach < and > about binary order. -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] strcmp() tie-breaker for identical ICU-collated strings
Peter Geogheganwrites: > On Thu, Jun 1, 2017 at 2:24 PM, Thomas Munro > wrote: >> Why should ICU be any different than the system provider in this >> respect? In both cases, we have a two-level comparison: first we use >> the collation-aware comparison, and then as a tie breaker, we use a >> binary comparison. If we didn't do a binary comparison as a >> tie-breaker, wouldn't the result be logically incompatible with the = >> operator, which does a binary comparison? > I agree with that assessment. The critical reason why this is not optional is that if texteq were to return true for strings that aren't bitwise identical, that breaks hashing --- unless you can guarantee that the hash values for such strings will be equal anyway. That's hardly possible when we don't even know what the collation's comparison rule is, and would likely be difficult even if we had complete knowledge. So no, we're not going there for ICU any more than we did for libc. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] strcmp() tie-breaker for identical ICU-collated strings
On Thu, Jun 1, 2017 at 2:24 PM, Thomas Munrowrote: > Why should ICU be any different than the system provider in this > respect? In both cases, we have a two-level comparison: first we use > the collation-aware comparison, and then as a tie breaker, we use a > binary comparison. If we didn't do a binary comparison as a > tie-breaker, wouldn't the result be logically incompatible with the = > operator, which does a binary comparison? I agree with that assessment. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] strcmp() tie-breaker for identical ICU-collated strings
On Fri, Jun 2, 2017 at 6:58 AM, Amit Khandekarwrote: > While comparing two text strings using varstr_cmp(), if *strcoll*() > call returns 0, we do strcmp() tie-breaker to do binary comparison, > because strcoll() can return 0 for non-identical strings : > > varstr_cmp() > { > ... > /* > * In some locales strcoll() can claim that nonidentical strings are > * equal. Believing that would be bad news for a number of reasons, > * so we follow Perl's lead and sort "equal" strings according to > * strcmp(). > */ > if (result == 0) > result = strcmp(a1p, a2p); > ... > } > > But is this supposed to apply for ICU collations as well ? If > collation provider is icu, the comparison is done using > ucol_strcoll*(). I suspect that ucol_strcoll*() intentionally returns > some characters as being identical, so doing strcmp() may not make > sense. > > For e.g. , if the below two characters are compared using > ucol_strcollUTF8(), it returns 0, meaning the strings are identical : > Greek Oxia : UTF-16 encoding : 0x1FFD > (http://www.fileformat.info/info/unicode/char/1ffd/index.htm) > Greek Tonos : UTF-16 encoding : 0x0384 > (http://www.fileformat.info/info/unicode/char/0384/index.htm) > > The characters are displayed like this : > postgres=# select (U&'\+001FFD') , (U&'\+000384') collate ucatest; > ?column? | ?column? > --+-- > ´| ΄ > (Although this example has similar looking characters, this might not > be a factor behind treating them equal) > > Now since ucol_strcoll*() returns 0, these strings are always compared > using strcmp(), so 1FFD > 0384 returns true : > > create collation ucatest (locale = 'en_US.UTF8', provider = 'icu'); > > postgres=# select (U&'\+001FFD') > (U&'\+000384') collate ucatest; > ?column? > -- > t > > Whereas, if strcmp() is skipped for ICU collations : > if (result == 0 && !(mylocale && mylocale->provider == COLLPROVIDER_ICU)) >result = strcmp(a1p, a2p); > > ... then the comparison using ICU collation tells they are identical strings : > > postgres=# select (U&'\+001FFD') > (U&'\+000384') collate ucatest; > ?column? > -- > f > (1 row) > > postgres=# select (U&'\+001FFD') < (U&'\+000384') collate ucatest; > ?column? > -- > f > (1 row) > > postgres=# select (U&'\+001FFD') <= (U&'\+000384') collate ucatest; > ?column? > -- > t > > > Now I have verified that strcoll() returns true for 1FFD > 0384. So, > it looks like ICU API function ucol_strcoll() returns false by > intention. That's the reason I feel like the > strcmp-if-strtoll-returns-0 thing might not be applicable for ICU. But > I may be wrong, please correct me if I may be missing something. I may not have had enough coffee yet, but... Why should ICU be any different than the system provider in this respect? In both cases, we have a two-level comparison: first we use the collation-aware comparison, and then as a tie breaker, we use a binary comparison. If we didn't do a binary comparison as a tie-breaker, wouldn't the result be logically incompatible with the = operator, which does a binary comparison? Put another way, if we didn't use binary order tie-breaking, we'd have to teach texteq to understand collations (ie be defined as not (a < b) and not (b > a)) otherwise we'd permit contradictions like a != b and not (a < b) and not (b > a). -- Thomas Munro http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] strcmp() tie-breaker for identical ICU-collated strings
While comparing two text strings using varstr_cmp(), if *strcoll*() call returns 0, we do strcmp() tie-breaker to do binary comparison, because strcoll() can return 0 for non-identical strings : varstr_cmp() { ... /* * In some locales strcoll() can claim that nonidentical strings are * equal. Believing that would be bad news for a number of reasons, * so we follow Perl's lead and sort "equal" strings according to * strcmp(). */ if (result == 0) result = strcmp(a1p, a2p); ... } But is this supposed to apply for ICU collations as well ? If collation provider is icu, the comparison is done using ucol_strcoll*(). I suspect that ucol_strcoll*() intentionally returns some characters as being identical, so doing strcmp() may not make sense. For e.g. , if the below two characters are compared using ucol_strcollUTF8(), it returns 0, meaning the strings are identical : Greek Oxia : UTF-16 encoding : 0x1FFD (http://www.fileformat.info/info/unicode/char/1ffd/index.htm) Greek Tonos : UTF-16 encoding : 0x0384 (http://www.fileformat.info/info/unicode/char/0384/index.htm) The characters are displayed like this : postgres=# select (U&'\+001FFD') , (U&'\+000384') collate ucatest; ?column? | ?column? --+-- ´| ΄ (Although this example has similar looking characters, this might not be a factor behind treating them equal) Now since ucol_strcoll*() returns 0, these strings are always compared using strcmp(), so 1FFD > 0384 returns true : create collation ucatest (locale = 'en_US.UTF8', provider = 'icu'); postgres=# select (U&'\+001FFD') > (U&'\+000384') collate ucatest; ?column? -- t Whereas, if strcmp() is skipped for ICU collations : if (result == 0 && !(mylocale && mylocale->provider == COLLPROVIDER_ICU)) result = strcmp(a1p, a2p); ... then the comparison using ICU collation tells they are identical strings : postgres=# select (U&'\+001FFD') > (U&'\+000384') collate ucatest; ?column? -- f (1 row) postgres=# select (U&'\+001FFD') < (U&'\+000384') collate ucatest; ?column? -- f (1 row) postgres=# select (U&'\+001FFD') <= (U&'\+000384') collate ucatest; ?column? -- t Now I have verified that strcoll() returns true for 1FFD > 0384. So, it looks like ICU API function ucol_strcoll() returns false by intention. That's the reason I feel like the strcmp-if-strtoll-returns-0 thing might not be applicable for ICU. But I may be wrong, please correct me if I may be missing something. -- Thanks, -Amit Khandekar EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers