Re: [HACKERS] strcmp() tie-breaker for identical ICU-collated strings

2017-07-10 Thread Peter Geoghegan
On Fri, Jun 9, 2017 at 11:09 AM, Robert Haas  wrote:
>> Isn't that what strxfrm() is?
>
> Yeah, just with bugs.  If ICU has a non-buggy equivalent, then we can
> make this work.

I agree that it probably isn't worth using strxfrm() again, simply
because the glibc implementation is buggy, and glibc as a project is
not at all concerned about how badly that would affect PostgreSQL.

I would like to point out on this thread that the strcmp() tie-breaker
is also a big blocker to implementing normalized keys in B-Tree
indexes (at least, if you want to get them for collated text, which I
think you really need to make the implementation effort worth it).
This is something that is discussed in a section on the normalized
keys wiki page I created recently [1].

[1] 
https://wiki.postgresql.org/wiki/Key_normalization#ICU.2C_text_equality_semantics.2C_and_hashing
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] strcmp() tie-breaker for identical ICU-collated strings

2017-06-09 Thread Robert Haas
On Fri, Jun 9, 2017 at 1:45 PM, Peter Eisentraut
 wrote:
> On 6/9/17 12:17, Robert Haas wrote:
>> IOW, suppose there
>> were a collation API call distill() which had the property that
>> strcmp(distill(X), distill(Y)) == 0 iff X and Y are considered equal
>> under that collation.  Then, you could define your hash function as
>> hash_any(distill(X)).  Alternatively, if the collation library
>> provided its own hashing function, that would be fine too, and
>> probably faster.
>
> Isn't that what strxfrm() is?

Yeah, just with bugs.  If ICU has a non-buggy equivalent, then we can
make this work.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] strcmp() tie-breaker for identical ICU-collated strings

2017-06-09 Thread Peter Geoghegan
On Fri, Jun 9, 2017 at 10:45 AM, Robert Haas  wrote:
>> But they are getting the sort order they need. They just don't get the
>> equality semantics they expect.
>
> You're right.

If we happened to ever guarantee the user a stable sort, then I'd be
wrong. We don't, though.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] strcmp() tie-breaker for identical ICU-collated strings

2017-06-09 Thread Peter Eisentraut
On 6/9/17 12:17, Robert Haas wrote:
> IOW, suppose there
> were a collation API call distill() which had the property that
> strcmp(distill(X), distill(Y)) == 0 iff X and Y are considered equal
> under that collation.  Then, you could define your hash function as
> hash_any(distill(X)).  Alternatively, if the collation library
> provided its own hashing function, that would be fine too, and
> probably faster.

Isn't that what strxfrm() is?

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] strcmp() tie-breaker for identical ICU-collated strings

2017-06-09 Thread Robert Haas
On Fri, Jun 9, 2017 at 12:18 PM, Peter Geoghegan  wrote:
> On Fri, Jun 9, 2017 at 9:17 AM, Robert Haas  wrote:
>> I'm not exactly sure what is possible or
>> desirable, but I would not be too surprised to hear complaints about
>> the observed behavior different from the "pure" ICU behavior because
>> of the tiebreak, and at least some users might even find it worth
>> giving up hashing in order to get the exact sort order they need.
>
> But they are getting the sort order they need. They just don't get the
> equality semantics they expect.

You're right.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] strcmp() tie-breaker for identical ICU-collated strings

2017-06-09 Thread Peter Geoghegan
On Fri, Jun 9, 2017 at 9:17 AM, Robert Haas  wrote:
> I'm not exactly sure what is possible or
> desirable, but I would not be too surprised to hear complaints about
> the observed behavior different from the "pure" ICU behavior because
> of the tiebreak, and at least some users might even find it worth
> giving up hashing in order to get the exact sort order they need.

But they are getting the sort order they need. They just don't get the
equality semantics they expect.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] strcmp() tie-breaker for identical ICU-collated strings

2017-06-09 Thread Robert Haas
On Fri, Jun 9, 2017 at 11:46 AM, Tom Lane  wrote:
> Peter Eisentraut  writes:
>> On 6/9/17 11:12, Tom Lane wrote:
>>> https://www.postgresql.org/message-id/27064.1134753...@sss.pgh.pa.us
>
>> Good to know.  That just says that if we were to go with the strcoll()
>> result only, things would work correctly.
>
> There's still the hashing problem.

Tom, that mailing list discussions is very illuminating.  Thanks for
digging it up.

Regarding the question of hashing, one way to support that would be if
we had some sort of canonicalization function.  IOW, suppose there
were a collation API call distill() which had the property that
strcmp(distill(X), distill(Y)) == 0 iff X and Y are considered equal
under that collation.  Then, you could define your hash function as
hash_any(distill(X)).  Alternatively, if the collation library
provided its own hashing function, that would be fine too, and
probably faster.

On the other hand, is there any rule that says we have to support
hashing?  Certainly, if we defined a new datatype collated_text, it
could have a btree opfamily and no hash opfamily.  It's trickier with
only one datatype, but possibly we could come up with a way for an
opfamily to be consulted about whether it is available for a given
choice of collation.  I'm not exactly sure what is possible or
desirable, but I would not be too surprised to hear complaints about
the observed behavior different from the "pure" ICU behavior because
of the tiebreak, and at least some users might even find it worth
giving up hashing in order to get the exact sort order they need.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] strcmp() tie-breaker for identical ICU-collated strings

2017-06-09 Thread Tom Lane
Peter Eisentraut  writes:
> On 6/9/17 11:12, Tom Lane wrote:
>> https://www.postgresql.org/message-id/27064.1134753...@sss.pgh.pa.us

> Good to know.  That just says that if we were to go with the strcoll()
> result only, things would work correctly.

There's still the hashing problem.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] strcmp() tie-breaker for identical ICU-collated strings

2017-06-09 Thread Peter Eisentraut
On 6/9/17 11:12, Tom Lane wrote:
> Robert Haas  writes:
>> I have to admit that I'm still a little confused about what's actually
>> going on here.  Commit says that it "fixes inconsistent behavior under
>> glibc's hu_HU locale", but it doesn't say what sort of inconsistent
>> behavior it fixes.
> 
> Unfortunately we were not good back then about linking commits to
> list discussions, but a bit of excavation in the archives found this:
> 
> https://www.postgresql.org/message-id/27064.1134753...@sss.pgh.pa.us

Good to know.  That just says that if we were to go with the strcoll()
result only, things would work correctly.

Again, some other details to work out.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] strcmp() tie-breaker for identical ICU-collated strings

2017-06-09 Thread Tom Lane
Robert Haas  writes:
> I have to admit that I'm still a little confused about what's actually
> going on here.  Commit says that it "fixes inconsistent behavior under
> glibc's hu_HU locale", but it doesn't say what sort of inconsistent
> behavior it fixes.

Unfortunately we were not good back then about linking commits to
list discussions, but a bit of excavation in the archives found this:

https://www.postgresql.org/message-id/27064.1134753...@sss.pgh.pa.us

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] strcmp() tie-breaker for identical ICU-collated strings

2017-06-09 Thread Peter Eisentraut
On 6/9/17 10:31, Robert Haas wrote:
> + * In some locales strcoll() can claim that nonidentical strings are
> + * equal.  Believing that would be bad news for a number of reasons,
> + * so we follow Perl's lead and sort "equal" strings according to
> + * strcmp().
> 
> Again, however, the reasons why believing it would be bad news are not
> enumerated.  It is merely asserted that there is more than one such
> reason.

I suspect that there were just issues that haven't been thought through
yet, including hashing.

More generally, the code's receptiveness to internationalization issues
is ever expanding.  Early code probably also thought that using
multibyte characters or non-C locales was bad news.  Over time, we have
worked those issues out.  This might be just be one more.

> So, what's special about text that it can never report two
> non-byte-for-byte values as equal?  And could we consider changing
> that, so that users can select an ICU collator and get exactly the
> behavior ICU delivers, without the extra tiebreak?

I don't think there is anything special.  We just need to work through
the details.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] strcmp() tie-breaker for identical ICU-collated strings

2017-06-09 Thread Robert Haas
On Fri, Jun 2, 2017 at 2:22 PM, Peter Geoghegan  wrote:
> On Fri, Jun 2, 2017 at 10:34 AM, Amit Khandekar  
> wrote:
>> Ok. I was thinking we are doing the tie-breaker because specifically
>> strcoll_l() was unexpectedly returning 0 for some cases. Now I get it,
>> that we do that to be compatible with texteq().
>
> Both of these explanations are correct, in a way. See commit 656beff.

I have to admit that I'm still a little confused about what's actually
going on here.  Commit says that it "fixes inconsistent behavior under
glibc's hu_HU locale", but it doesn't say what sort of inconsistent
behavior it fixes.  It added a comment - which remains to this day -
saying this:

+ * In some locales strcoll() can claim that nonidentical strings are
+ * equal.  Believing that would be bad news for a number of reasons,
+ * so we follow Perl's lead and sort "equal" strings according to
+ * strcmp().

Again, however, the reasons why believing it would be bad news are not
enumerated.  It is merely asserted that there is more than one such
reason.

Now, it is obviously not true in general that a comparison operator
can never deem two values which are not byte-for-byte identical as
equal, because citext does exactly that (indeed, that's the point).  I
thought maybe citext could get away with it because it lacked indexing
support but, nope, it has indexing support.  Also, the in-core numeric
data type has the same property ('1.0'::numeric = '1'::numeric, but
scale() reveals that they are not byte-for-byte identical).

So, what's special about text that it can never report two
non-byte-for-byte values as equal?  And could we consider changing
that, so that users can select an ICU collator and get exactly the
behavior ICU delivers, without the extra tiebreak?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] strcmp() tie-breaker for identical ICU-collated strings

2017-06-09 Thread Amit Khandekar
On 2 June 2017 at 23:52, Peter Geoghegan  wrote:
> On Fri, Jun 2, 2017 at 10:34 AM, Amit Khandekar  
> wrote:
>> Ok. I was thinking we are doing the tie-breaker because specifically
>> strcoll_l() was unexpectedly returning 0 for some cases. Now I get it,
>> that we do that to be compatible with texteq().
>
> Both of these explanations are correct, in a way. See commit 656beff.
>
>> Secondly, I was also considering if ICU especially has a way to
>> customize an ICU locale by setting some attributes which dictate
>> comparison or sorting rules for a set of characters. I mean, if there
>> is such customized ICU locale defined in the system, and we use that
>> to create PG collation, I thought we might have to strictly follow
>> those rules without a tie-breaker, so as to be 100% conformant to ICU.
>> I can't come up with an example, or may there isn't one, but , say ,
>> there is a locale which is supposed to sort only by lowest comparison
>> strength (de@strength=1 ?? ). In that case, there might be many
>> characters considered equal, but PG < operator or > operator would
>> still return true for those chars.
>
> In the terminology of the Unicode collation algorithm, PostgreSQL
> "forces deterministic comparisons" [1]. There is a lot of information
> on the details of that within the UCA spec.
>
> If we ever wanted to offer a case insensitive collation feature, then
> we wouldn't necessarily have to do the equivalent of a full strxfrm()
> when hashing, at least with collations controlled by ICU. Perhaps we
> could instead use a collator whose UCOL_STRENGTH is only UCOL_PRIMARY
> to build binary sort keys, and leave the rest to a ucol_equal() call
> (within texteq()) that has the usual UCOL_STRENGTH for the underlying
> PostgreSQL collation.
>
> I don't think it would be possible to implement case insensitive
> collations by using some pre-existing ICU collation that is case
> insensitive. Instead, an implementation might directly vary collation
> strength of any given collation to achieve case insensitivity.
> PostgreSQL would know that this collation was case insensitive, so
> regular collations wouldn't need to change their
> behavior/implementation (to use ucol_equal() within texteq(), and so
> on).

Ah ok. Understood, thanks.


Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] strcmp() tie-breaker for identical ICU-collated strings

2017-06-02 Thread Peter Geoghegan
On Fri, Jun 2, 2017 at 10:34 AM, Amit Khandekar  wrote:
> Ok. I was thinking we are doing the tie-breaker because specifically
> strcoll_l() was unexpectedly returning 0 for some cases. Now I get it,
> that we do that to be compatible with texteq().

Both of these explanations are correct, in a way. See commit 656beff.

> Secondly, I was also considering if ICU especially has a way to
> customize an ICU locale by setting some attributes which dictate
> comparison or sorting rules for a set of characters. I mean, if there
> is such customized ICU locale defined in the system, and we use that
> to create PG collation, I thought we might have to strictly follow
> those rules without a tie-breaker, so as to be 100% conformant to ICU.
> I can't come up with an example, or may there isn't one, but , say ,
> there is a locale which is supposed to sort only by lowest comparison
> strength (de@strength=1 ?? ). In that case, there might be many
> characters considered equal, but PG < operator or > operator would
> still return true for those chars.

In the terminology of the Unicode collation algorithm, PostgreSQL
"forces deterministic comparisons" [1]. There is a lot of information
on the details of that within the UCA spec.

If we ever wanted to offer a case insensitive collation feature, then
we wouldn't necessarily have to do the equivalent of a full strxfrm()
when hashing, at least with collations controlled by ICU. Perhaps we
could instead use a collator whose UCOL_STRENGTH is only UCOL_PRIMARY
to build binary sort keys, and leave the rest to a ucol_equal() call
(within texteq()) that has the usual UCOL_STRENGTH for the underlying
PostgreSQL collation.

I don't think it would be possible to implement case insensitive
collations by using some pre-existing ICU collation that is case
insensitive. Instead, an implementation might directly vary collation
strength of any given collation to achieve case insensitivity.
PostgreSQL would know that this collation was case insensitive, so
regular collations wouldn't need to change their
behavior/implementation (to use ucol_equal() within texteq(), and so
on).

[1] http://unicode.org/reports/tr10/#Forcing_Deterministic_Comparisons
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] strcmp() tie-breaker for identical ICU-collated strings

2017-06-02 Thread Amit Khandekar
On 2 June 2017 at 03:18, Thomas Munro  wrote:
> On Fri, Jun 2, 2017 at 9:27 AM, Peter Geoghegan  wrote:
>> On Thu, Jun 1, 2017 at 2:24 PM, Thomas Munro
>>  wrote:
>>> Why should ICU be any different than the system provider in this
>>> respect?  In both cases, we have a two-level comparison: first we use
>>> the collation-aware comparison, and then as a tie breaker, we use a
>>> binary comparison.  If we didn't do a binary comparison as a
>>> tie-breaker, wouldn't the result be logically incompatible with the =
>>> operator, which does a binary comparison?

Ok. I was thinking we are doing the tie-breaker because specifically
strcoll_l() was unexpectedly returning 0 for some cases. Now I get it,
that we do that to be compatible with texteq().

Secondly, I was also considering if ICU especially has a way to
customize an ICU locale by setting some attributes which dictate
comparison or sorting rules for a set of characters. I mean, if there
is such customized ICU locale defined in the system, and we use that
to create PG collation, I thought we might have to strictly follow
those rules without a tie-breaker, so as to be 100% conformant to ICU.
I can't come up with an example, or may there isn't one, but , say ,
there is a locale which is supposed to sort only by lowest comparison
strength (de@strength=1 ?? ). In that case, there might be many
characters considered equal, but PG < operator or > operator would
still return true for those chars.


-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] strcmp() tie-breaker for identical ICU-collated strings

2017-06-01 Thread Thomas Munro
On Fri, Jun 2, 2017 at 9:27 AM, Peter Geoghegan  wrote:
> On Thu, Jun 1, 2017 at 2:24 PM, Thomas Munro
>  wrote:
>> Why should ICU be any different than the system provider in this
>> respect?  In both cases, we have a two-level comparison: first we use
>> the collation-aware comparison, and then as a tie breaker, we use a
>> binary comparison.  If we didn't do a binary comparison as a
>> tie-breaker, wouldn't the result be logically incompatible with the =
>> operator, which does a binary comparison?
>
> I agree with that assessment.

I think you *could* make a logically consistent set of operations with
no binary tie-breaker.  = could be defined in terms of strcoll and
hash could hash the output of strxfrm, but it it'd be impractical and
slow.  In order to take advantage of simple and fast = and hash, we go
the other way and teach < and > about binary order.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] strcmp() tie-breaker for identical ICU-collated strings

2017-06-01 Thread Tom Lane
Peter Geoghegan  writes:
> On Thu, Jun 1, 2017 at 2:24 PM, Thomas Munro
>  wrote:
>> Why should ICU be any different than the system provider in this
>> respect?  In both cases, we have a two-level comparison: first we use
>> the collation-aware comparison, and then as a tie breaker, we use a
>> binary comparison.  If we didn't do a binary comparison as a
>> tie-breaker, wouldn't the result be logically incompatible with the =
>> operator, which does a binary comparison?

> I agree with that assessment.

The critical reason why this is not optional is that if texteq were to
return true for strings that aren't bitwise identical, that breaks hashing
--- unless you can guarantee that the hash values for such strings will be
equal anyway.  That's hardly possible when we don't even know what the
collation's comparison rule is, and would likely be difficult even if
we had complete knowledge.

So no, we're not going there for ICU any more than we did for libc.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] strcmp() tie-breaker for identical ICU-collated strings

2017-06-01 Thread Peter Geoghegan
On Thu, Jun 1, 2017 at 2:24 PM, Thomas Munro
 wrote:
> Why should ICU be any different than the system provider in this
> respect?  In both cases, we have a two-level comparison: first we use
> the collation-aware comparison, and then as a tie breaker, we use a
> binary comparison.  If we didn't do a binary comparison as a
> tie-breaker, wouldn't the result be logically incompatible with the =
> operator, which does a binary comparison?

I agree with that assessment.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] strcmp() tie-breaker for identical ICU-collated strings

2017-06-01 Thread Thomas Munro
On Fri, Jun 2, 2017 at 6:58 AM, Amit Khandekar  wrote:
> While comparing two text strings using varstr_cmp(), if *strcoll*()
> call returns 0, we do strcmp() tie-breaker to do binary comparison,
> because strcoll() can return 0 for non-identical strings :
>
> varstr_cmp()
> {
> ...
> /*
> * In some locales strcoll() can claim that nonidentical strings are
> * equal.  Believing that would be bad news for a number of reasons,
> * so we follow Perl's lead and sort "equal" strings according to
> * strcmp().
> */
> if (result == 0)
> result = strcmp(a1p, a2p);
> ...
> }
>
> But is this supposed to apply for ICU collations as well ? If
> collation provider is icu, the comparison is done using
> ucol_strcoll*(). I suspect that ucol_strcoll*() intentionally returns
> some characters as being identical, so doing strcmp() may not make
> sense.
>
> For e.g. , if the below two characters are compared using
> ucol_strcollUTF8(), it returns 0, meaning the strings are identical :
> Greek Oxia : UTF-16 encoding : 0x1FFD
> (http://www.fileformat.info/info/unicode/char/1ffd/index.htm)
> Greek Tonos : UTF-16 encoding : 0x0384
> (http://www.fileformat.info/info/unicode/char/0384/index.htm)
>
> The characters are displayed like this :
> postgres=# select (U&'\+001FFD') , (U&'\+000384') collate ucatest;
>  ?column? | ?column?
> --+--
>  ´| ΄
> (Although this example has similar looking characters, this might not
> be a factor behind treating them equal)
>
> Now since ucol_strcoll*() returns 0, these strings are always compared
> using strcmp(), so 1FFD > 0384 returns true :
>
> create collation ucatest (locale = 'en_US.UTF8', provider = 'icu');
>
> postgres=# select (U&'\+001FFD') > (U&'\+000384') collate ucatest;
>  ?column?
> --
>  t
>
> Whereas, if strcmp() is skipped for ICU collations :
> if (result == 0 && !(mylocale && mylocale->provider == COLLPROVIDER_ICU))
>result = strcmp(a1p, a2p);
>
> ... then the comparison using ICU collation tells they are identical strings :
>
> postgres=# select (U&'\+001FFD') > (U&'\+000384') collate ucatest;
>  ?column?
> --
>  f
> (1 row)
>
> postgres=# select (U&'\+001FFD') < (U&'\+000384') collate ucatest;
>  ?column?
> --
>  f
> (1 row)
>
> postgres=# select (U&'\+001FFD') <= (U&'\+000384') collate ucatest;
>  ?column?
> --
>  t
>
>
> Now I have verified that strcoll() returns true for 1FFD > 0384. So,
> it looks like ICU API function ucol_strcoll() returns false by
> intention. That's the reason I feel like the
> strcmp-if-strtoll-returns-0 thing might not be applicable for ICU. But
> I may be wrong, please correct me if I may be missing something.

I may not have had enough coffee yet, but...

Why should ICU be any different than the system provider in this
respect?  In both cases, we have a two-level comparison: first we use
the collation-aware comparison, and then as a tie breaker, we use a
binary comparison.  If we didn't do a binary comparison as a
tie-breaker, wouldn't the result be logically incompatible with the =
operator, which does a binary comparison?

Put another way, if we didn't use binary order tie-breaking, we'd have
to teach texteq to understand collations (ie be defined as not (a < b)
and not (b > a)) otherwise we'd permit contradictions like a != b and
not (a < b) and not (b > a).

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] strcmp() tie-breaker for identical ICU-collated strings

2017-06-01 Thread Amit Khandekar
While comparing two text strings using varstr_cmp(), if *strcoll*()
call returns 0, we do strcmp() tie-breaker to do binary comparison,
because strcoll() can return 0 for non-identical strings :

varstr_cmp()
{
...
/*
* In some locales strcoll() can claim that nonidentical strings are
* equal.  Believing that would be bad news for a number of reasons,
* so we follow Perl's lead and sort "equal" strings according to
* strcmp().
*/
if (result == 0)
result = strcmp(a1p, a2p);
...
}

But is this supposed to apply for ICU collations as well ? If
collation provider is icu, the comparison is done using
ucol_strcoll*(). I suspect that ucol_strcoll*() intentionally returns
some characters as being identical, so doing strcmp() may not make
sense.

For e.g. , if the below two characters are compared using
ucol_strcollUTF8(), it returns 0, meaning the strings are identical :
Greek Oxia : UTF-16 encoding : 0x1FFD
(http://www.fileformat.info/info/unicode/char/1ffd/index.htm)
Greek Tonos : UTF-16 encoding : 0x0384
(http://www.fileformat.info/info/unicode/char/0384/index.htm)

The characters are displayed like this :
postgres=# select (U&'\+001FFD') , (U&'\+000384') collate ucatest;
 ?column? | ?column?
--+--
 ´| ΄
(Although this example has similar looking characters, this might not
be a factor behind treating them equal)

Now since ucol_strcoll*() returns 0, these strings are always compared
using strcmp(), so 1FFD > 0384 returns true :

create collation ucatest (locale = 'en_US.UTF8', provider = 'icu');

postgres=# select (U&'\+001FFD') > (U&'\+000384') collate ucatest;
 ?column?
--
 t

Whereas, if strcmp() is skipped for ICU collations :
if (result == 0 && !(mylocale && mylocale->provider == COLLPROVIDER_ICU))
   result = strcmp(a1p, a2p);

... then the comparison using ICU collation tells they are identical strings :

postgres=# select (U&'\+001FFD') > (U&'\+000384') collate ucatest;
 ?column?
--
 f
(1 row)

postgres=# select (U&'\+001FFD') < (U&'\+000384') collate ucatest;
 ?column?
--
 f
(1 row)

postgres=# select (U&'\+001FFD') <= (U&'\+000384') collate ucatest;
 ?column?
--
 t


Now I have verified that strcoll() returns true for 1FFD > 0384. So,
it looks like ICU API function ucol_strcoll() returns false by
intention. That's the reason I feel like the
strcmp-if-strtoll-returns-0 thing might not be applicable for ICU. But
I may be wrong, please correct me if I may be missing something.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers