Re: [HACKERS] Re: CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?

2017-10-01 Thread Noah Misch
On Sun, Oct 01, 2017 at 09:56:11AM -0400, Peter Eisentraut wrote:
> On 9/30/17 03:01, Noah Misch wrote:
> > This PostgreSQL 10 open item is past due for your status update.  On the 
> > worst
> > week to be violating open item policies.  Kindly send a status update within
> > 24 hours, and include a date for your subsequent status update.  Refer to 
> > the
> > policy on open item ownership:
> > https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com
> 
> I had previously replied that I think nothing should be changed.  What
> process should be applied in that case?

If you determine community decision-making is not against that conclusion,
edit the list to move the item from "Open Issues" to "Non-bugs".


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?

2017-10-01 Thread Peter Eisentraut
On 9/30/17 03:01, Noah Misch wrote:
> On Mon, Sep 25, 2017 at 08:26:21AM +, Noah Misch wrote:
>> On Tue, Sep 19, 2017 at 07:01:47PM -0700, Peter Geoghegan wrote:
>>> On Tue, Sep 19, 2017 at 5:52 PM, Peter Eisentraut
>>>  wrote:
 On 9/18/17 18:46, Peter Geoghegan wrote:
> As I pointed out a couple of times already [1], we don't currently
> sanitize ICU's BCP 47 language tags within CREATE COLLATION.

 There is no requirement that the locale strings for ICU need to be BCP
 47.  ICU locale names like 'de@collation=phonebook' are also acceptable.
>>>
>>> Right. But, we only document that BCP 47 is supported by Postgres.
>>> Maybe we can use get_icu_language_tag()/uloc_toLanguageTag() to ensure
>>> that we end up with BCP 47, even when the user happens to specify the
>>> legacy syntax. Should we be "canonicalizing" legacy ICU locale strings
>>> as BCP 47, too?
>>>
 The reason they are not validated is that, as you know, ICU accepts any
 locale string as valid.  You appear to have found a way to do some
 validation, but I would like to see that code.
>>>
>>> As I mentioned, I'm simply calling
>>> get_icu_language_tag()/uloc_toLanguageTag() to do that sanitization.
>>> The code to get the extra sanitization is completely trivial.
>>>
>>> I didn't post the patch that generates the errors in my opening e-mail
>>> because I'm not confident it's correct just yet. And, I think that I
>>> see a bigger problem: we pass a string that is almost certainly a BCP
>>> 47 string to ucol_open() from within pg_newlocale_from_collation(). We
>>> do so despite the fact that ucol_open() apparently doesn't accept BCP
>>> 47 syntax locale strings until ICU 54 [1]. Seems entirely possible
>>> that this accounts for the problems you saw on ICU 4.2, back when we
>>> were still creating keyword variants (I guess that the keyword
>>> variants seem very "BCP 47-ish" to me).
>>>
>>> I really think we need to add some kind of debug mode that makes ICU
>>> optionally spit out a locale display name at key points. We need this
>>> to gain confidence that the behavior that ICU provides actually
>>> matches what is expected across ICU different versions for different
>>> locale formats. We talked about this as a user-facing feature before,
>>> which can wait till v11; I just want this to debug problems like this
>>> one.
>>>
>>> [1] 
>>> https://ssl.icu-project.org/apiref/icu4c/ucol_8h.html#a4721e4c0a519bb0139a874e191223590
>>
>> [Action required within three days.  This is a generic notification.]
>>
>> The above-described topic is currently a PostgreSQL 10 open item.  Peter,
>> since you committed the patch believed to have created it, you own this open
>> item.  If some other commit is more relevant or if this does not belong as a
>> v10 open item, please let us know.  Otherwise, please observe the policy on
>> open item ownership[1] and send a status update within three calendar days of
>> this message.  Include a date for your subsequent status update.  Testers may
>> discover new open items at any time, and I want to plan to get them all fixed
>> well in advance of shipping v10.  Consequently, I will appreciate your 
>> efforts
>> toward speedy resolution.  Thanks.
>>
>> [1] 
>> https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com
> 
> This PostgreSQL 10 open item is past due for your status update.  On the worst
> week to be violating open item policies.  Kindly send a status update within
> 24 hours, and include a date for your subsequent status update.  Refer to the
> policy on open item ownership:
> https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com

I had previously replied that I think nothing should be changed.  What
process should be applied in that case?


-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?

2017-10-01 Thread Peter Eisentraut
On 9/30/17 15:28, Tom Lane wrote:
> This suggests to me that arguing about canonicalization is moot so
> far as avoiding reindexing goes: if you change ICU library versions,
> you're screwed and will have to jump through all the reindexing hoops,
> no matter what we do or don't do.

One reason for that is that the collation version also encodes things
like if the internal method for computing sort keys changes.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?

2017-10-01 Thread Peter Eisentraut
On 9/30/17 15:28, Tom Lane wrote:
> Now, it may still be worthwhile to argue about whether canonicalization
> will help the other component of the problem, which is whether you can
> dump and reload CREATE COLLATION commands into a new ICU version and
> expect to get more-or-less-the-same behavior as before.

Arguably, you don't always want that.  Maybe you selected a locale name
like 'en-NEWCOUNTRY', and in old ICU versions you want that to fall back
to default 'en' behavior and in newer you get the updated custom behavior.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?

2017-09-30 Thread Peter Geoghegan
On Sat, Sep 30, 2017 at 12:28 PM, Tom Lane  wrote:
> This suggests to me that arguing about canonicalization is moot so
> far as avoiding reindexing goes: if you change ICU library versions,
> you're screwed and will have to jump through all the reindexing hoops,
> no matter what we do or don't do.  (Maybe we are looking at the wrong
> information to populate collversion?)

Just to be clear, I was never arguing that canonicalization would
avoid reindexing, and I didn't think anyone else was. I believe that
the version string will change when the behavior of its collator
changes for any reason and in any way. This includes changes to how
binary sort keys are generated.

(We don't currently store binary keys on disk, so a change to that
representation doesn't really necessitate a REINDEX for us. The UCA
spec explicitly decouples compatibility for indexing with binary keys
from changes needed due to human requirements. ICU binary sort keys
are compressed, and this is sometimes improved, independently of the
evolution of how natural language experts say text should be sorted.)

> Now, it may still be worthwhile to argue about whether canonicalization
> will help the other component of the problem, which is whether you can
> dump and reload CREATE COLLATION commands into a new ICU version and
> expect to get more-or-less-the-same behavior as before.

Again, to be clear, that's what I was arguing all along. I think that
users will get very good results with this approach. When the sorting
behavior of a locale is refined by natural language experts, it's
almost certainly a very small change that most users that are affected
won't actually notice. For example, when en-us-x-icu is changed,
that's actually due to a general change in the inherited root collator
that doesn't really affect English speakers. There is no practical
question about how you're supposed to sort English text.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?

2017-09-30 Thread Tom Lane
Noah Misch  writes:
> On Sat, Sep 30, 2017 at 11:25:43AM -0400, Tom Lane wrote:
>> Sure, but dealing with that is mechanical: reindex the necessary indexes
>> and you're done.

> In the general case, one must revalidate CHECK constraints, re-partition
> tables, revalidate range values, and reindex.

True, but that is what it is: nothing we can do is going to affect the
consequences of a collation behavior change, if there is one.  What's more
useful for our immediate purposes is to ask whether we can reliably detect
a collation behavior change.  False negatives are bad, but so are false
positives, because those would force DBAs to jump through lots of hoops
unnecessarily.

So: are canonicalized locale descriptions any better or worse by that
metric than non-canonicalized descriptions?  In principle I think a
canonicalized description might be more likely to be recognized as
the "same" locale by another ICU version than one that isn't, but
I don't know whether there's any meaningful difference in practice.

Another point here is whether, even if a new ICU version recognizes
a locale description as being "the same" interpretation that an old
ICU version used, will it report the same collation version?  Limited
experimentation suggests that the collversions we're actually getting
out of ICU depend on little other than the libicu version.  "select
distinct collversion from pg_collation where collversion is not null"
produces this on ICU 4.2.1:

 49.192.5.41
 49.192.0.41

and this on 52.1:

 58.0.6.50
 58.0.0.50

and this on 57.1:

 153.64.29
 153.64

This suggests to me that arguing about canonicalization is moot so
far as avoiding reindexing goes: if you change ICU library versions,
you're screwed and will have to jump through all the reindexing hoops,
no matter what we do or don't do.  (Maybe we are looking at the wrong
information to populate collversion?)

Now, it may still be worthwhile to argue about whether canonicalization
will help the other component of the problem, which is whether you can
dump and reload CREATE COLLATION commands into a new ICU version and
expect to get more-or-less-the-same behavior as before.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?

2017-09-30 Thread Noah Misch
On Sat, Sep 30, 2017 at 11:25:43AM -0400, Tom Lane wrote:
> Noah Misch  writes:
> > On Mon, Sep 25, 2017 at 09:36:44AM -0700, Peter Geoghegan wrote:
> >>> I think it's inevitable that a certain number of users are going to
> >>> have to cope with ICU version changes breaking stuff.
> 
> >> Wasn't the main point of adopting ICU that that doesn't happen when it
> >> isn't essential?
> 
> > I wouldn't describe it that way.  I agree that few, if any, ICU upgrades 
> > will
> > remove country or language codes.  Overall, though, almost every ICU upgrade
> > will be difficult.  Each ICU release, even a minor release like 58.2, 
> > changes
> > the sorting rules in some tiny way.  You then see "Rebuild all objects
> > affected by this collation" messages.
> 
> Sure, but dealing with that is mechanical: reindex the necessary indexes
> and you're done.

In the general case, one must revalidate CHECK constraints, re-partition
tables, revalidate range values, and reindex.

> In the libc world,
> when you upgrade libc's locale definitions, you have no idea what the
> consequences are.

Yep.  It's strictly better than the libc case.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?

2017-09-30 Thread Peter Geoghegan
On Sat, Sep 30, 2017 at 8:25 AM, Tom Lane  wrote:
> I'd also argue that the point of adopting ICU was exactly so we *could*
> distinguish those cases, and limit the scope of a normal upgrade to
> "reindex these identifiable indexes and you're done".  In the libc world,
> when you upgrade libc's locale definitions, you have no idea what the
> consequences are.

Right. With libc, we think of collations as something that there is a
small, fixed number of on a system, that we cannot safely assume
anything about. But with ICU, all of the semantics of how natural
languages should be sorted are exposed via various APIs, and there are
literally more possible sets of collation behaviors than there are
grains of sand in the Sahara (there are hundreds of distinct scripts,
which we can change the overall ordering of arbitrarily, on top of all
the other customizations). Clearly the libc way of looking at things
doesn't really carry over.

BCP 47 is supposed to be universal -- it's an IETF standard. That's
where all the stability guarantees are. The officially recognized 'u'
extension that ICU uses is a CLDR/Unicode thing, not an ICU thing. The
same format could, in the future, be used by other collation
providers, since there actually are other CLDR consumers/UCA
implementations. And, ICU have said that they have deprecated the old
locale format, and have standardized on BCP 47. As of ICU 54, it is
recommended that ucol_open() be passed a string in BCP 47 format.

I'm surprised that this issue was not resolved earlier in the week. I
presumed that all of this was obvious to Peter E., but I seem to have
been wrong about that.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?

2017-09-30 Thread Tom Lane
Noah Misch  writes:
> On Mon, Sep 25, 2017 at 09:36:44AM -0700, Peter Geoghegan wrote:
>>> I think it's inevitable that a certain number of users are going to
>>> have to cope with ICU version changes breaking stuff.

>> Wasn't the main point of adopting ICU that that doesn't happen when it
>> isn't essential?

> I wouldn't describe it that way.  I agree that few, if any, ICU upgrades will
> remove country or language codes.  Overall, though, almost every ICU upgrade
> will be difficult.  Each ICU release, even a minor release like 58.2, changes
> the sorting rules in some tiny way.  You then see "Rebuild all objects
> affected by this collation" messages.

Sure, but dealing with that is mechanical: reindex the necessary indexes
and you're done.  I think it's important to distinguish that from a case
where users have to change their collation definitions.  That is a
qualitatively different, and much harder, upgrade problem.

I'd also argue that the point of adopting ICU was exactly so we *could*
distinguish those cases, and limit the scope of a normal upgrade to
"reindex these identifiable indexes and you're done".  In the libc world,
when you upgrade libc's locale definitions, you have no idea what the
consequences are.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?

2017-09-30 Thread Noah Misch
On Mon, Sep 25, 2017 at 09:36:44AM -0700, Peter Geoghegan wrote:
> On Mon, Sep 25, 2017 at 9:06 AM, Robert Haas  wrote:
> >> The big concern I have here is that this feels a lot like something that
> >> we'll regret at leisure, if it's not right in the first release.  I'd
> >> much rather be restrictive in v10 and then loosen the rules later, than
> >> be lax in v10 and then have to argue about whether to break backwards
> >> compatibility in order to gain saner behavior.
> >
> > I think it's inevitable that a certain number of users are going to
> > have to cope with ICU version changes breaking stuff.
> 
> Wasn't the main point of adopting ICU that that doesn't happen when it
> isn't essential? There will be some risk with pg_dump across ICU
> versions, due rare to political changes. But, on pg_upgrade, the old
> collations will continue to work, even if they would not have appeared
> at initdb time on that ICU version, because ICU has all kinds of
> fallbacks.

I wouldn't describe it that way.  I agree that few, if any, ICU upgrades will
remove country or language codes.  Overall, though, almost every ICU upgrade
will be difficult.  Each ICU release, even a minor release like 58.2, changes
the sorting rules in some tiny way.  You then see "Rebuild all objects
affected by this collation" messages.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?

2017-09-30 Thread Noah Misch
On Mon, Sep 25, 2017 at 08:26:21AM +, Noah Misch wrote:
> On Tue, Sep 19, 2017 at 07:01:47PM -0700, Peter Geoghegan wrote:
> > On Tue, Sep 19, 2017 at 5:52 PM, Peter Eisentraut
> >  wrote:
> > > On 9/18/17 18:46, Peter Geoghegan wrote:
> > >> As I pointed out a couple of times already [1], we don't currently
> > >> sanitize ICU's BCP 47 language tags within CREATE COLLATION.
> > >
> > > There is no requirement that the locale strings for ICU need to be BCP
> > > 47.  ICU locale names like 'de@collation=phonebook' are also acceptable.
> > 
> > Right. But, we only document that BCP 47 is supported by Postgres.
> > Maybe we can use get_icu_language_tag()/uloc_toLanguageTag() to ensure
> > that we end up with BCP 47, even when the user happens to specify the
> > legacy syntax. Should we be "canonicalizing" legacy ICU locale strings
> > as BCP 47, too?
> > 
> > > The reason they are not validated is that, as you know, ICU accepts any
> > > locale string as valid.  You appear to have found a way to do some
> > > validation, but I would like to see that code.
> > 
> > As I mentioned, I'm simply calling
> > get_icu_language_tag()/uloc_toLanguageTag() to do that sanitization.
> > The code to get the extra sanitization is completely trivial.
> > 
> > I didn't post the patch that generates the errors in my opening e-mail
> > because I'm not confident it's correct just yet. And, I think that I
> > see a bigger problem: we pass a string that is almost certainly a BCP
> > 47 string to ucol_open() from within pg_newlocale_from_collation(). We
> > do so despite the fact that ucol_open() apparently doesn't accept BCP
> > 47 syntax locale strings until ICU 54 [1]. Seems entirely possible
> > that this accounts for the problems you saw on ICU 4.2, back when we
> > were still creating keyword variants (I guess that the keyword
> > variants seem very "BCP 47-ish" to me).
> > 
> > I really think we need to add some kind of debug mode that makes ICU
> > optionally spit out a locale display name at key points. We need this
> > to gain confidence that the behavior that ICU provides actually
> > matches what is expected across ICU different versions for different
> > locale formats. We talked about this as a user-facing feature before,
> > which can wait till v11; I just want this to debug problems like this
> > one.
> > 
> > [1] 
> > https://ssl.icu-project.org/apiref/icu4c/ucol_8h.html#a4721e4c0a519bb0139a874e191223590
> 
> [Action required within three days.  This is a generic notification.]
> 
> The above-described topic is currently a PostgreSQL 10 open item.  Peter,
> since you committed the patch believed to have created it, you own this open
> item.  If some other commit is more relevant or if this does not belong as a
> v10 open item, please let us know.  Otherwise, please observe the policy on
> open item ownership[1] and send a status update within three calendar days of
> this message.  Include a date for your subsequent status update.  Testers may
> discover new open items at any time, and I want to plan to get them all fixed
> well in advance of shipping v10.  Consequently, I will appreciate your efforts
> toward speedy resolution.  Thanks.
> 
> [1] 
> https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com

This PostgreSQL 10 open item is past due for your status update.  On the worst
week to be violating open item policies.  Kindly send a status update within
24 hours, and include a date for your subsequent status update.  Refer to the
policy on open item ownership:
https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?

2017-09-25 Thread Noah Misch
On Tue, Sep 19, 2017 at 07:01:47PM -0700, Peter Geoghegan wrote:
> On Tue, Sep 19, 2017 at 5:52 PM, Peter Eisentraut
>  wrote:
> > On 9/18/17 18:46, Peter Geoghegan wrote:
> >> As I pointed out a couple of times already [1], we don't currently
> >> sanitize ICU's BCP 47 language tags within CREATE COLLATION.
> >
> > There is no requirement that the locale strings for ICU need to be BCP
> > 47.  ICU locale names like 'de@collation=phonebook' are also acceptable.
> 
> Right. But, we only document that BCP 47 is supported by Postgres.
> Maybe we can use get_icu_language_tag()/uloc_toLanguageTag() to ensure
> that we end up with BCP 47, even when the user happens to specify the
> legacy syntax. Should we be "canonicalizing" legacy ICU locale strings
> as BCP 47, too?
> 
> > The reason they are not validated is that, as you know, ICU accepts any
> > locale string as valid.  You appear to have found a way to do some
> > validation, but I would like to see that code.
> 
> As I mentioned, I'm simply calling
> get_icu_language_tag()/uloc_toLanguageTag() to do that sanitization.
> The code to get the extra sanitization is completely trivial.
> 
> I didn't post the patch that generates the errors in my opening e-mail
> because I'm not confident it's correct just yet. And, I think that I
> see a bigger problem: we pass a string that is almost certainly a BCP
> 47 string to ucol_open() from within pg_newlocale_from_collation(). We
> do so despite the fact that ucol_open() apparently doesn't accept BCP
> 47 syntax locale strings until ICU 54 [1]. Seems entirely possible
> that this accounts for the problems you saw on ICU 4.2, back when we
> were still creating keyword variants (I guess that the keyword
> variants seem very "BCP 47-ish" to me).
> 
> I really think we need to add some kind of debug mode that makes ICU
> optionally spit out a locale display name at key points. We need this
> to gain confidence that the behavior that ICU provides actually
> matches what is expected across ICU different versions for different
> locale formats. We talked about this as a user-facing feature before,
> which can wait till v11; I just want this to debug problems like this
> one.
> 
> [1] 
> https://ssl.icu-project.org/apiref/icu4c/ucol_8h.html#a4721e4c0a519bb0139a874e191223590

[Action required within three days.  This is a generic notification.]

The above-described topic is currently a PostgreSQL 10 open item.  Peter,
since you committed the patch believed to have created it, you own this open
item.  If some other commit is more relevant or if this does not belong as a
v10 open item, please let us know.  Otherwise, please observe the policy on
open item ownership[1] and send a status update within three calendar days of
this message.  Include a date for your subsequent status update.  Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping v10.  Consequently, I will appreciate your efforts
toward speedy resolution.  Thanks.

[1] 
https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers