Re: [HACKERS] Re: CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?
On Sun, Oct 01, 2017 at 09:56:11AM -0400, Peter Eisentraut wrote: > On 9/30/17 03:01, Noah Misch wrote: > > This PostgreSQL 10 open item is past due for your status update. On the > > worst > > week to be violating open item policies. Kindly send a status update within > > 24 hours, and include a date for your subsequent status update. Refer to > > the > > policy on open item ownership: > > https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com > > I had previously replied that I think nothing should be changed. What > process should be applied in that case? If you determine community decision-making is not against that conclusion, edit the list to move the item from "Open Issues" to "Non-bugs". -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?
On 9/30/17 03:01, Noah Misch wrote: > On Mon, Sep 25, 2017 at 08:26:21AM +, Noah Misch wrote: >> On Tue, Sep 19, 2017 at 07:01:47PM -0700, Peter Geoghegan wrote: >>> On Tue, Sep 19, 2017 at 5:52 PM, Peter Eisentraut >>>wrote: On 9/18/17 18:46, Peter Geoghegan wrote: > As I pointed out a couple of times already [1], we don't currently > sanitize ICU's BCP 47 language tags within CREATE COLLATION. There is no requirement that the locale strings for ICU need to be BCP 47. ICU locale names like 'de@collation=phonebook' are also acceptable. >>> >>> Right. But, we only document that BCP 47 is supported by Postgres. >>> Maybe we can use get_icu_language_tag()/uloc_toLanguageTag() to ensure >>> that we end up with BCP 47, even when the user happens to specify the >>> legacy syntax. Should we be "canonicalizing" legacy ICU locale strings >>> as BCP 47, too? >>> The reason they are not validated is that, as you know, ICU accepts any locale string as valid. You appear to have found a way to do some validation, but I would like to see that code. >>> >>> As I mentioned, I'm simply calling >>> get_icu_language_tag()/uloc_toLanguageTag() to do that sanitization. >>> The code to get the extra sanitization is completely trivial. >>> >>> I didn't post the patch that generates the errors in my opening e-mail >>> because I'm not confident it's correct just yet. And, I think that I >>> see a bigger problem: we pass a string that is almost certainly a BCP >>> 47 string to ucol_open() from within pg_newlocale_from_collation(). We >>> do so despite the fact that ucol_open() apparently doesn't accept BCP >>> 47 syntax locale strings until ICU 54 [1]. Seems entirely possible >>> that this accounts for the problems you saw on ICU 4.2, back when we >>> were still creating keyword variants (I guess that the keyword >>> variants seem very "BCP 47-ish" to me). >>> >>> I really think we need to add some kind of debug mode that makes ICU >>> optionally spit out a locale display name at key points. We need this >>> to gain confidence that the behavior that ICU provides actually >>> matches what is expected across ICU different versions for different >>> locale formats. We talked about this as a user-facing feature before, >>> which can wait till v11; I just want this to debug problems like this >>> one. >>> >>> [1] >>> https://ssl.icu-project.org/apiref/icu4c/ucol_8h.html#a4721e4c0a519bb0139a874e191223590 >> >> [Action required within three days. This is a generic notification.] >> >> The above-described topic is currently a PostgreSQL 10 open item. Peter, >> since you committed the patch believed to have created it, you own this open >> item. If some other commit is more relevant or if this does not belong as a >> v10 open item, please let us know. Otherwise, please observe the policy on >> open item ownership[1] and send a status update within three calendar days of >> this message. Include a date for your subsequent status update. Testers may >> discover new open items at any time, and I want to plan to get them all fixed >> well in advance of shipping v10. Consequently, I will appreciate your >> efforts >> toward speedy resolution. Thanks. >> >> [1] >> https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com > > This PostgreSQL 10 open item is past due for your status update. On the worst > week to be violating open item policies. Kindly send a status update within > 24 hours, and include a date for your subsequent status update. Refer to the > policy on open item ownership: > https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com I had previously replied that I think nothing should be changed. What process should be applied in that case? -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?
On 9/30/17 15:28, Tom Lane wrote: > This suggests to me that arguing about canonicalization is moot so > far as avoiding reindexing goes: if you change ICU library versions, > you're screwed and will have to jump through all the reindexing hoops, > no matter what we do or don't do. One reason for that is that the collation version also encodes things like if the internal method for computing sort keys changes. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?
On 9/30/17 15:28, Tom Lane wrote: > Now, it may still be worthwhile to argue about whether canonicalization > will help the other component of the problem, which is whether you can > dump and reload CREATE COLLATION commands into a new ICU version and > expect to get more-or-less-the-same behavior as before. Arguably, you don't always want that. Maybe you selected a locale name like 'en-NEWCOUNTRY', and in old ICU versions you want that to fall back to default 'en' behavior and in newer you get the updated custom behavior. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?
On Sat, Sep 30, 2017 at 12:28 PM, Tom Lanewrote: > This suggests to me that arguing about canonicalization is moot so > far as avoiding reindexing goes: if you change ICU library versions, > you're screwed and will have to jump through all the reindexing hoops, > no matter what we do or don't do. (Maybe we are looking at the wrong > information to populate collversion?) Just to be clear, I was never arguing that canonicalization would avoid reindexing, and I didn't think anyone else was. I believe that the version string will change when the behavior of its collator changes for any reason and in any way. This includes changes to how binary sort keys are generated. (We don't currently store binary keys on disk, so a change to that representation doesn't really necessitate a REINDEX for us. The UCA spec explicitly decouples compatibility for indexing with binary keys from changes needed due to human requirements. ICU binary sort keys are compressed, and this is sometimes improved, independently of the evolution of how natural language experts say text should be sorted.) > Now, it may still be worthwhile to argue about whether canonicalization > will help the other component of the problem, which is whether you can > dump and reload CREATE COLLATION commands into a new ICU version and > expect to get more-or-less-the-same behavior as before. Again, to be clear, that's what I was arguing all along. I think that users will get very good results with this approach. When the sorting behavior of a locale is refined by natural language experts, it's almost certainly a very small change that most users that are affected won't actually notice. For example, when en-us-x-icu is changed, that's actually due to a general change in the inherited root collator that doesn't really affect English speakers. There is no practical question about how you're supposed to sort English text. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?
Noah Mischwrites: > On Sat, Sep 30, 2017 at 11:25:43AM -0400, Tom Lane wrote: >> Sure, but dealing with that is mechanical: reindex the necessary indexes >> and you're done. > In the general case, one must revalidate CHECK constraints, re-partition > tables, revalidate range values, and reindex. True, but that is what it is: nothing we can do is going to affect the consequences of a collation behavior change, if there is one. What's more useful for our immediate purposes is to ask whether we can reliably detect a collation behavior change. False negatives are bad, but so are false positives, because those would force DBAs to jump through lots of hoops unnecessarily. So: are canonicalized locale descriptions any better or worse by that metric than non-canonicalized descriptions? In principle I think a canonicalized description might be more likely to be recognized as the "same" locale by another ICU version than one that isn't, but I don't know whether there's any meaningful difference in practice. Another point here is whether, even if a new ICU version recognizes a locale description as being "the same" interpretation that an old ICU version used, will it report the same collation version? Limited experimentation suggests that the collversions we're actually getting out of ICU depend on little other than the libicu version. "select distinct collversion from pg_collation where collversion is not null" produces this on ICU 4.2.1: 49.192.5.41 49.192.0.41 and this on 52.1: 58.0.6.50 58.0.0.50 and this on 57.1: 153.64.29 153.64 This suggests to me that arguing about canonicalization is moot so far as avoiding reindexing goes: if you change ICU library versions, you're screwed and will have to jump through all the reindexing hoops, no matter what we do or don't do. (Maybe we are looking at the wrong information to populate collversion?) Now, it may still be worthwhile to argue about whether canonicalization will help the other component of the problem, which is whether you can dump and reload CREATE COLLATION commands into a new ICU version and expect to get more-or-less-the-same behavior as before. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?
On Sat, Sep 30, 2017 at 11:25:43AM -0400, Tom Lane wrote: > Noah Mischwrites: > > On Mon, Sep 25, 2017 at 09:36:44AM -0700, Peter Geoghegan wrote: > >>> I think it's inevitable that a certain number of users are going to > >>> have to cope with ICU version changes breaking stuff. > > >> Wasn't the main point of adopting ICU that that doesn't happen when it > >> isn't essential? > > > I wouldn't describe it that way. I agree that few, if any, ICU upgrades > > will > > remove country or language codes. Overall, though, almost every ICU upgrade > > will be difficult. Each ICU release, even a minor release like 58.2, > > changes > > the sorting rules in some tiny way. You then see "Rebuild all objects > > affected by this collation" messages. > > Sure, but dealing with that is mechanical: reindex the necessary indexes > and you're done. In the general case, one must revalidate CHECK constraints, re-partition tables, revalidate range values, and reindex. > In the libc world, > when you upgrade libc's locale definitions, you have no idea what the > consequences are. Yep. It's strictly better than the libc case. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?
On Sat, Sep 30, 2017 at 8:25 AM, Tom Lanewrote: > I'd also argue that the point of adopting ICU was exactly so we *could* > distinguish those cases, and limit the scope of a normal upgrade to > "reindex these identifiable indexes and you're done". In the libc world, > when you upgrade libc's locale definitions, you have no idea what the > consequences are. Right. With libc, we think of collations as something that there is a small, fixed number of on a system, that we cannot safely assume anything about. But with ICU, all of the semantics of how natural languages should be sorted are exposed via various APIs, and there are literally more possible sets of collation behaviors than there are grains of sand in the Sahara (there are hundreds of distinct scripts, which we can change the overall ordering of arbitrarily, on top of all the other customizations). Clearly the libc way of looking at things doesn't really carry over. BCP 47 is supposed to be universal -- it's an IETF standard. That's where all the stability guarantees are. The officially recognized 'u' extension that ICU uses is a CLDR/Unicode thing, not an ICU thing. The same format could, in the future, be used by other collation providers, since there actually are other CLDR consumers/UCA implementations. And, ICU have said that they have deprecated the old locale format, and have standardized on BCP 47. As of ICU 54, it is recommended that ucol_open() be passed a string in BCP 47 format. I'm surprised that this issue was not resolved earlier in the week. I presumed that all of this was obvious to Peter E., but I seem to have been wrong about that. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: CREATE COLLATION does not sanitize ICU's BCP 47 language tags. Should it?
Noah Mischwrites: > On Mon, Sep 25, 2017 at 09:36:44AM -0700, Peter Geoghegan wrote: >>> I think it's inevitable that a certain number of users are going to >>> have to cope with ICU version changes breaking stuff. >> Wasn't the main point of adopting ICU that that doesn't happen when it >> isn't essential? > I wouldn't describe it that way. I agree that few, if any, ICU upgrades will > remove country or language codes. Overall, though, almost every ICU upgrade > will be difficult. Each ICU release, even a minor release like 58.2, changes > the sorting rules in some tiny way. You then see "Rebuild all objects > affected by this collation" messages. Sure, but dealing with that is mechanical: reindex the necessary indexes and you're done. I think it's important to distinguish that from a case where users have to change their collation definitions. That is a qualitatively different, and much harder, upgrade problem. I'd also argue that the point of adopting ICU was exactly so we *could* distinguish those cases, and limit the scope of a normal upgrade to "reindex these identifiable indexes and you're done". In the libc world, when you upgrade libc's locale definitions, you have no idea what the consequences are. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers