Re: [HACKERS] ICU collation variant keywords and pg_collation entries (Was: [BUGS] Crash report for some ICU-52 (debian8) COLLATE and work_mem values)
On 8/7/17 21:00, Peter Geoghegan wrote: > Actually, it's *impossible* for ICU to fail to accept any string as a > valid locale within CREATE COLLATION, because CollationCreate() simply > doesn't sanitize ICU names. It doesn't do something like call > get_icu_language_tag(), unlike initdb (within > pg_import_system_collations()). > > If I add such a test to CollationCreate(), it does a reasonable job of > sanitizing, while preserving the spirit of the BCP 47 language tag > format by not assuming that the user didn't specify a brand new locale > that it hasn't heard of. I'm not sure what you are proposing here. Convert the input to CREATE COLLATION to a BCP 47 language tag? -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ICU collation variant keywords and pg_collation entries (Was: [BUGS] Crash report for some ICU-52 (debian8) COLLATE and work_mem values)
On Mon, Aug 7, 2017 at 3:23 PM, Tom Lanewrote: > The thing that I'm particularly thinking about is that if someone wants > an ICU variant collation that we didn't make initdb provide, they'll do > a CREATE COLLATION and go use it. At update time, pg_dump or pg_upgrade > will export/import that via CREATE COLLATION, and the only way it fails > is if ICU rejects the collation name as garbage. (Which, as we already > established upthread, it's quite unlikely to do.) Actually, it's *impossible* for ICU to fail to accept any string as a valid locale within CREATE COLLATION, because CollationCreate() simply doesn't sanitize ICU names. It doesn't do something like call get_icu_language_tag(), unlike initdb (within pg_import_system_collations()). If I add such a test to CollationCreate(), it does a reasonable job of sanitizing, while preserving the spirit of the BCP 47 language tag format by not assuming that the user didn't specify a brand new locale that it hasn't heard of. All of these are accepted with unmodified master: postgres=# CREATE COLLATION test1 (provider = icu, locale = 'en-x-icu'); CREATE COLLATION postgres=# CREATE COLLATION test2 (provider = icu, locale = 'foo bar baz'); ERROR: XX000: could not convert locale name "foo bar baz" to language tag: U_ILLEGAL_ARGUMENT_ERROR LOCATION: get_icu_language_tag, collationcmds.c:454 postgres=# CREATE COLLATION test3 (provider = icu, locale = 'en-gb-icu'); ERROR: XX000: could not convert locale name "en-gb-icu" to language tag: U_ILLEGAL_ARGUMENT_ERROR LOCATION: get_icu_language_tag, collationcmds.c:454 postgres=# CREATE COLLATION test4 (provider = icu, locale = 'not-a-country'); CREATE COLLATION If it's mandatory for get_icu_language_tag() to not throw an error during initdb import when passed strings like these (that are generated mechanically), why should we not do the same with CREATE COLLATION? While the choice to preserve BCP 47's tolerance of missing collations is debatable, not doing at least this much up-front is a bug IMV. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ICU collation variant keywords and pg_collation entries (Was: [BUGS] Crash report for some ICU-52 (debian8) COLLATE and work_mem values)
Peter Eisentrautwrites: > On 8/6/17 20:07, Peter Geoghegan wrote: >> I've looked into this. I'll give an example of what keyword variants >> there are for Greek, and then discuss what I think each is. > I'm not sure why we want to get into editorializing this. We query ICU > for the names of distinct collations and use that. It's more than most > people need, sure, but it doesn't cost us anything. Yes, *it does*. The cost will be borne by users who get screwed at update time, not by developers, but that doesn't make it insignificant. > The alternatives are hand-maintaining a list of collations, or > installing no collations by default. Both of those are arguably worse > for users or for future code maintenance or both. I'm not (yet) convinced that we need a hand-maintained whitelist. But I am wondering why we're expending extra code to import keyword variants. Who is that catering to, really? The thing that I'm particularly thinking about is that if someone wants an ICU variant collation that we didn't make initdb provide, they'll do a CREATE COLLATION and go use it. At update time, pg_dump or pg_upgrade will export/import that via CREATE COLLATION, and the only way it fails is if ICU rejects the collation name as garbage. (Which, as we already established upthread, it's quite unlikely to do.) On the other hand, if someone relies on an ICU variant collation that initdb did import, and then in the next release that collation doesn't get imported because ICU changed their minds on what to advertise, the update situation is not pretty at all. Certainly it won't get handled transparently. This line of thinking leads me to believe that we ought to be pretty conservative about what we import during initdb. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ICU collation variant keywords and pg_collation entries (Was: [BUGS] Crash report for some ICU-52 (debian8) COLLATE and work_mem values)
On Mon, Aug 7, 2017 at 2:50 PM, Peter Eisentrautwrote: > On 8/6/17 20:07, Peter Geoghegan wrote: >> I've looked into this. I'll give an example of what keyword variants >> there are for Greek, and then discuss what I think each is. > > I'm not sure why we want to get into editorializing this. We query ICU > for the names of distinct collations and use that. We ask ucol_getKeywordValuesForLocale() to get only "commonly used [variant] values with the given locale" within pg_import_system_collations(). So the editorializing has already begun. > It's more than most > people need, sure, but it doesn't cost us anything. It's also *less* than what other users need. I disagree on the cost of redundancy among entries after initdb. It's just confusing to users, and seems avoidable without adding special case logic. What's the difference between el-u-co-standard-x-icu and el-x-icu? > The alternatives > are hand-maintaining a list of collations, or installing no collations > by default. A better alternative would be to actively take an interest in what collations are created, by further refining the rules by which they are created. We have a stable API, described by various standards, that we can work with for this. This doesn't have to be a maintainability burden. We can provide general guidance about how to add stuff back within documentation. I do think that we should actually list all the collations that are available by default on some representative ICU version, once that list is tightened up, just as other database systems list them. That necessitates a little weasel wording that notes that later ICU versions might add more, but that's not a problem IMV. I don't think that CLDR will ever omit anything previously available, at least within a reasonable timeframe [1]. [1] http://cldr.unicode.org/index/process/cldr-data-retention-policy -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ICU collation variant keywords and pg_collation entries (Was: [BUGS] Crash report for some ICU-52 (debian8) COLLATE and work_mem values)
On 8/6/17 20:07, Peter Geoghegan wrote: > I've looked into this. I'll give an example of what keyword variants > there are for Greek, and then discuss what I think each is. I'm not sure why we want to get into editorializing this. We query ICU for the names of distinct collations and use that. It's more than most people need, sure, but it doesn't cost us anything. The alternatives are hand-maintaining a list of collations, or installing no collations by default. Both of those are arguably worse for users or for future code maintenance or both. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] ICU collation variant keywords and pg_collation entries (Was: [BUGS] Crash report for some ICU-52 (debian8) COLLATE and work_mem values)
On Sun, Aug 6, 2017 at 1:06 PM, Peter Geogheganwrote: > On Sat, Aug 5, 2017 at 8:26 PM, Tom Lane wrote: >> I'm quite disturbed though that the set of installed collations on these >> two test cases seem to be entirely different both from each other and from >> what you reported. The base collations look generally similar, but the >> "keyword variant" versions are not comparable at all. Considering that >> the entire reason we are interested in ICU in the first place is its >> alleged cross-version collation behavior stability, this gives me the >> exact opposite of a warm fuzzy feeling. We need to understand why it's >> like that and what we can do to reduce the variation, or else we're just >> buying our users enormous future pain. At least with the libc collations, >> you can expect that if you have en_US.utf8 available today you will >> probably still have en_US.utf8 available tomorrow. I am not seeing any >> reason to believe that the same holds for ICU collations. > > +1. That seems like something that is important to get right up-front. I've looked into this. I'll give an example of what keyword variants there are for Greek, and then discuss what I think each is. These keyword variant locations on my machine with master + ICU support (ICU 55): postgres=# \dOS+ el-* List of collations Schema │ Name │ Collate │ Ctype │ Provider │ Description ┼┼──┼──┼──┼─ pg_catalog │ el-u-co-emoji-x-icu│ el-u-co-emoji│ el-u-co-emoji│ icu │ Greek pg_catalog │ el-u-co-eor-x-icu │ el-u-co-eor │ el-u-co-eor │ icu │ Greek pg_catalog │ el-u-co-search-x-icu │ el-u-co-search │ el-u-co-search │ icu │ Greek pg_catalog │ el-u-co-standard-x-icu │ el-u-co-standard │ el-u-co-standard │ icu │ Greek pg_catalog │ el-x-icu │ el │ el │ icu │ Greek (5 rows) Greek has only one region, standard Greek. A few other language-regions have variations like multiple regions (e.g. Austrian German), or a phonebook variant, which you don't see here. Almost all have -emoji, -search, and -standard, which you do see here. We pass "commonlyUsed = true" to ucol_getKeywordValuesForLocale() within pg_import_system_collations(), and so it "will return only commonly used values with the given locale in preferred order". But should we go even further? If the charter of pg_import_system_collations() is to import every possible valid collation for pg_collation, then it's already failing at that by limiting itself to "common variants". I agree with the decision to do that, though, and I think we probably need to go a bit further. Possible issues with current ICU pg_collation entries after initdb: * I don't think we should have user-visible "search" collations at all. Apparently "search" collations are useful because "primary- and secondary-level distinctions for searching may not be the same as those for sorting; in ICU, many languages provide a special "search" collator with the appropriate level settings for search" [1]. I don't think that we should expose "search" keyword variants at all, because clearly they're an implementation detail that Postgres may one day have special knowledge of [2], to correctly mix searching and sorting semantics. For the time being, those should simply not be added within pg_import_system_collations(). Someone could still create the entries themselves, which seems harmless. Let's avoid establishing the expectation that they'll be in pg_collation. * Redundant ICU spellings for the same collation seem to appear. I find it questionable that there is both a "el-x-icu" and a "el-u-co-standard-x-icu". That looks like an artifact of how pg_import_system_collations() was written, as opposed to a bonafide behavioral difference. I cannot find an example of a "$COUNTRY_CODE-x-icu" collation without a corresponding "$COUNTRY_CODE-*-u-standard-x-icu" (The situation is similar for regional variants, like Austrian German). What, if anything, is the difference between each such pair of collations? Can we find a way to provide only one canonical entry if those are simply different ICU spellings? * Many emoji variant collations. I have to wonder if there is much value in creating so many pg_collation entries that are mere variants to do pictographic emoji sorting. Call me a killjoy, but I think that users that want that behavior can create the collations themselves. We could still document it. I wouldn't mind it if there wasn't so many emoji collations. * Many EOR variant collations. EOR as a collation variant is an ICU hack to get around the fact that EOR doesn't fit with their taxonomy for locales. My understanding is that there is supposed to be one EOR collation, used across Europe, per the ISO standard. I think ICU structures it