Re: [HACKERS] ICU collation variant keywords and pg_collation entries (Was: [BUGS] Crash report for some ICU-52 (debian8) COLLATE and work_mem values)

2017-08-14 Thread Peter Eisentraut
On 8/7/17 21:00, Peter Geoghegan wrote:
> Actually, it's *impossible* for ICU to fail to accept any string as a
> valid locale within CREATE COLLATION, because CollationCreate() simply
> doesn't sanitize ICU names. It doesn't do something like call
> get_icu_language_tag(), unlike initdb (within
> pg_import_system_collations()).
> 
> If I add such a test to CollationCreate(), it does a reasonable job of
> sanitizing, while preserving the spirit of the BCP 47 language tag
> format by not assuming that the user didn't specify a brand new locale
> that it hasn't heard of.

I'm not sure what you are proposing here.  Convert the input to CREATE
COLLATION to a BCP 47 language tag?

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ICU collation variant keywords and pg_collation entries (Was: [BUGS] Crash report for some ICU-52 (debian8) COLLATE and work_mem values)

2017-08-07 Thread Peter Geoghegan
On Mon, Aug 7, 2017 at 3:23 PM, Tom Lane  wrote:
> The thing that I'm particularly thinking about is that if someone wants
> an ICU variant collation that we didn't make initdb provide, they'll do
> a CREATE COLLATION and go use it.  At update time, pg_dump or pg_upgrade
> will export/import that via CREATE COLLATION, and the only way it fails
> is if ICU rejects the collation name as garbage.  (Which, as we already
> established upthread, it's quite unlikely to do.)

Actually, it's *impossible* for ICU to fail to accept any string as a
valid locale within CREATE COLLATION, because CollationCreate() simply
doesn't sanitize ICU names. It doesn't do something like call
get_icu_language_tag(), unlike initdb (within
pg_import_system_collations()).

If I add such a test to CollationCreate(), it does a reasonable job of
sanitizing, while preserving the spirit of the BCP 47 language tag
format by not assuming that the user didn't specify a brand new locale
that it hasn't heard of. All of these are accepted with unmodified
master:

postgres=# CREATE COLLATION test1 (provider = icu, locale = 'en-x-icu');
CREATE COLLATION
postgres=# CREATE COLLATION test2 (provider = icu, locale = 'foo bar baz');
ERROR:  XX000: could not convert locale name "foo bar baz" to language
tag: U_ILLEGAL_ARGUMENT_ERROR
LOCATION:  get_icu_language_tag, collationcmds.c:454
postgres=# CREATE COLLATION test3 (provider = icu, locale = 'en-gb-icu');
ERROR:  XX000: could not convert locale name "en-gb-icu" to language
tag: U_ILLEGAL_ARGUMENT_ERROR
LOCATION:  get_icu_language_tag, collationcmds.c:454
postgres=# CREATE COLLATION test4 (provider = icu, locale = 'not-a-country');
CREATE COLLATION

If it's mandatory for get_icu_language_tag() to not throw an error
during initdb import when passed strings like these (that are
generated mechanically), why should we not do the same with CREATE
COLLATION? While the choice to preserve BCP 47's tolerance of missing
collations is debatable, not doing at least this much up-front is a
bug IMV.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ICU collation variant keywords and pg_collation entries (Was: [BUGS] Crash report for some ICU-52 (debian8) COLLATE and work_mem values)

2017-08-07 Thread Tom Lane
Peter Eisentraut  writes:
> On 8/6/17 20:07, Peter Geoghegan wrote:
>> I've looked into this. I'll give an example of what keyword variants
>> there are for Greek, and then discuss what I think each is.

> I'm not sure why we want to get into editorializing this.  We query ICU
> for the names of distinct collations and use that.  It's more than most
> people need, sure, but it doesn't cost us anything.

Yes, *it does*.  The cost will be borne by users who get screwed at update
time, not by developers, but that doesn't make it insignificant.

> The alternatives are hand-maintaining a list of collations, or
> installing no collations by default.  Both of those are arguably worse
> for users or for future code maintenance or both.

I'm not (yet) convinced that we need a hand-maintained whitelist.  But
I am wondering why we're expending extra code to import keyword variants.
Who is that catering to, really?

The thing that I'm particularly thinking about is that if someone wants
an ICU variant collation that we didn't make initdb provide, they'll do
a CREATE COLLATION and go use it.  At update time, pg_dump or pg_upgrade
will export/import that via CREATE COLLATION, and the only way it fails
is if ICU rejects the collation name as garbage.  (Which, as we already
established upthread, it's quite unlikely to do.)  On the other hand,
if someone relies on an ICU variant collation that initdb did import,
and then in the next release that collation doesn't get imported because
ICU changed their minds on what to advertise, the update situation is not
pretty at all.  Certainly it won't get handled transparently.  This line
of thinking leads me to believe that we ought to be pretty conservative
about what we import during initdb.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ICU collation variant keywords and pg_collation entries (Was: [BUGS] Crash report for some ICU-52 (debian8) COLLATE and work_mem values)

2017-08-07 Thread Peter Geoghegan
On Mon, Aug 7, 2017 at 2:50 PM, Peter Eisentraut
 wrote:
> On 8/6/17 20:07, Peter Geoghegan wrote:
>> I've looked into this. I'll give an example of what keyword variants
>> there are for Greek, and then discuss what I think each is.
>
> I'm not sure why we want to get into editorializing this.  We query ICU
> for the names of distinct collations and use that.

We ask ucol_getKeywordValuesForLocale() to get only "commonly used
[variant] values with the given locale" within
pg_import_system_collations(). So the editorializing has already
begun.

> It's more than most
> people need, sure, but it doesn't cost us anything.

It's also *less* than what other users need. I disagree on the cost of
redundancy among entries after initdb. It's just confusing to users,
and seems avoidable without adding special case logic. What's the
difference between el-u-co-standard-x-icu and el-x-icu?

> The alternatives
> are hand-maintaining a list of collations, or installing no collations
> by default.

A better alternative would be to actively take an interest in what
collations are created, by further refining the rules by which they
are created. We have a stable API, described by various standards,
that we can work with for this. This doesn't have to be a
maintainability burden. We can provide general guidance about how to
add stuff back within documentation.

I do think that we should actually list all the collations that are
available by default on some representative ICU version, once that
list is tightened up, just as other database systems list them. That
necessitates a little weasel wording that notes that later ICU
versions might add more, but that's not a problem IMV. I don't think
that CLDR will ever omit anything previously available, at least
within a reasonable timeframe [1].

[1] http://cldr.unicode.org/index/process/cldr-data-retention-policy
-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ICU collation variant keywords and pg_collation entries (Was: [BUGS] Crash report for some ICU-52 (debian8) COLLATE and work_mem values)

2017-08-07 Thread Peter Eisentraut
On 8/6/17 20:07, Peter Geoghegan wrote:
> I've looked into this. I'll give an example of what keyword variants
> there are for Greek, and then discuss what I think each is.

I'm not sure why we want to get into editorializing this.  We query ICU
for the names of distinct collations and use that.  It's more than most
people need, sure, but it doesn't cost us anything.  The alternatives
are hand-maintaining a list of collations, or installing no collations
by default.  Both of those are arguably worse for users or for future
code maintenance or both.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] ICU collation variant keywords and pg_collation entries (Was: [BUGS] Crash report for some ICU-52 (debian8) COLLATE and work_mem values)

2017-08-06 Thread Peter Geoghegan
On Sun, Aug 6, 2017 at 1:06 PM, Peter Geoghegan  wrote:
> On Sat, Aug 5, 2017 at 8:26 PM, Tom Lane  wrote:
>> I'm quite disturbed though that the set of installed collations on these
>> two test cases seem to be entirely different both from each other and from
>> what you reported.  The base collations look generally similar, but the
>> "keyword variant" versions are not comparable at all.  Considering that
>> the entire reason we are interested in ICU in the first place is its
>> alleged cross-version collation behavior stability, this gives me the
>> exact opposite of a warm fuzzy feeling.  We need to understand why it's
>> like that and what we can do to reduce the variation, or else we're just
>> buying our users enormous future pain.  At least with the libc collations,
>> you can expect that if you have en_US.utf8 available today you will
>> probably still have en_US.utf8 available tomorrow.  I am not seeing any
>> reason to believe that the same holds for ICU collations.
>
> +1. That seems like something that is important to get right up-front.

I've looked into this. I'll give an example of what keyword variants
there are for Greek, and then discuss what I think each is. These
keyword variant locations on my machine with master + ICU support (ICU
55):

postgres=# \dOS+ el-*
 List of collations
   Schema   │  Name  │ Collate  │  Ctype
│ Provider │ Description
┼┼──┼──┼──┼─
 pg_catalog │ el-u-co-emoji-x-icu│ el-u-co-emoji│
el-u-co-emoji│ icu  │ Greek
 pg_catalog │ el-u-co-eor-x-icu  │ el-u-co-eor  │ el-u-co-eor
│ icu  │ Greek
 pg_catalog │ el-u-co-search-x-icu   │ el-u-co-search   │
el-u-co-search   │ icu  │ Greek
 pg_catalog │ el-u-co-standard-x-icu │ el-u-co-standard │
el-u-co-standard │ icu  │ Greek
 pg_catalog │ el-x-icu   │ el   │ el
│ icu  │ Greek
(5 rows)

Greek has only one region, standard Greek. A few other
language-regions have variations like multiple regions (e.g. Austrian
German), or a phonebook variant, which you don't see here. Almost all
have -emoji, -search, and -standard, which you do see here.

We pass "commonlyUsed = true" to ucol_getKeywordValuesForLocale()
within pg_import_system_collations(), and so it "will return only
commonly used values with the given locale in preferred order". But
should we go even further? If the charter of
pg_import_system_collations() is to import every possible valid
collation for pg_collation, then it's already failing at that by
limiting itself to "common variants". I agree with the decision to do
that, though, and I think we probably need to go a bit further.

Possible issues with current ICU pg_collation entries after initdb:

* I don't think we should have user-visible "search" collations at all.

Apparently "search" collations are useful because "primary- and
secondary-level distinctions for searching may not be the same as
those for sorting; in ICU, many languages provide a special "search"
collator with the appropriate level settings for search" [1]. I don't
think that we should expose "search" keyword variants at all, because
clearly they're an implementation detail that Postgres may one day
have special knowledge of [2], to correctly mix searching and sorting
semantics. For the time being, those should simply not be added within
pg_import_system_collations(). Someone could still create the entries
themselves, which seems harmless. Let's avoid establishing the
expectation that they'll be in pg_collation.

* Redundant ICU spellings for the same collation seem to appear.

I find it questionable that there is both a "el-x-icu" and a
"el-u-co-standard-x-icu". That looks like an artifact of how
pg_import_system_collations() was written, as opposed to a bonafide
behavioral difference. I cannot find an example of a
"$COUNTRY_CODE-x-icu" collation without a corresponding
"$COUNTRY_CODE-*-u-standard-x-icu" (The situation is similar for
regional variants, like Austrian German). What, if anything, is the
difference between each such pair of collations? Can we find a way to
provide only one canonical entry if those are simply different ICU
spellings?

* Many emoji variant collations.

I have to wonder if there is much value in creating so many
pg_collation entries that are mere variants to do pictographic emoji
sorting. Call me a killjoy, but I think that users that want that
behavior can create the collations themselves. We could still document
it. I wouldn't mind it if there wasn't so many emoji collations.

* Many EOR variant collations.

EOR as a collation variant is an ICU hack to get around the fact that
EOR doesn't fit with their taxonomy for locales. My understanding is
that there is supposed to be one EOR collation, used across Europe,
per the ISO standard. I think ICU structures it