subject:"\[HACKERS\] ICU integration"

Re: [HACKERS] ICU integration

2017-03-24 Thread Peter Geoghegan

On Thu, Mar 23, 2017 at 6:31 PM, Craig Ringer  wrote:
> Congratulations on getting this done. It's great work, and it'll make
> a whole class of potential bugs and platform portability warts go away
> if widely adopted.

+1

I would like to see us rigorously define a standard for when Postgres
installations are binary compatible. There should be a simple tool.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ICU integration

2017-03-23 Thread Craig Ringer

On 24 March 2017 at 04:54, Peter Eisentraut
 wrote:

> Fixed.

Congratulations on getting this done. It's great work, and it'll make
a whole class of potential bugs and platform portability warts go away
if widely adopted.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ICU integration

2017-03-23 Thread Peter Eisentraut

On 3/23/17 16:45, Thomas Munro wrote:
> On Fri, Mar 24, 2017 at 8:34 AM, Peter Eisentraut
>  wrote:
>> Committed.
> 
> On a macOS system using MacPorts, I have "icu" installed.  MacPorts is
> a package manager that installs stuff under /opt/local.  I have
> /opt/local/bin in $PATH so that pkg-config can be found.  I run
> ./configure with --with-icu but without mentioning any special paths
> and it says:
> 
> checking whether to build with ICU support... yes
> checking for pkg-config... /opt/local/bin/pkg-config
> checking pkg-config is at least version 0.9.0... yes
> checking for icu-uc icu-i18n... yes
> 
> ... but then building fails, because there are no headers in the search path:

Fixed.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ICU integration

2017-03-23 Thread Thomas Munro

On Fri, Mar 24, 2017 at 8:34 AM, Peter Eisentraut
 wrote:
> Committed.

On a macOS system using MacPorts, I have "icu" installed.  MacPorts is
a package manager that installs stuff under /opt/local.  I have
/opt/local/bin in $PATH so that pkg-config can be found.  I run
./configure with --with-icu but without mentioning any special paths
and it says:

checking whether to build with ICU support... yes
checking for pkg-config... /opt/local/bin/pkg-config
checking pkg-config is at least version 0.9.0... yes
checking for icu-uc icu-i18n... yes

... but then building fails, because there are no headers in the search path:

gcc -Wall -Wmissing-prototypes -Wpointer-arith
-Wdeclaration-after-statement -Wendif-labels
-Wmissing-format-attribute -Wformat-security -fno-strict-aliasing
-fwrapv -Wno-unused-command-line-argument -g -O2 -Wall -Werror
-I../../../src/include/snowball
-I../../../src/include/snowball/libstemmer -I../../../src/include   -c
-o dict_snowball.o dict_snowball.c -MMD -MP -MF .deps/dict_snowball.Po
...
../../../src/include/utils/pg_locale.h:19:10: fatal error:
'unicode/ucol.h' file not found

But pkg-config does know where to find those headers:

$ pkg-config --cflags icu-i18n
-I/opt/local/include

... and it's not wrong:

$ ls /opt/local/include/unicode/ucol.h
/opt/local/include/unicode/ucol.h

So I think there may be a bug in the configure script: I think it
should be putting pkg-config's --cflags output into one of the
CFLAGS-like variables.

Or am I misunderstanding what pkg-config is supposed to be doing for us here?

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ICU integration

2017-03-23 Thread Jeff Janes

On Thu, Mar 23, 2017 at 12:34 PM, Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> wrote:

> On 3/23/17 05:34, Andreas Karlsson wrote:
> > I am fine with this version of the patch. The issues I have with it,
> > which I mentioned earlier in this thread, seem to be issues with ICU
> > rather than with this patch. For example there seems to be no way for
> > ICU to validate the syntax of the BCP 47 locales (or ICU's old format).
> > But I think we will just have to accept the weirdness of how ICU handles
> > locales.
> >
> > I think this patch is ready to be committed.
> >
> > Found a typo in the documentation:
> >
> > "The inspect the currently available locales" should be "To inspect the
> > currently available locales".
>
> Committed.
>

This has broken the C locale, and the build farm.

if (pg_database_encoding_max_length() > 1 || locale->provider ==
COLLPROVIDER_ICU)

segfaults because locale is null.  (locale_is_c is true)

Cheers,

Jeff

Re: [HACKERS] ICU integration

2017-03-23 Thread Peter Eisentraut

On 3/23/17 05:34, Andreas Karlsson wrote:
> I am fine with this version of the patch. The issues I have with it, 
> which I mentioned earlier in this thread, seem to be issues with ICU 
> rather than with this patch. For example there seems to be no way for 
> ICU to validate the syntax of the BCP 47 locales (or ICU's old format). 
> But I think we will just have to accept the weirdness of how ICU handles 
> locales.
> 
> I think this patch is ready to be committed.
> 
> Found a typo in the documentation:
> 
> "The inspect the currently available locales" should be "To inspect the 
> currently available locales".

Committed.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ICU integration

2017-03-23 Thread Andreas Karlsson

I am fine with this version of the patch. The issues I have with it, 
which I mentioned earlier in this thread, seem to be issues with ICU 
rather than with this patch. For example there seems to be no way for 
ICU to validate the syntax of the BCP 47 locales (or ICU's old format). 
But I think we will just have to accept the weirdness of how ICU handles 
locales.


I think this patch is ready to be committed.

Found a typo in the documentation:

"The inspect the currently available locales" should be "To inspect the 
currently available locales".


Andreas


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ICU integration

2017-03-19 Thread Andreas Karlsson


On 03/15/2017 05:33 PM, Peter Eisentraut wrote:

Updated patch attached.


Ok, I applied to patch again and ran the tests. I get a test failure on 
make check-world in the pg_dump tests but it can be fixed with the below.


diff --git a/src/bin/pg_dump/t/002_pg_dump.pl 
b/src/bin/pg_dump/t/002_pg_dump.pl

index 3cac4a9ae0..7d9c90363b 100644
--- a/src/bin/pg_dump/t/002_pg_dump.pl
+++ b/src/bin/pg_dump/t/002_pg_dump.pl
@@ -2422,7 +2422,7 @@ qr/^\QINSERT INTO test_fifth_table (col1, col2, 
col3, col4, col5) VALUES (NULL,

  'CREATE COLLATION test0 FROM "C";',
regexp =>
  qr/^
- \QCREATE COLLATION test0 (lc_collate = 'C', lc_ctype = 'C');\E/xm,
+ \QCREATE COLLATION test0 (provider = 'libc', locale = 'C');\E/xm,
like => {
binary_upgrade   => 1,
clean=> 1,


- I do not like how it is not obvious which is the default version of
every locale. E.g. I believe that "sv%icu" is "sv_reformed%icu" and not
"sv_standard%icu" as one might expect. Is this inherent to ICU or
something we can work around?


We get these keywords from ucol_getKeywordValuesForLocale(), which says
"Given a key and a locale, returns an array of string values in a
preferred order that would make a difference."  So all those are
supposedly different from each other.


I believe you are mistaken. The locale "sv" is just an alias for 
"sv-u-standard" as far as I can tell. See the definition of the Swedish 
locale 
(http://source.icu-project.org/repos/icu/trunk/icu4c/source/data/coll/sv.txt) 
and there are just three collations: reformed (default), standard, 
search (plus eot and emoji which are inherited).


I am also quite annoyed at col-emoji and col-eor (European Ordering 
Rules). They are defined at the root and inherited by all languages, but 
no language modifies col-emoji for their needs which makes it a foot 
gun. See the Danish sorting example below where at least I expected the 
same order. For col-eor it makes a bit more sense since I would expect 
the locales en_GB-u-col-eot and sv_SE-u-col-eot to collate exactly the same.


# SELECT * FROM (VALUES ('a'), ('k'), ('aa')) q (x) ORDER BY x COLLATE 
"da-x-icu";

 x

 a
 k
 aa
(3 rows)

# SELECT * FROM (VALUES ('a'), ('k'), ('aa')) q (x) ORDER BY x COLLATE 
"da-u-co-emoji-x-icu";

 x

 a
 aa
 k
(3 rows)

It seems ICU has made quite the mess here, and I am not sure to what 
degree we need to handle it to avoid our users getting confused. I need 
to think some of it, and would love input from others. Maybe the right 
thing to do is to ignore the issue with col-emoji, but I would want to 
do something about the default collations.



- ICU talks about a new syntax for locale extensions (old:
de_DE@collation=phonebook, new: de_DE_u_co_phonebk) at this page
http://userguide.icu-project.org/collation/api. Is this something we
need to car about? It looks like we currently use the old format, and
while I personally prefer it I am not sure we should rely on an old syntax.


Interesting.  I hadn't see this before, and the documentation is sparse.
 But it seems that this refers to BCP 47 language tags, which seem like
a good idea.

So what I have done is change all the predefined ICU collations to be
named after the BCP 47 scheme, with a "private use" -x-icu suffix
(instead of %icu).  The preserves the original idea but uses a standard
naming scheme.

I'm not terribly worried that they are going to remove the "old" locale
naming, but just to be forward looking, I have changed it so that the
collcollate entries are made using the "new" naming for ICU >=54.


Sounds good.


- I get an error when creating a new locale.

#CREATE COLLATION sv2 (LOCALE = 'sv');
ERROR:  could not create locale "sv": Success

# CREATE COLLATION sv2 (LOCALE = 'sv');
ERROR:  could not create locale "sv": Resource temporarily unavailable
Time: 1.109 ms


Hmm, that's pretty straightforward code.  What is your operating system?
 What are the build options?  Does it work without this patch?


This issue is unrelated to ICU. I had forgot to specify provider so the 
eorrs are correct (even though that the value of the errno is weird).



- For the collprovider is it really correct to say that 'd' is the
default value as it does in catalogs.sgml?


It doesn't say it's the default value, it says it uses the database
default.  This is all a bit confusing.  We have a collation object named
"default", which uses the locale set for the database.  That's been that
way for a while.  Now when introducing the collation providers, that
"default" collation gets its own collprovider category 'd'.  That is not
really used anywhere, but we have to put something there.


Ah, I see now. Hm, that is a bit awkward but I cannot think of a cleaner 
alternative.



- I do not know what the policy for formatting the documentation is, but
some of the paragraphs are in need of re-wrapping.


Hmm, I don't see anything terribly bad.


Maybe it is just me

Re: [HACKERS] ICU integration

2017-03-14 Thread Andreas Karlsson


On 03/09/2017 10:13 PM, Peter Eisentraut wrote:

- Naming of collations:  Are we happy with the "de%icu" naming?  I might
have come up with that while reviewing the IPv6 zone index patch. ;-)
An alternative might be "de$icu" for more Oracle vibe and avoiding the
need for double quotes in some cases.  (But we have mixed-case names
like "de_AT%icu", so further changes would be necessary to fully get rid
of the need for quoting.)  A more radical alternative would be to
install ICU locales in a different schema and use the search_path, or
even have a separate search path setting for collations only.  Which
leads to ...


I do not like the schema based solution since search_path already gives 
us enough headaches. As for the naming I am fine with the current scheme.


Would be nice with something we did not have to quote, but I do not like 
using dollar signs since they are already use for other things.



- Selecting default collation provider:  Maybe we want a setting, say in
initdb, to determine which provider's collations get the "good" names?
Maybe not necessary for this release, but something to think about.


This does not seem necessary for the initial release.


- Currently (in this patch), we check a collation's version when it is
first used.  But, say, after pg_upgrade, you might want to check all of
them right away.  What might be a good interface for that?  (Possibly,
we only have to check the ones actually in use, and we have dependency
information for that.)


How about adding a SQL level function for checking this which can be 
called by pg_upgrade?


= Review

Here is an initial review. I will try to find some time to do more 
testing later this week.


This is a really useful feature given the poor quality of collation 
support libc. Just that ICU versions the encodings is huge, and the 
larger range of available collations with high configurability. 
Additionally being able to use abbreviated keys again would be huge.


ICU also supports writing your own collations and modifying existing 
ones, something which might be possible to expose in the future. In 
general ICU offers a lot for users who care about the details of text 
collation.


== Functional

- Generally things seem to work fine and as expected.

- I get a test failures in the default test suite due to not having the 
tr_TR locale installed. I would assume that this would be pretty common 
for hackers.


***
*** 428,443 

  -- to_char
  SET lc_time TO 'tr_TR';
  SELECT to_char(date '2010-04-01', 'DD TMMON ');
 to_char
  -
!  01 NIS 2010
  (1 row)

  SELECT to_char(date '2010-04-01', 'DD TMMON ' COLLATE "tr%icu");
 to_char
  -
!  01 NİS 2010
  (1 row)

  -- backwards parsing
--- 428,444 

  -- to_char
  SET lc_time TO 'tr_TR';
+ ERROR:  invalid value for parameter "lc_time": "tr_TR"
  SELECT to_char(date '2010-04-01', 'DD TMMON ');
 to_char
  -
!  01 APR 2010
  (1 row)

  SELECT to_char(date '2010-04-01', 'DD TMMON ' COLLATE "tr%icu");
 to_char
  -
!  01 APR 2010
  (1 row)

  -- backwards parsing

==

- The code no longer compiles without HAVE_LOCALE_T.

- I do not like how it is not obvious which is the default version of 
every locale. E.g. I believe that "sv%icu" is "sv_reformed%icu" and not 
"sv_standard%icu" as one might expect. Is this inherent to ICU or 
something we can work around?


- ICU talks about a new syntax for locale extensions (old: 
de_DE@collation=phonebook, new: de_DE_u_co_phonebk) at this page 
http://userguide.icu-project.org/collation/api. Is this something we 
need to car about? It looks like we currently use the old format, and 
while I personally prefer it I am not sure we should rely on an old syntax.


- I get an error when creating a new locale.

#CREATE COLLATION sv2 (LOCALE = 'sv');
ERROR:  could not create locale "sv": Success

# CREATE COLLATION sv2 (LOCALE = 'sv');
ERROR:  could not create locale "sv": Resource temporarily unavailable
Time: 1.109 ms

== Code review

- For the collprovider is it really correct to say that 'd' is the 
default value as it does in catalogs.sgml?


- I do not know what the policy for formatting the documentation is, but 
some of the paragraphs are in need of re-wrapping.


- Add a hint to "ERROR:  conflicting or redundant options". The error 
message is pretty unclear.


- I am not a fan of this patch comparing the encoding with a -1 literal. 
How about adding -1 as a value to the enum? See the example below for a 
place where the patch compares with -1.


ereport(NOTICE,
(errcode(ERRCODE_DUPLICATE_OBJECT),
-errmsg("collation \"%s\" for encoding \"%s\" already 
exists, skipping",

-   collname, pg_encoding_to_char(collencoding;
+collencoding == -1
+? errmsg("collation \"%s\" already exists, skipping",
+

Re: [HACKERS] ICU integration

2017-02-20 Thread Peter Geoghegan

On Mon, Feb 20, 2017 at 3:51 PM, Bruce Momjian  wrote:
> So we don't have any other cases where we warn about possible corruption
> except this?

I'm not sure that I understand the distinction you're making.

> Also, I will go back to my previous concern, that while I like the fact
> we can detect collation changes with ICU, we don't know if ICU
> collations change more often than OS collations.

We do know that ICU collations can never change behaviorally in a
given stable release. Bug fixes are allowed in point releases, but
these never change the user-visible behavior of collations. That's
very clear, because an upstream Unciode UCA version is used by a given
major release of ICU. This upstream data describes the behavior of a
collation using a high-level declarative language, that non-programmer
experts in natural languages write.

ICU versions many different things, in fact. Importantly, it
explicitly decouples behavioral issues (user visible sort order -- UCA
version) from technical issues (collator implementation details). So,
my original point is that that could change, and if that happens we
ought to have a plan. But, it won't change unless it really has to.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ICU integration

2017-02-20 Thread Bruce Momjian

On Mon, Feb 20, 2017 at 03:29:07PM -0800, Peter Geoghegan wrote:
> Marking all indexes as invalid would be an enormous overkill. We don't
> even know for sure that the values the user has indexed happens to be
> affected. In order for there to have been a bug in ICU in the first
> place, the likelihood is that it only occurs in what are edge cases
> for the collator.
> 
> ICU is a very popular library, used in software that I personally
> interact with every day [1]. Any bugs like this should be exceptional.
> They very clearly appreciate how sensitive software like Postgres is
> to changes like this, which is why the versioning API exists.
> 
> [1] http://site.icu-project.org/#TOC-Who-Uses-ICU-

So we don't have any other cases where we warn about possible corruption
except this?

Also, I will go back to my previous concern, that while I like the fact
we can detect collation changes with ICU, we don't know if ICU
collations change more often than OS collations.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ICU integration

2017-02-20 Thread Peter Geoghegan

On Mon, Feb 20, 2017 at 3:19 PM, Bruce Momjian  wrote:
> Well, the release notes are a clear-enough communication for a need to
> reindex.  I don't see a LOG message as similar.  Don't we have other
> cases where we have to warn administrators?  We can mark the indexes as
> invalid but how would we report that?

Marking all indexes as invalid would be an enormous overkill. We don't
even know for sure that the values the user has indexed happens to be
affected. In order for there to have been a bug in ICU in the first
place, the likelihood is that it only occurs in what are edge cases
for the collator.

ICU is a very popular library, used in software that I personally
interact with every day [1]. Any bugs like this should be exceptional.
They very clearly appreciate how sensitive software like Postgres is
to changes like this, which is why the versioning API exists.

[1] http://site.icu-project.org/#TOC-Who-Uses-ICU-
-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ICU integration

2017-02-20 Thread Bruce Momjian

On Mon, Feb 20, 2017 at 02:54:22PM -0800, Peter Geoghegan wrote:
> On Mon, Feb 20, 2017 at 2:38 PM, Bruce Momjian  wrote:
> > I can't think of any cases where we warn of possible corruption only in
> > the server logs.
> 
> It's just like any case where there was a bug in Postgres that could
> have subtly broken index builds. We don't make the next point release
> force users to REINDEX every possibly-affected index every time that
> happens. Presumably this is because they aren't particularly likely to
> be affected by any given bug, and because it's pretty infeasible for
> us to do so anyway. There aren't always easy ways to detect the
> problem, and users will never learn to deal with issues like this well
> when it is by definition something that should never happen.

Well, the release notes are a clear-enough communication for a need to
reindex.  I don't see a LOG message as similar.  Don't we have other
cases where we have to warn administrators?  We can mark the indexes as
invalid but how would we report that?

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ICU integration

2017-02-20 Thread Peter Geoghegan

On Mon, Feb 20, 2017 at 2:38 PM, Bruce Momjian  wrote:
> I can't think of any cases where we warn of possible corruption only in
> the server logs.

It's just like any case where there was a bug in Postgres that could
have subtly broken index builds. We don't make the next point release
force users to REINDEX every possibly-affected index every time that
happens. Presumably this is because they aren't particularly likely to
be affected by any given bug, and because it's pretty infeasible for
us to do so anyway. There aren't always easy ways to detect the
problem, and users will never learn to deal with issues like this well
when it is by definition something that should never happen.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ICU integration

2017-02-20 Thread Bruce Momjian

On Wed, Feb 15, 2017 at 10:35:34PM -0800, Peter Geoghegan wrote:
> On Wed, Feb 15, 2017 at 9:17 PM, Peter Eisentraut
>  wrote:
> > Stable operating systems shouldn't do major library upgrades, so I feel
> > pretty confident about this.
> 
> There is one case where it *is* appropriate for a bugfix release of
> ICU to bump collator version: When there was a bug in the collator
> itself, leading to broken binary blobs [1]. We should make the user
> REINDEX when this happens.
> 
> I think that ICU may well do this in point releases, which is actually
> a good thing. So, it seems like a good idea to test that there is no
> change, in a place like check_strxfrm_bug(). In my opinion, we should
> LOG a complaint in the event of a mismatch rather than refusing to
> start the server, since there probably isn't that much chance of the
> user being harmed by the bug. The cost of not starting the server
> properly until a REINDEX finishes or similar looks particularly
> unappealing when one considers that the app was probably affected by
> any corruption for weeks or months before the ICU update enabled its
> detection.
> http://www.postgresql.org/mailpref/pgsql-hackers

I can't think of any cases where we warn of possible corruption only in
the server logs.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+  Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ICU integration

2017-02-16 Thread Peter Eisentraut

On 2/16/17 01:13, Peter Geoghegan wrote:
> On Wed, Feb 15, 2017 at 9:17 PM, Peter Eisentraut
>  wrote:
>> I have changed it to use ucol_nextSortKeyPart() when appropriate.
> 
> I think it would be fine if the second last argument was
> "sizeof(Datum)", not "Min(sizeof(Datum), sss->buflen2)" here:

I don't see anything that guarantees that the buffer length is always at
least sizeof(Datum).

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ICU integration

2017-02-16 Thread Alvaro Herrera

Peter Eisentraut wrote:

> - Naming of collations:  Are we happy with the "de%icu" naming?  I might
> have come up with that while reviewing the IPv6 zone index patch. ;-)
> An alternative might be "de$icu" for more Oracle vibe and avoiding the
> need for double quotes in some cases.  (But we have mixed-case names
> like "de_AT%icu", so further changes would be necessary to fully get rid
> of the need for quoting.)  A more radical alternative would be to
> install ICU locales in a different schema and use the search_path, or
> even have a separate search path setting for collations only.  Which
> leads to ...
> 
> - Selecting default collation provider:  Maybe we want a setting, say in
> initdb, to determine which provider's collations get the "good" names?
> Maybe not necessary for this release, but something to think about.

I'm not sure I like "default collations" to depend on either search_path
or some hypothetical future setting.  That seems to lead to unproductive
discussions on mailing list questions,

User: hey, my sort doesn't work sanely
Pg person: okay, but what collation are you using?  Look for the collate
clause
User: Ah, it's de_DE
Pg person: oh, okay, but is it ICU's de_DE or the libc's?
User: ...

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ICU integration

2017-02-15 Thread Peter Geoghegan

On Wed, Feb 15, 2017 at 9:17 PM, Peter Eisentraut
 wrote:
>> Clearly when you upgrade your system from (say) Debian 8 to Debian 9
>> and the ICU major version changes we expect to have to REINDEX, but
>> does anyone know if such data changes are ever pulled into the minor
>> version package upgrades you get from regular apt-get update of (say)
>> a Debian 8 or CentOS 7 or FreeBSD 11 system?  In other words, do we
>> expect collversion changes to happen basically any time in response to
>> regular system updates, or only when you're doing a major upgrade of
>> some kind, as the above-quoted documentation patch implies?
>
> Stable operating systems shouldn't do major library upgrades, so I feel
> pretty confident about this.

There is one case where it *is* appropriate for a bugfix release of
ICU to bump collator version: When there was a bug in the collator
itself, leading to broken binary blobs [1]. We should make the user
REINDEX when this happens.

I think that ICU may well do this in point releases, which is actually
a good thing. So, it seems like a good idea to test that there is no
change, in a place like check_strxfrm_bug(). In my opinion, we should
LOG a complaint in the event of a mismatch rather than refusing to
start the server, since there probably isn't that much chance of the
user being harmed by the bug. The cost of not starting the server
properly until a REINDEX finishes or similar looks particularly
unappealing when one considers that the app was probably affected by
any corruption for weeks or months before the ICU update enabled its
detection.

[1] http://userguide.icu-project.org/collation/api#TOC-Sort-Key-Features
-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ICU integration

2017-02-15 Thread Peter Geoghegan

On Wed, Feb 15, 2017 at 9:17 PM, Peter Eisentraut
 wrote:
> I have changed it to use ucol_nextSortKeyPart() when appropriate.

I think it would be fine if the second last argument was
"sizeof(Datum)", not "Min(sizeof(Datum), sss->buflen2)" here:

> +   if (GetDatabaseEncoding() == PG_UTF8)
> +   {
> +   UCharIterator iter;
> +   uint32_tstate[2];
> +   UErrorCode  status;
> +
> +   uiter_setUTF8(, sss->buf1, len);
> +   state[0] = state[1] = 0;  /* won't need that again */
> +   status = U_ZERO_ERROR;
> +   bsize = ucol_nextSortKeyPart(sss->locale->info.icu.ucol,
> +,
> +state,
> +(uint8_t *) sss->buf2,
> +Min(sizeof(Datum), 
> sss->buflen2),
> +);
> +   if (U_FAILURE(status))
> +   ereport(ERROR,
> +   (errmsg("sort key generation failed: %s", 
> u_errorName(status;
> +   }

Does this really need to be in the strxfrm() loop at all? I don't see
why you need to ever retry here.

There is an issue of style here:

> -   /* Just like strcoll(), strxfrm() expects a NUL-terminated string */
> memcpy(sss->buf1, authoritative_data, len);
> +   /* Just like strcoll(), strxfrm() expects a NUL-terminated string.
> +* Not necessary for ICU, but doesn't hurt. */

I see that you have preserved strcoll() comparison caching (or should
I say ucol_strcollUTF8() comparison caching?), at the cost of having
to keep around a buffer which we must continue to copy every text
string into within varstr_abbrev_convert(). That was probably the
right trade-off. It doesn't hurt that it required minimal divergence
within new ICU code, too.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ICU integration

2017-02-15 Thread Peter Eisentraut

On 1/31/17 16:59, Thomas Munro wrote:
> I assume (but haven't checked) that ucol_nextSortKeyPart accesses only
> the start of the string via the UCharIterator passed in, unless you
> have the rare reverse-accent-sort feature enabled (maybe used only in
> fr_CA, it looks like it is required to scan the whole string looking
> for the last accent).  So I assume that uiter_setUTF8 and
> ucol_nextSortKeyPart would usually do a small fixed amount of work,
> whereas this patch's icu_to_uchar allocates space and converts the
> whole variable length string every time.

I have changed it to use ucol_nextSortKeyPart() when appropriate.

> That's about abbreviation, but I note that you can also compare
> strings using iterators with ucol_strcollIter, avoiding the need to
> allocate and transcode up front.  I have no idea whether that'd pay
> off.

These iterators don't actually transcode on the fly.  You can only set
up iterators for UTF-8 or UTF-16 strings.  And for UTF-8, we already
have a special code path using ucol_strcollUTF8(), which uses an
interator internally.

> Clearly when you upgrade your system from (say) Debian 8 to Debian 9
> and the ICU major version changes we expect to have to REINDEX, but
> does anyone know if such data changes are ever pulled into the minor
> version package upgrades you get from regular apt-get update of (say)
> a Debian 8 or CentOS 7 or FreeBSD 11 system?  In other words, do we
> expect collversion changes to happen basically any time in response to
> regular system updates, or only when you're doing a major upgrade of
> some kind, as the above-quoted documentation patch implies?

Stable operating systems shouldn't do major library upgrades, so I feel
pretty confident about this.

> 
> +   collversion = ntohl(*((uint32 *) versioninfo));
> 
> UVersionInfo is an array of four uint8_t.  That doesn't sound like
> something that needs to be bit-swizzled... is it really?  Even if it
> is arranged differently on different architectures, I'm not sure why
> you care since we only ever use it to compare for equality on the same
> system.  But aside from that, I don't love this cast to uint32.  I
> wonder if we should use u_versionToString to respect boundaries a bit
> better?

Makes sense, changed the column type to text.

> I have another motivation for wanting to model collation versions as
> strings: I have been contemplating a version check for system-provided
> collations too, piggy-backing on your work here.  Obviously PostgreSQL
> can't directly know anything about system collation versions, but I'm
> thinking of a GUC system_collation_version_command which you could
> optionally set to a script that knows how to inspect the local
> operating system.  For example, a package maintainer might be
> interested in writing such a script that knows how to ask the package
> system for a locale data version.  A brute force approach that works
> on many systems could be 'tar c /usr/share/local/*/LC_COLLATE | md5'.

That sounds like an idea worth pursuing.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ICU integration

2017-02-09 Thread Peter Geoghegan

On Thu, Feb 9, 2017 at 7:58 AM, Peter Eisentraut
 wrote:
> I think I have this sorted out.  What I don't understand, however, is
> why varstr_abbrev_convert() makes a point of looping until the result of
> strxfrm() fits into the output buffer (buf2), when we only need 8 bytes,
> and we throw away the rest later.  Wouldn't it work to just request 8 bytes?

Maybe. We do that because strxfrm() is not required by the standard to
produce well defined contents for the buffer when the return value
indicates that it didn't fit entirely. This is a standard idiom, I
think.

> If there is a problem with just requesting 8 bytes, then I'm wondering
> how this would affect the ICU code branch.

This must be fine with ICU's ucol_nextSortKeyPart(), because it is
designed for the express purpose of producing only a few bytes of the
final blob at a time.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ICU integration

2017-02-09 Thread Peter Eisentraut

On 1/24/17 12:45 PM, Peter Eisentraut wrote:
> On 1/9/17 3:45 PM, Peter Geoghegan wrote:
>> * I think it's worth looking into ucol_nextSortKeyPart(), and using
>> that as an alternative to ucol_getSortKey(). It doesn't seem any
>> harder, and when I tested it it was clearly faster. (I think that
>> ucol_nextSortKeyPart() is more or less intended to be used for
>> abbreviated keys.)
> I will try to look into that.

I think I have this sorted out.  What I don't understand, however, is
why varstr_abbrev_convert() makes a point of looping until the result of
strxfrm() fits into the output buffer (buf2), when we only need 8 bytes,
and we throw away the rest later.  Wouldn't it work to just request 8 bytes?

If there is a problem with just requesting 8 bytes, then I'm wondering
how this would affect the ICU code branch.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ICU integration

2017-02-01 Thread Tom Lane

I wrote:
> Evidently collateral damage from 2f5c9d9c9.  But I'd suggest waiting
> to fix it until you can also do s/simple_heap_delete/CatalogTupleDelete/
> as I proposed in
> https://www.postgresql.org/message-id/462.1485902...@sss.pgh.pa.us
> I'll go make that happen right now, now that I realize there are patch(es)
> waiting on it.

Done.  There's some more loose ends but they won't affect very many
call sites, so you should be able to rebase now.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ICU integration

2017-02-01 Thread Tom Lane

Pavel Stehule  writes:
> This patch is not possible to compile on today master
> commands/collationcmds.o: In function `AlterCollation':
> /home/pavel/src/postgresql/src/backend/commands/collationcmds.c:297:
> undefined reference to `CatalogUpdateIndexes'

Evidently collateral damage from 2f5c9d9c9.  But I'd suggest waiting
to fix it until you can also do s/simple_heap_delete/CatalogTupleDelete/
as I proposed in
https://www.postgresql.org/message-id/462.1485902...@sss.pgh.pa.us

I'll go make that happen right now, now that I realize there are patch(es)
waiting on it.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ICU integration

2017-02-01 Thread Pavel Stehule

Hi

2017-01-24 18:44 GMT+01:00 Peter Eisentraut <
peter.eisentr...@2ndquadrant.com>:

> On 1/15/17 5:53 AM, Pavel Stehule wrote:
> > the regress test fails
> >
> >  Program received signal SIGSEGV, Segmentation fault.
> > 0x007bbc2b in pattern_char_isalpha (locale_is_c=0 '\000',
> > locale=0x1a73220, is_multibyte=1 '\001', c=97 'a') at selfuncs.c:5291
> > 5291return isalpha_l((unsigned char) c, locale->lt);
>
> Here is an updated patch that fixes this crash and is rebased on top of
> recent changes.
>

This patch is not possible to compile on today master

commands/collationcmds.o: In function `AlterCollation':
/home/pavel/src/postgresql/src/backend/commands/collationcmds.c:297:
undefined reference to `CatalogUpdateIndexes'
collect2: error: ld returned 1 exit status

Regards

Pavel



>
> --
> Peter Eisentraut  http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>

Re: [HACKERS] ICU integration

2017-01-31 Thread Michael Paquier

On Wed, Jan 25, 2017 at 2:44 AM, Peter Eisentraut
 wrote:
> On 1/15/17 5:53 AM, Pavel Stehule wrote:
>> the regress test fails
>>
>>  Program received signal SIGSEGV, Segmentation fault.
>> 0x007bbc2b in pattern_char_isalpha (locale_is_c=0 '\000',
>> locale=0x1a73220, is_multibyte=1 '\001', c=97 'a') at selfuncs.c:5291
>> 5291return isalpha_l((unsigned char) c, locale->lt);
>
> Here is an updated patch that fixes this crash and is rebased on top of
> recent changes.

Patch that still applies + no reviews = moved to CF 2017-03.
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ICU integration

2017-01-31 Thread Thomas Munro

On Tue, Jan 10, 2017 at 9:45 AM, Peter Geoghegan  wrote:
> * I think it's worth looking into ucol_nextSortKeyPart(), and using
> that as an alternative to ucol_getSortKey(). It doesn't seem any
> harder, and when I tested it it was clearly faster. (I think that
> ucol_nextSortKeyPart() is more or less intended to be used for
> abbreviated keys.)

+1

I assume (but haven't checked) that ucol_nextSortKeyPart accesses only
the start of the string via the UCharIterator passed in, unless you
have the rare reverse-accent-sort feature enabled (maybe used only in
fr_CA, it looks like it is required to scan the whole string looking
for the last accent).  So I assume that uiter_setUTF8 and
ucol_nextSortKeyPart would usually do a small fixed amount of work,
whereas this patch's icu_to_uchar allocates space and converts the
whole variable length string every time.

On the other hand, I see that this patch's icu_to_uchar can deal with
encodings other than UTF8.  At first glance, the UCharIterator
interface provides a way to make a UCharIterator that iterates over a
string of UChar or a string of UTF8, but I'm not sure how to make it
do character-by-character transcoding from arbitrary encodings via the
C API.  It seems like that should be possible using ucnv_getNextUChar
as the source of transcoded characters, but I'm not sure how to wire
those two things together (in C++ I think you'd subclass
icu::CharacterIterator).

That's about abbreviation, but I note that you can also compare
strings using iterators with ucol_strcollIter, avoiding the need to
allocate and transcode up front.  I have no idea whether that'd pay
off.

+   A change in collation definitions can lead to corrupt indexes and other
+   problems where the database system relies on stored objects having a
+   certain sort order.  Generally, this should be avoided, but it can happen
+   in legitimate circumstances, such as when
+   using pg_upgrade to upgrade to server binaries linked
+   with a newer version of ICU.  When this happens, all objects depending on
+   the collation should be rebuilt, for example,
+   using REINDEX.  When that is done, the collation version
+   can be refreshed using the command ALTER COLLATION ... REFRESH
+   VERSION.  This will update the system catalog to record the
+   current collator version and will make the warning go away.  Note that this
+   does not actually check whether all affected objects have been rebuilt

I think this is a pretty reasonable first approach to this problem.
It's a simple way to flag up a problem to the DBA, but leaves all the
responsibility for figuring out how to fix it to the DBA.  I think we
should considering going further in later patches (tracking the
version used at last rebuild per index etc as discussed, so that the
condition is cleared only by rebuilding the affected things).

(REPARTITION anyone?)

As far as I know there are two things moving: ICU code and CLDR data.
Here we can see the CLDR versions being pulled into ICU:

http://bugs.icu-project.org/trac/log/icu/trunk/source/data/locales?rev=39273

Clearly when you upgrade your system from (say) Debian 8 to Debian 9
and the ICU major version changes we expect to have to REINDEX, but
does anyone know if such data changes are ever pulled into the minor
version package upgrades you get from regular apt-get update of (say)
a Debian 8 or CentOS 7 or FreeBSD 11 system?  In other words, do we
expect collversion changes to happen basically any time in response to
regular system updates, or only when you're doing a major upgrade of
some kind, as the above-quoted documentation patch implies?

+   collversion = ntohl(*((uint32 *) versioninfo));

UVersionInfo is an array of four uint8_t.  That doesn't sound like
something that needs to be bit-swizzled... is it really?  Even if it
is arranged differently on different architectures, I'm not sure why
you care since we only ever use it to compare for equality on the same
system.  But aside from that, I don't love this cast to uint32.  I
wonder if we should use u_versionToString to respect boundaries a bit
better?

I have another motivation for wanting to model collation versions as
strings: I have been contemplating a version check for system-provided
collations too, piggy-backing on your work here.  Obviously PostgreSQL
can't directly know anything about system collation versions, but I'm
thinking of a GUC system_collation_version_command which you could
optionally set to a script that knows how to inspect the local
operating system.  For example, a package maintainer might be
interested in writing such a script that knows how to ask the package
system for a locale data version.  A brute force approach that works
on many systems could be 'tar c /usr/share/local/*/LC_COLLATE | md5'.
A string would provide more leeway than a measly int32.  That's
independent of this patch and you might hate the whole idea, but seems
to be the kind of thing you anticipated when you

Re: [HACKERS] ICU integration

2017-01-24 Thread Peter Eisentraut

On 1/9/17 3:45 PM, Peter Geoghegan wrote:
> * I think it's worth looking into ucol_nextSortKeyPart(), and using
> that as an alternative to ucol_getSortKey(). It doesn't seem any
> harder, and when I tested it it was clearly faster. (I think that
> ucol_nextSortKeyPart() is more or less intended to be used for
> abbreviated keys.)

I will try to look into that.

> * I think that it's not okay that convert_string_datum() still uses
> strxfrm() without considering if it's an ICU build. That's why I
> raised the idea of a pg_strxfrm() wrapper at one point.

That code works in a locale-agnostic way now, which might be
questionable, but it's not the fault of this patch, I think.

> * Similarly, I think that check_strxfrm_bug() should have something
> about ICU. It's looking for a particular bug in some very old version
> of Solaris 8. At a minimum, check_strxfrm_bug() should now not run at
> all (a broken OS strxfrm() shouldn't be a problem with ICU).

But we'll still be using strxfrm for the database default locale, so we
need to keep that check.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ICU integration

2017-01-15 Thread Pavel Stehule

Hi

2016-12-28 3:50 GMT+01:00 Peter Eisentraut :

> Updated patch attached.
>
> The previous round of reviews showed that there was general agreement on
> the approach.  So I have focused on filling in the gaps, added ICU
> support to all the locale-using places, added documentation, fixed some
> minor issues that have been pointed out.  Everything appears to work
> correctly now, and the user-facing feature set is pretty well-rounded.
>
> I don't have much experience with the abbreviated key stuff.  I have
> filled in what I think should work, but it needs detailed review.
>
> Similarly, some of the stuff in the regular expression code was hacked
> in blindly.
>
> One minor problem is that we now support adding any-encoding collation
> entries, which violates some of the comments in CollationCreate().  See
> FIXME there.  It doesn't seem worth major contortions to fix that; maybe
> it just has to be documented better.
>

the regress test fails

 Program received signal SIGSEGV, Segmentation fault.
0x007bbc2b in pattern_char_isalpha (locale_is_c=0 '\000',
locale=0x1a73220, is_multibyte=1 '\001', c=97 'a') at selfuncs.c:5291
5291 return isalpha_l((unsigned char) c, locale->lt);


(gdb) bt
#0  0x007bbc2b in pattern_char_isalpha (locale_is_c=0 '\000',
locale=0x1a73220, is_multibyte=1 '\001', c=97 'a') at selfuncs.c:5291
#1  like_fixed_prefix (patt_const=,
case_insensitive=, collation=,
prefix_const=0x7ffc0963e800,
rest_selec=0x7ffc0963e808) at selfuncs.c:5389
#2  0x007c1076 in patternsel (fcinfo=,
ptype=ptype@entry=Pattern_Type_Like_IC, negate=negate@entry=0 '\000')
at selfuncs.c:1228
#3  0x007c1680 in iclikesel (fcinfo=) at
selfuncs.c:1406
#4  0x0080db56 in OidFunctionCall4Coll (functionId=,
collation=collation@entry=12886, arg1=arg1@entry=28299032,
arg2=arg2@entry=1627, arg3=arg3@entry=28300096, arg4=arg4@entry=0) at
fmgr.c:1674
#5  0x0068e424 in restriction_selectivity (root=root@entry=0x1afcf18,
operatorid=1627, args=0x1afd340, inputcollid=12886,
varRelid=varRelid@entry=0) at plancat.c:1583
#6  0x0065457e in clause_selectivity (root=0x1afcf18,
clause=, varRelid=0, jointype=JOIN_INNER, sjinfo=0x0)
at clausesel.c:657
#7  0x0065485c in clauselist_selectivity (root=root@entry=0x1afcf18,
clauses=, varRelid=varRelid@entry=0,
jointype=jointype@entry=JOIN_INNER, sjinfo=sjinfo@entry=0x0) at
clausesel.c:107
#8  0x006599d4 in set_baserel_size_estimates
(root=root@entry=0x1afcf18,
rel=rel@entry=0x1afd500) at costsize.c:3771
#9  0x006526e5 in set_plain_rel_size (rte=,
rel=, root=) at allpaths.c:492
#10 set_rel_size (root=root@entry=0x1afcf18, rel=rel@entry=0x1afd500,
rti=rti@entry=1, rte=0x1acfb68) at allpaths.c:352
#11 0x00653ebd in set_base_rel_sizes (root=) at
allpaths.c:272
#12 make_one_rel (root=root@entry=0x1afcf18, joinlist=joinlist@entry=0x1afd810)
at allpaths.c:170
#13 0x00670ad4 in query_planner (root=root@entry=0x1afcf18,
tlist=tlist@entry=0x1afd1b0,
qp_callback=qp_callback@entry=0x6710c0 ,
qp_extra=qp_extra@entry=0x7ffc0963f020) at planmain.c:254
#14 0x006727cc in grouping_planner (root=root@entry=0x1afcf18,
inheritance_update=inheritance_update@entry=0 '\000',
tuple_fraction=, tuple_fraction@entry=0) at
planner.c:1729
#15 0x006752c6 in subquery_planner (glob=glob@entry=0x1afcbe8,
parse=parse@entry=0x1acfa50, parent_root=parent_root@entry=0x0,
hasRecursion=hasRecursion@entry=0 '\000',
tuple_fraction=tuple_fraction@entry=0) at planner.c:789
#16 0x0067619f in standard_planner (parse=0x1acfa50,
cursorOptions=256, boundParams=0x0) at planner.c:301
#17 0x007095bd in pg_plan_query (querytree=0x1acfa50,
cursorOptions=256, boundParams=0x0) at postgres.c:798
#18 0x0070968e in pg_plan_queries (querytrees=,
cursorOptions=cursorOptions@entry=256, boundParams=boundParams@entry=0x0)
at postgres.c:864
#19 0x0070b1e7 in exec_simple_query (query_string=0x1ace8c8 "SELECT
* FROM collate_test1 WHERE b ILIKE 'abc';") at postgres.c:1029
#20 PostgresMain (argc=, argv=argv@entry=0x1a78988,
dbname=, username=) at postgres.c:4067
#21 0x0046fc7d in BackendRun (port=0x1a6e6e0) at postmaster.c:4300
#22 BackendStartup (port=0x1a6e6e0) at postmaster.c:3972

[root@localhost backend]# dnf info libicu
Last metadata expiration check: 1:50:20 ago on Sun Jan 15 09:58:29 2017.
Installed Packages
Name: libicu
Arch: x86_64
Epoch   : 0
Version : 57.1
Release : 4.fc25
Size: 29 M
Repo: @System

Regards

Pavel



> --
> Peter Eisentraut  http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>
>

Re: [HACKERS] ICU integration

2017-01-09 Thread Peter Geoghegan

On Tue, Dec 27, 2016 at 6:50 PM, Peter Eisentraut
 wrote:
> Updated patch attached.

Some more things I noticed following another quick read over the patch:

* I think it's worth looking into ucol_nextSortKeyPart(), and using
that as an alternative to ucol_getSortKey(). It doesn't seem any
harder, and when I tested it it was clearly faster. (I think that
ucol_nextSortKeyPart() is more or less intended to be used for
abbreviated keys.)

* I think that it's not okay that convert_string_datum() still uses
strxfrm() without considering if it's an ICU build. That's why I
raised the idea of a pg_strxfrm() wrapper at one point.

* Similarly, I think that check_strxfrm_bug() should have something
about ICU. It's looking for a particular bug in some very old version
of Solaris 8. At a minimum, check_strxfrm_bug() should now not run at
all (a broken OS strxfrm() shouldn't be a problem with ICU).
Otherwise, it could do some kind of testing on our pg_strxfrm()
wrapper (or similar).

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ICU integration

2017-01-09 Thread Peter Geoghegan

On Mon, Jan 9, 2017 at 12:25 PM, Peter Eisentraut
 wrote:
> On 1/7/17 10:01 PM, Peter Geoghegan wrote:
>> It occurs to me that the comparison caching stuff added by commit
>> 0e57b4d8b needs to be considered here, too.

> That might be worth looking into, but it seems a bit daunting to
> construct a benchmark specifically for this, unless we have the one that
> was originally used lying around somewhere.

The benchmark used when 0e57b4d8b went in only had to prove that there
was no measurable overhead when the optimization didn't help (It was
quite clear that it was worthwhile in good cases). I think that
comparison caching will continue to be about as effective as before in
good cases, but you don't do comparison caching anymore. That might be
fine, but let's be sure that that's the right trade-off.

To demonstrate the effectiveness of the patch, I used this cities sample data:

http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/data/cities.dump

Test query: select country, province, count(*) from cities group by
rollup (country, province);

This was shown to be about 25% faster, although that was with
abbreviated keys (plus caching of abbreviated keys), and not just the
comparison caching optimization.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ICU integration

2017-01-09 Thread Peter Eisentraut

On 1/7/17 10:01 PM, Peter Geoghegan wrote:
> It occurs to me that the comparison caching stuff added by commit
> 0e57b4d8b needs to be considered here, too. When we had to copy the
> string to a temp buffer anyway, in order to add the terminating NUL
> byte expected by strcoll(), there was an opportunity to do caching of
> comparisons at little additional cost. However, since ICU offers an
> interface that you're using that doesn't require any NUL byte, there
> is a new trade-off to be considered -- swallow the cost of copying
> into our own temp buffer solely for the benefit of comparison caching,
> or don't do comparison caching. (Note that glibc had a similar
> comparison caching optimization itself at one point, built right into
> strcoll(), but it was subsequently disabled.)

That might be worth looking into, but it seems a bit daunting to
construct a benchmark specifically for this, unless we have the one that
was originally used lying around somewhere.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ICU integration

2017-01-07 Thread Peter Geoghegan

On Tue, Dec 27, 2016 at 6:50 PM, Peter Eisentraut
 wrote:
> I don't have much experience with the abbreviated key stuff.  I have
> filled in what I think should work, but it needs detailed review.

Thanks.

It occurs to me that the comparison caching stuff added by commit
0e57b4d8b needs to be considered here, too. When we had to copy the
string to a temp buffer anyway, in order to add the terminating NUL
byte expected by strcoll(), there was an opportunity to do caching of
comparisons at little additional cost. However, since ICU offers an
interface that you're using that doesn't require any NUL byte, there
is a new trade-off to be considered -- swallow the cost of copying
into our own temp buffer solely for the benefit of comparison caching,
or don't do comparison caching. (Note that glibc had a similar
comparison caching optimization itself at one point, built right into
strcoll(), but it was subsequently disabled.)

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ICU integration

2016-10-12 Thread Peter Eisentraut

On 9/30/16 4:32 PM, Thomas Munro wrote:
>> Hmm, yeah, that will need more work.  To be completely correct, I think,
>> > we'd also need to record the version in each expression node, so that
>> > check constraints of the form CHECK (x > 'abc') can be handled.
> Hmm.  That is quite a rabbit hole.  In theory you need to recheck such
> a constraint, but it's not at all clear when you should recheck and
> what you should do about it if it fails.  Similar for the future
> PARTITION feature.

I think it's not worth dealing with this in that much detail at the
moment.  It's not like the collation will just randomly change under you
(unlike with glibc).  It would have to involve pg_upgrade, physical
replication, or a rebuilt installation.  So I think I will change the
message to something to the effect of "however you got here, you can't
do that".  We can develop some recipes and ideas on the side for how to
recover situations like that and then maybe integrate tooling for that
later.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ICU integration

2016-09-30 Thread Thomas Munro

On Sat, Oct 1, 2016 at 3:30 AM, Peter Eisentraut
 wrote:
> On 9/23/16 2:27 AM, Thomas Munro wrote:
>> The warning persists even after I reindex all affected tables (and
>> start a new backend), because you're only recording the collation at
>> pg_collation level and then only setting it at initdb time, so that
>> HINT isn't much use for clearing the warning.  I think you'd need to
>> record a snapshot of the version used for each collation that was used
>> on *each index*, and update it whenever you CREATE INDEX or REINDEX
>> etc.  There can of course be more than one collation and thus version
>> involved, if you have columns with different collations, so I guess
>> you'd need an column to hold an array of version snapshots whose order
>> corresponds to pg_index.indcollation's.
>
> Hmm, yeah, that will need more work.  To be completely correct, I think,
> we'd also need to record the version in each expression node, so that
> check constraints of the form CHECK (x > 'abc') can be handled.

Hmm.  That is quite a rabbit hole.  In theory you need to recheck such
a constraint, but it's not at all clear when you should recheck and
what you should do about it if it fails.  Similar for the future
PARTITION feature.

-- 
Thomas Munro
http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ICU integration

2016-09-30 Thread Peter Eisentraut

On 9/23/16 2:27 AM, Thomas Munro wrote:
> This patch adds pkg-config support to our configure script, in order
> to produce the build options for ICU.  That's cool, and I'm a fan of
> pkg-config, but it's an extra dependency that I just wanted to
> highlight.  For example MacOSX appears to ship with ICU but has is no
> pkg-config or ICU .pc files out of the box (erm, I can't even find the
> headers, so maybe that copy of ICU is useless to everyone except
> Apple).

The Homebrew package icu4c contains this note:

OS X provides libicucore.dylib (but nothing else).

That probably explains what you are seeing.

> There is something missing from the configure script however: the
> output of pkg-config is not making it into CFLAGS or LDFLAGS, so
> building fails on FreeBSD and MacOSX where for example
>  doesn't live in the default search path.

It sets ICU_CFLAGS and ICU_LIBS, but it seems I didn't put ICU_CFLAGS in
the backend makefiles.  So that should be easy to fix.

> You call the built-in strcoll/strxfrm collation provider "posix", and
> although POSIX does define those functions, it only does so because it
> inherits them from ISO C90.

POSIX defines strcoll_l() and such.  But I agree SYSTEM might be better.

> I notice that you use encoding -1, meaning "all".

The use of encoding -1 is existing behavior.

> I haven't fully
> grokked what that really means but it appears that we won't be able to
> create any new such collations after initdb using CREATE COLLATION, if
> for example you update your ICU installation and now have a new
> collation available.  When I tried dropping some and recreating them
> they finished up with collencoding = 6.  Should the initdb-created
> rows use 6 too?

Good observation.  That will need some fine-tuning.

> The warning persists even after I reindex all affected tables (and
> start a new backend), because you're only recording the collation at
> pg_collation level and then only setting it at initdb time, so that
> HINT isn't much use for clearing the warning.  I think you'd need to
> record a snapshot of the version used for each collation that was used
> on *each index*, and update it whenever you CREATE INDEX or REINDEX
> etc.  There can of course be more than one collation and thus version
> involved, if you have columns with different collations, so I guess
> you'd need an column to hold an array of version snapshots whose order
> corresponds to pg_index.indcollation's.

Hmm, yeah, that will need more work.  To be completely correct, I think,
we'd also need to record the version in each expression node, so that
check constraints of the form CHECK (x > 'abc') can be handled.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ICU integration

2016-09-28 Thread Thomas Munro

On Fri, Sep 23, 2016 at 6:27 PM, Thomas Munro
 wrote:
> On Wed, Aug 31, 2016 at 2:46 PM, Peter Eisentraut
>  wrote:
>> Here is a patch I've been working on to allow the use of ICU for sorting
>> and other locale things.
>
> This is very interesting work, and it's great to see some development
> in this area.  I've been peripherally involved in more than one
> collation-change-broke-my-data incident over the years.  I took the
> patch for a quick spin today.  Here are a couple of initial
> observations.

This seems like a solid start, but there are unresolved questions
about both high level goals (versioning strategy etc) and also some
technical details with this WIP patch.  It looks like several people
have an interest and ideas in this area, but clearly there isn't going
to be a committable patch in the next 48 hours.  So I will set this to
'Returned with Feedback' for now.  If you think you'll have a new
patch for the next CF then it looks like you can still 'Move to Next
CF' from 'Returned with Feedback' state if appropriate.  Thanks!

-- 
Thomas Munro
http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ICU integration

2016-09-24 Thread Thomas Munro

On Sat, Sep 24, 2016 at 10:13 PM, Peter Geoghegan  wrote:
> On Fri, Sep 23, 2016 at 7:27 AM, Thomas Munro
>  wrote:
>> It looks like varstr_abbrev_convert calls strxfrm unconditionally
>> (assuming TRUST_STRXFRM is defined).  This needs to
>> use ucol_getSortKey instead when appropriate.  It looks like it's a
>> bit more helpful than strxfrm about telling you the output buffer size
>> it wants, and it doesn't need nul termination, which is nice.
>> Unfortunately it is like strxfrm in that the output buffer's contents
>> is unspecified if it ran out of space.
>
> One can use the ucol_nextSortKeyPart() interface to just get the first
> 4/8 bytes of an abbreviated key, reducing the overhead somewhat, so
> the output buffer size limitation is probably irrelevant. The ICU
> documentation says something about this being useful for Radix sort,
> but I suspect it's more often used to generate abbreviated keys.
> Abbreviated keys were not my original idea. They're really just a
> standard technique.

Nice!  The other advantage of ucol_nextSortKeyPart is that you don't have
to convert the whole string to UChar (UTF16) first, as I think you would
need to with ucol_getSortKey, because the UCharIterator mechanism can read
directly from a UTF8 string.  I see in the documentation that
ucol_nextSortKeyPart and ucol_getSortKey don't have compatible output, and
this caveat may be related to whether sort key compression is used.  I
don't understand what sort of compression is involved but out of curiosity
I asked ICU to spit out some sort keys from ucol_nextSortKeyPart so I could
see their size.  As you say, we can ask it to stop at 4 or 8 bytes which is
very convenient for our purposes, but here I asked for more to get the full
output so I could see where the primary weight part ends.  The primary
weight took one byte for the Latin letters I tried and two for the Japanese
characters I tried (except 一 which was just 0xaa).

ucol_nextSortKeyPart(en_US, "a", ...) -> 29 01 05 01 05
ucol_nextSortKeyPart(en_US, "ab", ...) -> 29 2b 01 06 01 06
ucol_nextSortKeyPart(en_US, "abc", ...) -> 29 2b 2d 01 07 01 07
ucol_nextSortKeyPart(en_US, "abcd", ...) -> 29 2b 2d 2f 01 08 01 08
ucol_nextSortKeyPart(en_US, "A", ...) -> 29 01 05 01 dc
ucol_nextSortKeyPart(en_US, "AB", ...) -> 29 2b 01 06 01 dc dc
ucol_nextSortKeyPart(en_US, "ABC", ...) -> 29 2b 2d 01 07 01 dc dc dc
ucol_nextSortKeyPart(en_US, "ABCD", ...) -> 29 2b 2d 2f 01 08 01 dc dc dc dc
ucol_nextSortKeyPart(ja_JP, "一", ...) -> aa 01 05 01 05
ucol_nextSortKeyPart(ja_JP, "一二", ...) -> aa d0 0f 01 06 01 06
ucol_nextSortKeyPart(ja_JP, "一二三", ...) -> aa d0 0f cb b8 01 07 01 07
ucol_nextSortKeyPart(ja_JP, "一二三四", ...) -> aa d0 0f cb b8 cb d5 01 08 01 08
ucol_nextSortKeyPart(ja_JP, "日", ...) -> d0 18 01 05 01 05
ucol_nextSortKeyPart(ja_JP, "日本", ...) -> d0 18 d1 d0 01 06 01 06
ucol_nextSortKeyPart(fr_FR, "cote", ...) -> 2d 45 4f 31 01 08 01 08
ucol_nextSortKeyPart(fr_FR, "côte", ...) -> 2d 45 4f 31 01 44 8e 06 01 09
ucol_nextSortKeyPart(fr_FR, "coté", ...) -> 2d 45 4f 31 01 42 88 01 09
ucol_nextSortKeyPart(fr_FR, "côté", ...) -> 2d 45 4f 31 01 44 8e 44 88 01 0a
ucol_nextSortKeyPart(fr_CA, "cote", ...) -> 2d 45 4f 31 01 08 01 08
ucol_nextSortKeyPart(fr_CA, "côte", ...) -> 2d 45 4f 31 01 44 8e 06 01 09
ucol_nextSortKeyPart(fr_CA, "coté", ...) -> 2d 45 4f 31 01 88 08 01 09
ucol_nextSortKeyPart(fr_CA, "côté", ...) -> 2d 45 4f 31 01 88 44 8e 06 01 0a

I wonder how it manages to deal with fr_CA's reversed secondary weighting
rule which requires you to consider diacritics in reverse order --
apparently abandoned in France but still used in Canada -- using a fixed
size space for state between calls.

-- 
Thomas Munro
http://www.enterprisedb.com

Re: [HACKERS] ICU integration

2016-09-24 Thread Peter Geoghegan

On Fri, Sep 23, 2016 at 7:27 AM, Thomas Munro
 wrote:
> A couple of thoughts about abbreviated keys:
>
> #ifndef TRUST_STRXFRM
> if (!collate_c)
> abbreviate = false;
> #endif
>
> I think this macro should affect only strxfrm, and we should trust
> ucol_getSortKey or disable it independently.  ICU's manual says
> reassuring things like "Sort keys are most useful in databases" and
> "Sort keys are generally only useful in databases or other
> circumstances where function calls are extremely expensive".

+1. Abbreviated keys are essential to get competitive performance
while sorting text, and the fact that ICU makes them safe to
reintroduce is a major advantage of adopting ICU. Perhaps we should
consider wrapping strxfrm() instead, though, so that other existing
callers of strxfrm() (I'm thinking of convert_string_datum()) almost
automatically do the right thing. In other words, maybe there should
be a pg_strxfrm() or something, with TRUST_STRXFRM changed to be
something that can dynamically resolve whether or not it's a collation
managed by a trusted collation provider (this could only be resolved
at runtime, which I think is fine).

> It looks like varstr_abbrev_convert calls strxfrm unconditionally
> (assuming TRUST_STRXFRM is defined).  This needs to
> use ucol_getSortKey instead when appropriate.  It looks like it's a
> bit more helpful than strxfrm about telling you the output buffer size
> it wants, and it doesn't need nul termination, which is nice.
> Unfortunately it is like strxfrm in that the output buffer's contents
> is unspecified if it ran out of space.

One can use the ucol_nextSortKeyPart() interface to just get the first
4/8 bytes of an abbreviated key, reducing the overhead somewhat, so
the output buffer size limitation is probably irrelevant. The ICU
documentation says something about this being useful for Radix sort,
but I suspect it's more often used to generate abbreviated keys.
Abbreviated keys were not my original idea. They're really just a
standard technique.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ICU integration

2016-09-23 Thread Thomas Munro

On Wed, Aug 31, 2016 at 2:46 PM, Peter Eisentraut
 wrote:
> Here is a patch I've been working on to allow the use of ICU for sorting
> and other locale things.

This is very interesting work, and it's great to see some development
in this area.  I've been peripherally involved in more than one
collation-change-broke-my-data incident over the years.  I took the
patch for a quick spin today.  Here are a couple of initial
observations.

This patch adds pkg-config support to our configure script, in order
to produce the build options for ICU.  That's cool, and I'm a fan of
pkg-config, but it's an extra dependency that I just wanted to
highlight.  For example MacOSX appears to ship with ICU but has is no
pkg-config or ICU .pc files out of the box (erm, I can't even find the
headers, so maybe that copy of ICU is useless to everyone except
Apple).  The MacPorts ICU port ships with .pc files, so that's easy to
deal with, and I don't expect it to be difficult to get a working
pkg-config and meta-data installed alongside ICU on any platform that
doesn't already ship with them.  It may well be useful for configuring
other packages.  (There is also other cool stuff that would become
possible once ICU is optionally around, off topic.)

There is something missing from the configure script however: the
output of pkg-config is not making it into CFLAGS or LDFLAGS, so
building fails on FreeBSD and MacOSX where for example
 doesn't live in the default search path.  I tried
very briefly to work out what but autoconfuse defeated me.

You call the built-in strcoll/strxfrm collation provider "posix", and
although POSIX does define those functions, it only does so because it
inherits them from ISO C90.  As far as I can tell Windows provides
those functions because it conforms to the C spec, not the POSIX spec,
though I suppose you could argue that in that respect it DOES conform
to the POSIX spec...  Also, we already have a collation called
"POSIX".  Maybe we should avoid confusing people with a "posix"
provider and a "POSIX" collation?  But then I'm not sure what else to
call it, but perhaps "system" as Petr Jelinek suggested[1], or "libc".

postgres=# select * from pg_collation where collname like 'en_NZ%';
┌──┬───┬───┬──┬──┬──┬──┬─┐
│ collname │ collnamespace │ collowner │ collprovider │
collencoding │   collcollate│collctype │ collversion │
├──┼───┼───┼──┼──┼──┼──┼─┤
│ en_NZ│11 │10 │ p│
6 │ en_NZ│ en_NZ│   0 │
│ en_NZ│11 │10 │ p│
8 │ en_NZ.ISO8859-1  │ en_NZ.ISO8859-1  │   0 │
│ en_NZ│11 │10 │ p│
   16 │ en_NZ.ISO8859-15 │ en_NZ.ISO8859-15 │   0 │
│ en_NZ.ISO8859-1  │11 │10 │ p│
8 │ en_NZ.ISO8859-1  │ en_NZ.ISO8859-1  │   0 │
│ en_NZ.ISO8859-15 │11 │10 │ p│
   16 │ en_NZ.ISO8859-15 │ en_NZ.ISO8859-15 │   0 │
│ en_NZ.UTF-8  │11 │10 │ p│
6 │ en_NZ.UTF-8  │ en_NZ.UTF-8  │   0 │
│ en_NZ%icu│11 │10 │ i│
   -1 │ en_NZ│ en_NZ│ -1724383232 │
└──┴───┴───┴──┴──┴──┴──┴─┘
(7 rows)

I notice that you use encoding -1, meaning "all".  I haven't fully
grokked what that really means but it appears that we won't be able to
create any new such collations after initdb using CREATE COLLATION, if
for example you update your ICU installation and now have a new
collation available.  When I tried dropping some and recreating them
they finished up with collencoding = 6.  Should the initdb-created
rows use 6 too?

+ ucol_getVersion(collator, versioninfo);
+ numversion = ntohl(*((uint32 *) versioninfo));
+
+ if (numversion != collform->collversion)
+ ereport(WARNING,
+ (errmsg("ICU collator version mismatch"),
+ errdetail("The database was created using version 0x%08X, the
library provides version 0x%08X.",
+   (uint32) collform->collversion, (uint32) numversion),
+ errhint("Rebuild affected indexes, or build PostgreSQL with the
right version of ICU.")));

I played around with bumping version numbers artificially.  That gives
each session that accesses the collation one warning:

postgres=# select * from foo order by id;
WARNING:  ICU collator version mismatch
DETAIL:  The database was created using version 0x99380001, the
library provides version 0x9938.
HINT:  Rebuild affected indexes, or build PostgreSQL with the right
version of ICU.
┌┐
│ id │
├┤
└┘
(0 rows)

postgres=# select * from foo

Re: [HACKERS] ICU integration

2016-09-22 Thread Tom Lane

Alvaro Herrera  writes:
> Petr Jelinek wrote:
>>> I'm not sure how well it will work to replace all the bits of LIKE and
>>> regular expressions with ICU API calls.  One problem is that ICU likes
>>> to do case folding as a whole string, not by character.  I need to do
>>> more research about that.  

>> Can't we use the regular expression api provided by ICU, or is that too
>> incompatible?

> That probably won't fly very well -- it would be rather absurd to have
> different regex behavior when ICU is compiled compared to when it's
> not.  Perhaps down the road we could have an extension that provides
> ICU-based regex operators, but most likely not in core.

The regex stuff does not handle case-insensitivity by case-folding the
data string, anyway.  Rather, it does that by converting 'a' or 'A'
in the pattern to the equivalent of '[aA]'.  So performance for that
step is not terribly critical.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ICU integration

2016-09-22 Thread Alvaro Herrera

Petr Jelinek wrote:

> > I'm not sure how well it will work to replace all the bits of LIKE and
> > regular expressions with ICU API calls.  One problem is that ICU likes
> > to do case folding as a whole string, not by character.  I need to do
> > more research about that.  
> 
> Can't we use the regular expression api provided by ICU, or is that too
> incompatible?

That probably won't fly very well -- it would be rather absurd to have
different regex behavior when ICU is compiled compared to when it's
not.  Perhaps down the road we could have an extension that provides
ICU-based regex operators, but most likely not in core.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ICU integration

2016-09-21 Thread Petr Jelinek

On 31/08/16 04:46, Peter Eisentraut wrote:
> Here is a patch I've been working on to allow the use of ICU for sorting
> and other locale things.
>

Hi, first pass review of the patch (somewhat high level at this point).

First, there was some discussion about ICU versioning and collation
versioning and support for multiple versions of ICU at the same time. My
opinion on this is that the approach you have chosen is good, at least
for first iteration of the feature. We might want to support multiple
versions of the library in the future (I am not all that convinced of
that though), but we definitely don't at the moment. It is important to
have version checking against the data directory which you have done.

> 
> What I have done is extend collation objects with a collprovider column
> that tells whether the collation is using POSIX (appropriate name?) or
> ICU facilities.  The pg_locale_t type is changed to a struct that
> contains the provider-specific locale handles.  Users of locale
> information are changed to look into that struct for the appropriate
> handle to use.
> 

This sounds reasonable. I would prefer the POSIX to be named something
like SYSTEM though (think windows).

> In initdb, I initialize the default collation set as before from the
> `locale -a` output, but also add all available ICU locales with a "%icu"
> appended (so "fr_FR%icu").  I suppose one could create a configuration
> option perhaps in initdb to change the default so that, say, "fr_FR"
> uses ICU and "fr_FR%posix" uses the old stuff.

I wonder if we should have both %icu and %posix unconditionally and then
have option to pick one of them for the list of collations without the
suffix (defaulting to posix for now but maybe not in the future).

> 
> That all works well enough for named collations and for sorting.  The
> thread about the FreeBSD ICU patch discusses some details of how to best
> use the ICU APIs to do various aspects of the sorting, so I didn't focus
> on that too much.

That's something for follow-up patches IMHO.

> I'm not sure how well it will work to replace all the bits of LIKE and
> regular expressions with ICU API calls.  One problem is that ICU likes
> to do case folding as a whole string, not by character.  I need to do
> more research about that.  

Can't we use the regular expression api provided by ICU, or is that too
incompatible?

> Another problem, which was also previously
> discussed is that ICU does case folding in a locale-agnostic manner, so
> it does not consider things such as the Turkish special cases.  This is
> per Unicode standard modulo weasel wording, but it breaks existing tests
> at least.
> 

I am quite sure it will break existing usecases as well. Don't know if
that's an issue if we keep multiple locale providers though. It's
expected that you get different behavior from them.

> 
> Where it gets really interesting is what to do with the database
> locales.  They just set the global process locale.  So in order to port
> that to ICU we'd need to check every implicit use of the process locale
> and tweak it.  We could add a datcollprovider column or something.  But
> we also rely on the datctype setting to validate the encoding of the
> database.  Maybe we wouldn't need that anymore, but it sounds risky.

I think we'll need to separate the process locale and the locale used
for string operations.

> 
> We could have a datcollation column that by OID references a collation
> defined inside the database.  With a background worker, we can log into
> the database as it is being created and make adjustments, including
> defining or adjusting collation definitions.  This would open up
> interesting new possibilities.

I don't really follow this but it sounds icky.

> 
> What is a way to go forward here?  What's a minimal useful feature that
> is future-proof?  Just allow named collations referencing ICU for now?

I think that's minimal useful feature indeed, it will move us forward,
especially on systems where posix locales are not that well implemented
but will not be world changer just yet.

> Is throwing out POSIX locales even for the process locale reasonable?
> 

I don't think we can really do that, we'll need some kind of mapping
table between system locales and ICU locales (IIRC we build something
like that on windows as well). Maybe that mapping can be used for the
datctype to process locale conversion (datctype would then be provider
specific).

We also need to keep posix locale support anyway otherwise we'd break
pg_upgrade I think (at leas to the point that pg_upgrade where you have
to reindex whole database is much less feasible).

> Oh, that case folding code in formatting.c needs some refactoring.
> There are so many ifdefs there and it's repeated almost identically
> three times, it's crazy to work in that.

Yuck, yes it wasn't pretty before and it's quite unreadable now, I think
this patch will have to do something about that.

The message about collate provider version

Re: [HACKERS] ICU integration

2016-09-21 Thread Bruce Momjian

On Thu, Sep  8, 2016 at 12:19:39PM -0400, Peter Eisentraut wrote:
> On 9/8/16 11:16 AM, Tom Lane wrote:
> > This is a problem, if ICU won't guarantee cross-version compatibility,
> > because it destroys the argument that moving to ICU would offer us
> > collation behavior stability.
> 
> It would offer a significant upgrade over the current situation.
> 
> First, it offers stability inside the same version.  Whereas glibc might
> change a collation in a minor upgrade, ICU won't do that.  And the
> postgres binary is bound to a major version of ICU by the soname (which
> changes with every major release).  So this would avoid the situation
> that a simple OS update could break collations.

Uh, how do we know that ICU doesn't change collations in minor versions?
Couldn't we get into a case where the OS changes the ICU version or
collations more frequently than glibc does?  Seems that would be a
negative.

I don't see how having our binary bound to a ICU major version gives us
any benefit.

It seems we are still hostage to the OS version.

> Second, it offers a way to detect that something has changed.  With
> glibc, you don't know anything unless you read the source diffs.  With
> ICU, you can compare the collation version before and after and at least
> tell the user that they need to refresh indexes or whatever.

Yes, that is a clear win.

-- 
  Bruce Momjian  http://momjian.us
  EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ICU integration

2016-09-12 Thread Peter Geoghegan

On Fri, Sep 9, 2016 at 6:39 AM, Dave Page  wrote:
> Looking back at my old emails, apparently ICU 5.0 and later include
> ucol_strcollUTF8() which avoids the need to convert UTF-8 characters
> to 16 bit before sorting. RHEL 6 has the older 4.2 version of ICU.

At the risk of stating the obvious, there is a reason why ICU
traditionally worked with UTF-16 natively. It's the same reason why
many OSes and application frameworks (e.g., Java) use UTF-16
internally, even though UTF-8 is much more popular on the web. Which
is: there are certain low-level optimizations possible that are not
possible with UTF-8.

I'm not saying that it would be just as good if we were to not use the
UTF-8 optimized stuff that ICU now has. My point is that it's not
useful to prejudge whether or not performance will be acceptable based
on a factor like this, which is ultimately just an implementation
detail. The ICU patch either performs acceptably as a substitute for
something like glibc, or it does not.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ICU integration

2016-09-09 Thread Dave Page

On Fri, Sep 9, 2016 at 2:27 PM, Tom Lane  wrote:
> Dave Page  writes:
>> We needed a specific version that was newer than that shipped with
>> RHEL 6 (or in EPEL) iirc.
>
> Sure hope that's not true of the currently-proposed patch :-(

Looking back at my old emails, apparently ICU 5.0 and later include
ucol_strcollUTF8() which avoids the need to convert UTF-8 characters
to 16 bit before sorting. RHEL 6 has the older 4.2 version of ICU.

-- 
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ICU integration

2016-09-09 Thread Tom Lane

Dave Page  writes:
> We needed a specific version that was newer than that shipped with
> RHEL 6 (or in EPEL) iirc.

Sure hope that's not true of the currently-proposed patch :-(

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ICU integration

2016-09-09 Thread Dave Page

On Fri, Sep 9, 2016 at 9:02 AM, Devrim Gündüz  wrote:
>
> Hi,
>
> On Fri, 2016-09-09 at 09:48 +0800, Craig Ringer wrote:
>> Personally I would be pretty reluctant to package libicu when it's
>> already in RH/Fedora. If it were bundled in Pg's source tree and a
>> private copy was built/installed by the build system so it was part of
>> the main postgresql-server package that'd be different.
>
> Agreed. I did not read the whole thread (yet), but if this is something like
> tzdata, I would personally want to use the libuci supplied by OS, like we do
> for tzdata.
>
> (That said, just checked EDB's ICU support. We currently ship our own libicu
> there, as a part of EPAS, but I don't know the reasoning/history behind 
> there.)

We needed a specific version that was newer than that shipped with
RHEL 6 (or in EPEL) iirc.

-- 
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ICU integration

2016-09-09 Thread Peter Eisentraut

On 9/8/16 8:07 PM, Peter Geoghegan wrote:
> Not exactly. Peter E. didn't seem to be aware that there is an ICU
> collator versioning concept (perhaps I misunderstood, though).

You appear to have missed the part about the collversion column that my
patch adds.  That's exactly the collator version of ICU.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ICU integration

2016-09-09 Thread Peter Eisentraut

On 9/8/16 11:08 PM, Peter Geoghegan wrote:
> In principle, and assuming I haven't gotten something wrong, it ought
> to be possible to unambiguously identify a collation based on a
> matching UCA version (i.e. DUCET table), plus the collation tailorings
> matching exactly, even across ICU versions that happen to be based on
> the same UCA version (they only seem to put out a new UCA version
> about once a year [4]).  It *might* be fine, practically speaking, to
> assume that a collation with a matching iso-code and UCA version is
> compatible forever and always across any ICU version. If not, it might
> instead be feasible to write a custom fingerprinter for collation
> tailorings that ran at initdb time.

The documentation [0] states that the collation version covers a broad
range of things.  So I don't think these additional mechanisms you
describe are necessary.

[0]: http://userguide.icu-project.org/collation/architecture#TOC-Versioning

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ICU integration

2016-09-09 Thread Craig Ringer

On 9 September 2016 at 16:21, Magnus Hagander  wrote:
> On Thu, Sep 8, 2016 at 6:19 PM, Peter Eisentraut
>  wrote:
>>
>> On 9/8/16 11:16 AM, Tom Lane wrote:
>> > This is a problem, if ICU won't guarantee cross-version compatibility,
>> > because it destroys the argument that moving to ICU would offer us
>> > collation behavior stability.
>>
>> It would offer a significant upgrade over the current situation.
>>
>> First, it offers stability inside the same version.  Whereas glibc might
>> change a collation in a minor upgrade, ICU won't do that.  And the
>> postgres binary is bound to a major version of ICU by the soname (which
>> changes with every major release).  So this would avoid the situation
>> that a simple OS update could break collations.
>>
>> Second, it offers a way to detect that something has changed.  With
>> glibc, you don't know anything unless you read the source diffs.  With
>> ICU, you can compare the collation version before and after and at least
>> tell the user that they need to refresh indexes or whatever.
>>
>
> +1 on the importance of this last part.
>
> We may not be able to handle it directly, but just being able to point out
> to the user that "this index is incorrect, you have to reindex" and then
> refuse to use the index until that has been done would be a *huge*
> improvement.  And it would definitely help solve an existing real-world
> problem, which is what can happen when you restore a physical backup onto a
> different version of an operating system at least.
>
> Sure, it would be even better if we could automatically *deal* with it. But
> failing in a loud and obvious way is a *lot* better than silently returning
> incorrect data...

Yep, I strongly agree. That's part of why I think this is well worth
doing even though it doesn't look like it'll give us a full solution
for stable collations.

It's likely also a step or three toward case-insensitive
locales/collations, which I'm sure many people would like. Though far
from the whole picture.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ICU integration

2016-09-09 Thread Magnus Hagander

On Thu, Sep 8, 2016 at 6:19 PM, Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> wrote:

> On 9/8/16 11:16 AM, Tom Lane wrote:
> > This is a problem, if ICU won't guarantee cross-version compatibility,
> > because it destroys the argument that moving to ICU would offer us
> > collation behavior stability.
>
> It would offer a significant upgrade over the current situation.
>
> First, it offers stability inside the same version.  Whereas glibc might
> change a collation in a minor upgrade, ICU won't do that.  And the
> postgres binary is bound to a major version of ICU by the soname (which
> changes with every major release).  So this would avoid the situation
> that a simple OS update could break collations.
>
> Second, it offers a way to detect that something has changed.  With
> glibc, you don't know anything unless you read the source diffs.  With
> ICU, you can compare the collation version before and after and at least
> tell the user that they need to refresh indexes or whatever.
>
>
+1 on the importance of this last part.

We may not be able to handle it directly, but just being able to point out
to the user that "this index is incorrect, you have to reindex" and then
refuse to use the index until that has been done would be a *huge*
improvement.  And it would definitely help solve an existing real-world
problem, which is what can happen when you restore a physical backup onto a
different version of an operating system at least.

Sure, it would be even better if we could automatically *deal* with it. But
failing in a loud and obvious way is a *lot* better than silently returning
incorrect data...

-- 
 Magnus Hagander
 Me: http://www.hagander.net/
 Work: http://www.redpill-linpro.com/

Re: [HACKERS] ICU integration

2016-09-09 Thread Devrim Gündüz

Hi,

On Fri, 2016-09-09 at 09:48 +0800, Craig Ringer wrote:
> Personally I would be pretty reluctant to package libicu when it's
> already in RH/Fedora. If it were bundled in Pg's source tree and a
> private copy was built/installed by the build system so it was part of
> the main postgresql-server package that'd be different.

Agreed. I did not read the whole thread (yet), but if this is something like
tzdata, I would personally want to use the libuci supplied by OS, like we do
for tzdata.

(That said, just checked EDB's ICU support. We currently ship our own libicu
there, as a part of EPAS, but I don't know the reasoning/history behind there.)

Regards,
-- 
Devrim GÜNDÜZ
EnterpriseDB: http://www.enterprisedb.com
PostgreSQL Danışmanı/Consultant, Red Hat Certified Engineer
Twitter: @DevrimGunduz , @DevrimGunduzTR

signature.asc
Description: This is a digitally signed message part

Re: [HACKERS] ICU integration

2016-09-08 Thread Peter Geoghegan

On Thu, Sep 8, 2016 at 6:48 PM, Craig Ringer  wrote:
> Pity ICU doesn't offer versioned collations within a single install.
> Though I can understand why they don't.

There are two separate issues with collator versioning. ICU can
probably be used in a way that clearly decouples these two issues,
which is very important. The first is that the rules of collations
change. The second is that the binary key that collators create (i.e.
the equivalent of strxfrm()) can change for various reasons that have
nothing to do with culture or natural languages -- purely technical
reasons. For example, they can add new optimizations to make
generating new binary keys faster. If there are bugs in how that
works, they can fix the bugs and increment the identifier [1], which
could allow Postgres to insist on a REINDEX (if abbreviated keys for
collated text were reenabled, although I don't think that problems
like that are limited to binary key generation).

So, to bring it back to that little program I wrote:

$ ./icu-coll-versions | head
Collator  | ICU Version | UCA Version
-
Afrikaans | 99-38-00-00 | 07-00-00-00
Afrikaans (Namibia)   | 99-38-00-00 | 07-00-00-00
Afrikaans (South Africa)  | 99-38-00-00 | 07-00-00-00
Aghem | 99-38-00-00 | 07-00-00-00
Aghem (Cameroon)  | 99-38-00-00 | 07-00-00-00
Akan  | 99-38-00-00 | 07-00-00-00
Akan (Ghana)  | 99-38-00-00 | 07-00-00-00
Amharic   | 99-38-00-00 | 07-00-00-00

Here, what appears as "ICU version" has the identifier [1] baked in,
although this is undocumented (it also has any "custom tailorings"
that might be used, say if we had user defined customizations to
collations, as Firebird apparently does [2] [3]). I'm pretty sure that
UCA version relates to a version of the Unicode collation algorithm,
and its associated DUCET table (this is all subject to ISO
standardization). I gather that a particular collation is actually
comprised of a base UCA version (and DUCET table -- I think that ICU
sometimes calls this the "root"), with custom tailorings that a locale
provides for a given culture or country. These collators may in turn
be further "tailored" to get that fancy user defined customization
stuff.

In principle, and assuming I haven't gotten something wrong, it ought
to be possible to unambiguously identify a collation based on a
matching UCA version (i.e. DUCET table), plus the collation tailorings
matching exactly, even across ICU versions that happen to be based on
the same UCA version (they only seem to put out a new UCA version
about once a year [4]).  It *might* be fine, practically speaking, to
assume that a collation with a matching iso-code and UCA version is
compatible forever and always across any ICU version. If not, it might
instead be feasible to write a custom fingerprinter for collation
tailorings that ran at initdb time. Maybe the tailorings, which are
abstract rules, could even be stored in system catalogs, so the only
thing that need match is ICU's UCA version (the "root" collators must
still match), since replicas may reconstruct the serialized tailorings
that comprise a collation as needed [5][6], since the tailoring that a
default collator for a locale uses isn't special, technically
speaking.

Of course, this is all pretty hand-wavey right now, and much more
research is needed. I am very intrigued about the idea of storing the
collators in the system catalogs wholesale, since ICU provides
facilities that make that seem possible. If a "serialized unicode set"
build from a collators tailoring rules, or, alternatively, a collator
saved as a binary representation [7] were stored in the system
catalogs, perhaps it wouldn't matter as much that the stuff
distributed with different ICU versions didn't match, at least in
theory. It's unclear that the system catalog representation could be
usable with a fair cross section of ICU versions, but if it could then
that would be perfect. This also seems to be how Firebird style
user-defined tailorings might be implemented anyway, and it seems very
appealing to add that as a light layer on top of how the base system
works, if at all possible.

[1] 
https://github.com/svn2github/libicu/blob/c43ec130ea0ee6cd565d87d70088e1d70d892f32/common/unicode/uvernum.h#L149
[2] http://www.firebirdsql.org/refdocs/langrefupd25-ddl-collation.html
[3] 
http://userguide.icu-project.org/collation/customization#TOC-Building-on-Existing-Locales
[4] http://unicode.org/reports/tr10/#Synch_14651_Table
[5] 
https://ssl.icu-project.org/apiref/icu4c/ucol_8h.html#a1982f184bca8adaa848144a1959ff235
[6]

Re: [HACKERS] ICU integration

2016-09-08 Thread Craig Ringer

On 9 September 2016 at 08:51, Tatsuo Ishii  wrote:
>> On 9 September 2016 at 00:19, Peter Eisentraut
>>  wrote:
>>> On 9/8/16 11:16 AM, Tom Lane wrote:
 This is a problem, if ICU won't guarantee cross-version compatibility,
 because it destroys the argument that moving to ICU would offer us
 collation behavior stability.
>>>
>>> It would offer a significant upgrade over the current situation.
>>>
>>> First, it offers stability inside the same version.  Whereas glibc might
>>> change a collation in a minor upgrade, ICU won't do that.  And the
>>> postgres binary is bound to a major version of ICU by the soname (which
>>> changes with every major release).  So this would avoid the situation
>>> that a simple OS update could break collations.
>>
>> It also lets *users* and PostgreSQL-specific distributors bundle ICU
>> and get stable collations. We can't exactly bundle glibc.
>
> I would like to know the fate of community RPMs because if PostgreSQL
> does not include ICU source, the effort of integrating ICU is totally
> up to packagers.

CC'ing Devrim.

Personally I would be pretty reluctant to package libicu when it's
already in RH/Fedora. If it were bundled in Pg's source tree and a
private copy was built/installed by the build system so it was part of
the main postgresql-server package that'd be different. I can't really
imagine that being acceptable for upstream development though.

RH and Debian would instantly rip it out and replace it with their
packaged ICU anyway.

Pity ICU doesn't offer versioned collations within a single install.
Though I can understand why they don't.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ICU integration

2016-09-08 Thread Tatsuo Ishii

> On 9 September 2016 at 00:19, Peter Eisentraut
>  wrote:
>> On 9/8/16 11:16 AM, Tom Lane wrote:
>>> This is a problem, if ICU won't guarantee cross-version compatibility,
>>> because it destroys the argument that moving to ICU would offer us
>>> collation behavior stability.
>>
>> It would offer a significant upgrade over the current situation.
>>
>> First, it offers stability inside the same version.  Whereas glibc might
>> change a collation in a minor upgrade, ICU won't do that.  And the
>> postgres binary is bound to a major version of ICU by the soname (which
>> changes with every major release).  So this would avoid the situation
>> that a simple OS update could break collations.
> 
> It also lets *users* and PostgreSQL-specific distributors bundle ICU
> and get stable collations. We can't exactly bundle glibc.

I would like to know the fate of community RPMs because if PostgreSQL
does not include ICU source, the effort of integrating ICU is totally
up to packagers.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ICU integration

2016-09-08 Thread Craig Ringer

On 9 September 2016 at 00:19, Peter Eisentraut
 wrote:
> On 9/8/16 11:16 AM, Tom Lane wrote:
>> This is a problem, if ICU won't guarantee cross-version compatibility,
>> because it destroys the argument that moving to ICU would offer us
>> collation behavior stability.
>
> It would offer a significant upgrade over the current situation.
>
> First, it offers stability inside the same version.  Whereas glibc might
> change a collation in a minor upgrade, ICU won't do that.  And the
> postgres binary is bound to a major version of ICU by the soname (which
> changes with every major release).  So this would avoid the situation
> that a simple OS update could break collations.

It also lets *users* and PostgreSQL-specific distributors bundle ICU
and get stable collations. We can't exactly bundle glibc.

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ICU integration

2016-09-08 Thread Peter Geoghegan

On Thu, Sep 8, 2016 at 8:16 AM, Tom Lane  wrote:
>> I understand that in principle, but I don't see operating system
>> providers shipping a bunch of ICU versions to facilitate that.  They
>> will usually ship one.
>
> I agree with that estimate, and I would further venture that even if we
> wanted to bundle ICU into our tarballs, distributors would rip it out
> again on security grounds.

I agree that we're not going to bundle our own ICU. And, that
packagers have to be more or less on board with whatever plan we come
up with for any this to be of much practical value. The plan itself is
at least as important as the patch.

> This is a problem, if ICU won't guarantee cross-version compatibility,
> because it destroys the argument that moving to ICU would offer us
> collation behavior stability.

Not exactly. Peter E. didn't seem to be aware that there is an ICU
collator versioning concept (perhaps I misunderstood, though). It
might be that in practice, the locales are very stable, so it almost
doesn't matter that it's annoying when they change. Note that
"collators" are versioned in a sophisticated way, not locales.

You can build the attached simple C program to see the versions of
available collators from each locale, as follows:

$ gcc icu-test.c -licui18n -licuuc -o icu-coll-versions
$ ./icu-coll-versions | head -n 20
Collator  | ICU Version | UCA Version
-
Afrikaans | 99-38-00-00 | 07-00-00-00
Afrikaans (Namibia)   | 99-38-00-00 | 07-00-00-00
Afrikaans (South Africa)  | 99-38-00-00 | 07-00-00-00
Aghem | 99-38-00-00 | 07-00-00-00
Aghem (Cameroon)  | 99-38-00-00 | 07-00-00-00
Akan  | 99-38-00-00 | 07-00-00-00
Akan (Ghana)  | 99-38-00-00 | 07-00-00-00
Amharic   | 99-38-00-00 | 07-00-00-00
Amharic (Ethiopia)| 99-38-00-00 | 07-00-00-00
Arabic| 99-38-1B-01 | 07-00-00-00
Arabic (World)| 99-38-1B-01 | 07-00-00-00
Arabic (United Arab Emirates) | 99-38-1B-01 | 07-00-00-00
Arabic (Bahrain)  | 99-38-1B-01 | 07-00-00-00
Arabic (Djibouti) | 99-38-1B-01 | 07-00-00-00
Arabic (Algeria)  | 99-38-1B-01 | 07-00-00-00
Arabic (Egypt)| 99-38-1B-01 | 07-00-00-00
Arabic (Western Sahara)   | 99-38-1B-01 | 07-00-00-00
Arabic (Eritrea)  | 99-38-1B-01 | 07-00-00-00

I also attach a full list from my Ubuntu 16.04 laptop. I'll try to
find some other system to generate output from, to see how close it
matches what I happen to have here.

"ICU version" here is an opaque 32-bit integer [1]. I'd be interested
to see how much the output of this program differs from one major
version of ICU to the next. Collations will change. of course, but not
that often. It's not the end of the world if somebody has to REINDEX
when they change major OS version. It would be nice if everything just
continued to work with no further input from the user, but it's not
essential, assuming that collation are pretty stable in practice,
which I think they are. It is a total disaster if a mismatch in
collations is initially undetected, though.

Another issue that nobody has mentioned here, I think, is that the
glibc people just don't seem to care about our use-case (Carlos
O'Donnell basically said as much, during the strxfrm() debacle earlier
this year, but it wasn't limited to how we were relying on strxfrm()
at that time). Since it's almost certainly true that other major
database systems are critically reliant on ICU's strxfrm() agreeing
with strcoll (substitute ICU equivalent spellings), and issues beyond
that, it stands to reason that they take that stuff very seriously. It
would be really nice to get back abbreviated keys for collated text,
IMV. I think ICU gets us that. Even if we used ICU in exactly the same
way as we use the C standard library today, that general sense of
stability being critical that ICU has would still be a big advantage.
If ICU drops the ball on collation stability, or strxfrm() disagreeing
with strcoll(), it's a huge problem for lots of groups of people, not
just us.

[1] 
https://ssl.icu-project.org/apiref/icu4c/ucol_8h.html#af756972781ac556a62e48cbd509ea4a6
-- 
Peter Geoghegan
Collator  | ICU Version | UCA Version
-
Afrikaans | 99-38-00-00 |

Re: [HACKERS] ICU integration

2016-09-08 Thread Peter Eisentraut

On 9/8/16 11:16 AM, Tom Lane wrote:
> This is a problem, if ICU won't guarantee cross-version compatibility,
> because it destroys the argument that moving to ICU would offer us
> collation behavior stability.

It would offer a significant upgrade over the current situation.

First, it offers stability inside the same version.  Whereas glibc might
change a collation in a minor upgrade, ICU won't do that.  And the
postgres binary is bound to a major version of ICU by the soname (which
changes with every major release).  So this would avoid the situation
that a simple OS update could break collations.

Second, it offers a way to detect that something has changed.  With
glibc, you don't know anything unless you read the source diffs.  With
ICU, you can compare the collation version before and after and at least
tell the user that they need to refresh indexes or whatever.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ICU integration

2016-09-08 Thread Tom Lane

Peter Eisentraut  writes:
> On 9/6/16 1:40 PM, Doug Doole wrote:
>> We carried the ICU version numbers around on our collation and locale
>> IDs (such as fr_FR%icu36) . The database would load multiple versions of
>> the ICU library so that something created with ICU 3.6 would always be
>> processed with ICU 3.6. This avoided the problems of trying to change
>> the rules on the user. (We'd always intended to provide tooling to allow
>> the user to move an existing object up to a newer version of ICU, but we
>> never got around to doing it.) In the code, this meant we were
>> explicitly calling the versioned API so that we could keep the calls
>> straight.

> I understand that in principle, but I don't see operating system
> providers shipping a bunch of ICU versions to facilitate that.  They
> will usually ship one.

I agree with that estimate, and I would further venture that even if we
wanted to bundle ICU into our tarballs, distributors would rip it out
again on security grounds.  I am dead certain Red Hat would do so; less
sure that other vendors have similar policies, but it seems likely.
They don't want to have to fix security bugs in more than one place.

This is a problem, if ICU won't guarantee cross-version compatibility,
because it destroys the argument that moving to ICU would offer us
collation behavior stability.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ICU integration

2016-09-07 Thread Alvaro Herrera

> - I can't remember the specific language but they had the collation rule
> that "CH" was treated as a distinct entity between C and D. This gave the
> order C < CG < CI < CZ < CH < D. Then they removed CH as special which gave
> C < CG < CH < CI < CZ < D. Suppose there was the constraint CHECK (COL
> BETWEEN 'C' AND 'CH'). Originally it would allow (almost) all strings that
> started with C. After the change it the constraint would block everything
> that started with CI - CZ leaving many rows that no longer qualify in the
> database.

(This was probably Spanish.)

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ICU integration

2016-09-07 Thread Doug Doole

>
> This isn't a problem for Postgres, or at least wouldn't be right now,
> because we don't have case insensitive collations.


I was wondering if Postgres might be that way. It does avoid the RI
constraint problem, but there are still troubles with range based
predicates. (My previous project wanted case/accent insensitive collations,
so we got to deal with it all.)


> So, we use a strcmp()/memcmp() tie-breaker when strcoll() indicates
> equality, while also making the general notion of text equality actually
> mean binary equality.


We used a similar tie breaker in places. (e.g. Index keys needed to be
identical, not just equal. We also broke ties in sort to make its behaviour
more deterministic.)

I would like to get case insensitive collations some day, and was
> really hoping that ICU would help. That being said, the need for a
> strcmp() tie-breaker makes that hard. Oh well.
>

Prior to adding ICU to my previous project, it had the assumption that
equal meant identical as well. It turned out to be a lot easier to break
this assumption than I expected, but that code base had religiously used
its own string comparison function for user data - strcmp()/memcmp() was
never called for user data. (I don't know if the same can be said for
Postgres.) We found that very few places needed to be aware of values that
were equal but not identical. (Index and sort were the big two.)

Hopefully Postgres will be the same.

--
Doug Doole

Re: [HACKERS] ICU integration

2016-09-07 Thread Doug Doole

> I understand that in principle, but I don't see operating system
> providers shipping a bunch of ICU versions to facilitate that.  They
> will usually ship one.

Yep. If you want the protection I've described, you can't depend on the OS
copy of ICU. You need to have multiple ICU libraries that are
named/installed such that you can load the specific version you want. It
also means that you can have dependencies on versions of ICU that are no
longer supported. (In my previous project, we were shipping 3 copies of the
ICU library, going back to 2.x. Needless to say, we didn't add support for
every drop of ICU.)

On Wed, Sep 7, 2016 at 5:53 AM Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> wrote:

> On 9/6/16 1:40 PM, Doug Doole wrote:
> > We carried the ICU version numbers around on our collation and locale
> > IDs (such as fr_FR%icu36) . The database would load multiple versions of
> > the ICU library so that something created with ICU 3.6 would always be
> > processed with ICU 3.6. This avoided the problems of trying to change
> > the rules on the user. (We'd always intended to provide tooling to allow
> > the user to move an existing object up to a newer version of ICU, but we
> > never got around to doing it.) In the code, this meant we were
> > explicitly calling the versioned API so that we could keep the calls
> > straight.
>
> I understand that in principle, but I don't see operating system
> providers shipping a bunch of ICU versions to facilitate that.  They
> will usually ship one.
>
> --
> Peter Eisentraut  http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>

Re: [HACKERS] ICU integration

2016-09-07 Thread Peter Eisentraut

On 9/6/16 1:40 PM, Doug Doole wrote:
> We carried the ICU version numbers around on our collation and locale
> IDs (such as fr_FR%icu36) . The database would load multiple versions of
> the ICU library so that something created with ICU 3.6 would always be
> processed with ICU 3.6. This avoided the problems of trying to change
> the rules on the user. (We'd always intended to provide tooling to allow
> the user to move an existing object up to a newer version of ICU, but we
> never got around to doing it.) In the code, this meant we were
> explicitly calling the versioned API so that we could keep the calls
> straight.

I understand that in principle, but I don't see operating system
providers shipping a bunch of ICU versions to facilitate that.  They
will usually ship one.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ICU integration

2016-09-06 Thread Peter Geoghegan

On Tue, Sep 6, 2016 at 10:40 AM, Doug Doole  wrote:
> - Suppose in ICU X.X, AA = Å but in ICU Y.Y AA != Å. Further suppose there
> was an RI constraint where the primary key used AA and the foreign key used
> Å. If ICU was updated, the RI constraint between the rows would break,
> leaving an orphaned foreign key.

This isn't a problem for Postgres, or at least wouldn't be right now,
because we don't have case insensitive collations. So, we use a
strcmp()/memcmp() tie-breaker when strcoll() indicates equality, while
also making the general notion of text equality actually mean binary
equality. In short, we are aware that cases like this exist. IIRC
Unicode Technical Standard #10 independently recommends that this
tie-breaker strategy is one way of dealing with problems like this, in
a pinch, though I think we came up with the idea independently of that
recommendation. This was in response to a bug report over 10 years
ago.

I would like to get case insensitive collations some day, and was
really hoping that ICU would help. That being said, the need for a
strcmp() tie-breaker makes that hard. Oh well.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ICU integration

2016-09-06 Thread Peter Geoghegan

On Tue, Sep 6, 2016 at 10:40 AM, Doug Doole  wrote:
>> The ICU ABI (not API) is also versioned.  The way that this is done is
>> that all functions are actually macros to a versioned symbol.  So
>> ucol_open() is actually a macro that expands to, say, ucol_open_57() in
>> ICU version 57.  (They also got rid of a dot in their versions a while
>> ago.)  It's basically hand-crafted symbol versioning.  That way, you can
>> link with multiple versions of ICU at the same time.  However, the
>> purpose of that, as I understand it, is so that plugins can have a
>> different version of ICU loaded than the main process or another plugin.
>  > In terms of postgres using the right version of ICU, it doesn't buy
>> anything beyond what the soname mechanism does.
>
> You can access the versioned API as well, it's just not documented. (The ICU
> team does support this - we worked very closely with them when doing all
> this.) We exploited the versioned API when we learned that there is no
> guarantee of backwards compatibility in collations. You can't just change a
> collation under a user (at least that was our opinion) since it can cause
> all sorts of problems. Refreshing a collation (especially on the fly) is a
> lot more work than we were prepared to take on. So we exploited the
> versioned APIs.

I originally got some of this information from the ICU Doxygen site
for the C API, which isn't great documentation, but also isn't bad. I
admit that there are certainly gaps in my understanding of how to
bridge our requirements with versioning to what ICU can give us.


-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ICU integration

2016-09-06 Thread Doug Doole

> The ICU ABI (not API) is also versioned.  The way that this is done is
> that all functions are actually macros to a versioned symbol.  So
> ucol_open() is actually a macro that expands to, say, ucol_open_57() in
> ICU version 57.  (They also got rid of a dot in their versions a while
> ago.)  It's basically hand-crafted symbol versioning.  That way, you can
> link with multiple versions of ICU at the same time.  However, the
> purpose of that, as I understand it, is so that plugins can have a
> different version of ICU loaded than the main process or another plugin.
 > In terms of postgres using the right version of ICU, it doesn't buy
> anything beyond what the soname mechanism does.

You can access the versioned API as well, it's just not documented. (The
ICU team does support this - we worked very closely with them when doing
all this.) We exploited the versioned API when we learned that there is no
guarantee of backwards compatibility in collations. You can't just change a
collation under a user (at least that was our opinion) since it can cause
all sorts of problems. Refreshing a collation (especially on the fly) is a
lot more work than we were prepared to take on. So we exploited the
versioned APIs.

We carried the ICU version numbers around on our collation and locale IDs
(such as fr_FR%icu36) . The database would load multiple versions of the
ICU library so that something created with ICU 3.6 would always be
processed with ICU 3.6. This avoided the problems of trying to change the
rules on the user. (We'd always intended to provide tooling to allow the
user to move an existing object up to a newer version of ICU, but we never
got around to doing it.) In the code, this meant we were explicitly calling
the versioned API so that we could keep the calls straight. (Of course this
was abstracted in a set of our own locale functions so that the rest of the
engine was ignorant of the ICU library fun that was going on.)

> We can refine the guidance.  But indexes are the most important issue, I
> think, because changing the sorting rules in the background makes data
> silently disappear.

I'd say that collation is the most important issue, but collation impacts a
lot more than indexes.

Unfortunately as part of changing companies I had to leave my "screwy stuff
that has happened in collations" presentation behind so I don't have
concrete examples to point to, but I can cook up illustrative examples:

- Suppose in ICU X.X, AA = Å but in ICU Y.Y AA != Å. Further suppose there
was an RI constraint where the primary key used AA and the foreign key
used Å. If ICU was updated, the RI constraint between the rows would break,
leaving an orphaned foreign key.

- I can't remember the specific language but they had the collation rule
that "CH" was treated as a distinct entity between C and D. This gave the
order C < CG < CI < CZ < CH < D. Then they removed CH as special which gave
C < CG < CH < CI < CZ < D. Suppose there was the constraint CHECK (COL
BETWEEN 'C' AND 'CH'). Originally it would allow (almost) all strings that
started with C. After the change it the constraint would block everything
that started with CI - CZ leaving many rows that no longer qualify in the
database.

It could be argued that these are edge cases and not likely to be hit.
That's likely true for a lot of users. But for a user who hits this, their
database is going to be left in a mess.

--
Doug Doole

On Tue, Sep 6, 2016 at 8:37 AM Peter Eisentraut <
peter.eisentr...@2ndquadrant.com> wrote:

> On 8/31/16 4:24 PM, Doug Doole wrote:
> > ICU explicitly does not provide stability in their locales and
> collations. We pushed them hard to provide this, but between changes to the
> CLDR data and changes to the ICU code it just wasn’t feasible for them to
> provide version to version stability.
> >
> > What they do offer is a compile option when building ICU to version all
> their APIs. So instead of calling icu_foo() you’d call icu_foo46(). (Or
> something like this - it’s been a few years since I actually worked with
> the ICU code.) This ultimately allows you to load multiple versions of the
> ICU library into a single program and provide stability by calling the
> appropriate version of the library. (Unfortunately, the OS - at least my
> Linux box - only provides the generic version of ICU and not the version
> annotated APIs, which means a separate compile of ICU is needed.)
> >
> > The catch with this is that it means you likely want to expose the
> version information. In another note it was suggested to use something like
> fr_FR%icu. If you want to pin it to a specific version of ICU, you’ll
> likely need something like fr_FR%icu46. (There’s nothing wrong with
> supporting fr_FR%icu to give users an easy way of saying “give me the
> latest and greatest”, but you’d probably want to harden it to a specific
> ICU version internally.)
>
> There are multiple things going on.
>
> Collations in ICU are versioned.  You can find out the version

Re: [HACKERS] ICU integration

2016-09-06 Thread Peter Eisentraut

On 8/31/16 1:32 PM, Peter Geoghegan wrote:
> ICU is happy to have multiple versions
> of a collation at a time, and you'll probably retain the old collation
> version in ICU.
> 
> Even if your old collation version isn't available in a new ICU
> release (which I think is unlikely in practice)

I think this is wrong, or I have misunderstood the ICU documentation.
There is no way to "choose" a collation version.  You can only record
the one you have gotten and check that you get the same one next time.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ICU integration

2016-09-06 Thread Peter Eisentraut

On 8/31/16 4:24 PM, Doug Doole wrote:
> ICU explicitly does not provide stability in their locales and collations. We 
> pushed them hard to provide this, but between changes to the CLDR data and 
> changes to the ICU code it just wasn’t feasible for them to provide version 
> to version stability.
> 
> What they do offer is a compile option when building ICU to version all their 
> APIs. So instead of calling icu_foo() you’d call icu_foo46(). (Or something 
> like this - it’s been a few years since I actually worked with the ICU code.) 
> This ultimately allows you to load multiple versions of the ICU library into 
> a single program and provide stability by calling the appropriate version of 
> the library. (Unfortunately, the OS - at least my Linux box - only provides 
> the generic version of ICU and not the version annotated APIs, which means a 
> separate compile of ICU is needed.)
> 
> The catch with this is that it means you likely want to expose the version 
> information. In another note it was suggested to use something like 
> fr_FR%icu. If you want to pin it to a specific version of ICU, you’ll likely 
> need something like fr_FR%icu46. (There’s nothing wrong with supporting 
> fr_FR%icu to give users an easy way of saying “give me the latest and 
> greatest”, but you’d probably want to harden it to a specific ICU version 
> internally.)

There are multiple things going on.

Collations in ICU are versioned.  You can find out the version of the
collation you are currently using using an API call.  A collation
version does not change during the life of a single version of ICU.  But
it might well change in the next version of ICU, as bugs are fixed and
things are refined.  There is no way in the API to call for a collation
of a specific version, since there is only one version of a collation in
a specific installation of ICU.  So my implementation is that we store
the version of the collation in the catalog when we create the
collation, and if we later on find at run time that the collation is of
a different version, we warn about it.

The ICU ABI (not API) is also versioned.  The way that this is done is
that all functions are actually macros to a versioned symbol.  So
ucol_open() is actually a macro that expands to, say, ucol_open_57() in
ICU version 57.  (They also got rid of a dot in their versions a while
ago.)  It's basically hand-crafted symbol versioning.  That way, you can
link with multiple versions of ICU at the same time.  However, the
purpose of that, as I understand it, is so that plugins can have a
different version of ICU loaded than the main process or another plugin.
 In terms of postgres using the right version of ICU, it doesn't buy
anything beyond what the soname mechanism does.

>> +   if (numversion != collform->collversion)
>> +   ereport(WARNING,
>> +   (errmsg("ICU collator version mismatch"),
>> +errdetail("The database was created using
>> version 0x%08X, the library provides version 0x%08X.",
>> +  (uint32) collform->collversion,
>> (uint32) numversion),
>> +errhint("Rebuild affected indexes, or build
>> PostgreSQL with the right version of ICU.")));
>>
>> So you still need to manage this carefully, but at least you have a
>> chance to learn about it.
> 
> Indexes are the obvious place where collation comes into play, and are 
> relatively easy to address. But consider all the places where string 
> comparisons can be done. For example, check constraints and referential 
> constraints can depend on string comparisons. If the collation rules change 
> because of a new version of ICU, the database can become inconsistent and 
> will need a lot more work than an index rebuild.

We can refine the guidance.  But indexes are the most important issue, I
think, because changing the sorting rules in the background makes data
silently disappear.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ICU integration

2016-09-06 Thread Peter Eisentraut

On 8/31/16 12:32 AM, Michael Paquier wrote:
> On Wed, Aug 31, 2016 at 1:12 PM, Peter Eisentraut
>  wrote:
>> On 8/30/16 11:27 PM, Craig Ringer wrote:
>>> Speaking of which, have you had a chance to try it on Windows yet?
>>
>> nope
> 
> +SELECT a, b FROM collate_test2 ORDER BY b;
> + a |  b
> +---+-
> + 1 | abc
> + 4 | ABC
> + 3 | bbc
> + 2 | äbc
> +(4 rows)

It take that to mean, "it works".

> Be careful with non-ASCII characters in the regression tests, remember
> for example that where jacana was not happy:
> https://www.postgresql.org/message-id/cab7npqrod2mxqy_5+czjvhw0whrrz6p8jv_rsblcrxrtwlh...@mail.gmail.com

That thread didn't tell me much, except that the client encoding didn't
handle some of the characters.  That doesn't seem specific to Windows.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ICU integration

2016-08-31 Thread Doug Doole

Hi all. I’m new to the PostgreSQL code and the mailing list, but I’ve had a lot 
of experience with using ICU in a different database product. So while I’m not 
up to speed on the code yet, I can offer some insights on using ICU.

> On Aug 30, 2016, at 9:12 PM, Peter Eisentraut 
>  wrote:
>> How stable are the UCU locales? Most importantly, does ICU offer any
>> way to "pin" a locale version, so we can say "we want de_DE as it was
>> in ICU 4.6" and get consistent behaviour when the user sets up a
>> replica on some other system with ICU 4.8? Even if the German
>> government has changed its mind (again) about some details of the
>> language and 4.8 knows about the changes but 4.4 doesn’t?

ICU explicitly does not provide stability in their locales and collations. We 
pushed them hard to provide this, but between changes to the CLDR data and 
changes to the ICU code it just wasn’t feasible for them to provide version to 
version stability.

What they do offer is a compile option when building ICU to version all their 
APIs. So instead of calling icu_foo() you’d call icu_foo46(). (Or something 
like this - it’s been a few years since I actually worked with the ICU code.) 
This ultimately allows you to load multiple versions of the ICU library into a 
single program and provide stability by calling the appropriate version of the 
library. (Unfortunately, the OS - at least my Linux box - only provides the 
generic version of ICU and not the version annotated APIs, which means a 
separate compile of ICU is needed.)

The catch with this is that it means you likely want to expose the version 
information. In another note it was suggested to use something like fr_FR%icu. 
If you want to pin it to a specific version of ICU, you’ll likely need 
something like fr_FR%icu46. (There’s nothing wrong with supporting fr_FR%icu to 
give users an easy way of saying “give me the latest and greatest”, but you’d 
probably want to harden it to a specific ICU version internally.)

> I forgot to mention this, but the patch adds a collversion column that
> stores the collation version (provided by ICU).  And then when you
> upgrade ICU to something incompatible you get
> 
> +   if (numversion != collform->collversion)
> +   ereport(WARNING,
> +   (errmsg("ICU collator version mismatch"),
> +errdetail("The database was created using
> version 0x%08X, the library provides version 0x%08X.",
> +  (uint32) collform->collversion,
> (uint32) numversion),
> +errhint("Rebuild affected indexes, or build
> PostgreSQL with the right version of ICU.")));
> 
> So you still need to manage this carefully, but at least you have a
> chance to learn about it.

Indexes are the obvious place where collation comes into play, and are 
relatively easy to address. But consider all the places where string 
comparisons can be done. For example, check constraints and referential 
constraints can depend on string comparisons. If the collation rules change 
because of a new version of ICU, the database can become inconsistent and will 
need a lot more work than an index rebuild.

> Suggestions for refining this are welcome.
> 
> -- 
> Peter Eisentraut  http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Doug Doole
Salesforce

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ICU integration

2016-08-31 Thread Peter Geoghegan

On Tue, Aug 30, 2016 at 7:46 PM, Peter Eisentraut
 wrote:
> In initdb, I initialize the default collation set as before from the
> `locale -a` output, but also add all available ICU locales with a "%icu"
> appended (so "fr_FR%icu").  I suppose one could create a configuration
> option perhaps in initdb to change the default so that, say, "fr_FR"
> uses ICU and "fr_FR%posix" uses the old stuff.

I suspect that we'd be better off adding a mechanism for adding a new
collation after initdb runs, on a live production instance. Maybe that
part can come later.

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ICU integration

2016-08-31 Thread Peter Geoghegan

On Tue, Aug 30, 2016 at 7:46 PM, Peter Eisentraut
 wrote:
> Here is a patch I've been working on to allow the use of ICU for sorting
> and other locale things.

I'm really happy that you're working on this. This is more important
than is widely appreciated, and very important overall.

In a world where ICU becomes the defacto standard (i.e. is used on
major platforms by default), what remaining barriers are there to
documenting and enforcing the binary compatibility of replicas? I may
be mistaken, but offhand I can't think of any. Being able to describe
exactly what works and what doesn't is very important. After all,
failure to adhere to "the rules" today, such as they are, can leave
you with a subtly broken replica. I'd like to make that scenario
mechanically impossible, by locking everything down.

> I'm not sure how well it will work to replace all the bits of LIKE and
> regular expressions with ICU API calls.  One problem is that ICU likes
> to do case folding as a whole string, not by character.  I need to do
> more research about that.

My guess is that there are cultural reasons why it wants to operate on
a whole string, at least in some cases.

> Also note that ICU locales are encoding-independent and don't support a
> separate collcollate and collctype, so the existing catalog structure is
> not optimal.

That makes more sense to me, personally. ICU very explicitly decouples
technical issues (like the representation of strxfrm() keys, and, I
gather, encoding) from cultural issues (the actual user-visible
behaviors). This allows us to use strxfrm()-style binary keys in
indexes directly, since they're versioned independently from their
underlying collation; they can add a new optimization to strxfrm()-key
generation to the next ICU version, and we can detect that and require
a REINDEX, even when the collation version itself (the user-visible
behaviors) are unchanged. I'm getting ahead of myself here, but that
does seem very useful.

The Unicode collation algorithm [1] that ICU is directly based on
knows plenty about the requirements of indexing. It contains guidance
about equivalence vs. equality that we learned the hard way in commit
656beff5, for example.

> Where it gets really interesting is what to do with the database
> locales.  They just set the global process locale.  So in order to port
> that to ICU we'd need to check every implicit use of the process locale
> and tweak it.  We could add a datcollprovider column or something.  But
> we also rely on the datctype setting to validate the encoding of the
> database.  Maybe we wouldn't need that anymore, but it sounds risky.

Not sure about that.

Whatever we come up with here needs to mesh well with the existing
conventions around collation versioning that ICU has, in the context
of various operating system packages in particular. We can arrange it
so that in practice, an ICU upgrade doesn't often break your indexes
due to a collation rule change; ICU is happy to have multiple versions
of a collation at a time, and you'll probably retain the old collation
version in ICU.

Even if your old collation version isn't available in a new ICU
release (which I think is unlikely in practice), or you downgrade ICU,
it might be possible to give guidance on how to download a "Collation
Resource Bundle" [2][3] that *does* have the right collation version,
which presumably satisfies the requirement immediately.

Firebird already uses ICU. Maybe we have something to learn from them
here. In particular, where do they (by which I mean the ICU version
that Firebird links to) get its collations from in practice? I think
that the CLDR Data collations were at one time not even distributed
with ICU source. It might be a matter of individual OS packagers of
ICU deciding what exact CLDR data to use, which may or may not be of
any significant consequence in practice.

[1] http://unicode.org/reports/tr10
[2] http://site.icu-project.org/design/size/collation
[3] http://userguide.icu-project.org/icudata
-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ICU integration

2016-08-30 Thread Michael Paquier

On Wed, Aug 31, 2016 at 1:12 PM, Peter Eisentraut
 wrote:
> On 8/30/16 11:27 PM, Craig Ringer wrote:
>> Speaking of which, have you had a chance to try it on Windows yet?
>
> nope

+SELECT a, b FROM collate_test2 ORDER BY b;
+ a |  b
+---+-
+ 1 | abc
+ 4 | ABC
+ 3 | bbc
+ 2 | äbc
+(4 rows)
Be careful with non-ASCII characters in the regression tests, remember
for example that where jacana was not happy:
https://www.postgresql.org/message-id/cab7npqrod2mxqy_5+czjvhw0whrrz6p8jv_rsblcrxrtwlh...@mail.gmail.com
-- 
Michael


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ICU integration

2016-08-30 Thread Peter Eisentraut

On 8/30/16 11:27 PM, Craig Ringer wrote:
> Speaking of which, have you had a chance to try it on Windows yet?

nope

> How stable are the UCU locales? Most importantly, does ICU offer any
> way to "pin" a locale version, so we can say "we want de_DE as it was
> in ICU 4.6" and get consistent behaviour when the user sets up a
> replica on some other system with ICU 4.8? Even if the German
> government has changed its mind (again) about some details of the
> language and 4.8 knows about the changes but 4.4 doesn't?

I forgot to mention this, but the patch adds a collversion column that
stores the collation version (provided by ICU).  And then when you
upgrade ICU to something incompatible you get

+   if (numversion != collform->collversion)
+   ereport(WARNING,
+   (errmsg("ICU collator version mismatch"),
+errdetail("The database was created using
version 0x%08X, the library provides version 0x%08X.",
+  (uint32) collform->collversion,
(uint32) numversion),
+errhint("Rebuild affected indexes, or build
PostgreSQL with the right version of ICU.")));

So you still need to manage this carefully, but at least you have a
chance to learn about it.

Suggestions for refining this are welcome.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ICU integration

2016-08-30 Thread Craig Ringer

On 31 August 2016 at 10:46, Peter Eisentraut
 wrote:
> Here is a patch I've been working on to allow the use of ICU for sorting
> and other locale things.

Great to see you working on this. We've got some icky head-in-the-sand
issues in this area when it comes to OS upgrades, cross-OS-release
replication, etc, and having more control over our locale handling
would be nice. It should also get rid of the Windows-vs-*nix locale
naming wart.

Speaking of which, have you had a chance to try it on Windows yet? If
not, I'm happy to offer some pointers and a working test env.

How stable are the UCU locales? Most importantly, does ICU offer any
way to "pin" a locale version, so we can say "we want de_DE as it was
in ICU 4.6" and get consistent behaviour when the user sets up a
replica on some other system with ICU 4.8? Even if the German
government has changed its mind (again) about some details of the
language and 4.8 knows about the changes but 4.4 doesn't?

Otherwise we'll just have a new version of the same problem when it
comes to replication and upgrades. User upgrades ICU or replicates to
host with newer ICU and the data in indexes no longer correctly
reflects the runtime.

Even if ICU doesn't help solve this problem it's still valuable. I
just think it's something to think about.

We could always bundle a specific ICU version, but I know how well
that'd go down with distributors. They'd just ignore us and unbundle
it then complain about it. Not wholly without reason either; they
don't want to push security updates and bug fixes for bundled
libraries. Also, ICU isn't exactly a pocket-sized library we can stash
in some 3rdpty_libs/ dir .

-- 
 Craig Ringer   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] ICU integration

2016-08-30 Thread Peter Eisentraut

Here is a patch I've been working on to allow the use of ICU for sorting
and other locale things.

This is mostly complementary to the existing FreeBSD ICU patch, most
recently discussed in [0].  While that patch removes the POSIX locale
use and replaces it with ICU, my interest was on allowing the use of
both.  I think that is necessary for upgrading, compatibility, and maybe
because someone likes it.

What I have done is extend collation objects with a collprovider column
that tells whether the collation is using POSIX (appropriate name?) or
ICU facilities.  The pg_locale_t type is changed to a struct that
contains the provider-specific locale handles.  Users of locale
information are changed to look into that struct for the appropriate
handle to use.

In initdb, I initialize the default collation set as before from the
`locale -a` output, but also add all available ICU locales with a "%icu"
appended (so "fr_FR%icu").  I suppose one could create a configuration
option perhaps in initdb to change the default so that, say, "fr_FR"
uses ICU and "fr_FR%posix" uses the old stuff.

That all works well enough for named collations and for sorting.  The
thread about the FreeBSD ICU patch discusses some details of how to best
use the ICU APIs to do various aspects of the sorting, so I didn't focus
on that too much.  I took the existing collate.linux.utf8.sql test and
ported it to the ICU setup, and it passes except for the case noted below.

I'm not sure how well it will work to replace all the bits of LIKE and
regular expressions with ICU API calls.  One problem is that ICU likes
to do case folding as a whole string, not by character.  I need to do
more research about that.  Another problem, which was also previously
discussed is that ICU does case folding in a locale-agnostic manner, so
it does not consider things such as the Turkish special cases.  This is
per Unicode standard modulo weasel wording, but it breaks existing tests
at least.

So right now the entries in collcollate and collctype need to be valid
for ICU *and* POSIX for everything to work.

Also note that ICU locales are encoding-independent and don't support a
separate collcollate and collctype, so the existing catalog structure is
not optimal.

Where it gets really interesting is what to do with the database
locales.  They just set the global process locale.  So in order to port
that to ICU we'd need to check every implicit use of the process locale
and tweak it.  We could add a datcollprovider column or something.  But
we also rely on the datctype setting to validate the encoding of the
database.  Maybe we wouldn't need that anymore, but it sounds risky.

We could have a datcollation column that by OID references a collation
defined inside the database.  With a background worker, we can log into
the database as it is being created and make adjustments, including
defining or adjusting collation definitions.  This would open up
interesting new possibilities.

What is a way to go forward here?  What's a minimal useful feature that
is future-proof?  Just allow named collations referencing ICU for now?
Is throwing out POSIX locales even for the process locale reasonable?

Oh, that case folding code in formatting.c needs some refactoring.
There are so many ifdefs there and it's repeated almost identically
three times, it's crazy to work in that.


[0]:
https://www.postgresql.org/message-id/flat/789A2F56-0E42-409D-A840-6AF5110D6085%40pingpong.net


-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
diff --git a/aclocal.m4 b/aclocal.m4
index 6f930b6..5ca902b 100644
--- a/aclocal.m4
+++ b/aclocal.m4
@@ -7,6 +7,7 @@ m4_include([config/docbook.m4])
 m4_include([config/general.m4])
 m4_include([config/libtool.m4])
 m4_include([config/perl.m4])
+m4_include([config/pkg.m4])
 m4_include([config/programs.m4])
 m4_include([config/python.m4])
 m4_include([config/tcl.m4])
diff --git a/config/pkg.m4 b/config/pkg.m4
new file mode 100644
index 000..82bea96
--- /dev/null
+++ b/config/pkg.m4
@@ -0,0 +1,275 @@
+dnl pkg.m4 - Macros to locate and utilise pkg-config.   -*- Autoconf -*-
+dnl serial 11 (pkg-config-0.29.1)
+dnl
+dnl Copyright Â© 2004 Scott James Remnant .
+dnl Copyright Â© 2012-2015 Dan Nicholson 
+dnl
+dnl This program is free software; you can redistribute it and/or modify
+dnl it under the terms of the GNU General Public License as published by
+dnl the Free Software Foundation; either version 2 of the License, or
+dnl (at your option) any later version.
+dnl
+dnl This program is distributed in the hope that it will be useful, but
+dnl WITHOUT ANY WARRANTY; without even the implied warranty of
+dnl MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+dnl General Public License for more details.
+dnl
+dnl You should have received a copy of the GNU General Public License
+dnl along with this program; if not, write to the

77 matches

Mail list logo