Re: 16: Collation versioning and dependency helpers

2022-12-06 Thread Jeff Davis
On Tue, 2022-12-06 at 10:53 -0800, Andres Freund wrote:
> Perhaps worth posting a new version? Or are the remaining patches
> abandoned in
> favor of the other threads?

Marked what is there as committed, and the remainder is abandoned in
favor of other threads.

Thanks,
Jeff Davis





Re: 16: Collation versioning and dependency helpers

2022-12-06 Thread Andres Freund
Hi,

On 2022-10-31 16:36:54 -0700, Jeff Davis wrote:
> Committed these two small changes.

FWIW, as it stands cfbot can't apply the remaining changes:
http://cfbot.cputube.org/patch_40_3977.log

Perhaps worth posting a new version? Or are the remaining patches abandoned in
favor of the other threads?

Greetings,

Andres Freund




Re: 16: Collation versioning and dependency helpers

2022-10-31 Thread Jeff Davis
On Sun, 2022-10-30 at 19:10 +1300, Thomas Munro wrote:
> FWIW we did this (plus a lot more) in the per-index version tracking
> feature reverted from 14.

Thank you. I will catch up on that patch/thread.

> >   0002: Enable pg_collation_actual_version() to work on the default
> 
> Makes sense.
> 
> >   0003: Fix ALTER COLLATION "default" REFRESH VERSION, which 
> 
> Makes sense.

Committed these two small changes.

> >   0004: Add system views:
> >     pg_collation_versions: quickly see the current (from the
> > catalog)
> > and actual (from the provider) versions of each collation
> >     pg_collation_dependencies: map of objects to the collations
> > they
> > depend on
> > 
> > Along with these patches, you can use some tricks to verify data,
> > such
> > as /contrib/amcheck; or fix the data with things like:
> > 
> >   * REINDEX
> >   * VACUUM FULL/TRUNCATE/CLUSTER
> >   * REFRESH MATERIALIZED VIEW
> > 
> > And then refresh the collation version when you're confident that
> > your
> > data is valid.
> 
> Here you run into an argument that we had many times in that cycle:
> what's the point of views that suffer both false positives and false
> negatives?

The pg_collation_versions view is just a convenience, useful because
the default collation isn't represented normally in pg_collation so it
needs to be special-cased.

I could see how it would be tricky to precisely track the dependencies
through composite types (that is, create the proper pg_depend records),
but to just provide a view of the affected-by relationship seems more
doable. I'll review the previous discussion and see what I come up
with.

Of course, the view will just show an "affected by" relationship, it
won't show which objects are actually in violation of the current
collation version. But it at least gives the administrator a starting
place.

Regards,
Jeff Davis





Re: 16: Collation versioning and dependency helpers

2022-10-30 Thread Thomas Munro
On Sun, Oct 30, 2022 at 5:41 PM Jeff Davis  wrote:
> We haven't fully solved the changing collation-provider problem. An
> upgrade of the OS may change the version of libc or icu, and that might
> affect the collation, which could leave you with various corrupt
> database objects including:
>
>   * indexes
>   * constraints
>   * range types or multiranges (or other types dependent
> on collation for internal consistency)
>   * materialized views
>   * partitioned tables (range or hash)

Check.

> There's discussion about trying to reliably detect these changes and
> remedy them. But there are major challenges; for instance, glibc
> doesn't give a reliable signal that a collation may have changed, which
> would leave us with a lot of false positives and create a new set of
> problems (e.g. reindexing when it's unnecessary). And even with ICU, we
> don't have a way to support multiple versions of a provider or of a
> single collation, so trying to upgrade would still be a hassle.

FWIW some experimental code for multi-version ICU is proposed for
discussion here:

https://commitfest.postgresql.org/40/3956/

> Proposal:
>
> Add in some tools to make it easier for administrators to find out if
> they are at risk and solve the problem for themselves in a systematic
> way.

Excellent goal.

> Patches:
>
>   0001: Treat "default" collation as unpinned, so that entries in
> pg_depend are created. The rationale is that, since the "default"
> collation can change, it's not really an immutable system object, and
> it's worth tracking which objects are affected by it. It seems to bloat
> pg_depend by about 5-10% though -- that doesn't seem great, but I'm not
> sure if it's a real problem or not.

FWIW we did this (plus a lot more) in the per-index version tracking
feature reverted from 14.

>   0002: Enable pg_collation_actual_version() to work on the default
> collation (oid=100) so that it doesn't need to be treated as a special
> case.

Makes sense.

>   0003: Fix ALTER COLLATION "default" REFRESH VERSION, which currently
> throws an unhelpful internal error. Instead, issue a more helpful error
> that suggests "ALTER DATABASE ... REFRESH COLLATION VERSION" instead.

Makes sense.

>   0004: Add system views:
> pg_collation_versions: quickly see the current (from the catalog)
> and actual (from the provider) versions of each collation
> pg_collation_dependencies: map of objects to the collations they
> depend on
>
> Along with these patches, you can use some tricks to verify data, such
> as /contrib/amcheck; or fix the data with things like:
>
>   * REINDEX
>   * VACUUM FULL/TRUNCATE/CLUSTER
>   * REFRESH MATERIALIZED VIEW
>
> And then refresh the collation version when you're confident that your
> data is valid.

Here you run into an argument that we had many times in that cycle:
what's the point of views that suffer both false positives and false
negatives?

> TODO:

>   * Consider better tracking of which collation versions were active on
> a particular object since the last REINDEX (or REFRESH MATERIALIZED
> VIEW, TRUNCATE, or other command that would remove any trace of data
> affected by the previous collation version).

Right, the per-object dependency tracking feature, reverted from 14,
aimed to do exactly that.  It fell down on (1) some specific bugs that
were hard to fix, like dependencies inherited via composite types when
you change the composite type, and (2) doubt expressed by Tom, and
earlier Stephen, that pg_depend was a good place to store version
information.




16: Collation versioning and dependency helpers

2022-10-29 Thread Jeff Davis

Motivation:

We haven't fully solved the changing collation-provider problem. An
upgrade of the OS may change the version of libc or icu, and that might
affect the collation, which could leave you with various corrupt
database objects including:

  * indexes
  * constraints
  * range types or multiranges (or other types dependent
on collation for internal consistency)
  * materialized views
  * partitioned tables (range or hash)

There's discussion about trying to reliably detect these changes and
remedy them. But there are major challenges; for instance, glibc
doesn't give a reliable signal that a collation may have changed, which
would leave us with a lot of false positives and create a new set of
problems (e.g. reindexing when it's unnecessary). And even with ICU, we
don't have a way to support multiple versions of a provider or of a
single collation, so trying to upgrade would still be a hassle.

Proposal:

Add in some tools to make it easier for administrators to find out if
they are at risk and solve the problem for themselves in a systematic
way.

Patches:

  0001: Treat "default" collation as unpinned, so that entries in
pg_depend are created. The rationale is that, since the "default"
collation can change, it's not really an immutable system object, and
it's worth tracking which objects are affected by it. It seems to bloat
pg_depend by about 5-10% though -- that doesn't seem great, but I'm not
sure if it's a real problem or not.

  0002: Enable pg_collation_actual_version() to work on the default
collation (oid=100) so that it doesn't need to be treated as a special
case.

  0003: Fix ALTER COLLATION "default" REFRESH VERSION, which currently
throws an unhelpful internal error. Instead, issue a more helpful error
that suggests "ALTER DATABASE ... REFRESH COLLATION VERSION" instead.

  0004: Add system views:
pg_collation_versions: quickly see the current (from the catalog)
and actual (from the provider) versions of each collation
pg_collation_dependencies: map of objects to the collations they
depend on

Along with these patches, you can use some tricks to verify data, such
as /contrib/amcheck; or fix the data with things like:

  * REINDEX
  * VACUUM FULL/TRUNCATE/CLUSTER
  * REFRESH MATERIALIZED VIEW

And then refresh the collation version when you're confident that your
data is valid.

TODO:

  * The dependencies view is not rigorously complete, because the
directed dependency graph doesn't quite establish an "affected by"
relationship. One exception is that a composite type doesn't depend on
its associated relation, so a composite type over a range type doesn't
depend on the range type.
  * Consider adding in some verification helpers that can verify that a
value is still valid (e.g. a range type that depends on a collation
might have corrupt values). We could have a collation verifier for
types that are collation-dependenent, or perhaps just go through the
input and output functions and catch any errors.
  * Consider better tracking of which collation versions were active on
a particular object since the last REINDEX (or REFRESH MATERIALIZED
VIEW, TRUNCATE, or other command that would remove any trace of data
affected by the previous collation version).

Regards,
Jeff Davis

From eabc71acbce4c008e618a057c123ca15d8eb3eb5 Mon Sep 17 00:00:00 2001
From: Jeff Davis 
Date: Sat, 29 Oct 2022 12:38:42 -0700
Subject: [PATCH 1/4] Count the default collation as an unpinned dependency.

Upgrading the collation provider library can affect the default
collation in subtle ways, so track the dependency like any other. That
will at least preserve the information about which objects might be
affected, enabling administrator intervention.
---
 src/backend/catalog/catalog.c  | 12 +
 src/backend/catalog/dependency.c   | 28 +---
 src/backend/catalog/heap.c |  8 +---
 src/backend/catalog/index.c|  4 +-
 src/backend/catalog/pg_type.c  |  6 +--
 src/backend/commands/tablecmds.c   |  3 +-
 src/test/regress/expected/create_index.out | 52 +-
 7 files changed, 58 insertions(+), 55 deletions(-)

diff --git a/src/backend/catalog/catalog.c b/src/backend/catalog/catalog.c
index 2abd6b007a..2cf3f38d1e 100644
--- a/src/backend/catalog/catalog.c
+++ b/src/backend/catalog/catalog.c
@@ -29,6 +29,7 @@
 #include "catalog/namespace.h"
 #include "catalog/pg_auth_members.h"
 #include "catalog/pg_authid.h"
+#include "catalog/pg_collation.h"
 #include "catalog/pg_database.h"
 #include "catalog/pg_db_role_setting.h"
 #include "catalog/pg_largeobject.h"
@@ -355,6 +356,17 @@ IsPinnedObject(Oid classId, Oid objectId)
 	if (classId == DatabaseRelationId)
 		return false;
 
+	/*
+	 * The default collation is never pinned. While logically the default
+	 * collation should not change, the collation provider may change in
+	 * subtle ways when upgraded. Recording the dependencies on the default
+	 * collation makes it 

Re: Collation versioning

2021-04-15 Thread Thomas Munro
On Mon, Mar 15, 2021 at 2:25 PM Thomas Munro  wrote:
> FYI I have added this as an open item for PostgreSQL 14.  My default
> action will be to document this limitation, if we can't come up with
> something better in time.

Here is a short doc update to explain the situation on Windows and
close that open item.

PS  While trying to find official names to use to refer to the "en-US"
and "English_United States.1252" forms, I came across these sentences
in the Windows documentation[1], which support the idea already
discussed of trying to prevent the latter format from ever entering
our catalogs, in some future release:

"The locale-name form is a short, IETF-standardized string; for
example, en-US for English (United States) or bs-Cyrl-BA for Bosnian
(Cyrillic, Bosnia and Herzegovina).  These forms are preferred. [...]"

"The language[_country-region[.code-page]] form is stored in the
locale setting for a category when a language string, or language
string and country or region string, is used to create the locale.
[...] We do not recommend this form for locale strings embedded in
code or serialized to storage, because these strings are more likely
to be changed by an operating system update than the locale name
form."

[1] 
https://docs.microsoft.com/en-us/cpp/c-runtime-library/locale-names-languages-and-country-region-strings?view=msvc-160
From fd6c376dba21fdb0020d9b9de08fb878bb66f23d Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Fri, 16 Apr 2021 10:21:48 +1200
Subject: [PATCH] Doc: Document known problem with Windows collation versions.

Warn users that locales with traditional Windows NLS names like
"English_United States.1252" won't provide version information, and that
something like initdb --lc-collate=en-US would be needed to fix that
problem for the database default collation.

Discussion: https://postgr.es/m/CA%2BhUKGJ_hk3rU%3D%3Dg2FpAMChb_4i%2BTJacpjjqFsinY-tRM3FBmA%40mail.gmail.com
---
 doc/src/sgml/charset.sgml | 9 +
 1 file changed, 9 insertions(+)

diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index 1b00e543a6..9630b18988 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -985,6 +985,15 @@ CREATE COLLATION ignore_accents (provider = icu, locale = 'und-u-ks-level1-kc-tr
  approach is imperfect as maintainers are free to back-port newer
  collation definitions to older C library releases.
 
+
+ When using Windows collations, version information is only available for
+ collations defined with IETF BCP47 locale names such as
+ en-US.  Currently, initdb selects
+ a default locale using a traditional Windows language and country
+ string such as English_United States.1252.  The
+ --lc-collate option can be used to provide an explicit
+ locale name in IETF-standardized form.
+

   
  
-- 
2.30.1



Re: Collation versioning

2021-03-14 Thread Thomas Munro
On Fri, Nov 6, 2020 at 10:56 AM Thomas Munro  wrote:
> On Wed, Nov 4, 2020 at 9:11 PM Michael Paquier  wrote:
> > On Wed, Nov 04, 2020 at 08:44:15AM +0100, Juan José Santamaría Flecha wrote:
> > >  We could create a static table with the conversion based on what was
> > > discussed for commit a169155, please find attached a spreadsheet with the
> > > comparison. This would require maintenance as new LCIDs are released [1].

> > I am honestly not a fan of something like that as it has good chances
> > to rot.

> No opinion on that, other than that we'd surely want a machine
> readable version.  As for *when* we use that information, I'm
> wondering if it would make sense to convert datcollate to a language
> tag in initdb, and also change pg_upgrade's equivalent_locale()
> function to consider "English_United States.*" and "en-US" to be
> equivalent when upgrading to 14 (which would then be the only point
> you'd ever have to have faith that we can convert the old style names
> to the new names correctly).  I'm unlikely to work on this myself as I
> have other operating systems to fix, but I'll certainly be happy if
> somehow we can get versioning for default on Windows in PG14 and not
> have to come up with weasel words in the manual.

FYI I have added this as an open item for PostgreSQL 14.  My default
action will be to document this limitation, if we can't come up with
something better in time.




Have collation versioning to ignore hash and similar AM

2020-11-25 Thread Julien Rouhaud
Hello,

Collation versioning has been committed, but there are still at least 2 cases
that aren't perfectly handled:

- AMs which don't rely on stable collation ordering (hash, bloom...)
- expressions which don't rely on a stable collation ordering (e.g. md5())

Handling expressions will probably require a lot of work, so I'd like to start
with AM handling.

My initial idea was to add a new field in IndexAmRoutine for that (see patch
[1] that was present in v30 [2], and the rest of the patches that used it).
Does anyone have any suggestion on a better way to handle that?

Note also that in the patch I named the field "amnostablecollorder", which is a
bit weird but required so that a custom access method won't be assumed to not
rely on a stable collation ordering if for some reason the author forgot to
correctly set it up.  If we end up with a new field in IndexAmRoutine, maybe we
could also add an API version field that could be bumped in case of changes so
that authors get an immediate error?  This way we wouldn't have to be worried
of a default value anymore.

[1] 
https://www.postgresql.org/message-id/attachment/114354/v30-0001-Add-a-new-amnostablecollorder-flag-in-IndexAmRou.patch
[2] https://www.postgresql.org/message-id/20200924094854.abjmpfqixq6xd4o5%40nol




Re: Collation versioning

2020-11-08 Thread Thomas Munro
On Fri, Oct 4, 2019 at 1:25 AM Thomas Munro  wrote:
> Ok.  Here's one like that.  Also, a WIP patch for FreeBSD.

Here's an updated patch for FreeBSD, which I'll sit on for a bit
longer.  It needs bleeding edge 13-CURRENT (due out Q1ish).
From b9cb5562457c214c48c0a6334b8ed3264f50e3d6 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Sun, 8 Nov 2020 21:12:12 +1300
Subject: [PATCH] Add collation versions for FreeBSD.

On FreeBSD 13, use querylocale() to read the current version of libc
collations.
---
 doc/src/sgml/charset.sgml |  3 ++-
 src/backend/utils/adt/pg_locale.c | 20 
 2 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/charset.sgml b/doc/src/sgml/charset.sgml
index 832a701523..e151987eff 100644
--- a/doc/src/sgml/charset.sgml
+++ b/doc/src/sgml/charset.sgml
@@ -973,7 +973,8 @@ CREATE COLLATION ignore_accents (provider = icu, locale = 'und-u-ks-level1-kc-tr
 Version information is available from the
 icu provider on all operating systems.  For the
 libc provider, versions are currently only available
-on systems using the GNU C library (most Linux systems) and Windows.
+on systems using the GNU C library (most Linux systems), FreeBSD and
+Windows.

 

diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 1dfe343b79..b1a8eb9a4e 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -1684,6 +1684,26 @@ get_collation_actual_version(char collprovider, const char *collcollate)
 
 		/* Use the glibc version because we don't have anything better. */
 		collversion = pstrdup(gnu_get_libc_version());
+#elif defined(LC_VERSION_MASK)
+		locale_tloc;
+
+		/* C[.encoding] and POSIX never change. */
+		if (strcmp("C", collcollate) == 0 ||
+			strncmp("C.", collcollate, 2) == 0 ||
+			strcmp("POSIX", collcollate) == 0)
+			return NULL;
+
+		/* Look up FreeBSD collation version. */
+		loc = newlocale(LC_COLLATE, collcollate, NULL);
+		if (loc)
+		{
+			collversion =
+pstrdup(querylocale(LC_COLLATE_MASK | LC_VERSION_MASK, loc));
+			freelocale(loc);
+		}
+		else
+			ereport(ERROR,
+	(errmsg("could not load locale \"%s\"", collcollate)));
 #elif defined(WIN32) && _WIN32_WINNT >= 0x0600
 		/*
 		 * If we are targeting Windows Vista and above, we can ask for a name
-- 
2.28.0



Re: Collation versioning

2020-11-05 Thread Thomas Munro
On Wed, Nov 4, 2020 at 9:11 PM Michael Paquier  wrote:
> On Wed, Nov 04, 2020 at 08:44:15AM +0100, Juan José Santamaría Flecha wrote:
> >  We could create a static table with the conversion based on what was
> > discussed for commit a169155, please find attached a spreadsheet with the
> > comparison. This would require maintenance as new LCIDs are released [1].
> >
> > [1]
> > https://docs.microsoft.com/en-us/openspecs/windows_protocols/ms-lcid/70feba9f-294e-491e-b6eb-56532684c37f
>
> I am honestly not a fan of something like that as it has good chances
> to rot.

No opinion on that, other than that we'd surely want a machine
readable version.  As for *when* we use that information, I'm
wondering if it would make sense to convert datcollate to a language
tag in initdb, and also change pg_upgrade's equivalent_locale()
function to consider "English_United States.*" and "en-US" to be
equivalent when upgrading to 14 (which would then be the only point
you'd ever have to have faith that we can convert the old style names
to the new names correctly).  I'm unlikely to work on this myself as I
have other operating systems to fix, but I'll certainly be happy if
somehow we can get versioning for default on Windows in PG14 and not
have to come up with weasel words in the manual.

Just by the way, I think Windows does one thing pretty nicely here: it
has versions with a major and a minor part.  If the minor part goes
up, it means that they only added new code points, but didn't change
the ordering of any existing code points, so in some circumstances you
don't have to rebuild (which I think is the case for many Unicode
updates, adding new Chinese characters or emojis or whatever).  I
thought about whether we should replace the strcmp() comparison with a
call into provider-specific code, and in the case of Win32 locales it
could maybe understand that.  But there are two problems of limited
surmountability: (1) You have an idex built with version 42.1, and now
version 42.3 is present; OK, we can read this index, but if we write
any new data, then a streaming replica that has 42.2 will think it's
OK to read data, but it's not OK; so as soon as you write, you'd need
to update the catalogue, which is quite complicated (cf enum types);
(2) The whole theory only holds together if you didn't actually use
any of the new codepoints introduced by 42.3 while the index said
42.1, yet PostgreSQL isn't validating the codepoints you use against
the collation provider's internal map of valid code points.  So I gave
up with that line of thinking for now.




Re: Collation versioning

2020-11-04 Thread Laurenz Albe
On Tue, 2020-11-03 at 23:14 +0100, Christoph Berg wrote:
> Re: Thomas Munro
> > for all I know, "en" variants might change
> > independently (I doubt it in practice, but in theory it's wrong).
> 
> Long before the glibc 2.28 incident, the same collation change
> had already happened twice, namely between RedHat 5 -> 6 -> 7, for
> de_DE.UTF-8 only. de_AT.UTF-8 and all other locales were unaffected.
> 
> At the time I didn't connect the dots to check if Debian was affected
> as well, but of course later testing revealed it was since it was a
> change in glibc.

Yes, this is a persistent pain; I had several customers suffering from
these issues too.

I wish
https://postgr.es/m/5e756dd6-0e91-d778-96fd-b1bcb06c161a%402ndquadrant.com
hade made it into core.

Yours,
Laurenz Albe





Re: Collation versioning

2020-11-04 Thread Julien Rouhaud
On Wed, Nov 4, 2020 at 4:11 PM Michael Paquier  wrote:
>
> On Wed, Nov 04, 2020 at 08:44:15AM +0100, Juan José Santamaría Flecha wrote:
> >  We could create a static table with the conversion based on what was
> > discussed for commit a169155, please find attached a spreadsheet with the
> > comparison. This would require maintenance as new LCIDs are released [1].
> >
> > [1]
> > https://docs.microsoft.com/en-us/openspecs/windows_protocols/ms-lcid/70feba9f-294e-491e-b6eb-56532684c37f
>
> I am honestly not a fan of something like that as it has good chances
> to rot.

Same here.




Re: Collation versioning

2020-11-04 Thread Michael Paquier
On Wed, Nov 04, 2020 at 08:44:15AM +0100, Juan José Santamaría Flecha wrote:
>  We could create a static table with the conversion based on what was
> discussed for commit a169155, please find attached a spreadsheet with the
> comparison. This would require maintenance as new LCIDs are released [1].
> 
> [1]
> https://docs.microsoft.com/en-us/openspecs/windows_protocols/ms-lcid/70feba9f-294e-491e-b6eb-56532684c37f

I am honestly not a fan of something like that as it has good chances
to rot.
--
Michael


signature.asc
Description: PGP signature


Re: Collation versioning

2020-11-03 Thread Juan José Santamaría Flecha
On Tue, Nov 3, 2020 at 10:49 PM Thomas Munro  wrote:

>
> So we have:
>
> 1.  Windows locale names, like "English_United States.1252".  Windows
> still returns these from setlocale(), so they finish up in datcollate,
> and yet some relevant APIs don't accept them, at least on some
> machines.
>
> 2.  BCP 47/RFC 5646 language tags, like "en-US".  Windows uses these
> in relevant new APIs, including the case in point.
>
> 3.  Unix-style (XPG?  ISO/IEC 15897?) locale names, like "en_US"
> ("language[_territory[.codeset]][@modifier]").  These are used for
> message catalogues.
>
> We have a VS2015+ way of converting from form 1 to form 2 (and thence
> 3 by s/-/_/), and an older way.  Unfortunately, the new way looks a
> little too fuzzy: if i'm reading it right, search_locale_enum() might
> stop on either "en" or "en-AU", given "English_Australia", depending
> on the search order, no?


No, that is not the case. "English" could match any locale if the
enumeration order was to be changed in the future, right now the order is a
given (Language, Location), but "English_Australia" can only match  "en-AU".

This may be fine for the purpose of looking
> up error messages with gettext() (where there is only one English
> language message catalogue, we haven't got around to translating our
> errors into 'strayan yet), but it doesn't seem like a good way to look
> up the collation version; for all I know, "en" variants might change
> independently (I doubt it in practice, but in theory it's wrong).  We
> want the same algorithm that Windows uses internally to resolve the
> old style name to a collation; in other words we probably want
> something more like the code path that they took away in VS2015 :-(.
>

 We could create a static table with the conversion based on what was
discussed for commit a169155, please find attached a spreadsheet with the
comparison. This would require maintenance as new LCIDs are released [1].

[1]
https://docs.microsoft.com/en-us/openspecs/windows_protocols/ms-lcid/70feba9f-294e-491e-b6eb-56532684c37f

Regards,

Juan José Santamaría


WindowsNLSLocales.ods
Description: application/vnd.oasis.opendocument.spreadsheet


Re: Collation versioning

2020-11-03 Thread Thomas Munro
On Wed, Nov 4, 2020 at 2:56 PM David Rowley  wrote:
> initdb works fine. I ran vcregress upgradecheck and it passes.
>
> With my default locale of English.New Zealand.1252 I get zero rows from:
>
> select * from pg_depend where coalesce(refobjversion,'') <> '';
>
> if I initdb with --lc-collate=en-NZ, it works and I see:
>
> postgres=# select * from pg_depend where coalesce(refobjversion,'') <> '';
>  classid | objid | objsubid | refclassid | refobjid | refobjsubid |
> deptype |  refobjversion
> -+---+--++--+-+-+-
> 2606 | 12512 |0 |   3456 |  100 |   0 | n
>  | 1538.14,1538.14
> (1 row)

Thanks for all the help and testing!  Pushed.  If we don't come up
with something better I'll need to figure out how to explain this in
the manual.  (Will no one rid us of these meddlesome old format locale
names?  It seems like pg_locale.c could drop a lot of rather
unpleasant code if initdb, CREATE COLLATION, and CREATE DATABASE
didn't allow them into the catalogue in the first place...)




Re: Collation versioning

2020-11-03 Thread David Rowley
On Wed, 4 Nov 2020 at 14:21, Thomas Munro  wrote:
>
> On Wed, Nov 4, 2020 at 10:56 AM Thomas Munro  wrote:
> > On Wed, Nov 4, 2020 at 10:52 AM Tom Lane  wrote:
> > > Thomas Munro  writes:
> > > > We want the same algorithm that Windows uses internally to resolve the
> > > > old style name to a collation; in other words we probably want
> > > > something more like the code path that they took away in VS2015 :-(.
> > >
> > > Yeah.  In the short run, though, it'd be nice to un-break the buildfarm.
> > > Maybe we could push David's code or something similar, and then
> > > contemplate better ways at leisure?
> >
> > Ok, yeah, I'll do that in the next few hours.
>
> I can't bring myself to commit that, it's not really in the spirit of
> this data integrity feature, and it's not our business to second guess
> the relationship between different locale naming schemes through fuzzy
> logic.  Instead, I propose to just neuter the feature if Windows
> decides it can't understand a locale names that it gave us.  It should
> still work fine with something like initdb --lc-collate=en-US.  Here's
> an untested patch.  Thoughts?

I gave this a quick test.

initdb works fine. I ran vcregress upgradecheck and it passes.

With my default locale of English.New Zealand.1252 I get zero rows from:

select * from pg_depend where coalesce(refobjversion,'') <> '';

if I initdb with --lc-collate=en-NZ, it works and I see:

postgres=# select * from pg_depend where coalesce(refobjversion,'') <> '';
 classid | objid | objsubid | refclassid | refobjid | refobjsubid |
deptype |  refobjversion
-+---+--++--+-+-+-
2606 | 12512 |0 |   3456 |  100 |   0 | n
 | 1538.14,1538.14
(1 row)

David




Re: Collation versioning

2020-11-03 Thread Tom Lane
Thomas Munro  writes:
> I can't bring myself to commit that, it's not really in the spirit of
> this data integrity feature, and it's not our business to second guess
> the relationship between different locale naming schemes through fuzzy
> logic.  Instead, I propose to just neuter the feature if Windows
> decides it can't understand a locale names that it gave us.  It should
> still work fine with something like initdb --lc-collate=en-US.  Here's
> an untested patch.  Thoughts?

Works for me, at least as a short-term solution.

regards, tom lane




Re: Collation versioning

2020-11-03 Thread Thomas Munro
On Wed, Nov 4, 2020 at 10:56 AM Thomas Munro  wrote:
> On Wed, Nov 4, 2020 at 10:52 AM Tom Lane  wrote:
> > Thomas Munro  writes:
> > > We want the same algorithm that Windows uses internally to resolve the
> > > old style name to a collation; in other words we probably want
> > > something more like the code path that they took away in VS2015 :-(.
> >
> > Yeah.  In the short run, though, it'd be nice to un-break the buildfarm.
> > Maybe we could push David's code or something similar, and then
> > contemplate better ways at leisure?
>
> Ok, yeah, I'll do that in the next few hours.

I can't bring myself to commit that, it's not really in the spirit of
this data integrity feature, and it's not our business to second guess
the relationship between different locale naming schemes through fuzzy
logic.  Instead, I propose to just neuter the feature if Windows
decides it can't understand a locale names that it gave us.  It should
still work fine with something like initdb --lc-collate=en-US.  Here's
an untested patch.  Thoughts?
From 0e99688adb3c1b18c01fbfd8e2488e5769309fc7 Mon Sep 17 00:00:00 2001
From: Thomas Munro 
Date: Wed, 4 Nov 2020 13:49:05 +1300
Subject: [PATCH] Tolerate version lookup failure for old-style Windows locale
 names.

Accept that we can't get versions for such locale names for now, leaving
it up to the user to provide the modern language tag format to benefit
from the collation versioning feature.  It's not clear that we can do
the conversion from the old style to the new style reliably enough for
this purpose.

Unfortunately, this includes the values that are typically installed as
database default by initdb, so it'd be nice to find a better solution
than this.

Discussion: https://postgr.es/m/CAEepm%3D0uEQCpfq_%2BLYFBdArCe4Ot98t1aR4eYiYTe%3DyavQygiQ%40mail.gmail.com
---
 src/backend/utils/adt/pg_locale.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/src/backend/utils/adt/pg_locale.c b/src/backend/utils/adt/pg_locale.c
index 3b0324ce18..d5a0169420 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -1702,10 +1702,22 @@ get_collation_actual_version(char collprovider, const char *collcollate)
 		MultiByteToWideChar(CP_ACP, 0, collcollate, -1, wide_collcollate,
 			LOCALE_NAME_MAX_LENGTH);
 		if (!GetNLSVersionEx(COMPARE_STRING, wide_collcollate, ))
+		{
+			/*
+			 * GetNLSVersionEx() wants a language tag such as "en-US", not a
+			 * locale name like "English_United States.1252".  Until those
+			 * values can be prevented from entering the system, or 100%
+			 * reliably converted to the more useful tag format, tolerate the
+			 * resulting error and report that we have no version data.
+			 */
+			if (GetLastError() == ERROR_INVALID_PARAMETER)
+return NULL;
+
 			ereport(ERROR,
 	(errmsg("could not get collation version for locale \"%s\": error code %lu",
 			collcollate,
 			GetLastError(;
+		}
 		collversion = psprintf("%d.%d,%d.%d",
 			   (version.dwNLSVersion >> 8) & 0x,
 			   version.dwNLSVersion & 0xFF,
-- 
2.20.1



Re: Collation versioning

2020-11-03 Thread Christoph Berg
Re: Thomas Munro
> for all I know, "en" variants might change
> independently (I doubt it in practice, but in theory it's wrong).

Long before the glibc 2.28 incident, the same collation change
had already happened twice, namely between RedHat 5 -> 6 -> 7, for
de_DE.UTF-8 only. de_AT.UTF-8 and all other locales were unaffected.

At the time I didn't connect the dots to check if Debian was affected
as well, but of course later testing revealed it was since it was a
change in glibc.

(German blogpost: 
https://www.credativ.de/blog/postgresql/postgresql-und-inkompatible-deutsche-spracheigenschaften-in-centos-rhel/)

Christoph




Re: Collation versioning

2020-11-03 Thread Thomas Munro
On Wed, Nov 4, 2020 at 10:52 AM Tom Lane  wrote:
> Thomas Munro  writes:
> > We want the same algorithm that Windows uses internally to resolve the
> > old style name to a collation; in other words we probably want
> > something more like the code path that they took away in VS2015 :-(.
>
> Yeah.  In the short run, though, it'd be nice to un-break the buildfarm.
> Maybe we could push David's code or something similar, and then
> contemplate better ways at leisure?

Ok, yeah, I'll do that in the next few hours.




Re: Collation versioning

2020-11-03 Thread Tom Lane
Thomas Munro  writes:
> On Tue, Nov 3, 2020 at 4:38 PM Thomas Munro  wrote:
>> On Tue, Nov 3, 2020 at 1:51 PM David Rowley  wrote:
>>> FWIW, the attached does fix the issue for me.  It basically just calls
>>> the function that converts the windows-type "English_New Zealand.1252"
>>> locale name string into, e.g. "en_NZ".

> We want the same algorithm that Windows uses internally to resolve the
> old style name to a collation; in other words we probably want
> something more like the code path that they took away in VS2015 :-(.

Yeah.  In the short run, though, it'd be nice to un-break the buildfarm.
Maybe we could push David's code or something similar, and then
contemplate better ways at leisure?

regards, tom lane




Re: Collation versioning

2020-11-03 Thread Thomas Munro
On Tue, Nov 3, 2020 at 4:38 PM Thomas Munro  wrote:
> On Tue, Nov 3, 2020 at 1:51 PM David Rowley  wrote:
> > On Tue, 3 Nov 2020 at 12:29, David Rowley  wrote:
> > > Running low on ideas for now, so thought I'd post this in case it
> > > someone thinks of something else.
> >
> > FWIW, the attached does fix the issue for me.  It basically just calls
> > the function that converts the windows-type "English_New Zealand.1252"
> > locale name string into, e.g. "en_NZ". Then, since GetNLSVersionEx()
> > wants yet another variant with a - rather than an _, I've just added a
> > couple of lines to swap the _ for a -.  There's a bit of extra work
> > there since IsoLocaleName() just did the opposite, so perhaps doing it
> > that way was lazy of me.  I'd have invented some other function if I
> > could have thought of a meaningful name for it, then just have the ISO
> > version of it swap - for _.
>
> Thanks!  Hmm, it looks like Windows calls the hyphenated ISO
> language-country form a "tag".  It makes me slightly nervous to ask
> for the version of a transformed name with the encoding stripped, but
> it does seem entirely plausible that it gives the answer we seek.  I
> suppose if we were starting from a clean slate we might want to
> perform this transformation up front so that we have it in datcollate
> and then not have to think about the older form ever again.  If we
> decided to do that going forward, the last trace of that problem would
> live in pg_upgrade.  If we ever extend pg_import_system_collations()
> to cover Windows, we should make sure it captures the tag form.

So we have:

1.  Windows locale names, like "English_United States.1252".  Windows
still returns these from setlocale(), so they finish up in datcollate,
and yet some relevant APIs don't accept them, at least on some
machines.

2.  BCP 47/RFC 5646 language tags, like "en-US".  Windows uses these
in relevant new APIs, including the case in point.

3.  Unix-style (XPG?  ISO/IEC 15897?) locale names, like "en_US"
("language[_territory[.codeset]][@modifier]").  These are used for
message catalogues.

We have a VS2015+ way of converting from form 1 to form 2 (and thence
3 by s/-/_/), and an older way.  Unfortunately, the new way looks a
little too fuzzy: if i'm reading it right, search_locale_enum() might
stop on either "en" or "en-AU", given "English_Australia", depending
on the search order, no?  This may be fine for the purpose of looking
up error messages with gettext() (where there is only one English
language message catalogue, we haven't got around to translating our
errors into 'strayan yet), but it doesn't seem like a good way to look
up the collation version; for all I know, "en" variants might change
independently (I doubt it in practice, but in theory it's wrong).  We
want the same algorithm that Windows uses internally to resolve the
old style name to a collation; in other words we probably want
something more like the code path that they took away in VS2015 :-(.




Re: Collation versioning

2020-11-03 Thread Juan José Santamaría Flecha
On Tue, Nov 3, 2020 at 4:39 AM Thomas Munro  wrote:

> On Tue, Nov 3, 2020 at 1:51 PM David Rowley  wrote:
>
> > It would be good if this could also be tested on Visual Studio version
> > 12 as I see IsoLocaleName() does something else for anything before
> > 15.  I only have 10 and 17 installed and I see we don't support
> > anything before 12 on master per:
>
> I think others have mentioned that it might be time to drop some older
> Windows versions.  I don't follow that stuff, so I've quietly added a
> name to the CC list and will hope for the best :-)
>

There has been some talk about pushing _WIN32_WINNT to newer releases, but
ended without an actual patch for doing so. Maybe we can revisit that in
another thread.

[1]
https://www.postgresql.org/message-id/20200218065418.GK4176%40paquier.xyz

Regards,

Juan José Santamaría Flecha


Re: Collation versioning

2020-11-02 Thread Thomas Munro
On Tue, Nov 3, 2020 at 1:51 PM David Rowley  wrote:
> On Tue, 3 Nov 2020 at 12:29, David Rowley  wrote:
> > Running low on ideas for now, so thought I'd post this in case it
> > someone thinks of something else.
>
> FWIW, the attached does fix the issue for me.  It basically just calls
> the function that converts the windows-type "English_New Zealand.1252"
> locale name string into, e.g. "en_NZ". Then, since GetNLSVersionEx()
> wants yet another variant with a - rather than an _, I've just added a
> couple of lines to swap the _ for a -.  There's a bit of extra work
> there since IsoLocaleName() just did the opposite, so perhaps doing it
> that way was lazy of me.  I'd have invented some other function if I
> could have thought of a meaningful name for it, then just have the ISO
> version of it swap - for _.

Thanks!  Hmm, it looks like Windows calls the hyphenated ISO
language-country form a "tag".  It makes me slightly nervous to ask
for the version of a transformed name with the encoding stripped, but
it does seem entirely plausible that it gives the answer we seek.  I
suppose if we were starting from a clean slate we might want to
perform this transformation up front so that we have it in datcollate
and then not have to think about the older form ever again.  If we
decided to do that going forward, the last trace of that problem would
live in pg_upgrade.  If we ever extend pg_import_system_collations()
to cover Windows, we should make sure it captures the tag form.

> It would be good if this could also be tested on Visual Studio version
> 12 as I see IsoLocaleName() does something else for anything before
> 15.  I only have 10 and 17 installed and I see we don't support
> anything before 12 on master per:

I think others have mentioned that it might be time to drop some older
Windows versions.  I don't follow that stuff, so I've quietly added a
name to the CC list and will hope for the best :-)




Re: Collation versioning

2020-11-02 Thread David Rowley
On Tue, 3 Nov 2020 at 12:29, David Rowley  wrote:
> Running low on ideas for now, so thought I'd post this in case it
> someone thinks of something else.

FWIW, the attached does fix the issue for me.  It basically just calls
the function that converts the windows-type "English_New Zealand.1252"
locale name string into, e.g. "en_NZ". Then, since GetNLSVersionEx()
wants yet another variant with a - rather than an _, I've just added a
couple of lines to swap the _ for a -.  There's a bit of extra work
there since IsoLocaleName() just did the opposite, so perhaps doing it
that way was lazy of me.  I'd have invented some other function if I
could have thought of a meaningful name for it, then just have the ISO
version of it swap - for _.

It would be good if this could also be tested on Visual Studio version
12 as I see IsoLocaleName() does something else for anything before
15.  I only have 10 and 17 installed and I see we don't support
anything before 12 on master per:

"Unable to determine Visual Studio version: Visual Studio versions
before 12.0 aren't supported. at
L:/Projects/Postgres/d/src/tools/msvc/Mkvcbuild.pm line 93."

David
diff --git a/src/backend/utils/adt/pg_locale.c 
b/src/backend/utils/adt/pg_locale.c
index 3b0324ce18..4982fc7eb1 100644
--- a/src/backend/utils/adt/pg_locale.c
+++ b/src/backend/utils/adt/pg_locale.c
@@ -943,7 +943,7 @@ cache_locale_time(void)
 }
 
 
-#if defined(WIN32) && defined(LC_MESSAGES)
+#if defined(WIN32)
 /*
  * Convert a Windows setlocale() argument to a Unix-style one.
  *
@@ -1692,6 +1692,16 @@ get_collation_actual_version(char collprovider, const 
char *collcollate)
 */
NLSVERSIONINFOEX version = {sizeof(NLSVERSIONINFOEX)};
WCHAR   wide_collcollate[LOCALE_NAME_MAX_LENGTH];
+   char   *shortlocale = IsoLocaleName(collcollate);
+   char   *underscore;
+
+   /*
+* GetNLSVersionEx wants - format, whereas 
the ISO
+* format is _.  So replace the _ with -
+*/
+   underscore = strchr(shortlocale, '_');
+   if (underscore)
+   *underscore = '-';
 
/* These would be invalid arguments, but have no version. */
if (pg_strcasecmp("c", collcollate) == 0 ||
@@ -1699,12 +1709,12 @@ get_collation_actual_version(char collprovider, const 
char *collcollate)
return NULL;
 
/* For all other names, ask the OS. */
-   MultiByteToWideChar(CP_ACP, 0, collcollate, -1, 
wide_collcollate,
+   MultiByteToWideChar(CP_ACP, 0, shortlocale, -1, 
wide_collcollate,
LOCALE_NAME_MAX_LENGTH);
if (!GetNLSVersionEx(COMPARE_STRING, wide_collcollate, 
))
ereport(ERROR,
(errmsg("could not get collation 
version for locale \"%s\": error code %lu",
-   collcollate,
+   shortlocale,
GetLastError(;
collversion = psprintf("%d.%d,%d.%d",
   
(version.dwNLSVersion >> 8) & 0x,


Re: Collation versioning

2020-11-02 Thread David Rowley
On Tue, 3 Nov 2020 at 09:43, Thomas Munro  wrote:
> Fortunately David Rowley is able to repro this on his Windows box (it
> fails even with strings that are succeeding on the other BF machines),
> so we have something to work with.  The name mangling that is done in
> get_iso_localename() looks pretty interesting...  It does feel a bit
> like there is some other hidden environmental factor or setting here,
> because commit 352f6f2df60 tested OK on Juan Jose's machine too.
> Hopefully more soon.

It seems to boil down to GetNLSVersionEx() not liking the "English_New
Zealand.1252" string.  The theory about it having a space does not
seem to be a factor as if I change it to "English_Australia.1252", I
get the same issue.

Going by the docs in [1] and following the "local name" link to [2],
there's a description there that mentions: "Generally, the pattern
- is used.".  So, if I just hack the code in
get_collation_actual_version() to pass "en-NZ" to GetNLSVersionEx(),
that works fine.

In [3], Juan José was passing in en-US rather than these more weird
Windows-specific locale strings, so the testing that code got when it
went in didn't include seeing if something like "English_New
Zealand.1252" would be accepted.

The "English_New Zealand.1252" string seems to come from the
setlocales() call in initdb via check_locale_name(LC_COLLATE,
lc_collate, ), and fundamentally setlocale(LC_COLLATE).

I'm still a bit mystified why whelk seems unphased by this change. You
can see from [4] that it must be passing "German_Germany.1252" to
GetNLSVersionEx().  I've tested both on Windows 8.1 and Windows 10 and
I can't get GetNLSVersionEx() to accept that. So maybe Windows 7
allowed these non-ISO formats?  That theory seems to break down a bit
when you see that walleye is perfectly happy on Windows 10 (MinGW64).
You can see from [5] it mentions "The database cluster will be
initialized with locale "English_United States.1252".".

Running low on ideas for now, so thought I'd post this in case it
someone thinks of something else.

David

[1] 
https://docs.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-getnlsversionex
[2] https://docs.microsoft.com/en-us/windows/win32/intl/locale-names
[3] 
https://www.postgresql.org/message-id/cac+axb0eat3aletrbdobb9jx863cu_+rsbgiajced5dcxob...@mail.gmail.com
[4] 
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=whelk=2020-11-02%2020%3A41%3A40=check-pg_upgrade
[5] 
https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=walleye=2020-11-02%2020%3A55%3A31=check-pg_upgrade




Re: Collation versioning

2020-11-02 Thread Thomas Munro
On Tue, Nov 3, 2020 at 6:51 AM Tom Lane  wrote:
> Thomas Munro  writes:
> > Hmm, a failure from dory (WIndows) during pg_upgrade:
>
> > performing post-bootstrap initialization ... 2020-11-02 08:08:22.213
> > EST [5392] FATAL:  could not get collation version for locale
> > "English_United States.1252": error code 87
>
> > 87 means invalid parameter.  I'm surprised it got through various
> > other tests and then failed here.  Whelk (also Windows) passed using
> > "German_Germany.1252".  Hmm.  I'll wait for more Windows systems to
> > report.
>
> drongo just did it too, and it seems repeatable on dory.  I'm not 100%
> sure, but I think the buildfarm's initial "check" step may be run under C
> locale while pg_upgrade sees whatever the machine's prevailing locale is.
> If that's correct, it seems like the simplest explanation is just that
> extraction of a collation version is busted for (some?) non-C locales on
> Windows.  Could be something as dumb as spaces in the locale name
> being problematic.

Fortunately David Rowley is able to repro this on his Windows box (it
fails even with strings that are succeeding on the other BF machines),
so we have something to work with.  The name mangling that is done in
get_iso_localename() looks pretty interesting...  It does feel a bit
like there is some other hidden environmental factor or setting here,
because commit 352f6f2df60 tested OK on Juan Jose's machine too.
Hopefully more soon.




Re: Collation versioning

2020-11-02 Thread Tom Lane
Thomas Munro  writes:
> Hmm, a failure from dory (WIndows) during pg_upgrade:

> performing post-bootstrap initialization ... 2020-11-02 08:08:22.213
> EST [5392] FATAL:  could not get collation version for locale
> "English_United States.1252": error code 87

> 87 means invalid parameter.  I'm surprised it got through various
> other tests and then failed here.  Whelk (also Windows) passed using
> "German_Germany.1252".  Hmm.  I'll wait for more Windows systems to
> report.

drongo just did it too, and it seems repeatable on dory.  I'm not 100%
sure, but I think the buildfarm's initial "check" step may be run under C
locale while pg_upgrade sees whatever the machine's prevailing locale is.
If that's correct, it seems like the simplest explanation is just that
extraction of a collation version is busted for (some?) non-C locales on
Windows.  Could be something as dumb as spaces in the locale name
being problematic.

regards, tom lane




Re: Collation versioning

2020-11-02 Thread Thomas Munro
On Tue, Nov 3, 2020 at 2:08 AM Julien Rouhaud  wrote:
> On Mon, Nov 2, 2020 at 9:04 PM Thomas Munro  wrote:
> > Ok, moved and renamed, and finally pushed.  Thanks for all the help!
>
> \o/  Thanks a lot Thomas!

Hmm, a failure from dory (WIndows) during pg_upgrade:

performing post-bootstrap initialization ... 2020-11-02 08:08:22.213
EST [5392] FATAL:  could not get collation version for locale
"English_United States.1252": error code 87

87 means invalid parameter.  I'm surprised it got through various
other tests and then failed here.  Whelk (also Windows) passed using
"German_Germany.1252".  Hmm.  I'll wait for more Windows systems to
report.




Re: Collation versioning

2020-11-02 Thread Julien Rouhaud
On Mon, Nov 2, 2020 at 9:04 PM Thomas Munro  wrote:
>
> On Mon, Nov 2, 2020 at 10:28 PM Julien Rouhaud  wrote:
> > +   /* Remember not to complain about this collation again. */
> > +   context->checked_colls = lappend_oid(context->checked_colls,
> > +otherObject->objectId);
> >
> > It's maybe not immediately obvious that it's safe to save the
> > collation oid at that point, or that it'll always be.  Maybe move it
> > in the if branch above to make it extra clear?
>
> Ok, moved and renamed, and finally pushed.  Thanks for all the help!

\o/  Thanks a lot Thomas!




Re: Collation versioning

2020-11-02 Thread Thomas Munro
On Mon, Nov 2, 2020 at 10:28 PM Julien Rouhaud  wrote:
> +   /* Remember not to complain about this collation again. */
> +   context->checked_colls = lappend_oid(context->checked_colls,
> +otherObject->objectId);
>
> It's maybe not immediately obvious that it's safe to save the
> collation oid at that point, or that it'll always be.  Maybe move it
> in the if branch above to make it extra clear?

Ok, moved and renamed, and finally pushed.  Thanks for all the help!




Re: Collation versioning

2020-11-02 Thread Julien Rouhaud
On Mon, Nov 2, 2020 at 3:56 PM Thomas Munro  wrote:
>
> On Mon, Nov 2, 2020 at 7:59 PM Julien Rouhaud  wrote:
> > Note that v34 now fails when run on a without that don't have
> > defined(__GLIBC__) (e.g. macos).  The failure are in
> > collate.icu.utf8.sql, of the form:
> >
> > - icuidx06_d_en_fr_ga   | "default"  | up to date
> > + icuidx06_d_en_fr_ga   | "default"  | version not tracked
> >
> > Given the new approach, the only option I can see is to simply remove
> > any attempt to cover the default collation in the tests.  An
> > alternative file would be a pain to maintain and it wouldn't bring any
> > value apart from checking that the default collation is either always
> > tracked or never, but not a mix of those.
>
> Blah, right.  Ok, I changed it to output XXX for "default".
>
> I did a bit more language clean up.  I fixed the tab completion for
> ALTER INDEX ... ALTER COLLATION.  I simplified a couple of tests.  I
> dropped the pg_dump TAP test for now (I might see if I can find a
> simpler way to test refobjversion restoration later).  I dropped the
> special handling for T_CollateExpr in find_expr_references_walker()
> (it wasn't affecting the test cases we have, and I wasn't entirely
> sure about it; do you see a problem?).  I dropped the VACUUM-log-spam
> feature for now (I'm not against it, but it seems like an independent
> enough thing to not include in the initial commit).  This brought the
> patch mercifully under 100kB.  This is the version I'm planning to
> commit if you don't see anything else.

Thanks a lot for the updated patch!  I agree with dropping the
T_CollateExpr test in find_exp_references_walker().  This was more a
hack than anything, and it should be better addressed by an approach
that can actually handle all cases rather than some random ones.

I'm also ok with dropping the logging from VACUUM from the initial
patch, but this seems like an important codepath to handle (and an
easy one), so I hope we can address that afterwards.

I just have a minor nit:

+   /* Do they match? */
+   if (strcmp(current_version, version) != 0)
+   {
+   /*
+* The version has changed, probably due to an OS/library upgrade or
+* streaming replication between different OS/library versions.
+*/
+   ereport(WARNING,
+   (errmsg("index \"%s\" depends on collation \"%s\"
version \"%s\", but the current version is \"%s\"",
+   get_rel_name(context->relid),
+   get_collation_name(otherObject->objectId),
+   version,
+   current_version),
+errdetail("The index may be corrupted due to changes
in sort order."),
+errhint("REINDEX to avoid the risk of corruption.")));
+   }
+
+   /* Remember not to complain about this collation again. */
+   context->checked_colls = lappend_oid(context->checked_colls,
+otherObject->objectId);

It's maybe not immediately obvious that it's safe to save the
collation oid at that point, or that it'll always be.  Maybe move it
in the if branch above to make it extra clear?

and also this will probably need an update:

-#define CATALOG_VERSION_NO 202010291
+#define CATALOG_VERSION_NO 202011013




Re: Collation versioning

2020-11-01 Thread Julien Rouhaud
On Sat, Oct 31, 2020 at 10:28 AM Thomas Munro  wrote:
>
> On Fri, Oct 30, 2020 at 1:20 AM Thomas Munro  wrote:
> > 4.  I didn't really like the use of '' for unknown.  I figured out how
> > to use NULL for that.
>
> Hrmph.  That logic didn't work for the pattern ops case.  I fixed
> that, by partially reintroducing a special value, but this time
> restricting the code that knows about that to just pg_dump, and I used
> the value 'unknown', so really it's not special at all as far as the
> server is concerned and there is only one kind of warning message.

Ok, I'm fine with that.

> Furthermore, I realised that I really don't like the policy of
> assuming that all text-related indexes imported from older releases
> need the "unknown" warning.  That'll just make this feature
> unnecessarily noisy and unpopular when 14 comes out, essentially
> crying wolf, even though it's technically true that the collations in
> imported-from-before-14 indexes are of unknown version.  Even worse,
> instructions might be shared around the internet to show how to shut
> the warnings up without reindexing, and then later when there really
> is a version change, someone might find those instructions and follow
> them!  So I propose that the default should be to assume that indexes
> are not corrupted, unless you opt into the more pessimistic policy
> with --index-collation-versions-unknown.  Done like that.

Yes, I was also worried about spamming this kind of messages after an
upgrade.  Note that this was initially planned for REL_13_STABLE,
whose release date was very close to glibc 2.28, so at that time this
would have been more likely to have a real corruption on the indexes.
I'm fine with the new behavior.

> I also realised that I don't like carrying a bunch of C code to
> support binary upgrades, when it's really just a hand-coded trivial
> UPDATE of pg_depend.  Is there any reason pg_dump --binary-upgrade
> shouldn't just dump UPDATE statements, and make this whole feature a
> lot less mysterious, and shorter?  Done like that.

I just thought that it wouldn't be acceptable to do plain DML on the
catalogs.  If that's ok, then definitely this approach is better.

> While testing on another OS that will be encountered in the build farm
> when I commit this, I realised that I needed to add --encoding=UTF8 to
> tests under src/bin/pg_dump and src/test/locale, because they now do
> things with ICU collations (if built with ICU support) and that only
> works with UTF8.

Oh I didn't know that.

> Another option would be to find a way to skip those
> tests if the encoding is not UTF8.  Hmm, I wonder if it's bad to
> effectively remove the testing that comes for free from buildfarm
> animals running this under non-UTF8 encodings; but if we actually
> valued that, I suppose we'd do it explicitly as another test pass with
> SQL_ASCII.

I guess it would be better to keep checking non-UTF8 encodings, but
the current approach looks quite random and I don't have any better
suggestions.

Note that v34 now fails when run on a without that don't have
defined(__GLIBC__) (e.g. macos).  The failure are in
collate.icu.utf8.sql, of the form:

- icuidx06_d_en_fr_ga   | "default"  | up to date
+ icuidx06_d_en_fr_ga   | "default"  | version not tracked

Given the new approach, the only option I can see is to simply remove
any attempt to cover the default collation in the tests.  An
alternative file would be a pain to maintain and it wouldn't bring any
value apart from checking that the default collation is either always
tracked or never, but not a mix of those.




Re: Collation versioning

2020-10-26 Thread Julien Rouhaud
On Sun, Oct 25, 2020 at 7:13 PM Julien Rouhaud  wrote:
>
> On Sun, Oct 25, 2020 at 10:36 AM Thomas Munro  wrote:
> >
> > On Fri, Oct 23, 2020 at 2:07 AM Julien Rouhaud  wrote:
> > > While reviewing the changes, I found a couple of minor issues
> > > (inherited from the previous versions).  It's missing a
> > > free_objects_addresses in recordDependencyOnCollations, and there's a
> > > small typo.  Inline diff:
> >
> > Thanks, fixed.
> >
> > I spent the past few days trying to simplify this patch further.
> > Here's what I came up with:
>
> Thanks!
>
> >
> > 1.  Drop the function dependencyExists() and related code, which in
> > earlier versions tried to avoid creating duplicate pg_depends rows by
> > merging with existing rows.  This was rather complicated, and there
> > are not many duplicates anyway, and it's easier to suppress duplicate
> > warnings at warning time (see do_collation_version_check()).  (I'm not
> > against working harder to make pg_depend rows unique, but it's not
> > required for this and I didn't like that way of doing it.)
>
> I didn't review all the changes yet, so I'll probably post a deeper
> review tomorrow.  I'm not opposed to this new approach, as it indeed
> saves a lot of code.  However, looking at
> do_collation_version_check(), it seems that you're saving the
> collation in context->checked_calls even if it didn't raise a WARNING.
> Since you can now have indexes with dependencies on a same collation
> with both version tracked and untracked (see for instance
> icuidx00_val_pattern_where in the regression tests), can't this hide
> corruption warning reports if the untracked version is found first?
> That can be easily fixed, so no objection to that approach of course.

I finish looking at the rest of the patches.  I don't have much to
say, it all looks good and I quite like how much useless code you got
rid of!




Re: Collation versioning

2020-10-25 Thread Julien Rouhaud
On Sun, Oct 25, 2020 at 10:36 AM Thomas Munro  wrote:
>
> On Fri, Oct 23, 2020 at 2:07 AM Julien Rouhaud  wrote:
> > While reviewing the changes, I found a couple of minor issues
> > (inherited from the previous versions).  It's missing a
> > free_objects_addresses in recordDependencyOnCollations, and there's a
> > small typo.  Inline diff:
>
> Thanks, fixed.
>
> I spent the past few days trying to simplify this patch further.
> Here's what I came up with:

Thanks!

>
> 1.  Drop the function dependencyExists() and related code, which in
> earlier versions tried to avoid creating duplicate pg_depends rows by
> merging with existing rows.  This was rather complicated, and there
> are not many duplicates anyway, and it's easier to suppress duplicate
> warnings at warning time (see do_collation_version_check()).  (I'm not
> against working harder to make pg_depend rows unique, but it's not
> required for this and I didn't like that way of doing it.)

I didn't review all the changes yet, so I'll probably post a deeper
review tomorrow.  I'm not opposed to this new approach, as it indeed
saves a lot of code.  However, looking at
do_collation_version_check(), it seems that you're saving the
collation in context->checked_calls even if it didn't raise a WARNING.
Since you can now have indexes with dependencies on a same collation
with both version tracked and untracked (see for instance
icuidx00_val_pattern_where in the regression tests), can't this hide
corruption warning reports if the untracked version is found first?
That can be easily fixed, so no objection to that approach of course.




Re: Collation versioning

2020-10-22 Thread Julien Rouhaud
On Thu, Oct 22, 2020 at 8:00 PM Thomas Munro  wrote:
>
> On Thu, Sep 24, 2020 at 9:49 PM Julien Rouhaud  wrote:
> > On Sun, Sep 20, 2020 at 10:24:26AM +0800, Julien Rouhaud wrote:
> > > On the other hand the *_pattern_ops are entirely hardcoded, and I
> > > don't think that we'll ever have an extensible way to have this kind
> > > of magic exception.  So maybe having a flag at the am level is
> > > acceptable?
> >
> > Hearing no complaint, I kept the flag at the AM level and added hardcoded
> > exceptions for the *_pattern_ops opclasses to avoid false positive as much 
> > as
> > possible, and no false negative (at least that I'm aware of).  I added many
> > indexes to the regression tests to make sure that all the cases are 
> > correctly
> > handled.
> >
> > Unfortunately, there's still one case that can't be fixed easily.  Here's an
> > example of such case:
> >
> > CREATE INDEX ON sometable ((collatable_col || collatable_col) 
> > text_pattern_ops)
>
> I think we should try to get the basic feature into the tree, and then
> look at these kinds of subtleties as later improvements.

Agreed.

>  Here's a new
> version with the following changes:
>
> 1.  Update the doc patch to mention that this stuff now works on
> Windows too (see commit 352f6f2d).
> 2.  Drop non_deterministic_only argument for from GetTypeCollations();
> it was unused.
> 3.  Drop that "stable collation order" field at the AM level for now.
> This means that we'll warn you about collation versions changes for
> hash and bloom indexes, even when it's technically unnecessary, for
> now.
>
> The pattern ops stuff seems straightforward however, so let's keep
> that bit in the initial commit of the feature.  That's a finite set of
> hard coded op classes which exist specifically to ignore collations.

Thanks a lot!  I'm fine with all the changes.  The "stable collation
order" part would definitely benefit from more thoughts, so it's good
if we can focus on that later.

While reviewing the changes, I found a couple of minor issues
(inherited from the previous versions).  It's missing a
free_objects_addresses in recordDependencyOnCollations, and there's a
small typo.  Inline diff:

diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c
index cab552eb32..4680b4e538 100644
--- a/src/backend/catalog/dependency.c
+++ b/src/backend/catalog/dependency.c
@@ -1674,6 +1674,8 @@ recordDependencyOnCollations(ObjectAddress *myself,
eliminate_duplicate_dependencies(addrs);
recordMultipleDependencies(myself, addrs->refs, addrs->numrefs,
   DEPENDENCY_NORMAL, record_version);
+
+   free_object_addresses(addrs);
 }

 /*
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index 69978fb409..048a41f446 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -1154,7 +1154,7 @@ index_create(Relation heapRelation,
colls_pattern_ops = list_difference_oid(colls_pattern_ops, colls);

/*
-* Record the dependencies for collation declares with any of the
+* Record the dependencies for collation declared with any of the
 * *_pattern_ops opclass, without version tracking.
 */
if (colls_pattern_ops != NIL)

Other than that it all looks good to me!




Re: Collation versioning

2020-09-19 Thread Julien Rouhaud
On Sun, Sep 20, 2020 at 6:36 AM Thomas Munro  wrote:
>
> On Thu, Sep 17, 2020 at 5:41 AM Julien Rouhaud  wrote:
> > On Tue, Sep 15, 2020 at 2:26 PM Peter Eisentraut
> >  wrote:
> > > I'm confused now.  I think we had mostly agreed on the v28 patch set,
> > > without this additional AM flag.  There was still some discussion on
> > > what the AM flag's precise semantics should be.  Do we want to work that
> > > out first?
> >
> > That was my understanding too, but since Michael raised a concern I
> > wrote some initial implementation for that part.  I'm assuming that
> > this new flag will raise some new discussion, and I hope this can be
> > discussed later, or at least in parallel, without interfering with the
> > rest of the patchset.
>
> If we always record dependencies we'll have the option to invent
> clever ways to ignore them during version checking in later releases.

But in any case we need to record the dependencies for all collations
right?  The only difference is that we shouldn't record the collation
version if there's no risk of corruption if the underlying sort order
changes.
So while I want to address this part in pg14, if that wasn't the case
the problem would anyway be automatically fixed in the later version
by doing a reindex I think, as the version would be cleared.

There could still be a possible false positive warning in that case if
the lib is updated, but users could clear it with the infrastructure
proposed.  Or alternatively if we add a new backend filter, say
REINDEX (COLLATION NOT CURRENT), we could add there additional
knowledge to ignore such cases.

> > > Btw., I'm uneasy about the term "stable collation order".  "Stable" has
> > > an established meaning for sorting.  It's really about whether the AM
> > > uses collations at all, right?
> >
> > Well, at the AM level I guess it's only about whether it's using some
> > kind of sorting or not, as the collation information is really at the
> > opclass level.  It makes me realize that this approach won't be able
> > to cope with an index built using (varchar|text)_pattern_ops, and
> > that's probably something we should handle correctly.
>
> Hmm.

On the other hand the *_pattern_ops are entirely hardcoded, and I
don't think that we'll ever have an extensible way to have this kind
of magic exception.  So maybe having a flag at the am level is
acceptable?




Re: Collation versioning

2020-09-19 Thread Thomas Munro
On Thu, Sep 17, 2020 at 5:41 AM Julien Rouhaud  wrote:
> On Tue, Sep 15, 2020 at 2:26 PM Peter Eisentraut
>  wrote:
> > I'm confused now.  I think we had mostly agreed on the v28 patch set,
> > without this additional AM flag.  There was still some discussion on
> > what the AM flag's precise semantics should be.  Do we want to work that
> > out first?
>
> That was my understanding too, but since Michael raised a concern I
> wrote some initial implementation for that part.  I'm assuming that
> this new flag will raise some new discussion, and I hope this can be
> discussed later, or at least in parallel, without interfering with the
> rest of the patchset.

If we always record dependencies we'll have the option to invent
clever ways to ignore them during version checking in later releases.

> > Btw., I'm uneasy about the term "stable collation order".  "Stable" has
> > an established meaning for sorting.  It's really about whether the AM
> > uses collations at all, right?
>
> Well, at the AM level I guess it's only about whether it's using some
> kind of sorting or not, as the collation information is really at the
> opclass level.  It makes me realize that this approach won't be able
> to cope with an index built using (varchar|text)_pattern_ops, and
> that's probably something we should handle correctly.

Hmm.




Re: Collation versioning

2020-09-16 Thread Julien Rouhaud
On Tue, Sep 15, 2020 at 2:26 PM Peter Eisentraut
 wrote:
>
> On 2020-09-08 16:45, Julien Rouhaud wrote:
> > I usually agree with that approach, I'm just afraid that getting a 
> > consensus on
> > the best way to do that will induce a lot of discussions, while this is
> > probably a corner case due to general usage of hash and bloom indexes.
> >
> > Anyway, in order to make progress on that topic I attach an additional POC
> > commit to add the required infrastructure to handle this case in
> > v29-0001-Add-a-new-amnostablecollorder-flag-in-IndexAmRou.patch.
>
> I'm confused now.  I think we had mostly agreed on the v28 patch set,
> without this additional AM flag.  There was still some discussion on
> what the AM flag's precise semantics should be.  Do we want to work that
> out first?

That was my understanding too, but since Michael raised a concern I
wrote some initial implementation for that part.  I'm assuming that
this new flag will raise some new discussion, and I hope this can be
discussed later, or at least in parallel, without interfering with the
rest of the patchset.

> Btw., I'm uneasy about the term "stable collation order".  "Stable" has
> an established meaning for sorting.  It's really about whether the AM
> uses collations at all, right?

Well, at the AM level I guess it's only about whether it's using some
kind of sorting or not, as the collation information is really at the
opclass level.  It makes me realize that this approach won't be able
to cope with an index built using (varchar|text)_pattern_ops, and
that's probably something we should handle correctly.




Re: Collation versioning

2020-09-15 Thread Peter Eisentraut

On 2020-09-08 16:45, Julien Rouhaud wrote:

I usually agree with that approach, I'm just afraid that getting a consensus on
the best way to do that will induce a lot of discussions, while this is
probably a corner case due to general usage of hash and bloom indexes.

Anyway, in order to make progress on that topic I attach an additional POC
commit to add the required infrastructure to handle this case in
v29-0001-Add-a-new-amnostablecollorder-flag-in-IndexAmRou.patch.


I'm confused now.  I think we had mostly agreed on the v28 patch set, 
without this additional AM flag.  There was still some discussion on 
what the AM flag's precise semantics should be.  Do we want to work that 
out first?


Btw., I'm uneasy about the term "stable collation order".  "Stable" has 
an established meaning for sorting.  It's really about whether the AM 
uses collations at all, right?


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Collation versioning

2020-09-10 Thread Julien Rouhaud
On Thu, Sep 10, 2020 at 6:52 PM Peter Eisentraut
 wrote:
>
> On 2020-09-08 21:33, Christoph Berg wrote:
> > IOW, I think we should aim at simply tracking the version, and leave
> > it to the admin (with the help of supplied SQL queries) to either
> > rebuild indexes or waive them.
>
> It's certainly safer to track dependency for all indexes and then
> carefully create exceptions afterwards.

Do you mean storing the collation version even when those are not
relevant, and let client code (or backend command) deal with it?  This
would require to store a dependency per index and column (or at least
if it's a column or an expression to avoid bloating the dependencies
too much), as it's otherwise impossible to know if a version mismatch
can be safely ignored or not.
I'm also wondering how much more complexity it would add to people who
want to actively monitor such mismatch using SQL queries.




Re: Collation versioning

2020-09-10 Thread Peter Eisentraut

On 2020-09-08 21:33, Christoph Berg wrote:

IOW, I think we should aim at simply tracking the version, and leave
it to the admin (with the help of supplied SQL queries) to either
rebuild indexes or waive them.


It's certainly safer to track dependency for all indexes and then 
carefully create exceptions afterwards.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Collation versioning

2020-09-08 Thread Julien Rouhaud
On Tue, Sep 08, 2020 at 09:33:26PM +0200, Christoph Berg wrote:
> Re: Julien Rouhaud
> > Here's the rationale for this new flag, extracted from the patch's commit
> > message:
> > 
> > Add a new amnostablecollorder flag in IndexAmRoutine.
> > 
> > This flag indicates if the access method does not rely on a stable collation
> > ordering for deterministic collation, i.e. would not be corrupted if the
> > underlying collation library changes its ordering.  This is done this way to
> > make sure that if any external access method isn't updated to correctly 
> > setup
> > this flag it will be assumed to rely on a stable collation ordering as this
> > should be the case for the majority of access methods.
> > 
> > This flag will be useful for an upcoming commit that will add detection of
> > possibly corrupted index due to changed collation library version.
> 
> Hmm. Does that flag gain us much? What about non-collation locale
> changes that might still affect indexes like lower() and the citext
> extension? That still depends on locale changes, but that flag
> wouldn't be able to help with "this index is (not) affected by this
> locale change".
> 
> IOW, I think we should aim at simply tracking the version, and leave
> it to the admin (with the help of supplied SQL queries) to either
> rebuild indexes or waive them.
> 
> Or maybe I misunderstood the problem.
> 

I see your point, and indeed this isn't really clear how the flag will be used
given this description.

I guess that my idea is that how exactly an index depends on a stable
collation ordering isn't part of this flag definition, as it should be the same
for any access method.

In the commit that uses the infrastructure, the lack of stable ordering
requirement is only used for index key column, but not for any
expression, whether on index key or qual, because as you mention there's no
guarantee that the expression itself depends on a stable ordering or not.

There could be some improvements done for some simple case (like maybe md5() is
used often enough in index keys that it could be worth to detect), but nothing
in done to attempt that for now.




Re: Collation versioning

2020-09-08 Thread Christoph Berg
Re: Julien Rouhaud
> Here's the rationale for this new flag, extracted from the patch's commit
> message:
> 
> Add a new amnostablecollorder flag in IndexAmRoutine.
> 
> This flag indicates if the access method does not rely on a stable collation
> ordering for deterministic collation, i.e. would not be corrupted if the
> underlying collation library changes its ordering.  This is done this way to
> make sure that if any external access method isn't updated to correctly setup
> this flag it will be assumed to rely on a stable collation ordering as this
> should be the case for the majority of access methods.
> 
> This flag will be useful for an upcoming commit that will add detection of
> possibly corrupted index due to changed collation library version.

Hmm. Does that flag gain us much? What about non-collation locale
changes that might still affect indexes like lower() and the citext
extension? That still depends on locale changes, but that flag
wouldn't be able to help with "this index is (not) affected by this
locale change".

IOW, I think we should aim at simply tracking the version, and leave
it to the admin (with the help of supplied SQL queries) to either
rebuild indexes or waive them.

Or maybe I misunderstood the problem.

Christoph




Re: Collation versioning

2020-09-06 Thread Michael Paquier
On Fri, Aug 14, 2020 at 11:02:35AM +0200, Julien Rouhaud wrote:
> On Fri, Aug 14, 2020 at 04:37:46PM +0900, Michael Paquier wrote:
>> +   /*
>> +* XXX For deterministic transaction, se should only track the
>> version
>> +* if the AM relies on a stable ordering.
>> +*/
>> +   if (determ_colls)
>> +   {
>> +   /* XXX check if the AM relies on a stable ordering */
>> +   recordDependencyOnCollations(, determ_colls, true);
>> Some cleanup needed here?  Wouldn't it be better to address the issues
>> with stable ordering first?
> 
> Didn't we just agreed 3 mails ago to *not* take care of that in this patch, 
> and
> add an extensible solution for that later?  I kept the XXX comment to make it
> extra clear that this will be addressed.

FWIW, I tend to prefer the approach where we put in place the
necessary infrastructure first, and then have a feature rely on what
we think is the most correct.  This way, we avoid having any moment in
the code history where we have something that we know from the start
is not covered.

The patch set needs a rebase.  There are conflicts coming at least
from pg_depend.c where I switched the code to use multi-INSERTs for
catalog insertions.
--
Michael


signature.asc
Description: PGP signature


Re: Collation versioning

2020-08-14 Thread Michael Paquier
On Fri, Aug 14, 2020 at 02:21:58PM +1200, Thomas Munro wrote:
> Thanks Julien.  I'm planning to do a bit more testing and review, and
> then hopefully commit this next week.  If anyone else has objections
> to this design, now would be a good time to speak up.

The design to use pg_depend for the version string and rely on an
unknown state for indexes whose collations are unknown has a clear
consensus, so nothing to say about that.  It looks like this will
benefit from using multi-INSERTs with pg_depend, actually.

I have read through the patch, and there are a couple of portions that
could be improved and/or simplified.

 /*
- * Adjust all dependency records to come from a different object of the same 
type
  * Swap all dependencies of and on the old index to the new one, and
+ * vice-versa, while preserving any referenced version for the original owners.
+ * Note that a call to CommandCounterIncrement() would cause duplicate entries
+ * in pg_depend, so this should not be done.
+ */
+void
+swapDependencies(Oid classId, Oid firstObjectId, Oid secondObjectId)
+{
+   changeDependenciesOf(classId, firstObjectId, secondObjectId, true);
+   changeDependenciesOn(classId, firstObjectId, secondObjectId);
+
+   changeDependenciesOf(classId, secondObjectId, firstObjectId, true);
+   changeDependenciesOn(classId, secondObjectId, firstObjectId);
+}

The comment on top of the routine is wrong, as it could apply to
something else than indexes.  Anyway, I don't think there is much
value in adding this API as the only part where this counts is
relation swapping for reindex concurrently.  It could also be possible
that this breaks some extension code by making those static to
pg_depend.c.

-long
+static long
 changeDependenciesOf(Oid classId, Oid oldObjectId,
-Oid newObjectId)
+Oid newObjectId, bool preserve_version)
All the callers of changeDependenciesOf() set the new argument to
true, making the false path dead, even if it just implies that the
argument is null.  I would suggest to keep the original function
signature.  If somebody needs a version where they don't want to
preserve the version, it could just be added later.

+* We don't want to record redundant depedencies that are used
+* to track versions to avoid redundant warnings in case of
s/depedencies/dependencies/

+   /*
+* XXX For deterministic transaction, se should only track the
version
+* if the AM relies on a stable ordering.
+*/
+   if (determ_colls)
+   {
+   /* XXX check if the AM relies on a stable ordering */
+   recordDependencyOnCollations(, determ_colls, true);
Some cleanup needed here?  Wouldn't it be better to address the issues
with stable ordering first?

+   /* recordDependencyOnSingleRelExpr get rid of duplicated
entries */
s/get/gets/, incorrect grammar.

+   /* XXX should we warn about "disappearing" versions? */
+   if (current_version)
+   {
Something to do here?

+   /*
+* We now support versioning for the underlying collation library on
+* this system, or previous version is unknown.
+*/
+   if (!version || (strcmp(version, "") == 0 && strcmp(current_version,
+   "") != 0))
Strange diff format here.

+static char *
+index_check_collation_version(const ObjectAddress *otherObject,
+ const char *version,
+ void *userdata)
All the new functions in index.c should have more documentation and
comments to explain what they do.

+   foreach(lc, collations)
+   {
+   ObjectAddress referenced;
+
+   ObjectAddressSet(referenced, CollationRelationId, lfirst_oid(lc));
+
+   recordMultipleDependencies(myself, , 1,
+  DEPENDENCY_NORMAL, record_version);
+   }
I think that you could just use an array of ObjectAddresses here, fill
in a set of ObjectAddress objects and just call
recordMultipleDependencies() for all of them?  Just create a set using
new_object_addresses(), register them with add_exact_object_address(),
and then finish the job with record_object_address_dependencies().
--
Michael


signature.asc
Description: PGP signature


Re: Collation versioning

2020-08-13 Thread Thomas Munro
On Thu, Aug 13, 2020 at 9:52 PM Julien Rouhaud  wrote:
> Here's a rebased v27 that removes the current approach to ignore indexes that
> don't rely on a stable ordering.  I'll start a new thread on that matter once
> the infrastructure pieces will be committed.

Thanks Julien.  I'm planning to do a bit more testing and review, and
then hopefully commit this next week.  If anyone else has objections
to this design, now would be a good time to speak up.




Re: Collation versioning

2020-07-09 Thread Thomas Munro
On Thu, Jul 9, 2020 at 11:13 PM Julien Rouhaud  wrote:
> On Thu, Jul 9, 2020 at 10:00 AM Peter Eisentraut
>  wrote:
> > In order not to derail this patch set I think it would be okay for now
> > to just include all index AMs in dependency tracking and invent a
> > mechanism later that excludes hash and bloom in an extensible manner.
>
> FTR I'll be happy to take care of that.

Ok, thanks!  Let's go with that.




Re: Collation versioning

2020-07-09 Thread Julien Rouhaud
On Thu, Jul 9, 2020 at 10:00 AM Peter Eisentraut
 wrote:
>
> On 2020-07-08 08:26, Michael Paquier wrote:
> > On Wed, Jul 08, 2020 at 06:12:51PM +1200, Thomas Munro wrote:
> >> I still wish I had a better idea than this:
> >>
> >> +/*
> >> + * Returns whether the given index access method depend on a stable 
> >> collation
> >> + * order.
> >> + */
> >> +static bool
> >> +index_depends_stable_coll_order(Oid amoid)
> >> +{
> >> +   return (amoid != HASH_AM_OID &&
> >> +   strcmp(get_am_name(amoid), "bloom") != 0);
> >> +}
> >>
> >> I'm doing some more testing and looking for weird cases...  More soon.
> >
> > Wouldn't the normal way to track that a new field in IndexAmRoutine?
> > What you have here is not extensible.
>
> Yeah, this should be decided and communicated by the index AM somehow.
>
> Perhaps it would also make sense to let the index AM handle the
> differences between deterministic and nondeterministic collations.  I
> don't know how the bloom AM works, though, to determine whether that
> makes sense.
>
> In order not to derail this patch set I think it would be okay for now
> to just include all index AMs in dependency tracking and invent a
> mechanism later that excludes hash and bloom in an extensible manner.

FTR I'll be happy to take care of that.




Re: Collation versioning

2020-07-09 Thread Peter Eisentraut

On 2020-07-08 08:26, Michael Paquier wrote:

On Wed, Jul 08, 2020 at 06:12:51PM +1200, Thomas Munro wrote:

I still wish I had a better idea than this:

+/*
+ * Returns whether the given index access method depend on a stable collation
+ * order.
+ */
+static bool
+index_depends_stable_coll_order(Oid amoid)
+{
+   return (amoid != HASH_AM_OID &&
+   strcmp(get_am_name(amoid), "bloom") != 0);
+}

I'm doing some more testing and looking for weird cases...  More soon.


Wouldn't the normal way to track that a new field in IndexAmRoutine?
What you have here is not extensible.


Yeah, this should be decided and communicated by the index AM somehow.

Perhaps it would also make sense to let the index AM handle the 
differences between deterministic and nondeterministic collations.  I 
don't know how the bloom AM works, though, to determine whether that 
makes sense.


In order not to derail this patch set I think it would be okay for now 
to just include all index AMs in dependency tracking and invent a 
mechanism later that excludes hash and bloom in an extensible manner.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Collation versioning

2020-07-08 Thread Michael Paquier
On Wed, Jul 08, 2020 at 06:12:51PM +1200, Thomas Munro wrote:
> I still wish I had a better idea than this:
> 
> +/*
> + * Returns whether the given index access method depend on a stable collation
> + * order.
> + */
> +static bool
> +index_depends_stable_coll_order(Oid amoid)
> +{
> +   return (amoid != HASH_AM_OID &&
> +   strcmp(get_am_name(amoid), "bloom") != 0);
> +}
> 
> I'm doing some more testing and looking for weird cases...  More soon.

Wouldn't the normal way to track that a new field in IndexAmRoutine?
What you have here is not extensible.
--
Michael


signature.asc
Description: PGP signature


Re: Collation versioning

2020-03-19 Thread Julien Rouhaud
On Thu, Mar 19, 2020 at 12:31:54PM +0900, Michael Paquier wrote:
> On Wed, Mar 18, 2020 at 04:35:43PM +0100, Julien Rouhaud wrote:
> > On Wed, Mar 18, 2020 at 09:56:40AM +0100, Julien Rouhaud wrote:
> > AFAICT it was only missing a call to index_update_collation_versions() in
> > ReindexRelationConcurrently.  I added regression tests to make sure that
> > REINDEX, REINDEX [INDEX|TABLE] CONCURRENTLY and VACUUM FULL are doing what's
> > expected.
> 
> If you add a call to index_update_collation_versions(), the old and
> invalid index will use the same refobjversion as the new index, which
> is the latest collation version of the system, no?  If the operation
> is interrupted before the invalid index is dropped, then we would keep
> a confusing value for refobjversion, because the old invalid index
> does not rely on the new collation version, but on the old one.
> Hence, it seems to me that it would be correct to have the old invalid
> index either use an empty version string to say "we don't know"
> because the index is invalid anyway, or keep a reference to the old
> collation version intact.  I think that the latter is much more useful
> for debugging issues when upgrading a subset of indexes if the
> operation is interrupted for a reason or another.

Indeed, I confused the _ccold and _ccnew indexes.  So, the root cause is phase
4, more precisely the dependency swap in index_concurrently_swap.

A possible fix would be to teach changeDependenciesOf() to preserve the
dependency version.  It'd be quite bit costly as this would mean an extra index
search for each dependency row found.  We could probably skip the lookup if the
row have a NULL recorded version, as version should either be null or non null
for both objects.

I'm wondering if that's a good time to make changeDependenciesOf and
changeDependenciesOn private, and instead expose a swapDependencies(classid,
obj1, obj2) that would call both, as preserving the version doesn't really
makes sense outside a switch.  It's als oa good way to ensure that no CCI is
performed in the middle.

> > Given discussion in nearby threads, I obviously can't add tests for failed
> > REINDEX CONCURRENTLY, so here's what's happening with a manual repro:
> > 
> > =# UPDATE pg_depend SET refobjversion = 'meh' WHERE refobjversion = 
> > '153.97';
> > UPDATE 1
> 
> Updates to catalogs are not an existing practice in the core
> regression tests, so patches had better not do that. :p

I already heavily relied on that in the previous version of the patchset.  The
only possible alternative would be to switch to TAP tests, and constantly
restart the instance in binary upgrade mode to be able to call
binary_upgrade_set_index_coll_version.  I'd prefer to avoid that if that's
possible, as it'll make the test way more complex and quite unreadable.

> > =# REINDEX TABLE CONCURRENTLY t1 ;
> > LOCATION:  ReindexRelationConcurrently, indexcmds.c:2839
> > ^CCancel request sent
> > ERROR:  57014: canceling statement due to user request
> > LOCATION:  ProcessInterrupts, postgres.c:3171
> 
> I guess that you used a second session here beginning a transaction
> before REINDEX CONCURRENTLY ran here so as it would stop after
> swapping dependencies, right?

Yes, sorry for eluding that.  I'm using a SELECT FOR UPDATE, same scenario as
the recent issue with TOAST tables with REINDEX CONCURRENTLY.




Re: Collation versioning

2020-03-18 Thread Michael Paquier
On Wed, Mar 18, 2020 at 04:35:43PM +0100, Julien Rouhaud wrote:
> On Wed, Mar 18, 2020 at 09:56:40AM +0100, Julien Rouhaud wrote:
> AFAICT it was only missing a call to index_update_collation_versions() in
> ReindexRelationConcurrently.  I added regression tests to make sure that
> REINDEX, REINDEX [INDEX|TABLE] CONCURRENTLY and VACUUM FULL are doing what's
> expected.

If you add a call to index_update_collation_versions(), the old and
invalid index will use the same refobjversion as the new index, which
is the latest collation version of the system, no?  If the operation
is interrupted before the invalid index is dropped, then we would keep
a confusing value for refobjversion, because the old invalid index
does not rely on the new collation version, but on the old one.
Hence, it seems to me that it would be correct to have the old invalid
index either use an empty version string to say "we don't know"
because the index is invalid anyway, or keep a reference to the old
collation version intact.  I think that the latter is much more useful
for debugging issues when upgrading a subset of indexes if the
operation is interrupted for a reason or another.

> Given discussion in nearby threads, I obviously can't add tests for failed
> REINDEX CONCURRENTLY, so here's what's happening with a manual repro:
> 
> =# UPDATE pg_depend SET refobjversion = 'meh' WHERE refobjversion = '153.97';
> UPDATE 1

Updates to catalogs are not an existing practice in the core
regression tests, so patches had better not do that. :p 

> =# REINDEX TABLE CONCURRENTLY t1 ;
> LOCATION:  ReindexRelationConcurrently, indexcmds.c:2839
> ^CCancel request sent
> ERROR:  57014: canceling statement due to user request
> LOCATION:  ProcessInterrupts, postgres.c:3171

I guess that you used a second session here beginning a transaction
before REINDEX CONCURRENTLY ran here so as it would stop after
swapping dependencies, right?

> =# SELECT objid::regclass, indisvalid, refobjversion
>FROM pg_depend d
>JOIN pg_index i ON i.indexrelid = d.objid
>WHERE refobjversion IS NOT NULL;
>   objid   | indisvalid | refobjversion
> --++---
>  t1_val_idx_ccold | f  | 153.97
>  t1_val_idx   | t  | meh
> (2 rows)
> 
> =# REINDEX TABLE t1;
> WARNING:  0A000: cannot reindex invalid index 
> "pg_toast.pg_toast_16418_index_ccold" on TOAST table, skipping
> LOCATION:  reindex_relation, index.c:3987
> REINDEX
> 
> =# SELECT objid::regclass, indisvalid, refobjversion
>FROM pg_depend d
>JOIN pg_index i ON i.indexrelid = d.objid
>WHERE refobjversion IS NOT NULL;
>   objid   | indisvalid | refobjversion
> --++---
>  t1_val_idx_ccold | t  | 153.97
>  t1_val_idx   | t  | 153.97
> (2 rows)
> 
> ISTM that it's working as intended.

After a non-concurrent reindex, this information is right.  However
based on the output of your test here, after REINDEX CONCURRENTLY the
information held in refobjversion is incorrect for t1_val_idx_ccold
and t1_val_idx.  They should be reversed.
--
Michael


signature.asc
Description: PGP signature


Re: Collation versioning

2020-03-18 Thread Michael Paquier
On Wed, Mar 18, 2020 at 09:29:55PM +0100, Peter Eisentraut wrote:
> OK, I have committed the regcollation patch, and some surrounding cleanup of
> the reg* types documentation.

Thanks, Peter.
--
Michael


signature.asc
Description: PGP signature


Re: Collation versioning

2020-03-18 Thread Julien Rouhaud
On Wed, Mar 18, 2020 at 09:29:55PM +0100, Peter Eisentraut wrote:
> On 2020-03-17 18:43, Julien Rouhaud wrote:
> > On Tue, Mar 17, 2020 at 05:31:47PM +0100, Christoph Berg wrote:
> > > Re: Peter Eisentraut 
> > > 2020-03-17
> > > > Did we discuss the regcollation type?  In the current patch set, it's 
> > > > only
> > > > used in two places in a new regression test, where it can easily be 
> > > > replaced
> > > > by a join.  Do we need it?
> > 
> > I originally wrote it for a previous version of the patchset, to shorten the
> > pg_dump query, but that went out when I replaced the DDL command with native
> > functions instead.  It didn't seem to hurt to keep it, so I relied on it in 
> > the
> > regression tests.
> 
> OK, I have committed the regcollation patch, and some surrounding cleanup of
> the reg* types documentation.

Thanks!

> Note that your patch updated the pg_upgrade documentation to say that tables
> with regcollation columns cannot be upgraded but didn't actually patch the
> pg_upgrade code to make that happen.

Oh right, sorry for that I shouldn't have miss it:(




Re: Collation versioning

2020-03-18 Thread Peter Eisentraut

On 2020-03-17 18:43, Julien Rouhaud wrote:

On Tue, Mar 17, 2020 at 05:31:47PM +0100, Christoph Berg wrote:

Re: Peter Eisentraut 
2020-03-17

Did we discuss the regcollation type?  In the current patch set, it's only
used in two places in a new regression test, where it can easily be replaced
by a join.  Do we need it?


I originally wrote it for a previous version of the patchset, to shorten the
pg_dump query, but that went out when I replaced the DDL command with native
functions instead.  It didn't seem to hurt to keep it, so I relied on it in the
regression tests.


OK, I have committed the regcollation patch, and some surrounding 
cleanup of the reg* types documentation.


Note that your patch updated the pg_upgrade documentation to say that 
tables with regcollation columns cannot be upgraded but didn't actually 
patch the pg_upgrade code to make that happen.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Collation versioning

2020-03-18 Thread Julien Rouhaud
On Wed, Mar 18, 2020 at 04:55:25PM +0900, Michael Paquier wrote:
> On Tue, Mar 17, 2020 at 11:42:34AM +0100, Julien Rouhaud wrote:
> > On Tue, Mar 17, 2020 at 08:19:30AM +0100, Julien Rouhaud wrote:
>
> It would be good to be careful about the indentation.  Certain parts
> of 0003 don't respect the core indentation.  Not necessarily your job
> though.  Other than that 0003 seems to be in good shape.


I'll try to do a partial pgindent run on all patches before next patchset.


>
> @@ -54,6 +55,7 @@ recordDependencyOn(const ObjectAddress *depender,
>  void
>  recordMultipleDependencies(const ObjectAddress *depender,
> const ObjectAddress *referenced,
> +   const char *version,
> int nreferenced,
> DependencyType behavior)
> Nit from patch 0002: the argument "version" should be fourth I think,
> keeping the number of referenced objects and the referenced objects
> close.  And actually, this "version" argument is removed in patch
> 0004, replaced by the boolean track_version.  (By reading the
> arguments below I'd rather keep *version).
>
> So, 0004 is the core of the changes.  I have found a bug with the
> handling of refobjversion and pg_depend entries.  When swapping the
> dependencies of the old and new indexes in index_concurrently_swap(),
> refobjversion remains set to the value of the old index.  I used a
> manual UPDATE on pg_depend to emulate that with a past fake version
> string to emulate that (sneaky I know), but you would get the same
> behavior with an upgraded instance.  refobjversion should be updated
> to the version of the new index.


Oh good catch.  I'll dig into it.


> +if (track_version)
> +{
> +/* Only dependency on a collation is supported. */
> +if (referenced->classId == CollationRelationId)
> +{
> +/* C and POSIX collations don't require tracking the version 
> */
> +if (referenced->objectId == C_COLLATION_OID ||
> +referenced->objectId == POSIX_COLLATION_OID)
> +continue;
> I don't think that the API is right for this stuff, as you introduce
> collation-level knowledge into something which has been much more
> flexible until now.  Wouldn't it be better to move the refobjversion
> string directly into ObjectAddress?


We could, but we would then need to add code to retrieve the collation version
in multiple places (at least RecordDependencyOnCollation and
recordDependencyOnSingleRelExpr).  I'm afraid that'll open room for bugs if
some other places are missed, now or later, even more if more objects get a
versionning support.


> + * Test if a record exists for the given depender and referenceds addresses.
> [...]
> +   /* recordDependencyOnSingleRelExpr get rid of duplicate entries */
> Typos.
>
> +   /* XXX should we warn about "disappearing" versions? */
> +   if (current_version)
> What are disappearing version strings?


A collation for which a version was previously recorded but that now doesn't
report a version anymore.  For instance if upgrading from glibc X.Y to X.Z
changes gnu_get_libc_version() to return NULL, or if a new major pg version
removes support for glibc (or other lib) versioning.  It seems unlikely to
happen, and if that happens there's nothing we can do anymore to warn about
possible corruption anyway.


> +/*
> + * Perform version sanity checks on the relation underlying indexes if
> + * it's not a VACUUM FULL
> + */
> +if (!(options & VACOPT_FULL) && onerel && !IsSystemRelation(onerel) &&
> +onerel->rd_rel->relhasindex)
> Should this explain why?


I was assuming it's self explanatory, since VACUUM FULL is one of the 3 only
ways to fix a possibly corrupt index (on top of REINDEX and ALTER INDEX ...
ALTER COLLATION ... REFRESH VERSION).  I can mention it if needed though.


>
> +/* Record collations from the type itself, or underlying in case 
> of
> + * complex type.  Note that if the direct parent is a CollateExpr
> + * node, there's no need to record the type underlying collation 
> if
> Comment block format.

Oops, will fix.


> +-- for key columns, hash indexes should record dependency on the collation 
> but
> +-- not the version
> +CREATE INDEX icuidx18_hash_d_es ON collate_test USING hash (d_es);
> Why is that?


Because hash indexes don't rely on the sort order for the key columns?  So even
if the sort order changes the index won't get corrupted (unless it's a non
deterministic collation of course).


> The code in 0004 has no mention of that, and relies on
> this code path:
> +/*
> + * Returns whether the given index access method depend on a stable collation
> + * order.
> + */
> +static bool
> +index_depends_stable_coll_order(Oid amoid)
> +{
> +   return (amoid != HASH_AM_OID &&
> +   strcmp(get_am_name(amoid), "bloom") != 0);
> +}



Re: Collation versioning

2020-03-18 Thread Michael Paquier
On Tue, Mar 17, 2020 at 11:42:34AM +0100, Julien Rouhaud wrote:
> On Tue, Mar 17, 2020 at 08:19:30AM +0100, Julien Rouhaud wrote:
>> On Tue, Mar 17, 2020 at 03:37:49PM +0900, Michael Paquier wrote:
>>> It seems to me that you could add an extra test with a catalog that
>>> does not exist, making sure that NULL is returned:
>>> SELECT to_regtype('ng_catalog."POSIX"');
>>
>>
>> Agreed, I'll add that, but using a name that looks less like a typo :)
> 
> Tests added, including one with an error output, as the not existing schema
> doesn't reveal the encoding.

Yep.

>> Ah good catch, I missed that during the NameData/text refactoring.  I'll fix 
>> it
>> anyway, better to have clean history.
> 
> And this should be fixed too.

Thanks.

It would be good to be careful about the indentation.  Certain parts
of 0003 don't respect the core indentation.  Not necessarily your job
though.  Other than that 0003 seems to be in good shape.

@@ -54,6 +55,7 @@ recordDependencyOn(const ObjectAddress *depender,
 void
 recordMultipleDependencies(const ObjectAddress *depender,
const ObjectAddress *referenced,
+   const char *version,
int nreferenced,
DependencyType behavior)
Nit from patch 0002: the argument "version" should be fourth I think,
keeping the number of referenced objects and the referenced objects
close.  And actually, this "version" argument is removed in patch
0004, replaced by the boolean track_version.  (By reading the
arguments below I'd rather keep *version).

So, 0004 is the core of the changes.  I have found a bug with the
handling of refobjversion and pg_depend entries.  When swapping the
dependencies of the old and new indexes in index_concurrently_swap(),
refobjversion remains set to the value of the old index.  I used a
manual UPDATE on pg_depend to emulate that with a past fake version
string to emulate that (sneaky I know), but you would get the same
behavior with an upgraded instance.  refobjversion should be updated
to the version of the new index. 

+void recordDependencyOnCollations(ObjectAddress *myself,
+  List *collations,
+  bool record_version)
Incorrect declaration format.

+if (track_version)
+{
+/* Only dependency on a collation is supported. */
+if (referenced->classId == CollationRelationId)
+{
+/* C and POSIX collations don't require tracking the version */
+if (referenced->objectId == C_COLLATION_OID ||
+referenced->objectId == POSIX_COLLATION_OID)
+continue;
I don't think that the API is right for this stuff, as you introduce
collation-level knowledge into something which has been much more
flexible until now.  Wouldn't it be better to move the refobjversion
string directly into ObjectAddress?

+ * Test if a record exists for the given depender and referenceds addresses.
[...]
+   /* recordDependencyOnSingleRelExpr get rid of duplicate entries */
Typos.

+   /* XXX should we warn about "disappearing" versions? */
+   if (current_version)
What are disappearing version strings?

+/*
+ * Perform version sanity checks on the relation underlying indexes if
+ * it's not a VACUUM FULL
+ */
+if (!(options & VACOPT_FULL) && onerel && !IsSystemRelation(onerel) &&
+onerel->rd_rel->relhasindex)
Should this explain why?

+/* Record collations from the type itself, or underlying in case of
+ * complex type.  Note that if the direct parent is a CollateExpr
+ * node, there's no need to record the type underlying collation if
Comment block format.

+-- for key columns, hash indexes should record dependency on the collation but
+-- not the version
+CREATE INDEX icuidx18_hash_d_es ON collate_test USING hash (d_es);
Why is that?  The code in 0004 has no mention of that, and relies on
this code path:
+/*
+ * Returns whether the given index access method depend on a stable collation
+ * order.
+ */
+static bool
+index_depends_stable_coll_order(Oid amoid)
+{
+   return (amoid != HASH_AM_OID &&
+   strcmp(get_am_name(amoid), "bloom") != 0);
+}
And how is that even extensible for custom index AMs?  There are other
things than bloom out there.

+   /*
+* We only care about dependencies on a specific collation if a valid Oid
+* was given.=
+*/
[...]
+   /*
+* do not issue UNKNOWN VERSION is caller specified that those are
+* compatible
+*/
Typos from patch 5.

-   $self->logfile, '-o', "--cluster-name=$name", 'start');
+   $self->logfile, '-o', $options, 'start');
This needs to actually be shaped with two separate arguments for
--cluster-name or using quotes would not work properly if I recall
correctly.  Not your patch's fault, so I would fix that separately.
--
Michael



Re: Collation versioning

2020-03-17 Thread Michael Paquier
On Tue, Mar 17, 2020 at 06:43:51PM +0100, Julien Rouhaud wrote:
> On Tue, Mar 17, 2020 at 05:31:47PM +0100, Christoph Berg wrote:
>> Not sure if that's the case there, but reg* typecasts are very handy
>> when used interactively in ad-hoc queries.
> 
> +1.  I'm obviously biased, but I find it quite useful when querying pg_depend,
> which may become more frequent once we start generating warnings about 
> possibly
> corrupted indexes.

That means less joins for lookup queries, and collations can be
schema-qualified, so I would be in favor of adding it rather than not.
Now, it is true as well that ::regcollation is not a mandatory
requirement for the feature discussed on this thread.
--
Michael


signature.asc
Description: PGP signature


Re: Collation versioning

2020-03-17 Thread Julien Rouhaud
On Tue, Mar 17, 2020 at 05:31:47PM +0100, Christoph Berg wrote:
> Re: Peter Eisentraut 2020-03-17 
> 
> > Did we discuss the regcollation type?  In the current patch set, it's only
> > used in two places in a new regression test, where it can easily be replaced
> > by a join.  Do we need it?


I originally wrote it for a previous version of the patchset, to shorten the
pg_dump query, but that went out when I replaced the DDL command with native
functions instead.  It didn't seem to hurt to keep it, so I relied on it in the
regression tests.


> > I realize we've been adding new reg* types lately; I'm not sure what the
> > current idea is on that.
>
> Not sure if that's the case there, but reg* typecasts are very handy
> when used interactively in ad-hoc queries.


+1.  I'm obviously biased, but I find it quite useful when querying pg_depend,
which may become more frequent once we start generating warnings about possibly
corrupted indexes.




Re: Collation versioning

2020-03-17 Thread Christoph Berg
Re: Peter Eisentraut 2020-03-17 

> Did we discuss the regcollation type?  In the current patch set, it's only
> used in two places in a new regression test, where it can easily be replaced
> by a join.  Do we need it?
> 
> I realize we've been adding new reg* types lately; I'm not sure what the
> current idea is on that.

Not sure if that's the case there, but reg* typecasts are very handy
when used interactively in ad-hoc queries.

Christoph




Re: Collation versioning

2020-03-17 Thread Peter Eisentraut
Did we discuss the regcollation type?  In the current patch set, it's 
only used in two places in a new regression test, where it can easily be 
replaced by a join.  Do we need it?


I realize we've been adding new reg* types lately; I'm not sure what the 
current idea is on that.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Collation versioning

2020-03-17 Thread opolofdez
njhjo



Enviado desde mi Redmi 4AEl Julien Rouhaud , 16 mar. 2020 3:05 p. m. escribió:On Mon, Mar 16, 2020 at 04:57:38PM +0900, Michael Paquier wrote:
> On Thu, Mar 12, 2020 at 03:00:26PM +0100, Julien Rouhaud wrote:
> > And v15 due to conflict with b08dee24a5 (Add pg_dump support for ALTER obj
> > DEPENDS ON EXTENSION).
>
> I have looked at patches 0001~0003 in the set for now.


Thanks!


> In patch 0002, you have the following addition:
> @@ -103,9 +103,10 @@ ORDER BY 1, 2;
>   pg_class    | relacl    | aclitem[]
>   pg_class    | reloptions    | text[]
>   pg_class    | relpartbound  | pg_node_tree
> + pg_depend   | refobjversion | text
> This comes from a regression test doing a sanity check to look for
> catalogs which have a toastable column but no toast tables.  As an
> exception, it should be documented in the test's comment.  Actually,
> does it need to be an exception?  This does not depend on
> relation-level facilities so there should be no risk of recursive
> dependencies, though I have not looked in details at this part.


I totally missed that, and I agree that there's no need for an exception, so
fixed.


> +  
> +   The only current use of refobjversion is to
> +   record dependencies between indexes and collation versions.
> +  
> [...]
> + 
> +  refobjversion
> +  text
> +  
> +  
> +   An optional version for the referenced object; see text
> +  
> + 
> Couldn't you merge both paragraphs here?


Done.


> Regarding patch 0003, it would be nice to include some tests
> independent on the rest and making use of the new functions.  These
> normally go in regproc.sql.  For example with a collation that needs
> double quotes as this is not obvious:
> =# select regcollation('"POSIX"');
> regcollation
>  --
>  "POSIX"
> (1 row)
>
> On top of that, this needs tests with to_regcollation() and tests with
> schema-qualified collations.


Done too, using the same collation name, for both with and without schema
qualification.


> Documentation for to_regcollation() is missing.  And it looks that
> many parts of the documentation are missing an update.  One example in
> datatype.sgml:
> Type oid represents an object identifier.  There are also
> several alias types for oid: regproc,
> regprocedure, regoper, regoperator,
> regclass, regtype, regrole,
> regnamespace, regconfig, and
> regdictionary.   shows an
> overview.
> At quick glance, there are more sections in need of a refresh..


Indeed.  I found missing reference in datatype.sgml; func.sgml and
pgupgrade.sgml.

v16 attached.



Re: Collation versioning

2020-03-17 Thread Julien Rouhaud
On Tue, Mar 17, 2020 at 03:37:49PM +0900, Michael Paquier wrote:
> On Mon, Mar 16, 2020 at 03:05:20PM +0100, Julien Rouhaud wrote:
> > On Mon, Mar 16, 2020 at 04:57:38PM +0900, Michael Paquier wrote:
> >> This comes from a regression test doing a sanity check to look for
> >> catalogs which have a toastable column but no toast tables.  As an
> >> exception, it should be documented in the test's comment.  Actually,
> >> does it need to be an exception?  This does not depend on
> >> relation-level facilities so there should be no risk of recursive
> >> dependencies, though I have not looked in details at this part.
> >
> > I totally missed that, and I agree that there's no need for an exception, so
> > fixed.
>
> How long can actually be collation version strings?  Note also
> 96cdeae, which makes sense for pg_depend to have one.


Versions shouldn't be that long usually, but there were some previous
discussions on how to try to come up with some workaround on systems that don't
provide a version, using a hash of the underlying file or something like that.
Using a hash value big enough to require toasting wouldn't make much sense, but
it feels safer to be ready to handle any later use, whether for collation or
other kind of objects


> >> Regarding patch 0003, it would be nice to include some tests
> >> independent on the rest and making use of the new functions.  These
> >> normally go in regproc.sql.  For example with a collation that needs
> >> double quotes as this is not obvious:
> >> =# select regcollation('"POSIX"');
> >> regcollation
> >>  --
> >>  "POSIX"
> >> (1 row)
> >>
> >> On top of that, this needs tests with to_regcollation() and tests with
> >> schema-qualified collations.
> >
> >
> > Done too, using the same collation name, for both with and without schema
> > qualification.
>
> It seems to me that you could add an extra test with a catalog that
> does not exist, making sure that NULL is returned:
> SELECT to_regtype('ng_catalog."POSIX"');


Agreed, I'll add that, but using a name that looks less like a typo :)


>
> pg_collation_actual_version
> -   
> pg_collation_actual_version(oid)
> +   
> pg_collation_actual_version(regcollation)
>
> The function's input argument is not changed, why?


That's a mistake, will fix.


> Patch 0003 is visibly getting in shape, and that's an independent
> piece.  I guess that Thomas is looking at that, so let's wait for his
> input.
>
> Note that patch 0002 fails to compile because it is missing to include
> utils/builtins.h for CStringGetTextDatum(), and that you cannot pass
> down a NameData to this API, because it needs a simple char string or
> you would need NameStr() or such.  Anyway, it happens that you don't
> need recordDependencyOnVersion() at all, because it is removed by
> patch 0004 in the set, so you could just let it go.


Ah good catch, I missed that during the NameData/text refactoring.  I'll fix it
anyway, better to have clean history.




Re: Collation versioning

2020-03-17 Thread Michael Paquier
On Mon, Mar 16, 2020 at 03:05:20PM +0100, Julien Rouhaud wrote:
> On Mon, Mar 16, 2020 at 04:57:38PM +0900, Michael Paquier wrote:
>> This comes from a regression test doing a sanity check to look for
>> catalogs which have a toastable column but no toast tables.  As an
>> exception, it should be documented in the test's comment.  Actually,
>> does it need to be an exception?  This does not depend on
>> relation-level facilities so there should be no risk of recursive
>> dependencies, though I have not looked in details at this part.
> 
> I totally missed that, and I agree that there's no need for an exception, so
> fixed.

How long can actually be collation version strings?  Note also
96cdeae, which makes sense for pg_depend to have one.

>> Regarding patch 0003, it would be nice to include some tests
>> independent on the rest and making use of the new functions.  These
>> normally go in regproc.sql.  For example with a collation that needs
>> double quotes as this is not obvious:
>> =# select regcollation('"POSIX"');
>> regcollation
>>  --
>>  "POSIX"
>> (1 row)
>>
>> On top of that, this needs tests with to_regcollation() and tests with
>> schema-qualified collations.
> 
> 
> Done too, using the same collation name, for both with and without schema
> qualification.

It seems to me that you could add an extra test with a catalog that
does not exist, making sure that NULL is returned:
SELECT to_regtype('ng_catalog."POSIX"');

The other two cases are not really doable in regproc.sql as they would
show up the encoding used in the error message, but there could be a
point to include them in collate.icu.utf8.sql or equivalent.

> Indeed.  I found missing reference in datatype.sgml; func.sgml and
> pgupgrade.sgml.

That looks right.

   
pg_collation_actual_version
-   
pg_collation_actual_version(oid)
+   
pg_collation_actual_version(regcollation)
   
The function's input argument is not changed, why?

Patch 0003 is visibly getting in shape, and that's an independent
piece.  I guess that Thomas is looking at that, so let's wait for his
input.

Note that patch 0002 fails to compile because it is missing to include
utils/builtins.h for CStringGetTextDatum(), and that you cannot pass
down a NameData to this API, because it needs a simple char string or
you would need NameStr() or such.  Anyway, it happens that you don't
need recordDependencyOnVersion() at all, because it is removed by
patch 0004 in the set, so you could just let it go.

I am still looking at the rest as of 0004~0007, the largest pieces.
--
Michael


signature.asc
Description: PGP signature


Re: Collation versioning

2020-03-16 Thread Michael Paquier
On Thu, Mar 12, 2020 at 03:00:26PM +0100, Julien Rouhaud wrote:
> And v15 due to conflict with b08dee24a5 (Add pg_dump support for ALTER obj
> DEPENDS ON EXTENSION).

I have looked at patches 0001~0003 in the set for now.  0001 looks
clean to me.

In patch 0002, you have the following addition:
@@ -103,9 +103,10 @@ ORDER BY 1, 2;
  pg_class| relacl| aclitem[]
  pg_class| reloptions| text[]
  pg_class| relpartbound  | pg_node_tree
+ pg_depend   | refobjversion | text
This comes from a regression test doing a sanity check to look for
catalogs which have a toastable column but no toast tables.  As an
exception, it should be documented in the test's comment.  Actually,
does it need to be an exception?  This does not depend on
relation-level facilities so there should be no risk of recursive
dependencies, though I have not looked in details at this part.

+  
+   The only current use of refobjversion is to
+   record dependencies between indexes and collation versions.
+  
[...]
+ 
+  refobjversion
+  text
+  
+  
+   An optional version for the referenced object; see text
+  
+ 
Couldn't you merge both paragraphs here?

Regarding patch 0003, it would be nice to include some tests
independent on the rest and making use of the new functions.  These
normally go in regproc.sql.  For example with a collation that needs
double quotes as this is not obvious:
=# select regcollation('"POSIX"');
regcollation
 --
 "POSIX"
(1 row)

On top of that, this needs tests with to_regcollation() and tests with
schema-qualified collations.

Documentation for to_regcollation() is missing.  And it looks that
many parts of the documentation are missing an update.  One example in
datatype.sgml:
Type oid represents an object identifier.  There are also
several alias types for oid: regproc,
regprocedure, regoper, regoperator,
regclass, regtype, regrole,
regnamespace, regconfig, and
regdictionary.   shows an
overview.
At quick glance, there are more sections in need of a refresh..
--
Michael


signature.asc
Description: PGP signature


Re: Collation versioning

2020-02-26 Thread Thomas Munro
On Thu, Feb 27, 2020 at 3:29 AM Julien Rouhaud  wrote:
> [v10]

Thanks.  I'll do some more testing and review soon.  It'd be really
cool to get this into PG13.

FYI cfbot said:

+++ 
/home/travis/build/postgresql-cfbot/postgresql/src/test/regress/results/collate.icu.utf8.out
2020-02-26 14:45:52.114401999 +
...
- icuidx06_d_en_fr_ga | "default" | up to date
+ icuidx06_d_en_fr_ga   | "default"  | out of date




Re: Collation versioning

2020-02-18 Thread Julien Rouhaud
On Mon, Feb 17, 2020 at 5:58 AM Thomas Munro  wrote:
>
> On Thu, Feb 13, 2020 at 8:13 AM Julien Rouhaud  wrote:
> > Hearing no complaints on the suggestions, I'm attaching v8 to address that:
> >
> > - pg_dump is now using a binary_upgrade_set_index_coll_version() function
> >   rather than plain DDL
> > - the additional DDL is now of the form:
> >   ALTER INDEX name ALTER COLLATION name REFRESH VERSION
>
> +1
>
> A couple of thoughts:
>
> @@ -1115,21 +1117,44 @@ index_create(Relation heapRelation,
> ...
> +   /*
> +* Get required distinct dependencies on collations
> for all index keys.
> +* Collations of directly referenced column in hash
> indexes can be
> +* skipped is they're deterministic.
> +*/
> for (i = 0; i < indexInfo->ii_NumIndexKeyAttrs; i++)
> {
> -   if (OidIsValid(collationObjectId[i]) &&
> -   collationObjectId[i] != DEFAULT_COLLATION_OID)
> +   Oid colloid = collationObjectId[i];
> +
> +   if (OidIsValid(colloid))
> {
> -   referenced.classId = CollationRelationId;
> -   referenced.objectId = collationObjectId[i];
> -   referenced.objectSubId = 0;
> +   if ((indexInfo->ii_Am != HASH_AM_OID) ||
> +
> !get_collation_isdeterministic(colloid))
>
> I still don't like the way catalog/index.c has hard-coded knowledge of
> HASH_AM_OID here.  Although it errs on the side of the assuming that
> there *is* a version dependency (good)

Oh, but it also means that it fails to create a versionless
dependency, which is totally wrong.  What we should do is instead
setup a "track_version" flag to pass down.

It also means that the current way I handled unknown version (empty
string) vs unknown collation lib version (null) will be problematic,
both for runtime check and pg_upgrade.  I think we should record an
empty string for both cases, and keep NULL for when explicitly no
version has to be recorded (whether it's not a dependency on
collation, or because the depender doesn't care about version).  It
also mean that I'm missing regression tests using such an access
method.

> there is already another AM in
> the tree that could safely skip it for deterministic collations AFAIK:
> Bloom indexes.  I suppose that any extension AM that is doing some
> kind of hashing would also like to be able to be able to opt out of
> collation version checking, when that is safe.  The question is how to
> model that in our system...

Oh indeed.

> One way would be for each AM to declare whether it is affected by
> collations; the answer could be yes/maybe (default), no,
> only-non-deterministic-ones.  But that still feels like the wrong
> level, not taking advantage of knowledge about operators.

On the other hand, would it be possible that some AM only supports
collation-dependency-free operators while still internally relying on
a stable sort order?

> A better way might be to make declarations about that sort of thing in
> the catalog, somewhere in the vicinity of the operator classes, or
> maybe just to have hard coded knowledge about operator classes (ie
> making declarations in the manual about what eg hash functions are
> allowed to consult and when), and then check which of those an index
> depends on.  I am not sure what would be best, I'd need to spend some
> time studying the am operator system.

I think this will be required at some point anyway, if we want to
eventually avoid warning about possible corruption when an
expression/where clause isn't depending on the collation ordering.

> Perhaps for the first version of this feature, we should just add a
> new local function
> index_can_skip_collation_version_dependency(indexInfo, colloid) to
> encapsulate your existing logic, but add a comment that in future we
> might be able to support skipping in more cases through analysis of
> the catalogs.

That'd be convenient, but would also break extensibility as bloom
would get a preferential treatment (even if such AM doesn't already
exist).

>
> +   
> +ALTER COLLATION
> +
> + 
> +  This command declares that the index is compatible with the currently
> +  installed version of the collations that determine its order.  It is 
> used
> +  to silence warnings caused by collation version incompatibilities and
> +  should be called after rebuilding the index or otherwise verifying its
> +  consistency.  Be aware that incorrect use of this command can hide 
> index
> +  corruption.
> + 
> +
> +   
>
> This sounds like something that you need to do after you reindex, but
> that's not true, is it?  This is something you can do *instead* of
> reindex, to make it shut up about versions.  Shouldn't it be something
> like "... should be issued only if the 

Re: Collation versioning

2020-02-16 Thread Thomas Munro
On Thu, Feb 13, 2020 at 8:13 AM Julien Rouhaud  wrote:
> Hearing no complaints on the suggestions, I'm attaching v8 to address that:
>
> - pg_dump is now using a binary_upgrade_set_index_coll_version() function
>   rather than plain DDL
> - the additional DDL is now of the form:
>   ALTER INDEX name ALTER COLLATION name REFRESH VERSION

+1

A couple of thoughts:

@@ -1115,21 +1117,44 @@ index_create(Relation heapRelation,
...
+   /*
+* Get required distinct dependencies on collations
for all index keys.
+* Collations of directly referenced column in hash
indexes can be
+* skipped is they're deterministic.
+*/
for (i = 0; i < indexInfo->ii_NumIndexKeyAttrs; i++)
{
-   if (OidIsValid(collationObjectId[i]) &&
-   collationObjectId[i] != DEFAULT_COLLATION_OID)
+   Oid colloid = collationObjectId[i];
+
+   if (OidIsValid(colloid))
{
-   referenced.classId = CollationRelationId;
-   referenced.objectId = collationObjectId[i];
-   referenced.objectSubId = 0;
+   if ((indexInfo->ii_Am != HASH_AM_OID) ||
+
!get_collation_isdeterministic(colloid))

I still don't like the way catalog/index.c has hard-coded knowledge of
HASH_AM_OID here.  Although it errs on the side of the assuming that
there *is* a version dependency (good), there is already another AM in
the tree that could safely skip it for deterministic collations AFAIK:
Bloom indexes.  I suppose that any extension AM that is doing some
kind of hashing would also like to be able to be able to opt out of
collation version checking, when that is safe.  The question is how to
model that in our system...

One way would be for each AM to declare whether it is affected by
collations; the answer could be yes/maybe (default), no,
only-non-deterministic-ones.  But that still feels like the wrong
level, not taking advantage of knowledge about operators.

A better way might be to make declarations about that sort of thing in
the catalog, somewhere in the vicinity of the operator classes, or
maybe just to have hard coded knowledge about operator classes (ie
making declarations in the manual about what eg hash functions are
allowed to consult and when), and then check which of those an index
depends on.  I am not sure what would be best, I'd need to spend some
time studying the am operator system.

Perhaps for the first version of this feature, we should just add a
new local function
index_can_skip_collation_version_dependency(indexInfo, colloid) to
encapsulate your existing logic, but add a comment that in future we
might be able to support skipping in more cases through analysis of
the catalogs.

+   
+ALTER COLLATION
+
+ 
+  This command declares that the index is compatible with the currently
+  installed version of the collations that determine its order.  It is used
+  to silence warnings caused by collation version incompatibilities and
+  should be called after rebuilding the index or otherwise verifying its
+  consistency.  Be aware that incorrect use of this command can hide index
+  corruption.
+ 
+
+   

This sounds like something that you need to do after you reindex, but
that's not true, is it?  This is something you can do *instead* of
reindex, to make it shut up about versions.  Shouldn't it be something
like "... should be issued only if the ordering is known not to have
changed since the index was built"?

+-- Test ALTER INDEX name ALTER COLLATION name REFRESH VERSION
+UPDATE pg_depend SET refobjversion = 'not a version'
+WHERE refclassid = 'pg_collation'::regclass
+AND objid::regclass::text = 'icuidx17_part'
+AND refobjversion IS NOT NULL;
+SELECT objid::regclass FROM pg_depend WHERE refobjversion = 'not a version';
+ objid
+---
+ icuidx17_part
+(1 row)
+
+ALTER INDEX icuidx17_part ALTER COLLATION "en-x-icu" REFRESH VERSION;
+SELECT objid::regclass FROM pg_depend WHERE refobjversion = 'not a version';
+ objid
+---
+(0 rows)
+

Would it be better to put refobjversion = 'not a version' in the
SELECT list, instead of the WHERE clause, with a WHERE clause that
hits that one row, so that we can see that the row still exists after
the REFRESH VERSION (while still hiding the unstable version string)?




Re: Collation versioning

2020-02-12 Thread Thomas Munro
On Thu, Feb 13, 2020 at 9:16 AM Julien Rouhaud  wrote:
> On Wed, Feb 12, 2020 at 08:55:06PM +0100, Laurenz Albe wrote:
> > I didn't study the patch in detail, but do I get it right that there will 
> > be no
> > warnings about version incompatibilities with libc collations?
>
> No, libc is also be supported (including the default collation), as long as we
> have a way to get the version.  Unfortunately, that means only linux/glibc.  I
> think that there was some previous discussion to work around that limitation
> for other systems, using some kind of hash of the underlying collation files,
> as Peter mentioned recently, but that's not part of this patchset.

Yeah, this is about the cataloguing infrastructure part, to get the
model and mechanisms right.  To actually get version information from
the underlying collation provider, there will need to be a series of
per-OS projects.  For glibc right now, it's done, but we just use the
whole glibc version as a proxy (sadly we know this can give false
positives and false negatives, but is expected to work a lot better
than nothing).  I hope we can get a proper CLDR version out of that
library one day.  For FreeBSD libc, I have patches, I just need more
round tuits.  For Windows, there is
https://commitfest.postgresql.org/27/2351/ which I'm planning to
commit soonish, after some more thought about the double-version
thing.  Then there is the "run a user-supplied script that gives me a
version" concept, which might work and perhaps allow package
maintainers to supply a script that works on each system.  Again,
that'd be a separate project.  I guess there will probably always be
some OSes that we can't get the data from so we'll probably always
have to support "don't know" mode.




Re: Collation versioning

2020-02-12 Thread Julien Rouhaud
On Wed, Feb 12, 2020 at 08:55:06PM +0100, Laurenz Albe wrote:
> On Wed, 2020-02-12 at 20:13 +0100, Julien Rouhaud wrote:
> > On Wed, Feb 05, 2020 at 05:17:25PM +0100, Julien Rouhaud wrote:
> > > Note that I didn't change any syntax (or switched to native functions
> > > for the binary pg_dump) as it's still not clear to me what exactly
> > > should be implemented.
> >
> > Hearing no complaints on the suggestions, I'm attaching v8 to address that:
> >
> > - pg_dump is now using a binary_upgrade_set_index_coll_version() function
> >   rather than plain DDL
> > - the additional DDL is now of the form:
> >   ALTER INDEX name ALTER COLLATION name REFRESH VERSION
> >
> > I also added an alternate file for the collate.icu.utf8, so the build farm 
> > bot
> > should turn green for the linux part.
>
> diff --git a/doc/src/sgml/ref/alter_index.sgml 
> b/doc/src/sgml/ref/alter_index.sgml
> index 6d34dbb74e..8661b031e9 100644
> --- a/doc/src/sgml/ref/alter_index.sgml
> +++ b/doc/src/sgml/ref/alter_index.sgml
> @@ -109,6 +110,18 @@ ALTER INDEX ALL IN TABLESPACE  class="parameter">name
>  
> 
>
> +   
> +ALTER COLLATION
> +
> + 
> +  This form update the index existing dependency on a specific collation,
> +  to specificy the the currently installed collation version is 
> compatible
> +  with the version used the last time the index was built.  Be aware that
> +  an incorrect use of this form can hide a corruption on the index.
> + 
> +
> +   
> +
> 
>  SET (  class="parameter">storage_parameter =  class="parameter">value [, ... ] )
>  
>
> This description could do with some love.  Perhaps:
>
> This command declares that the index is compatible with the currently
> installed version of the collations that determine its order.  It is used
> to silence warnings caused by collation
> version incompatibilities and
> should be called after rebuilding the index or otherwise verifying its
> consistency.  Be aware that incorrect use of this command can hide
> index corruption.

Thanks a lot, that's indeed way better!  I'll add it in the round of patch.

> I didn't study the patch in detail, but do I get it right that there will be 
> no
> warnings about version incompatibilities with libc collations?

No, libc is also be supported (including the default collation), as long as we
have a way to get the version.  Unfortunately, that means only linux/glibc.  I
think that there was some previous discussion to work around that limitation
for other systems, using some kind of hash of the underlying collation files,
as Peter mentioned recently, but that's not part of this patchset.




Re: Collation versioning

2020-02-12 Thread Laurenz Albe
On Wed, 2020-02-12 at 20:13 +0100, Julien Rouhaud wrote:
> On Wed, Feb 05, 2020 at 05:17:25PM +0100, Julien Rouhaud wrote:
> > Note that I didn't change any syntax (or switched to native functions
> > for the binary pg_dump) as it's still not clear to me what exactly
> > should be implemented.
> 
> Hearing no complaints on the suggestions, I'm attaching v8 to address that:
> 
> - pg_dump is now using a binary_upgrade_set_index_coll_version() function
>   rather than plain DDL
> - the additional DDL is now of the form:
>   ALTER INDEX name ALTER COLLATION name REFRESH VERSION
> 
> I also added an alternate file for the collate.icu.utf8, so the build farm bot
> should turn green for the linux part.

diff --git a/doc/src/sgml/ref/alter_index.sgml 
b/doc/src/sgml/ref/alter_index.sgml
index 6d34dbb74e..8661b031e9 100644
--- a/doc/src/sgml/ref/alter_index.sgml
+++ b/doc/src/sgml/ref/alter_index.sgml
@@ -109,6 +110,18 @@ ALTER INDEX ALL IN TABLESPACE name
 

 
+   
+ALTER COLLATION
+
+ 
+  This form update the index existing dependency on a specific collation,
+  to specificy the the currently installed collation version is compatible
+  with the version used the last time the index was built.  Be aware that
+  an incorrect use of this form can hide a corruption on the index.
+ 
+
+   
+

 SET ( storage_parameter = value [, ... ] )
 

This description could do with some love.  Perhaps:

This command declares that the index is compatible with the currently
installed version of the collations that determine its order.  It is used
to silence warnings caused by collation
version incompatibilities and
should be called after rebuilding the index or otherwise verifying its
consistency.  Be aware that incorrect use of this command can hide
index corruption.

I didn't study the patch in detail, but do I get it right that there will be no
warnings about version incompatibilities with libc collations?

Yours,
Laurenz Albe





Re: Collation versioning

2020-01-29 Thread Julien Rouhaud
On Tue, Jan 28, 2020 at 1:06 PM Peter Eisentraut
 wrote:
>
> On 2019-12-17 14:25, Julien Rouhaud wrote:
> > PFA rebased v6, with the following changes:
>
> Some thoughts on this patch set.

Thanks for looking at it!

> I think we are all agreed on the general idea.
>
> 0001-0003 seem pretty much OK.  Why is pg_depend.refobjversion of type
> "name" whereas the previous pg_collation.collversion was type "text"?
> Related to that, we previously used null to indicate an unknown
> collation version, and now it's an empty string.

That's what Thomas implemented when he started to work on it and I
simply kept it that way until now.  I'm assuming that it was simply to
avoid wasting time on the varlena stuff while working on the
prototype, so barring any objections I'll change to nullable text
column in the next revision.

> For 0005, if the new commands are only to be used in binary upgrades,
> then they should be implemented as built-in functions instead of full
> DDL commands.  There is precedent for that.

Oh, I wasn't aware of that.  I can definitely use built-in functions
instead, but some people previously argued that those command should
be available even in non binary upgrade and I'm not clear on whether
this is wanted or not.  Do you have any thoughts on that?

> The documentation snippet for this patch talks about upgrades from PG12
> to later.  But what about upgrades on platforms where we currently don't
> have collation versioning but might introduce it later (FreeBSD,
> Windows)?  This needs to be generalized.

Good point, I'll try to improve that.

> For 0006 ("Add a new ALTER INDEX name DEPENDS ON COLLATION name
>   CURRENT VERSION"), I find the syntax misleading.  This command doesn't
> (or shouldn't) add a new dependency, it only changes the version of an
> existing dependency.  The previously used syntax ALTER COLLATION /
> REFRESH VERSION is a better vocabulary, I think.

I agree and also complained about that syntax too.  I'm however
struggling on coming up with a syntax that makes it clear it's
refreshing the version of a collation the index already depends on.
E.g.:

ALTER INDEX name ALTER COLLATION name REFRESH VERSION

is still quite poor, but I don't have anything better.  Do you have
some better suggestion or should I go with that?

> I think this whole thing needs more tests.  We are designing this
> delicate mechanism but have no real tests for it.  We at least need to
> come up with a framework for how to test this automatically, so that we
> can add more test cases over time as people will invariably report
> issues when this hits the real world.

Indeed.  I have some unlikely index test cases I'm for now using to
validate the behavior, but didn't start a real test infrastructure.
I'll also work on that for the next revision, although I'll need some
more thinking on how to properly test the pg_upgrade part.




Re: Collation versioning

2020-01-28 Thread Peter Eisentraut

On 2019-12-17 14:25, Julien Rouhaud wrote:

PFA rebased v6, with the following changes:


Some thoughts on this patch set.

I think we are all agreed on the general idea.

0001-0003 seem pretty much OK.  Why is pg_depend.refobjversion of type 
"name" whereas the previous pg_collation.collversion was type "text"? 
Related to that, we previously used null to indicate an unknown 
collation version, and now it's an empty string.


Also, this would limit collation versions to 63 characters.  Perhaps not 
a problem right now, but if someone wants to implement Thomas's previous 
md5-the-file idea with sha256, we'll run out of space.


For 0005, if the new commands are only to be used in binary upgrades, 
then they should be implemented as built-in functions instead of full 
DDL commands.  There is precedent for that.


The documentation snippet for this patch talks about upgrades from PG12 
to later.  But what about upgrades on platforms where we currently don't 
have collation versioning but might introduce it later (FreeBSD, 
Windows)?  This needs to be generalized.


For 0006 ("Add a new ALTER INDEX name DEPENDS ON COLLATION name
 CURRENT VERSION"), I find the syntax misleading.  This command doesn't 
(or shouldn't) add a new dependency, it only changes the version of an 
existing dependency.  The previously used syntax ALTER COLLATION / 
REFRESH VERSION is a better vocabulary, I think.


I think this whole thing needs more tests.  We are designing this 
delicate mechanism but have no real tests for it.  We at least need to 
come up with a framework for how to test this automatically, so that we 
can add more test cases over time as people will invariably report 
issues when this hits the real world.


--
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Re: Collation versioning

2019-12-12 Thread Julien Rouhaud
Hello Thomas,

Thanks for looking at it!

On Thu, Dec 12, 2019 at 5:01 AM Thomas Munro  wrote:
>
> We create duplicate pg_depend records:
>
> [...]
>
> I wondered if that was harmless

That's the assumed behavior of recordMultipleDependencies:

/*
* Record the Dependency.  Note we don't bother to check for
* duplicate dependencies; there's no harm in them.
*/

We could add a check to skip duplicates for the "track_version ==
true" path, or switch to flags if we want to also skip duplicates in
other cases, but it'll make recordMultipleDependencies a little bit
more specialised.

> but for one thing it causes duplicate warnings:

Yes, that should be avoided.

> Here's another way to get a duplicate, and in this example you also
> get an unnecessary dependency on 100 "default" for this index:
>
> postgres=# create index on t(x collate "fr_FR") where x > 'helicopter'
> collate "fr_FR";
> CREATE INDEX
> postgres=# select * from pg_depend where objid = 't_x_idx'::regclass
> and refobjversion != '';
>  classid | objid | objsubid | refclassid | refobjid | refobjsubid |
> refobjversion | deptype
> -+---+--++--+-+---+-
> 1259 | 16460 |0 |   3456 |12603 |   0 |
> 0:34.0| n
> 1259 | 16460 |0 |   3456 |12603 |   0 |
> 0:34.0| n
> 1259 | 16460 |0 |   3456 |  100 |   0 |
> 0:34.0| n
> (3 rows)
>
> Or... maybe 100 should be there, by simple analysis of the x in the
> WHERE clause, but it's the same if you write x collate "fr_FR" >
> 'helicopter' collate "fr_FR", and in that case there are no
> expressions of collation "default" anywhere.

Ah good point.  That's because expression_tree_walker() will dig into
CollateExpr->args and eventually reach the underlying Var.  I don't
see an easy way to avoid that while still properly recording the
required dependency for an even more realistic index such as

CREATE INDEX ON t(x COLLATE "fr_FR") WHERE x > ((x COLLATE "en_US" >
'helicopter' COLLATE "en_US")::text) collate "fr_FR";

and for instance not for

CREATE INDEX ON t(x COLLATE "fr_FR") WHERE x > ((x COLLATE "en_US" ||
'helicopter' COLLATE "en_US")) collate "fr_FR";


> The indirection through composite types works nicely:
>
> postgres=# create type foo_t as (en text collate "en_CA", fr text
> collate "fr_CA");
> CREATE TYPE
> postgres=# create table t (foo foo_t);
> CREATE TABLE
> postgres=# create index on t(foo);
> CREATE INDEX
> postgres=# select * from pg_depend where objid = 't_foo_idx'::regclass
> and refobjversion != '';
>  classid | objid | objsubid | refclassid | refobjid | refobjsubid |
> refobjversion | deptype
> -+---+--++--+-+---+-
> 1259 | 16444 |0 |   3456 |12554 |   0 |
> 0:34.0| n
> 1259 | 16444 |0 |   3456 |12597 |   0 |
> 0:34.0| n
> (2 rows)
>
> ... but again it shows the extra and technically unnecessary
> dependencies (only 12603 "fr_FR" is really needed):
>
> postgres=# create index on t(((foo).fr collate "fr_FR"));
> CREATE INDEX
> postgres=# select * from pg_depend where objid = 't_fr_idx'::regclass
> and refobjversion != '';
>  classid | objid | objsubid | refclassid | refobjid | refobjsubid |
> refobjversion | deptype
> -+---+--++--+-+---+-
> 1259 | 16445 |0 |   3456 |12603 |   0 |
> 0:34.0| n
> 1259 | 16445 |0 |   3456 |12597 |   0 |
> 0:34.0| n
> 1259 | 16445 |0 |   3456 |12554 |   0 |
> 0:34.0| n
> (3 rows)

Yes :(

> I check that nested types are examined recursively, as appropriate.  I
> also tested domains, arrays, arrays of domains, expressions extracting
> an element from an array of a domain with an explicit collation, and
> the only problem I could find was more ways to get duplicates.  Hmm...
> what else is there that can contain a collatable type...?  Ranges!
>
> postgres=# create type myrange as range (subtype = text);
> CREATE TYPE
> postgres=# drop table t;
> DROP TABLE
> postgres=# create table t (x myrange);
> CREATE TABLE
> postgres=# create index on t(x);
> CREATE INDEX
> postgres=# select * from pg_depend where objid = 't_x_idx'::regclass
> and refobjversion != '';
>  classid | objid | objsubid | refclassid | refobjid | refobjsubid |
> refobjversion | deptype
> -+---+--++--+-+---+-
> (0 rows)
>
> ... or perhaps, more realistically, a GIST index might actually be
> useful for range queries, and we're not capturing the dependency:
>
> postgres=# create index t_x_idx on t using gist (x);
> CREATE INDEX
> postgres=# select * from pg_depend where objid = 't_x_idx'::regclass
> and refobjversion != '';
>  classid | objid 

Re: Collation versioning

2019-12-11 Thread Thomas Munro
On Thu, Dec 12, 2019 at 6:32 PM Tom Lane  wrote:
> Thomas Munro  writes:
> > Erm, but I shouldn't have to reindex my hash indexes (at least not
> > until someone invents collation-based equality and therefore
> > necessarily also collation-based hashing).  How can we exclude that?
>
> Um, we already invented that with nondeterministic collations, no?

Urghlgh, right, thanks, somehow I missed/forgot that that stuff
already works for hashing (neat).  So we do need to track collation
version dependencies for hash indexes, but only for non-deterministic
collations.  I wonder how best to code that.




Re: Collation versioning

2019-12-11 Thread Tom Lane
Thomas Munro  writes:
> Erm, but I shouldn't have to reindex my hash indexes (at least not
> until someone invents collation-based equality and therefore
> necessarily also collation-based hashing).  How can we exclude that?

Um, we already invented that with nondeterministic collations, no?

regards, tom lane




Re: Collation versioning

2019-12-11 Thread Thomas Munro
On Thu, Dec 12, 2019 at 5:00 PM Thomas Munro  wrote:
> Then, as a special case, there is the collation of the actual indexed
> value, because that will implicitly be used as input to the btree ops
> that would be collation sensitive.  [...]

Erm, but I shouldn't have to reindex my hash indexes (at least not
until someone invents collation-based equality and therefore
necessarily also collation-based hashing).  How can we exclude that?
amcanorder seems somehow right but also wrong.




Re: Collation versioning

2019-12-11 Thread Thomas Munro
On Thu, Dec 12, 2019 at 2:45 AM Julien Rouhaud  wrote:
> Hearing no objection in [1], attached v5 with following changes:
>
> - fix the spurious warnings by doing the version check in
> get_relation_info and vacuum_open_relation, and add a flag in
> RelationData to mark the check as already being done
> - change the IsCatalogRelation() check to IsSystemRelation() to also
> ignore toast indexes, as those can also be safely ignored too
> - add a new ALTER INDEX idxname DEPENDS ON COLLATION collname CURRENT
> VERSION to let users remove the warnings for safe library upgrade.
> Documentation and tab completion added
>
> If I'm not mistaken, all issues I was aware of are now fixed.

Thanks!  This is some great progress and I'm feeling positive about
getting this into PostgreSQL 13.  I haven't (re)reviewed the code yet,
but I played with it a bit and have some more feedback.

There are some missing semi-colons on the ALTER INDEX statements in
pg_dump.c that make the pg_upgrade test fail (at least, if LC_ALL is
set).

We create duplicate pg_depend records:

postgres=# create table t (x text);
CREATE TABLE
postgres=# create index on t(x) where x > 'hello';
CREATE INDEX
postgres=# select * from pg_depend where objid = 't_x_idx'::regclass
and refobjversion != '';
 classid | objid | objsubid | refclassid | refobjid | refobjsubid |
refobjversion | deptype
-+---+--++--+-+---+-
1259 | 16424 |0 |   3456 |  100 |   0 |
0:34.0| n
1259 | 16424 |0 |   3456 |  100 |   0 |
0:34.0| n
(2 rows)

I wondered if that was harmless, but for one thing it causes duplicate warnings:

postgres=# update pg_depend set refobjversion = 'BANANA' where
refobjversion = '0:34.0';
UPDATE 2

[new session]
postgres=# select count(*) from t;
WARNING:  index "t_x_idx" depends on collation "default" version
"BANANA", but the current version is "0:34.0"
DETAIL:  The index may be corrupted due to changes in sort order.
HINT:  REINDEX to avoid the risk of corruption.
WARNING:  index "t_x_idx" depends on collation "default" version
"BANANA", but the current version is "0:34.0"
DETAIL:  The index may be corrupted due to changes in sort order.
HINT:  REINDEX to avoid the risk of corruption.

Here's another way to get a duplicate, and in this example you also
get an unnecessary dependency on 100 "default" for this index:

postgres=# create index on t(x collate "fr_FR") where x > 'helicopter'
collate "fr_FR";
CREATE INDEX
postgres=# select * from pg_depend where objid = 't_x_idx'::regclass
and refobjversion != '';
 classid | objid | objsubid | refclassid | refobjid | refobjsubid |
refobjversion | deptype
-+---+--++--+-+---+-
1259 | 16460 |0 |   3456 |12603 |   0 |
0:34.0| n
1259 | 16460 |0 |   3456 |12603 |   0 |
0:34.0| n
1259 | 16460 |0 |   3456 |  100 |   0 |
0:34.0| n
(3 rows)

Or... maybe 100 should be there, by simple analysis of the x in the
WHERE clause, but it's the same if you write x collate "fr_FR" >
'helicopter' collate "fr_FR", and in that case there are no
expressions of collation "default" anywhere.

The indirection through composite types works nicely:

postgres=# create type foo_t as (en text collate "en_CA", fr text
collate "fr_CA");
CREATE TYPE
postgres=# create table t (foo foo_t);
CREATE TABLE
postgres=# create index on t(foo);
CREATE INDEX
postgres=# select * from pg_depend where objid = 't_foo_idx'::regclass
and refobjversion != '';
 classid | objid | objsubid | refclassid | refobjid | refobjsubid |
refobjversion | deptype
-+---+--++--+-+---+-
1259 | 16444 |0 |   3456 |12554 |   0 |
0:34.0| n
1259 | 16444 |0 |   3456 |12597 |   0 |
0:34.0| n
(2 rows)

... but again it shows the extra and technically unnecessary
dependencies (only 12603 "fr_FR" is really needed):

postgres=# create index on t(((foo).fr collate "fr_FR"));
CREATE INDEX
postgres=# select * from pg_depend where objid = 't_fr_idx'::regclass
and refobjversion != '';
 classid | objid | objsubid | refclassid | refobjid | refobjsubid |
refobjversion | deptype
-+---+--++--+-+---+-
1259 | 16445 |0 |   3456 |12603 |   0 |
0:34.0| n
1259 | 16445 |0 |   3456 |12597 |   0 |
0:34.0| n
1259 | 16445 |0 |   3456 |12554 |   0 |
0:34.0| n
(3 rows)

I check that nested types are examined recursively, as appropriate.  I
also tested domains, arrays, arrays of domains, expressions extracting
an element from an array of a domain with an explicit collation, and

Re: Collation versioning

2019-12-10 Thread Julien Rouhaud
On Fri, Nov 8, 2019 at 2:24 AM Thomas Munro  wrote:
>
> On Thu, Nov 7, 2019 at 10:20 PM Julien Rouhaud  wrote:
> >
> > I didn't do anything for the spurious warning when running a reindex,
> > and kept original approach of pg_depend catalog.
>
> I see three options:
>
> 1.  Change all or some of index_open(), relation_open(),
> RelationIdGetRelation(), RelationBuildDesc() and
> RelationInitIndexAccessInfo() to take some kind of flag so we can say
> NO_COLLATION_VERSION_CHECK_PLEASE, and then have ReindexIndex() pass
> that flag down when opening it for the purpose of rebuilding it.
> 2.  Use a global state to suppress these warnings while opening that
> index.  Perhaps ReindexIndex() would call RelCacheWarnings(false)
> before index_open(), and use PG_FINALLY to make sure it restores
> warnings with RelCacheWarnings(true).  (This is a less-code-churn
> bad-programming-style version of #1.)
> 3.  Move the place that we run the collation version check.  Instead
> of doing it in RelationInitIndexAccessInfo() so that it runs when the
> relation cache first loads the index, put it into a new routine
> RelationCheckValidity() and call that from ... I don't know, some
> other place that runs whenever we access indexes but not when we
> rebuild them.
>
> 3's probably a better idea, if you can find a reasonable place to call
> it from.  I'm thinking some time around the time the executor acquires
> locks, but using a flag in the relcache entry to make sure it doesn't
> run the check again if there were no warnings last time (so one
> successful version check turns the extra work off for the rest of this
> relcache entry's lifetime).  I think it'd be a feature, not a bug, if
> it gave you the warning every single time you executed a query using
> an index that has a mismatch...  but a practical alternative would be
> to check only once per index and that probably makes more sense.

I tried to dig into approach #3.  I think that trying to perform this
check when the executor acquires locks won't work well with at least
with partitioned table.  I instead tried to handle that in
get_relation_info(), adding a flag in the relcache to emit each
warning only once per backend lifetime, and this seems to work quite
well.

I think that on top of that, we should make sure that non-full vacuum
and analyze also emit such warnings, so that autovacuum can spam the
logs too.




Re: Collation versioning

2019-12-03 Thread Julien Rouhaud
On Mon, Dec 2, 2019 at 2:00 PM Robert Haas  wrote:
>
> ALTER INDEX idx_name DEPENDS ON COLLATION blah VERSION blah;
> -- I care about collations and I know which one and which version.
>
> ALTER INDEX idx_name DEPENDS ON SOME COLLATION;
> -- I care about collations but I don't know which one.

This seems a little bit ambiguous.  I wouldn't expect this command to
trash all recorded versions given how it's spelled.

> ALTER INDEX idx_name DEPENDS ON NO COLLATION;
> -- I don't care about collations at all.
> -- Not sure if we need this.

This should be an alias for "trust me all collation are ok"?  It's
certainly useful for collation library upgrade that don't break
indexes, but I'd prefer to spell it something like "CURRENT VERSION".
I'll also work on that, but preferably in a later patch as there'll be
additional need (process all indexes with a given collation or
collation version for instance).


While looking at the list of keywords again, I think that "ANY" is a
better candidate:

ALTER INDEX idx_name DEPENDS ON [ ANY COLLATION | COLLATION blah ] [
UNKNOWN VERSION | VERSION blah ]
or
ALTER INDEX idx_name ALTER [ ANY COLLATION | COLLATION blah ] DEPENDS
ON  [ UNKNOWN VERSION  | VERSION blah ]

I like the 2nd one as it's more obvious that the command will only
modify existing dependencies.




Re: Collation versioning

2019-12-02 Thread Robert Haas
On Thu, Nov 28, 2019 at 8:08 AM Julien Rouhaud  wrote:
> What we could do is storing an empty string if the compatibility is
> unknown, and detect it in index_check_collation_version() to report a
> slightly different message.  I'm assuming that not knowing the
> compatibility would be system-wide rather than per collation, so we
> could use an sql query like this:
>
> ALTER INDEX idx_name DEPENDS ON COLLATION UNKNOWN VERSION
>
> If adding (un)reserved keywords is not an issue, we could also instead
> use something along ALTER INDEX idx_name DEPENDS ON ALL COLLATIONS
> and/or ALL VERSIONS UNKNOWN, or switch to:

Adding unreserved keywords isn't a huge issue, but it's nicer to avoid
it if we can. Bloating the parser tables isn't that expensive, but
neither is it free. Maybe spell it like this:

ALTER INDEX idx_name DEPENDS ON COLLATION blah VERSION blah;
-- I care about collations and I know which one and which version.

ALTER INDEX idx_name DEPENDS ON SOME COLLATION;
-- I care about collations but I don't know which one.

ALTER INDEX idx_name DEPENDS ON NO COLLATION;
-- I don't care about collations at all.
-- Not sure if we need this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Collation versioning

2019-11-28 Thread Julien Rouhaud
On Thu, Nov 28, 2019 at 5:50 AM Michael Paquier  wrote:
>
> On Wed, Nov 27, 2019 at 04:09:57PM -0500, Robert Haas wrote:
> > Yeah, I like #3 too. If we're going to the trouble to build all of
> > this mechanism, it seems worth it to build the additional machinery to
> > be precise about the difference between "looks like a problem" and "we
> > don't know".
>
> Indeed, #3 sounds like a sensible way of doing things.  The two others
> may cause random problems which are harder to actually detect and act
> on as we should avoid as much as possible a forced system-wide REINDEX
> after an upgrade to a post-13 PG.

Thanks everyone for the feedback!  Since there seems to be an
agreement on #3, here's a proposal.

What we could do is storing an empty string if the compatibility is
unknown, and detect it in index_check_collation_version() to report a
slightly different message.  I'm assuming that not knowing the
compatibility would be system-wide rather than per collation, so we
could use an sql query like this:

ALTER INDEX idx_name DEPENDS ON COLLATION UNKNOWN VERSION

If adding (un)reserved keywords is not an issue, we could also instead
use something along ALTER INDEX idx_name DEPENDS ON ALL COLLATIONS
and/or ALL VERSIONS UNKNOWN, or switch to:

ALTER INDEX idx_name ALTER [ COLLATION coll_name | ALL COLLATIONS ]
DEPENDS ON [ UNKOWN VERSION | VERSION 'version_string' ]

Obviously, specific versions would require a specific collation, and
at least UNKNOWN VERSION would only be allowed in binary upgrade mode,
and not documented.  I have also some other ideas for additional DDL
to let users deal with catalog update after a compatible collation
library upgrade, but let's deal with that later.

Then for pg_upgrade, we can add a --collations-are-binary-compatible
switch or similar:

If upgrading from pre-v13
  - without the switch, we'd generate the VERSION UNKNOWN for all
indexes during pg_dump in upgrade_mode
  - with the switch, do nothing as all indexes would already be
pointing to the currently installed version

If upgrading from post v13, the switch shouldn't be useful as versions
will be restored, and if there was a collation library upgrade it
should be handled manually, same as if such an upgrade is done without
pg_upgrade-ing the cluster.  I'd personally disallow it to avoid users
to shoot themselves in the foot.

Does this sounds sensible?




Re: Collation versioning

2019-11-27 Thread Michael Paquier
On Wed, Nov 27, 2019 at 04:09:57PM -0500, Robert Haas wrote:
> Yeah, I like #3 too. If we're going to the trouble to build all of
> this mechanism, it seems worth it to build the additional machinery to
> be precise about the difference between "looks like a problem" and "we
> don't know".

Indeed, #3 sounds like a sensible way of doing things.  The two others
may cause random problems which are harder to actually detect and act
on as we should avoid as much as possible a forced system-wide REINDEX
after an upgrade to a post-13 PG.
--
Michael


signature.asc
Description: PGP signature


Re: Collation versioning

2019-11-27 Thread Robert Haas
On Tue, Nov 26, 2019 at 3:44 PM Thomas Munro  wrote:
> I suppose another reason to use such a switch would be if there is a
> change in the versioning scheme; for example, as of today in master we
> are using the glibc version, but a future glibc release might offer an
> interface to query the CLDR version it's using, and then a future
> release of PostgreSQL might get support for that, so the strings would
> change between major version of PostgreSQL but you might want to be
> able to tell pg_upgrade that your indexes are good.

Yeah, I like #3 too. If we're going to the trouble to build all of
this mechanism, it seems worth it to build the additional machinery to
be precise about the difference between "looks like a problem" and "we
don't know".

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Collation versioning

2019-11-26 Thread Thomas Munro
On Fri, Nov 8, 2019 at 5:40 PM Laurenz Albe  wrote:
> On Fri, 2019-11-08 at 15:04 +1300, Thomas Munro wrote:
> > 3.  We don't know if pre-13 indexes are corrupted or not, and we'll
> > record that with a special value just as in proposal #1, except that
> > we could show a different hint for that special version value.  It
> > would tell you can you can either REINDEX, or run ALTER INDEX ...
> > DEPENDS ON COLLATION "fr_FR" VERSION '34.0' if you believe the index
> > to have been created with the current collation version on an older
> > release of PostgreSQL that didn't track versions.

> #3 is the best proposal, but there is still the need to run
> ALTER INDEX on all affected indexes to keep PostgreSQL from nagging.
> Perhaps the situation could be improved with a pg_upgrade option
> --i-know-my-indexes-are-fine that causes a result like #2.
> Together with a bold note in the release notes, this may relieve
> the pain.

I suppose another reason to use such a switch would be if there is a
change in the versioning scheme; for example, as of today in master we
are using the glibc version, but a future glibc release might offer an
interface to query the CLDR version it's using, and then a future
release of PostgreSQL might get support for that, so the strings would
change between major version of PostgreSQL but you might want to be
able to tell pg_upgrade that your indexes are good.




Re: Collation versioning

2019-11-24 Thread Julien Rouhaud
On Tue, Nov 12, 2019 at 10:16 PM Thomas Munro  wrote:
>
> On Wed, Nov 13, 2019 at 3:27 AM Julien Rouhaud  wrote:
> > On Sun, Nov 10, 2019 at 10:08 AM Thomas Munro  
> > wrote:
> > > That's because the 0003 patch only calls recordDependencyOnVersion()
> > > for simple attribute references.  When
> > > recordDependencyOnSingleRelExpr() is called by index_create() to
> > > analyse ii_Expressions and ii_Predicate, it's going to have to be
> > > smart enough to detect collation references and record the versions.
> > > There is also some more code that ignores pinned collations hiding in
> > > there.
[...]

Indeed.  Now, using a composite type in an expression index, I can see that eg.

CREATE TYPE mytype AS (fr text COLLATE "fr-x-icu", en text COLLATE "en-x-icu");
CREATE TABLE test1(id integer, myval mytype);
CREATE INDEX ON sometable (somecol) WHERE (mytype).fr_name = 'meh'

does create a dependency on fr-x-icu collation, because collations are
checked for FieldSelect nodes (which indeed ignores default
collation), but eg.

CREATE INDEX idx2 ON test1(id) WHERE myval = ('foo', 'bar');

won't, so I confirm that recordDependencyOnSingleRelExpr() isn't
bullet proof either for finding collation dependencies.


> > > I think create_index() will need to perform recursive analysis on
> > > composite types to look for text attributes, when they appear as
> > > simple attributes, and then add direct dependencies index -> collation
> > > to capture the versions.  Then we might need to do the same for
> > > composite types hiding inside ii_Expressions and ii_Predicate (once we
> > > figure out what that really means).

I did write some code that can extract all collations that are used by
a datatype, which seems to work as intended over many combinations of
composite / array / domain types used in index simple attributes.  I'm
not sure if I should change find_expr_references_walker (called by
recordDependencyOnExpr) to also track those new dependencies in
ii_Expression and ii_Predicate, as it'll also add unneeded
dependencies for other callers.  And if I should add version detection
there too or have recordMultipleDependencies() automatically take care
of this.

> > A simple and exhaustive way to deal with that would be
> > to teach recordMultipleDependencies() to override isObjectPinned() and
> > retrieve the collation version if the referenced object is a collation
> > and it's neither C or POSIX collation
> >
> That doesn't seem like the right place; that's a raw data insertion
> function, though... I guess it does already have enough brains to skip
> pinned objects.  Hmm.

Another point is that unless we also do an additional check to see
what relkind is referencing the collation, it'll record a version
string for types and other objects.

> > Isn't that actually a bug?  For instance such an index will have a 0
> > indcollation in pg_index, and according to pg_index documentation:
> >
> > " this contains the OID of the collation to use for the index, or zero
> > if the column is not of a collatable data type."
> >
> > You can't use a COLLATE expression on such data type, but it still has
> > a collation used.
>
> I don't think it's a bug.  The information is available, but you have
> to follow the graph to get it.  Considering that the composite type
> could be something like CREATE TYPE my_type AS (fr_name text COLLATE
> "fr_CA", en_name text COLLATE "en_CA"), there is no single collation
> you could put into pg_index.indcollation anyway.  As for pg_depend,
> it's currently enough for the index to depend on the type, and the
> type to depend on the collation, because the purpose of dependencies
> is to control dropping and dumping order, but for our new purpose we
> also need to create direct dependencies index -> "fr_CA", index ->
> "en_CA" so we can record the current versions.

Oh right, I didn't think about that.

> > > 3.  Test t/002_pg_dump.pl in src/bin/pg_upgrade fails.
> >
> > Apparently neither "make check" nor "make world" run this test :(
> > This was broken due collversion support in pg_dump, I have fixed it
> > locally.
>
> make check-world

Thanks!




Re: Collation versioning

2019-11-12 Thread Thomas Munro
On Wed, Nov 13, 2019 at 3:27 AM Julien Rouhaud  wrote:
> On Sun, Nov 10, 2019 at 10:08 AM Thomas Munro  wrote:
> > That's because the 0003 patch only calls recordDependencyOnVersion()
> > for simple attribute references.  When
> > recordDependencyOnSingleRelExpr() is called by index_create() to
> > analyse ii_Expressions and ii_Predicate, it's going to have to be
> > smart enough to detect collation references and record the versions.
> > There is also some more code that ignores pinned collations hiding in
> > there.
> >
> > This leads to the difficult question of how you recognise a real
> > dependency on a collation's version in an expression.  I have some
> > vague ideas but haven't seriously looked into it yet.  (The same
> > question comes up for check constraint -> collation dependencies.)
>
> Oh good point.  A simple and exhaustive way to deal with that would be
> to teach recordMultipleDependencies() to override isObjectPinned() and
> retrieve the collation version if the referenced object is a collation
> and it's neither C or POSIX collation.   It means that we could also
> remove the extra "version" argument and get rid of
> recordDependencyOnVersion to simply call recordMultipleDependencies in
> create_index for direct column references having a collation.
>
> Would it be ok to add this kind of knowledge in
> recordMultipleDependencies() or is it too hacky?

That doesn't seem like the right place; that's a raw data insertion
function, though... I guess it does already have enough brains to skip
pinned objects.  Hmm.

> > I think create_index() will need to perform recursive analysis on
> > composite types to look for text attributes, when they appear as
> > simple attributes, and then add direct dependencies index -> collation
> > to capture the versions.  Then we might need to do the same for
> > composite types hiding inside ii_Expressions and ii_Predicate (once we
> > figure out what that really means).
>
> Isn't that actually a bug?  For instance such an index will have a 0
> indcollation in pg_index, and according to pg_index documentation:
>
> " this contains the OID of the collation to use for the index, or zero
> if the column is not of a collatable data type."
>
> You can't use a COLLATE expression on such data type, but it still has
> a collation used.

I don't think it's a bug.  The information is available, but you have
to follow the graph to get it.  Considering that the composite type
could be something like CREATE TYPE my_type AS (fr_name text COLLATE
"fr_CA", en_name text COLLATE "en_CA"), there is no single collation
you could put into pg_index.indcollation anyway.  As for pg_depend,
it's currently enough for the index to depend on the type, and the
type to depend on the collation, because the purpose of dependencies
is to control dropping and dumping order, but for our new purpose we
also need to create direct dependencies index -> "fr_CA", index ->
"en_CA" so we can record the current versions.

> > 3.  Test t/002_pg_dump.pl in src/bin/pg_upgrade fails.
>
> Apparently neither "make check" nor "make world" run this test :(
> This was broken due collversion support in pg_dump, I have fixed it
> locally.

make check-world




Re: Collation versioning

2019-11-12 Thread Julien Rouhaud
On Sun, Nov 10, 2019 at 10:08 AM Thomas Munro  wrote:
>
> Some more thoughts:
>
> 1.  If you create an index on an expression that includes a COLLATE or
> a partial index that has one in the WHERE clause, you get bogus
> warnings:
>
> postgres=# create table t (v text);
> CREATE TABLE
> postgres=# create index on t(v) where v > 'hello' collate "en_NZ";
> WARNING:  index "t_v_idx3" depends on collation "en_NZ" version "",
> but the current version is "2.28"
> DETAIL:  The index may be corrupted due to changes in sort order.
> HINT:  REINDEX to avoid the risk of corruption.
> CREATE INDEX
>
> postgres=# create index on t((case when v < 'x' collate "en_NZ" then
> 'foo' else 'bar' end));
> WARNING:  index "t_case_idx" depends on collation "en_NZ" version "",
> but the current version is "2.28"
> DETAIL:  The index may be corrupted due to changes in sort order.
> HINT:  REINDEX to avoid the risk of corruption.
> CREATE INDEX
>
> That's because the 0003 patch only calls recordDependencyOnVersion()
> for simple attribute references.  When
> recordDependencyOnSingleRelExpr() is called by index_create() to
> analyse ii_Expressions and ii_Predicate, it's going to have to be
> smart enough to detect collation references and record the versions.
> There is also some more code that ignores pinned collations hiding in
> there.
>
> This leads to the difficult question of how you recognise a real
> dependency on a collation's version in an expression.  I have some
> vague ideas but haven't seriously looked into it yet.  (The same
> question comes up for check constraint -> collation dependencies.)

Oh good point.  A simple and exhaustive way to deal with that would be
to teach recordMultipleDependencies() to override isObjectPinned() and
retrieve the collation version if the referenced object is a collation
and it's neither C or POSIX collation.   It means that we could also
remove the extra "version" argument and get rid of
recordDependencyOnVersion to simply call recordMultipleDependencies in
create_index for direct column references having a collation.

Would it be ok to add this kind of knowledge in
recordMultipleDependencies() or is it too hacky?

> 2.  If you create a composite type with a text attribute (with or
> without an explicit collation), and then create an index on a column
> of that type, we don't record the dependency.
>
> postgres=# create type my_type as (x text collate "en_NZ");
> CREATE TYPE
> postgres=# create table t (v my_type);
> CREATE TABLE
> postgres=# create index on t(v);
> CREATE INDEX
> postgres=# select * from pg_depend where refobjversion != '';
>  classid | objid | objsubid | refclassid | refobjid | refobjsubid |
> refobjversion | deptype
> -+---+--++--+-+---+-
> (0 rows)
>
> I think create_index() will need to perform recursive analysis on
> composite types to look for text attributes, when they appear as
> simple attributes, and then add direct dependencies index -> collation
> to capture the versions.  Then we might need to do the same for
> composite types hiding inside ii_Expressions and ii_Predicate (once we
> figure out what that really means).

Isn't that actually a bug?  For instance such an index will have a 0
indcollation in pg_index, and according to pg_index documentation:

" this contains the OID of the collation to use for the index, or zero
if the column is not of a collatable data type."

You can't use a COLLATE expression on such data type, but it still has
a collation used.

> 3.  Test t/002_pg_dump.pl in src/bin/pg_upgrade fails.

Apparently neither "make check" nor "make world" run this test :(
This was broken due collversion support in pg_dump, I have fixed it
locally.

> 4.  In the warning message we should show get_collation_name() instead
> of the OID.

+1, I also fixed it locally.




Re: Collation versioning

2019-11-10 Thread Thomas Munro
Some more thoughts:

1.  If you create an index on an expression that includes a COLLATE or
a partial index that has one in the WHERE clause, you get bogus
warnings:

postgres=# create table t (v text);
CREATE TABLE
postgres=# create index on t(v) where v > 'hello' collate "en_NZ";
WARNING:  index "t_v_idx3" depends on collation "en_NZ" version "",
but the current version is "2.28"
DETAIL:  The index may be corrupted due to changes in sort order.
HINT:  REINDEX to avoid the risk of corruption.
CREATE INDEX

postgres=# create index on t((case when v < 'x' collate "en_NZ" then
'foo' else 'bar' end));
WARNING:  index "t_case_idx" depends on collation "en_NZ" version "",
but the current version is "2.28"
DETAIL:  The index may be corrupted due to changes in sort order.
HINT:  REINDEX to avoid the risk of corruption.
CREATE INDEX

That's because the 0003 patch only calls recordDependencyOnVersion()
for simple attribute references.  When
recordDependencyOnSingleRelExpr() is called by index_create() to
analyse ii_Expressions and ii_Predicate, it's going to have to be
smart enough to detect collation references and record the versions.
There is also some more code that ignores pinned collations hiding in
there.

This leads to the difficult question of how you recognise a real
dependency on a collation's version in an expression.  I have some
vague ideas but haven't seriously looked into it yet.  (The same
question comes up for check constraint -> collation dependencies.)

2.  If you create a composite type with a text attribute (with or
without an explicit collation), and then create an index on a column
of that type, we don't record the dependency.

postgres=# create type my_type as (x text collate "en_NZ");
CREATE TYPE
postgres=# create table t (v my_type);
CREATE TABLE
postgres=# create index on t(v);
CREATE INDEX
postgres=# select * from pg_depend where refobjversion != '';
 classid | objid | objsubid | refclassid | refobjid | refobjsubid |
refobjversion | deptype
-+---+--++--+-+---+-
(0 rows)

I think create_index() will need to perform recursive analysis on
composite types to look for text attributes, when they appear as
simple attributes, and then add direct dependencies index -> collation
to capture the versions.  Then we might need to do the same for
composite types hiding inside ii_Expressions and ii_Predicate (once we
figure out what that really means).

3.  Test t/002_pg_dump.pl in src/bin/pg_upgrade fails.

4.  In the warning message we should show get_collation_name() instead
of the OID.




Re: Collation versioning

2019-11-08 Thread Julien Rouhaud
On Fri, Nov 8, 2019 at 10:20 AM Christoph Berg  wrote:
>
> Re: Laurenz Albe 2019-11-08 
> <3c3b9ff84d21acf3188558928249d04db84ea2e9.ca...@cybertec.at>
> > #3 is the best proposal, but there is still the need to run
> > ALTER INDEX on all affected indexes to keep PostgreSQL from nagging.
> > Perhaps the situation could be improved with a pg_upgrade option
> > --i-know-my-indexes-are-fine that causes a result like #2.
> > Together with a bold note in the release notes, this may relieve
> > the pain.
>
> Ack.

+1

> We should also try to make the actual commands more accessible.
> Instead of having the user specify a version number we could as well
> determine from the current state of the system as in
>   ALTER INDEX ... DEPENDS ON 'version-number-I-never-heard-of-before'

This one is needed for pg_upgrade on later version, but I agree that
it shouldn't be exposed to users.

> could it just be
>   ALTER INDEX ... COLLATION IS CURRENT

this sounds like a better idea, though this should probably work at
the collation lever rather than index level.  I think that we should
offer users this but with multiple filter, like:

- mark all indexes' collation version dependencies as current version
- mark all indexes' dependencies on a specific collation and collation
version as current version
- mark all indexes' dependencies on a specific collation (any version)
as current version

> or, given the general action to take is reindexing, how about a no-op reindex?
>   REINDEX INDEX ... METADATA ONLY
>
> That might look less scary to the average end user.

This should be scary, as any misuse can lead to hidden corruption.  If
a user isn't sure of what to do, a plain REINDEX is the safe (and
painful) way to go.

> Do we even think people upgrade PG and the OS at the same time?
> pg_upgrade might frequently actually be invoked on an otherwise
> unchanged system, so we could even make "collations are fine" the
> default for pg_upgrade. And maybe have a switch like pg_upgrade --os-upgrade
> that reverses this.

+1




Re: Collation versioning

2019-11-08 Thread Julien Rouhaud
On Fri, Nov 8, 2019 at 2:24 AM Thomas Munro  wrote:
>
> On Thu, Nov 7, 2019 at 10:20 PM Julien Rouhaud  wrote:
> > Attached 4th patch handles default collation.  I went with an
> > ignore_systempin flag in recordMultipleDependencies.
>
> Thanks for working on this!  I haven't looked closely or tested yet,
> but this seems reasonable.  Obviously it assumes that the default
> provider is really "libc" in disguise for now, but Peter's other patch
> will extend that to cover ICU later.

Yes, that should require minimal changes here.

>
> > > > * Andres mentioned off-list that pg_depend rows might get blown away
> > > > and recreated in some DDL circumstances.  We need to look into that.
> >
> > I tried various flavour of DDL but I couldn't wipe out the pg_depend
> > rows without having an index rebuild triggered (like changing the
> > underlying column datatype).  Do you have any scenario where the index
> > rebuild wouldn't be triggered?
>
> Ah, OK, if we only do that when the old index contents will also be
> destroyed, that's great news.
>
> > > > * Another is that pg_upgrade won't preserve pg_depend rows, so you'd
> > > > need some catalog manipulation (direct or via new DDL) to fix that.
> >
> > Attached 5th patch add a new "ALTER INDEX idx_name DEPENDS ON
> > COLLATION coll_oid VERSION coll_version_text" that can only be
> > executed in binary upgrade mode, and teach pg_dump to generate such
> > commands (including for indexes created for constraints).
>
> It's nice that you were able to make up a reasonable grammar out of
> existing keywords.  I wonder if we should make this user accessible...
> it could be useful for expert users.  If so, maybe it should use
> collation names, not OIDs?

I thought about it, but I'm wondering if it's a good idea to expose
this to users.  The command is not really POLA compliant, as what it
actually mean is "update the recorded version for all existing
pg_depend rows, if any, for this index and collation", while one could
assume that it'll add a new pg_depend row if none exist.  Ideally,
users would only need to use command that says "trust me this index
(or collation version) is actually compatible with this collation's
current version" or similar, but not some user provided string.

> > I didn't do anything for the spurious warning when running a reindex,
> > and kept original approach of pg_depend catalog.
>
> I see three options:
>
> 1.  Change all or some of index_open(), relation_open(),
> RelationIdGetRelation(), RelationBuildDesc() and
> RelationInitIndexAccessInfo() to take some kind of flag so we can say
> NO_COLLATION_VERSION_CHECK_PLEASE, and then have ReindexIndex() pass
> that flag down when opening it for the purpose of rebuilding it.
> 2.  Use a global state to suppress these warnings while opening that
> index.  Perhaps ReindexIndex() would call RelCacheWarnings(false)
> before index_open(), and use PG_FINALLY to make sure it restores
> warnings with RelCacheWarnings(true).  (This is a less-code-churn
> bad-programming-style version of #1.)
> 3.  Move the place that we run the collation version check.  Instead
> of doing it in RelationInitIndexAccessInfo() so that it runs when the
> relation cache first loads the index, put it into a new routine
> RelationCheckValidity() and call that from ... I don't know, some
> other place that runs whenever we access indexes but not when we
> rebuild them.
>
> 3's probably a better idea, if you can find a reasonable place to call
> it from.  I'm thinking some time around the time the executor acquires
> locks, but using a flag in the relcache entry to make sure it doesn't
> run the check again if there were no warnings last time (so one
> successful version check turns the extra work off for the rest of this
> relcache entry's lifetime).  I think it'd be a feature, not a bug, if
> it gave you the warning every single time you executed a query using
> an index that has a mismatch...  but a practical alternative would be
> to check only once per index and that probably makes more sense.

OTOH, 2 is more generic, and could maybe be a better way with Peter's
idea of new catalog that would also fit other use cases?




Re: Collation versioning

2019-11-08 Thread Christoph Berg
Re: Laurenz Albe 2019-11-08 
<3c3b9ff84d21acf3188558928249d04db84ea2e9.ca...@cybertec.at>
> #3 is the best proposal, but there is still the need to run
> ALTER INDEX on all affected indexes to keep PostgreSQL from nagging.
> Perhaps the situation could be improved with a pg_upgrade option
> --i-know-my-indexes-are-fine that causes a result like #2.
> Together with a bold note in the release notes, this may relieve
> the pain.

Ack.

We should also try to make the actual commands more accessible.
Instead of having the user specify a version number we could as well
determine from the current state of the system as in
  ALTER INDEX ... DEPENDS ON 'version-number-I-never-heard-of-before'
could it just be
  ALTER INDEX ... COLLATION IS CURRENT
or, given the general action to take is reindexing, how about a no-op reindex?
  REINDEX INDEX ... METADATA ONLY

That might look less scary to the average end user.

Do we even think people upgrade PG and the OS at the same time?
pg_upgrade might frequently actually be invoked on an otherwise
unchanged system, so we could even make "collations are fine" the
default for pg_upgrade. And maybe have a switch like pg_upgrade --os-upgrade
that reverses this.

Christoph




Re: Collation versioning

2019-11-07 Thread Laurenz Albe
On Fri, 2019-11-08 at 15:04 +1300, Thomas Munro wrote:
> So we have three proposals:
> 
> 1.  Assume that pre-13 indexes that depend on collations are
> potentially corrupted and complain until they are reindexed.  This
> could be done by having pg_upgrade run ALTER INDEX ... DEPENDS ON
> COLLATION "fr_FR" VERSION '' (empty string, or some other default
> value that we don't think is going to coincide with a real version).
> 2.  Assume that pre-13 indexes are not corrupted.  In the target 13
> database, the index will be created in the catalogs with the
> provider's current version.
> 3.  We don't know if pre-13 indexes are corrupted or not, and we'll
> record that with a special value just as in proposal #1, except that
> we could show a different hint for that special version value.  It
> would tell you can you can either REINDEX, or run ALTER INDEX ...
> DEPENDS ON COLLATION "fr_FR" VERSION '34.0' if you believe the index
> to have been created with the current collation version on an older
> release of PostgreSQL that didn't track versions.

I think #1 is bad because it would frighten all users, even those who
didn't upgrade their libc at all, only PostgreSQL.  I don't think that
shouting "wolf" for no real reason will improve trust in PostgreSQL.

#2 is bad because it may hide pre-existing index corruption.

#3 is the best proposal, but there is still the need to run
ALTER INDEX on all affected indexes to keep PostgreSQL from nagging.
Perhaps the situation could be improved with a pg_upgrade option
--i-know-my-indexes-are-fine that causes a result like #2.
Together with a bold note in the release notes, this may relieve
the pain.

Yours,
Laurenz Albe





Re: Collation versioning

2019-11-07 Thread Thomas Munro
On Fri, Nov 8, 2019 at 2:37 PM Michael Paquier  wrote:
> On Fri, Nov 08, 2019 at 02:23:54PM +1300, Thomas Munro wrote:
> > Right, so this is basically a policy decision: do we assume that all
> > pre-13 indexes that depend on collations are potentially corrupted, or
> > assume that they are not?  The "correct" thing to do would be to
> > assume they are potentially corrupted and complain until the user
> > reindexes, but I think the pragmatic thing to do would be to assume
> > that they're not and just let them adopt the current versions, even
> > though it's a lie.  I lean towards the pragmatic choice; we're trying
> > to catch future problems, not give the entire user base a load of
> > extra work to do on their next pg_upgrade for mostly theoretical
> > reasons.  (That said, given the new glibc versioning, we'll
> > effectively be giving most of our user base a load of extra work to do
> > on their next OS upgrade and that'll be a characteristic of PostgreSQL
> > going forward, once the versioning-for-default-provider patch goes
> > in.)  Any other opinions?
>
> Matching an incorrect collation version on an index which physically
> uses something else does not strike me as a good idea to me because
> you may hide corruptions, and you would actually lose the reason why
> the corruption happened (did the version bump up from an incorrect
> one?  Or what?).  Could it be possible to mark any existing indexes
> with an unknown version or something like that?  This way, we could
> just let the user decide what needs to be reindexed or not, and we
> need to offer an option to update the collation version from unknown
> to the latest one available.

Fair point.

So we have three proposals:

1.  Assume that pre-13 indexes that depend on collations are
potentially corrupted and complain until they are reindexed.  This
could be done by having pg_upgrade run ALTER INDEX ... DEPENDS ON
COLLATION "fr_FR" VERSION '' (empty string, or some other default
value that we don't think is going to coincide with a real version).
2.  Assume that pre-13 indexes are not corrupted.  In the target 13
database, the index will be created in the catalogs with the
provider's current version.
3.  We don't know if pre-13 indexes are corrupted or not, and we'll
record that with a special value just as in proposal #1, except that
we could show a different hint for that special version value.  It
would tell you can you can either REINDEX, or run ALTER INDEX ...
DEPENDS ON COLLATION "fr_FR" VERSION '34.0' if you believe the index
to have been created with the current collation version on an older
release of PostgreSQL that didn't track versions.




Re: Collation versioning

2019-11-07 Thread Thomas Munro
On Thu, Nov 7, 2019 at 1:27 AM Peter Eisentraut
 wrote:
> As I was working on that lately, I came to the conclusion that we should
> get *this* patch done first.

Cool.  Let's aim to get this into 13!

> > * Some have expressed doubt that pg_depend is the right place for
> > this; let's see if any counter-proposals appear.
>
> The only alternative is to create a new catalog that contains exactly
> the same columns as pg_depend (minus deptype) plus the version.  That
> would work but it would just create a lot of code duplication, I think.

Agreed.

> One thing I've been thinking about is whether this object-version
> concept could extend to other object types.  For example, if someone
> changes the binary layout of a type, they could change the version of
> the type, and this catalog could track the type version in the column ->
> type dependency.  Obviously, a lot more work would have to be done to
> make this work, but I think the concept of this catalog is sound.

Interesting idea.  Sounds like it requires version checks that
actually stop you from using the dependent object, instead of emitting
a few meek warnings.




Re: Collation versioning

2019-11-07 Thread Michael Paquier
On Fri, Nov 08, 2019 at 02:23:54PM +1300, Thomas Munro wrote:
> Right, so this is basically a policy decision: do we assume that all
> pre-13 indexes that depend on collations are potentially corrupted, or
> assume that they are not?  The "correct" thing to do would be to
> assume they are potentially corrupted and complain until the user
> reindexes, but I think the pragmatic thing to do would be to assume
> that they're not and just let them adopt the current versions, even
> though it's a lie.  I lean towards the pragmatic choice; we're trying
> to catch future problems, not give the entire user base a load of
> extra work to do on their next pg_upgrade for mostly theoretical
> reasons.  (That said, given the new glibc versioning, we'll
> effectively be giving most of our user base a load of extra work to do
> on their next OS upgrade and that'll be a characteristic of PostgreSQL
> going forward, once the versioning-for-default-provider patch goes
> in.)  Any other opinions?

Matching an incorrect collation version on an index which physically
uses something else does not strike me as a good idea to me because
you may hide corruptions, and you would actually lose the reason why
the corruption happened (did the version bump up from an incorrect
one?  Or what?).  Could it be possible to mark any existing indexes
with an unknown version or something like that?  This way, we could
just let the user decide what needs to be reindexed or not, and we
need to offer an option to update the collation version from unknown
to the latest one available.
--
Michael


signature.asc
Description: PGP signature


Re: Collation versioning

2019-11-07 Thread Thomas Munro
On Thu, Nov 7, 2019 at 10:20 PM Julien Rouhaud  wrote:
> Attached 4th patch handles default collation.  I went with an
> ignore_systempin flag in recordMultipleDependencies.

Thanks for working on this!  I haven't looked closely or tested yet,
but this seems reasonable.  Obviously it assumes that the default
provider is really "libc" in disguise for now, but Peter's other patch
will extend that to cover ICU later.

> > > * Andres mentioned off-list that pg_depend rows might get blown away
> > > and recreated in some DDL circumstances.  We need to look into that.
>
> I tried various flavour of DDL but I couldn't wipe out the pg_depend
> rows without having an index rebuild triggered (like changing the
> underlying column datatype).  Do you have any scenario where the index
> rebuild wouldn't be triggered?

Ah, OK, if we only do that when the old index contents will also be
destroyed, that's great news.

> > > * Another is that pg_upgrade won't preserve pg_depend rows, so you'd
> > > need some catalog manipulation (direct or via new DDL) to fix that.
>
> Attached 5th patch add a new "ALTER INDEX idx_name DEPENDS ON
> COLLATION coll_oid VERSION coll_version_text" that can only be
> executed in binary upgrade mode, and teach pg_dump to generate such
> commands (including for indexes created for constraints).

It's nice that you were able to make up a reasonable grammar out of
existing keywords.  I wonder if we should make this user accessible...
it could be useful for expert users.  If so, maybe it should use
collation names, not OIDs?

> One issue
> is that older versions don't have pg_depend information,  so pg_dump
> can't find the dependencies to generate such commands and override the
> version with anything else.  It means that the upgraded cluster will
> show all indexes as depending on the current collation provider
> version.  I'm not sure if that's the best thing to do, or if we should
> change all pg_depend rows to mention "unknown" version or something
> like that.  It would generate so much noise that it's probably not
> worth it.

Right, so this is basically a policy decision: do we assume that all
pre-13 indexes that depend on collations are potentially corrupted, or
assume that they are not?  The "correct" thing to do would be to
assume they are potentially corrupted and complain until the user
reindexes, but I think the pragmatic thing to do would be to assume
that they're not and just let them adopt the current versions, even
though it's a lie.  I lean towards the pragmatic choice; we're trying
to catch future problems, not give the entire user base a load of
extra work to do on their next pg_upgrade for mostly theoretical
reasons.  (That said, given the new glibc versioning, we'll
effectively be giving most of our user base a load of extra work to do
on their next OS upgrade and that'll be a characteristic of PostgreSQL
going forward, once the versioning-for-default-provider patch goes
in.)  Any other opinions?

> I didn't do anything for the spurious warning when running a reindex,
> and kept original approach of pg_depend catalog.

I see three options:

1.  Change all or some of index_open(), relation_open(),
RelationIdGetRelation(), RelationBuildDesc() and
RelationInitIndexAccessInfo() to take some kind of flag so we can say
NO_COLLATION_VERSION_CHECK_PLEASE, and then have ReindexIndex() pass
that flag down when opening it for the purpose of rebuilding it.
2.  Use a global state to suppress these warnings while opening that
index.  Perhaps ReindexIndex() would call RelCacheWarnings(false)
before index_open(), and use PG_FINALLY to make sure it restores
warnings with RelCacheWarnings(true).  (This is a less-code-churn
bad-programming-style version of #1.)
3.  Move the place that we run the collation version check.  Instead
of doing it in RelationInitIndexAccessInfo() so that it runs when the
relation cache first loads the index, put it into a new routine
RelationCheckValidity() and call that from ... I don't know, some
other place that runs whenever we access indexes but not when we
rebuild them.

3's probably a better idea, if you can find a reasonable place to call
it from.  I'm thinking some time around the time the executor acquires
locks, but using a flag in the relcache entry to make sure it doesn't
run the check again if there were no warnings last time (so one
successful version check turns the extra work off for the rest of this
relcache entry's lifetime).  I think it'd be a feature, not a bug, if
it gave you the warning every single time you executed a query using
an index that has a mismatch...  but a practical alternative would be
to check only once per index and that probably makes more sense.




Re: Collation versioning

2019-11-07 Thread Julien Rouhaud
On Mon, Nov 4, 2019 at 11:13 AM Julien Rouhaud  wrote:
>
> On Mon, Nov 4, 2019 at 4:58 AM Thomas Munro  wrote:
> >
> > Unfortunately I haven't had time to work on it seriously, but here's a
> > quick rebase to get the proof-of-concept back into working shape.

Thanks!  I also removed the test for REFRESH VERSION command that was
forgotten in the patch set, and run a pgindent.

> > Here are some problems to think about:
> >
> > * We'd need to track dependencies on the default collation once we
> > have versioning for that (see
> > https://www.postgresql.org/message-id/flat/5e756dd6-0e91-d778-96fd-b1bcb06c161a%402ndquadrant.com).
> > That is how most people actually consume collations out there in real
> > life, and yet we don't normally track dependencies on the default
> > collation and I don't know if that's simply a matter of ripping out
> > all the code that looks like "xxx != DEFAULT_COLLATION_ID" in the
> > dependency analysis code or if there's more to it.
>
> This isn't enough.  What would remain is:
>
> - teach get_collation_version_for_oid() to return the  default
> collation name, which is simple
> - have recordDependencyOnVersion() actually records the dependency,
> which wouldn't happen as the default collation is pinned.
>
> An easy fix would be to teach isObjectPinned() to ignore
> CollationRelationId / DEFAULT_COLLATION_OID, but that's ugly and would
> allow too many dependencies to be stored.  Not pinning the default
> collation during initdb doesn't sound a good alternative either.
> Maybe adding a force flag or a new DependencyType that'd mean "normal
> but forced" would be ok?

Attached 4th patch handles default collation.  I went with an
ignore_systempin flag in recordMultipleDependencies.

>
> > * Andres mentioned off-list that pg_depend rows might get blown away
> > and recreated in some DDL circumstances.  We need to look into that.

I tried various flavour of DDL but I couldn't wipe out the pg_depend
rows without having an index rebuild triggered (like changing the
underlying column datatype).  Do you have any scenario where the index
rebuild wouldn't be triggered?

> > * Another is that pg_upgrade won't preserve pg_depend rows, so you'd
> > need some catalog manipulation (direct or via new DDL) to fix that.

Attached 5th patch add a new "ALTER INDEX idx_name DEPENDS ON
COLLATION coll_oid VERSION coll_version_text" that can only be
executed in binary upgrade mode, and teach pg_dump to generate such
commands (including for indexes created for constraints).  One issue
is that older versions don't have pg_depend information,  so pg_dump
can't find the dependencies to generate such commands and override the
version with anything else.  It means that the upgraded cluster will
show all indexes as depending on the current collation provider
version.  I'm not sure if that's the best thing to do, or if we should
change all pg_depend rows to mention "unknown" version or something
like that.  It would generate so much noise that it's probably not
worth it.

I didn't do anything for the spurious warning when running a reindex,
and kept original approach of pg_depend catalog.


0002-Add-pg_depend.refobjversion.patch
Description: Binary data


0005-Preserve-index-dependencies-on-collation-during-pg_u.patch
Description: Binary data


0001-Remove-pg_collation.collversion.patch
Description: Binary data


0004-Also-track-default-collation-version.patch
Description: Binary data


0003-Track-collation-versions-for-indexes.patch
Description: Binary data


Re: Collation versioning

2019-11-07 Thread Julien Rouhaud
On Mon, Nov 4, 2019 at 9:59 PM Thomas Munro  wrote:
>
> On Tue, Nov 5, 2019 at 4:18 AM Julien Rouhaud  wrote:
> > On Mon, Nov 4, 2019 at 11:13 AM Julien Rouhaud  wrote:
> > > On Mon, Nov 4, 2019 at 4:58 AM Thomas Munro  
> > > wrote:
> > > > Here are some problems to think about:
> > > >
> > > > * We'd need to track dependencies on the default collation once we
> > > > have versioning for that [...]
> >
> > Another problem I just thought about is how to avoid discrepancy of
> > collation version for indexes on shared objects, such as
> > pg_database_datname_index.
>
> I didn't look closely at the code, but I think when "name" recently
> became collation-aware (commit 586b98fd), it switched to using
> C_COLLATION_OID as its typcollation, and "C" doesn't need versioning,
> so I think it would only be a problem if there are shared catalogs
> that have "name" columns that have a non-type-default collation.
> There are none of those, and you can't create them, right?  If there
> were, if we take this patch set to its logical conclusion, we'd also
> need pg_shdepend.refobjversion, but we don't need it AFAICS.

That's entirely correct, I should have checked that.




  1   2   >