Re: Supply restore_command to pg_rewind via CLI argument

2022-03-22 Thread Alexey Kondratov
Hi,

On Tue, Mar 22, 2022 at 3:32 AM Andres Freund  wrote:
>
> Doesn't apply once more: http://cfbot.cputube.org/patch_37_3213.log
>

Thanks for the reminder, a rebased version is attached.


Regards
-- 
Alexey Kondratov
From df56b5c7b882e781fdc0b92e7a83331f0baab094 Mon Sep 17 00:00:00 2001
From: Alexey Kondratov 
Date: Tue, 29 Jun 2021 17:17:47 +0300
Subject: [PATCH v4] Allow providing restore_command as a command line option
 to pg_rewind

This could be useful when postgres is usually run with
-c config_file=..., so the actual configuration and restore_command
is not inside $PGDATA/postgresql.conf.
---
 doc/src/sgml/ref/pg_rewind.sgml  | 19 +
 src/bin/pg_rewind/pg_rewind.c| 45 ++-
 src/bin/pg_rewind/t/001_basic.pl |  1 +
 src/bin/pg_rewind/t/RewindTest.pm| 95 ++--
 src/test/perl/PostgreSQL/Test/Cluster.pm |  5 +-
 5 files changed, 106 insertions(+), 59 deletions(-)

diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml
index 33e6bb64ad..af75f35867 100644
--- a/doc/src/sgml/ref/pg_rewind.sgml
+++ b/doc/src/sgml/ref/pg_rewind.sgml
@@ -241,6 +241,25 @@ PostgreSQL documentation
   
  
 
+ 
+  -C restore_command
+  --target-restore-command=restore_command
+  
+   
+Specifies the restore_command to use for retrieving
+WAL files from the WAL archive if these files are no longer available
+in the pg_wal directory of the target cluster.
+   
+   
+If restore_command is already set in
+postgresql.conf, you can provide the
+--restore-target-wal option instead. If both options
+are provided, then --target-restore-command
+will be used.
+   
+  
+ 
+
  
   --debug
   
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index b39b5c1aac..9aca041425 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -85,21 +85,22 @@ usage(const char *progname)
 	printf(_("%s resynchronizes a PostgreSQL cluster with another copy of the cluster.\n\n"), progname);
 	printf(_("Usage:\n  %s [OPTION]...\n\n"), progname);
 	printf(_("Options:\n"));
-	printf(_("  -c, --restore-target-wal   use restore_command in target configuration to\n"
-			 " retrieve WAL files from archives\n"));
-	printf(_("  -D, --target-pgdata=DIRECTORY  existing data directory to modify\n"));
-	printf(_("  --source-pgdata=DIRECTORY  source data directory to synchronize with\n"));
-	printf(_("  --source-server=CONNSTRsource server to synchronize with\n"));
-	printf(_("  -n, --dry-run  stop before modifying anything\n"));
-	printf(_("  -N, --no-sync  do not wait for changes to be written\n"
-			 " safely to disk\n"));
-	printf(_("  -P, --progress write progress messages\n"));
-	printf(_("  -R, --write-recovery-conf  write configuration for replication\n"
-			 " (requires --source-server)\n"));
-	printf(_("  --debugwrite a lot of debug messages\n"));
-	printf(_("  --no-ensure-shutdown   do not automatically fix unclean shutdown\n"));
-	printf(_("  -V, --version  output version information, then exit\n"));
-	printf(_("  -?, --help show this help, then exit\n"));
+	printf(_("  -c, --restore-target-wal  use restore_command in target configuration to\n"
+			 "retrieve WAL files from archives\n"));
+	printf(_("  -C, --target-restore-command=COMMAND  target WAL restore_command\n"));
+	printf(_("  -D, --target-pgdata=DIRECTORY existing data directory to modify\n"));
+	printf(_("  --source-pgdata=DIRECTORY source data directory to synchronize with\n"));
+	printf(_("  --source-server=CONNSTR   source server to synchronize with\n"));
+	printf(_("  -n, --dry-run stop before modifying anything\n"));
+	printf(_("  -N, --no-sync do not wait for changes to be written\n"
+			 "safely to disk\n"));
+	printf(_("  -P, --progresswrite progress messages\n"));
+	printf(_("  -R, --write-recovery-conf write configuration for replication\n"
+			 "(requires --source-server)\n"));
+	printf(_("  --debug   write a lot of debug messages\n"));
+	printf(_("  --no-ensure-shutdown  do not automatically fix 

Re: Supply restore_command to pg_rewind via CLI argument

2021-08-27 Thread Alexey Kondratov
On Fri, Aug 27, 2021 at 10:05 AM Andrey Borodin  wrote:
> There is a small bug
> +   /*
> +* Take restore_command from the postgresql.conf only if it is not 
> already
> +* provided as a command line option.
> +*/
> +   if (!restore_wal && restore_command == NULL)
> return;
>
> I think we should use condition (!restore_wal || restore_command != NULL).
>

Yes, you are right, thanks. Attached is a fixed version. Tests were
passing since PostgresNode->enable_restoring is adding restore_command
to the postgresql.conf anyway.

>
> Besides this patch looks good and is ready for committer IMV.
>


-- 
Alexey Kondratov


v2-0001-Allow-providing-restore_command-as-a-command-line.patch
Description: Binary data


Re: Supply restore_command to pg_rewind via CLI argument

2021-06-29 Thread Alexey Kondratov
On Fri, Jun 18, 2021 at 10:06 PM Alexey Kondratov
 wrote:
> On Fri, Jun 18, 2021 at 5:42 PM Andrey Borodin  wrote:
> >
> > If we run 'pg_rewind --restore-target-wal' there must be restore_command in 
> > config of target installation. But if the config is not within 
> > $PGDATA\postgresql.conf pg_rewind cannot use it.
> > If we run postmaster with `-c 
> > config_file=/etc/postgresql/10/data/postgresql.conf`, we simply cannot use 
> > the feature. We solved the problem by putting config into PGDATA only 
> > during pg_rewind, but this does not seem like a very robust solution.
> >
>
> Yeah, Michael was against it, while we had no good arguments, so
> Alexander removed it, IIRC. This example sounds reasonable to me. I
> also recall some complaints from PostgresPro support folks, that it is
> sad to not have a cli option to pass restore_command. However, I just
> thought about another recent feature --- ensure clean shutdown, which
> is turned on by default. So you usually run Postgres with one config,
> but pg_rewind may start it with another, although in single-user mode.
> Is it fine for you?
>
> >
> > Maybe we could add "-C, --target-restore-command=COMMAND  target WAL 
> > restore_command\n" as was proposed within earlier versions of the patch[0]? 
> > Or instruct pg_rewind to pass config to 'postgres -C restore_command' run?
>
> Hm, adding --target-restore-command is the simplest way, sure, but
> forwarding something like '-c config_file=...' to postgres sounds
> interesting too. Could it have any use case beside providing a
> restore_command? I cannot imagine anything right now, but if any
> exist, then it could be a more universal approach.
>
> >
> > From my POV adding --target-restore-command is simplest way, I can extract 
> > corresponding portions from original patch.
> >
>
> I will have a look, maybe I even already have this patch separately. I
> remember that we were considering adding this option to PostgresPro,
> when we did a backport of this feature.
>

Here it is. I have slightly adapted the previous patch to the recent
pg_rewind changes. In this version -C does not conflict with -c, it
just overrides it.


-- 
Alexey Kondratov


v1-0001-Allow-providing-restore_command-as-a-command-line.patch
Description: Binary data


Re: Supply restore_command to pg_rewind via CLI argument

2021-06-18 Thread Alexey Kondratov
Hi,

On Fri, Jun 18, 2021 at 5:42 PM Andrey Borodin  wrote:
>
> If we run 'pg_rewind --restore-target-wal' there must be restore_command in 
> config of target installation. But if the config is not within 
> $PGDATA\postgresql.conf pg_rewind cannot use it.
> If we run postmaster with `-c 
> config_file=/etc/postgresql/10/data/postgresql.conf`, we simply cannot use 
> the feature. We solved the problem by putting config into PGDATA only during 
> pg_rewind, but this does not seem like a very robust solution.
>

Yeah, Michael was against it, while we had no good arguments, so
Alexander removed it, IIRC. This example sounds reasonable to me. I
also recall some complaints from PostgresPro support folks, that it is
sad to not have a cli option to pass restore_command. However, I just
thought about another recent feature --- ensure clean shutdown, which
is turned on by default. So you usually run Postgres with one config,
but pg_rewind may start it with another, although in single-user mode.
Is it fine for you?

>
> Maybe we could add "-C, --target-restore-command=COMMAND  target WAL 
> restore_command\n" as was proposed within earlier versions of the patch[0]? 
> Or instruct pg_rewind to pass config to 'postgres -C restore_command' run?

Hm, adding --target-restore-command is the simplest way, sure, but
forwarding something like '-c config_file=...' to postgres sounds
interesting too. Could it have any use case beside providing a
restore_command? I cannot imagine anything right now, but if any
exist, then it could be a more universal approach.

>
> From my POV adding --target-restore-command is simplest way, I can extract 
> corresponding portions from original patch.
>

I will have a look, maybe I even already have this patch separately. I
remember that we were considering adding this option to PostgresPro,
when we did a backport of this feature.


--
Alexey Kondratov




Re: Free port choosing freezes when PostgresNode::use_tcp is used on BSD systems

2021-04-20 Thread Alexey Kondratov

On 2021-04-20 18:03, Tom Lane wrote:

Andrew Dunstan  writes:

On 4/19/21 7:22 PM, Tom Lane wrote:

I wonder whether we could get away with just replacing the $use_tcp
test with $TestLib::windows_os.  It's not really apparent to me
why we should care about 127.0.0.not-1 on Unix-oid systems.



Yeah
The comment is a bit strange anyway - Cygwin is actually going to use
Unix sockets, not TCP.
I think I would just change the test to this: $use_tcp &&
$TestLib::windows_os.


Works for me, but we need to revise the comment to match.



Then it could be somewhat like that, I guess.


Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Companydiff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index db47a97d196..f7b488ed464 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -1191,19 +1191,19 @@ sub get_free_port
 		# Check to see if anything else is listening on this TCP port.
 		# Seek a port available for all possible listen_addresses values,
 		# so callers can harness this port for the widest range of purposes.
-		# The 0.0.0.0 test achieves that for post-2006 Cygwin, which
-		# automatically sets SO_EXCLUSIVEADDRUSE.  The same holds for MSYS (a
-		# Cygwin fork).  Testing 0.0.0.0 is insufficient for Windows native
-		# Perl (https://stackoverflow.com/a/14388707), so we also test
-		# individual addresses.
+		# The 0.0.0.0 test achieves that for MSYS, which automatically sets
+		# SO_EXCLUSIVEADDRUSE.  Testing 0.0.0.0 is insufficient for Windows
+		# native Perl (https://stackoverflow.com/a/14388707), so we also
+		# have to test individual addresses.  Doing that for 127.0.0/24
+		# addresses other than 127.0.0.1 might fail with EADDRNOTAVAIL on
+		# non-Linux, non-Windows kernels.
 		#
-		# On non-Linux, non-Windows kernels, binding to 127.0.0/24 addresses
-		# other than 127.0.0.1 might fail with EADDRNOTAVAIL.  Binding to
-		# 0.0.0.0 is unnecessary on non-Windows systems.
+		# That way, 0.0.0.0 and individual 127.0.0/24 addresses are tested
+		# only on Windows when TCP usage is requested.
 		if ($found == 1)
 		{
 			foreach my $addr (qw(127.0.0.1),
-$use_tcp ? qw(127.0.0.2 127.0.0.3 0.0.0.0) : ())
+$use_tcp && $TestLib::windows_os ? qw(127.0.0.2 127.0.0.3 0.0.0.0) : ())
 			{
 if (!can_bind($addr, $port))
 {


Free port choosing freezes when PostgresNode::use_tcp is used on BSD systems

2021-04-19 Thread Alexey Kondratov

Hi Hackers,

Inside PostgresNode.pm there is a free port choosing routine --- 
get_free_port(). The comment section there says:


# On non-Linux, non-Windows kernels, binding to 127.0.0/24 addresses
# other than 127.0.0.1 might fail with EADDRNOTAVAIL.

And this is an absolute true, on BSD-like systems (macOS and FreeBSD 
tested) it hangs on looping through the entire ports range over and over 
when $PostgresNode::use_tcp = 1 is set, since bind fails with:


# Checking port 52208
# bind: 127.0.0.1 52208
# bind: 127.0.0.2 52208
bind: Can't assign requested address

To reproduce just apply reproduce.diff and try to run 'make -C 
src/bin/pg_ctl check'.


This is not a case with standard Postgres tests, since TestLib.pm 
chooses unix sockets automatically everywhere outside Windows. However, 
we got into this problem when tried to run a custom tap test that 
required TCP for stable running.


That way, if it really could happen why not to just skip binding to 
127.0.0/24 addresses other than 127.0.0.1 outside of Linux/Windows as 
per attached patch_PostgresNode.diff?



Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Companydiff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm
index db47a97d196..9add9bde2a4 100644
--- a/src/test/perl/PostgresNode.pm
+++ b/src/test/perl/PostgresNode.pm
@@ -1203,7 +1203,7 @@ sub get_free_port
 		if ($found == 1)
 		{
 			foreach my $addr (qw(127.0.0.1),
-$use_tcp ? qw(127.0.0.2 127.0.0.3 0.0.0.0) : ())
+$use_tcp && ($^O eq "linux" || $TestLib::windows_os) ? qw(127.0.0.2 127.0.0.3 0.0.0.0) : ())
 			{
 if (!can_bind($addr, $port))
 {
diff --git a/src/bin/pg_ctl/t/001_start_stop.pl b/src/bin/pg_ctl/t/001_start_stop.pl
index b1e419f02e9..c25c0793537 100644
--- a/src/bin/pg_ctl/t/001_start_stop.pl
+++ b/src/bin/pg_ctl/t/001_start_stop.pl
@@ -11,6 +11,8 @@ use Test::More tests => 24;
 my $tempdir   = TestLib::tempdir;
 my $tempdir_short = TestLib::tempdir_short;
 
+$PostgresNode::use_tcp = 1;
+
 program_help_ok('pg_ctl');
 program_version_ok('pg_ctl');
 program_options_handling_ok('pg_ctl');


Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2021-02-03 Thread Alexey Kondratov

On 2021-02-03 09:37, Michael Paquier wrote:

On Tue, Feb 02, 2021 at 10:32:19AM +0900, Michael Paquier wrote:

On Mon, Feb 01, 2021 at 06:28:57PM +0300, Alexey Kondratov wrote:
> Hm, IIUC, REINDEX CONCURRENTLY doesn't use either of them. It directly uses
> index_create() with a proper tablespaceOid instead of
> SetRelationTableSpace(). And its checks structure is more restrictive even
> without tablespace change, so it doesn't use CheckRelationTableSpaceMove().

Sure.  I have not checked the patch in details, but even with that it
would be much safer to me if we apply the same sanity checks
everywhere.  That's less potential holes to worry about.


Thanks Alexey for the new patch.  I have been looking at the main
patch in details.

/*
-* Don't allow reindex on temp tables of other backends ... their 
local

-* buffer manager is not going to cope.
+* We don't support moving system relations into different 
tablespaces

+* unless allow_system_table_mods=1.
 */
If you remove the check on RELATION_IS_OTHER_TEMP() in
reindex_index(), you would allow the reindex of a temp relation owned
by a different session if its tablespace is not changed, so this
cannot be removed.

+!allowSystemTableMods && IsSystemRelation(iRel))
 ereport(ERROR,
-(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
- errmsg("cannot reindex temporary tables of other 
sessions")));

+(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
+ errmsg("permission denied: \"%s\" is a system 
catalog",

+RelationGetRelationName(iRel;
Indeed, a system relation with a relfilenode should be allowed to move
under allow_system_table_mods.  I think that we had better move this
check into CheckRelationTableSpaceMove() instead of reindex_index() to
centralize the logic.  ALTER TABLE does this business in
RangeVarCallbackForAlterRelation(), but our code path opening the
relation is different for the non-concurrent case.

+   if (OidIsValid(params->tablespaceOid) &&
+   IsSystemClass(relid, classtuple))
+   {
+   if (!allowSystemTableMods)
+   {
+   /* Skip all system relations, if not 
allowSystemTableMods *

I don't see the need for having two warnings here to say the same
thing if a relation is mapped or not mapped, so let's keep it simple.



Yeah, I just wanted to separate mapped and system relations, but 
probably it is too complicated.




I have found that the test suite was rather messy in its
organization.  Table creations were done first with a set of tests not
really ordered, so that was really hard to follow.  This has also led
to a set of tests that were duplicated, while other tests have been
missed, mainly some cross checks for the concurrent and non-concurrent
behaviors.  I have reordered the whole so as tests on catalogs, normal
tables and partitions are done separately with relations created and
dropped for each set.  Partitions use a global check for tablespaces
and relfilenodes after one concurrent reindex (didn't see the point in
doubling with the non-concurrent case as the same code path to select
the relations from the partition tree is taken).  An ACL test has been
added at the end.

The case of partitioned indexes was kind of interesting and I thought
about that a couple of days, and I took the decision to ignore
relations that have no storage as you did, documenting that ALTER
TABLE can be used to update the references of the partitioned
relations.  The command is still useful with this behavior, and the
tests I have added track that.

Finally, I have reworked the docs, separating the limitations related
to system catalogs and partitioned relations, to be more consistent
with the notes at the end of the page.



Thanks for working on this.

+   if (tablespacename != NULL)
+   {
+   params.tablespaceOid = get_tablespace_oid(tablespacename, 
false);
+
+   /* Check permissions except when moving to database's default */
+   if (OidIsValid(params.tablespaceOid) &&

This check for OidIsValid() seems to be excessive, since you moved the 
whole ACL check under 'if (tablespacename != NULL)' here.


+   params.tablespaceOid != MyDatabaseTableSpace)
+   {
+   AclResult   aclresult;


+CREATE INDEX regress_tblspace_test_tbl_idx ON regress_tblspace_test_tbl 
(num1);

+-- move to global tablespace move fails

Maybe 'move to global tablespace, fail', just to match a style of the 
previous comments.


+REINDEX (TABLESPACE pg_global) INDEX regress_tblspace_test_tbl_idx;


+SELECT relid, parentrelid, level FROM 
pg_partition_tree('tbspace_reindex_part_index')

+  ORDER BY relid, level;
+SELECT relid, parentrelid, level FROM 
pg_partition_tree('tbspace_reindex_part_index')

+  ORDER BY relid, level;

Why do you do the same twic

Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2021-02-01 Thread Alexey Kondratov

On 2021-01-30 05:23, Michael Paquier wrote:

On Fri, Jan 29, 2021 at 08:56:47PM +0300, Alexey Kondratov wrote:

On 2021-01-28 14:42, Alexey Kondratov wrote:
No, you are right, we are doing REINDEX with AccessExclusiveLock as 
it

was before. This part is a more specific one. It only applies to
partitioned indexes, which do not hold any data, so we do not reindex
them directly, only their leafs. However, if we are doing a 
TABLESPACE

change, we have to record it in their pg_class entry, so all future
leaf partitions were created in the proper tablespace.

That way, we open partitioned index relation only for a reference,
i.e. read-only, but modify pg_class entry under a proper lock
(RowExclusiveLock). That's why I thought that ShareLock will be
enough.

IIUC, 'ALTER TABLE ... SET TABLESPACE' uses AccessExclusiveLock even
for relations with no storage, since AlterTableGetLockLevel() chooses
it if AT_SetTableSpace is met. This is very similar to our case, so
probably we should do the same?

Actually it is not completely clear for me why
ShareUpdateExclusiveLock is sufficient for newly added
SetRelationTableSpace() as Michael wrote in the comment.


Nay, it was not fine.  That's something Alvaro has mentioned, leading
to 2484329.  This also means that the main patch of this thread should
refresh the comments at the top of CheckRelationTableSpaceMove() and
SetRelationTableSpace() to mention that this is used by REINDEX
CONCURRENTLY with a lower lock.



Hm, IIUC, REINDEX CONCURRENTLY doesn't use either of them. It directly 
uses index_create() with a proper tablespaceOid instead of 
SetRelationTableSpace(). And its checks structure is more restrictive 
even without tablespace change, so it doesn't use 
CheckRelationTableSpaceMove().


Changed patch to use AccessExclusiveLock in this part for now. This is 
what
'ALTER TABLE/INDEX ... SET TABLESPACE' and 'REINDEX' usually do. 
Anyway, all
real leaf partitions are processed in the independent transactions 
later.


+   if (partkind == RELKIND_PARTITIONED_INDEX)
+   {
+   Relation iRel = index_open(partoid, AccessExclusiveLock);
+
+   if (CheckRelationTableSpaceMove(iRel, 
params->tablespaceOid))

+   SetRelationTableSpace(iRel,
+ params->tablespaceOid,
+ InvalidOid);
+   index_close(iRel, NoLock);
Are you sure that this does not represent a risk of deadlocks as EAL
is not taken consistently across all the partitions?  A second issue
here is that this breaks the assumption of REINDEX CONCURRENTLY kicked
on partitioned relations that should use ShareUpdateExclusiveLock for
all its steps.  This would make the first transaction invasive for the
user, but we don't want that.

This makes me really wonder if we would not be better to restrict this
operation for partitioned relation as part of REINDEX as a first step.
Another thing, mentioned upthread, is that we could do this part of
the switch at the last transaction, or we could silently *not* do the
switch for partitioned indexes in the flow of REINDEX, letting users
handle that with an extra ALTER TABLE SET TABLESPACE once REINDEX has
finished on all the partitions, cascading the command only on the
partitioned relation of a tree.  It may be interesting to look as well
at if we could lower the lock used for partitioned relations with
ALTER TABLE SET TABLESPACE from AEL to SUEL, choosing AEL only if at
least one partition with storage is involved in the command,
CheckRelationTableSpaceMove() discarding anything that has no need to
change.



I am not sure right now, so I split previous patch into two parts:

0001: Adds TABLESPACE into REINDEX with tests, doc and all the stuff we 
did before with the only exception that it doesn't move partitioned 
indexes into the new tablespace.


Basically, it implements this option "we could silently *not* do the 
switch for partitioned indexes in the flow of REINDEX, letting users 
handle that with an extra ALTER TABLE SET TABLESPACE once REINDEX has 
finished". It probably makes sense, since we are doing tablespace change 
altogether with index relation rewrite and don't touch relations without 
storage. Doing ALTER INDEX ... SET TABLESPACE will be almost cost-less 
on them, since they do not hold any data.


0002: Implements the remaining part where pg_class entry is also changed 
for partitioned indexes. I think that we should think more about it, 
maybe it is not so dangerous and proper locking strategy could be 
achieved.



Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres CompanyFrom 6322032b472e6b1a76e0ca9326974e5774371fb9 Mon Sep 17 00:00:00 2001
From: Alexey Kondratov 
Date: Mon, 1 Feb 2021 15:20:29 +0300
Subject: [PATCH v10 2/2] Change tablespace of partitioned indexes during
 REINDEX.

There are some doubts about proper locking of partitions
here. AccessExclusiveLock is surely enough, but

Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2021-01-29 Thread Alexey Kondratov

On 2021-01-28 14:42, Alexey Kondratov wrote:

On 2021-01-28 00:36, Alvaro Herrera wrote:



I didn't look at the patch closely enough to understand why you're
trying to do something like CLUSTER, VACUUM FULL or REINDEX without
holding full AccessExclusiveLock on the relation.  But do keep in mind
that once you hold a lock on a relation, trying to grab a weaker lock
afterwards is pretty pointless.



No, you are right, we are doing REINDEX with AccessExclusiveLock as it
was before. This part is a more specific one. It only applies to
partitioned indexes, which do not hold any data, so we do not reindex
them directly, only their leafs. However, if we are doing a TABLESPACE
change, we have to record it in their pg_class entry, so all future
leaf partitions were created in the proper tablespace.

That way, we open partitioned index relation only for a reference,
i.e. read-only, but modify pg_class entry under a proper lock
(RowExclusiveLock). That's why I thought that ShareLock will be
enough.

IIUC, 'ALTER TABLE ... SET TABLESPACE' uses AccessExclusiveLock even
for relations with no storage, since AlterTableGetLockLevel() chooses
it if AT_SetTableSpace is met. This is very similar to our case, so
probably we should do the same?

Actually it is not completely clear for me why
ShareUpdateExclusiveLock is sufficient for newly added
SetRelationTableSpace() as Michael wrote in the comment.



Changed patch to use AccessExclusiveLock in this part for now. This is 
what 'ALTER TABLE/INDEX ... SET TABLESPACE' and 'REINDEX' usually do. 
Anyway, all real leaf partitions are processed in the independent 
transactions later.


Also changed some doc/comment parts Justin pointed me to.

+  then all "mapped" and system relations will be skipped and a 
single
+  WARNING will be generated. Indexes on TOAST 
tables

+  are reindexed, but not moved the new tablespace.


moved *to* the new tablespace.



Fixed.



I don't know if that needs to be said at all.  We talked about it a lot 
to
arrive at the current behavior, but I think that's only due to the 
difficulty

of correcting the initial mistake.



I do not think that it will be a big deal to move indexes on TOAST 
tables as well. I just thought that since 'ALTER TABLE/INDEX ... SET 
TABLESPACE' only moves them together with host table, we also should not 
do that. Yet, I am ready to change this logic if requested.



Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres CompanyFrom 6e9db8d362e794edf421733bc7cade38c917bff4 Mon Sep 17 00:00:00 2001
From: Alexey Kondratov 
Date: Wed, 27 Jan 2021 00:46:17 +0300
Subject: [PATCH v9] Allow REINDEX to change tablespace

REINDEX already does full relation rewrite, this patch adds a
possibility to specify a new tablespace where new relfilenode
will be created.
---
 doc/src/sgml/ref/reindex.sgml |  31 +++-
 src/backend/catalog/index.c   |  47 +-
 src/backend/commands/indexcmds.c  | 112 -
 src/bin/psql/tab-complete.c   |   4 +-
 src/include/catalog/index.h   |   9 +-
 src/test/regress/input/tablespace.source  | 106 +
 src/test/regress/output/tablespace.source | 181 ++
 7 files changed, 478 insertions(+), 12 deletions(-)

diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 627b36300c..2b39699d42 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -27,6 +27,7 @@ REINDEX [ ( option [, ...] ) ] { IN
 
 CONCURRENTLY [ boolean ]
 VERBOSE [ boolean ]
+TABLESPACE new_tablespace
 
  
 
@@ -187,6 +188,21 @@ REINDEX [ ( option [, ...] ) ] { IN
 

 
+   
+TABLESPACE
+
+ 
+  Specifies that indexes will be rebuilt on a new tablespace.
+  Cannot be used with "mapped" or (unless allow_system_table_mods)
+  system relations. If SCHEMA,
+  DATABASE or SYSTEM are specified,
+  then all "mapped" and system relations will be skipped and a single
+  WARNING will be generated. Indexes on TOAST tables
+  are reindexed, but not moved to the new tablespace.
+ 
+
+   
+

 VERBOSE
 
@@ -210,6 +226,14 @@ REINDEX [ ( option [, ...] ) ] { IN
 

 
+   
+new_tablespace
+
+ 
+  The tablespace where indexes will be rebuilt.
+ 
+
+   
   
  
 
@@ -292,7 +316,12 @@ REINDEX [ ( option [, ...] ) ] { IN
with REINDEX INDEX or REINDEX TABLE,
respectively. Each partition of the specified partitioned relation is
reindexed in a separate transaction. Those commands cannot be used inside
-   a transaction block when working on a partitioned table or index.
+   a transaction block when working on a partitioned table or index. If
+   a REINDEX command fails when run on a partitioned
+   relation, and TABLESPACE was specified, then it may have
+   moved indexes on some partitions to the new tablespace.  Re-running th

Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2021-01-28 Thread Alexey Kondratov

On 2021-01-28 00:36, Alvaro Herrera wrote:

On 2021-Jan-28, Alexey Kondratov wrote:

I have read more about lock levels and ShareLock should prevent any 
kind of

physical modification of indexes. We already hold ShareLock doing
find_all_inheritors(), which is higher than ShareUpdateExclusiveLock, 
so

using ShareLock seems to be safe here, but I will look on it closer.


You can look at lock.c where LockConflicts[] is; that would tell you
that ShareLock indeed conflicts with ShareUpdateExclusiveLock ... but 
it

does not conflict with itself!  So it would be possible to have more
than one process doing this thing at the same time, which surely makes
no sense.



Thanks for the explanation and pointing me to the LockConflicts[]. This 
is a good reference.




I didn't look at the patch closely enough to understand why you're
trying to do something like CLUSTER, VACUUM FULL or REINDEX without
holding full AccessExclusiveLock on the relation.  But do keep in mind
that once you hold a lock on a relation, trying to grab a weaker lock
afterwards is pretty pointless.



No, you are right, we are doing REINDEX with AccessExclusiveLock as it 
was before. This part is a more specific one. It only applies to 
partitioned indexes, which do not hold any data, so we do not reindex 
them directly, only their leafs. However, if we are doing a TABLESPACE 
change, we have to record it in their pg_class entry, so all future leaf 
partitions were created in the proper tablespace.


That way, we open partitioned index relation only for a reference, i.e. 
read-only, but modify pg_class entry under a proper lock 
(RowExclusiveLock). That's why I thought that ShareLock will be enough.


IIUC, 'ALTER TABLE ... SET TABLESPACE' uses AccessExclusiveLock even for 
relations with no storage, since AlterTableGetLockLevel() chooses it if 
AT_SetTableSpace is met. This is very similar to our case, so probably 
we should do the same?


Actually it is not completely clear for me why ShareUpdateExclusiveLock 
is sufficient for newly added SetRelationTableSpace() as Michael wrote 
in the comment.



Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company




Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2021-01-27 Thread Alexey Kondratov

On 2021-01-27 06:14, Michael Paquier wrote:

On Wed, Jan 27, 2021 at 01:00:50AM +0300, Alexey Kondratov wrote:
In the new 0002 I moved ACL check to the upper level, i.e. 
ExecReindex(),
and removed expensive text generation in test. Not touched yet some of 
your
previously raised concerns. Also, you made SetRelationTableSpace() to 
accept

Relation instead of Oid, so now we have to open/close indexes in the
ReindexPartitions(), I am not sure that I use proper locking there, 
but it

works.


Passing down Relation to the new routines makes the most sense to me
because we force the callers to think about the level of locking
that's required when doing any tablespace moves.

+   Relation iRel = index_open(partoid, ShareLock);
+
+   if (CheckRelationTableSpaceMove(iRel, 
params->tablespaceOid))

+   SetRelationTableSpace(iRel,
+ params->tablespaceOid,
+ InvalidOid);
Speaking of which, this breaks the locking assumptions of
SetRelationTableSpace().  I feel that we should think harder about
this part for partitioned indexes and tables because this looks rather
unsafe in terms of locking assumptions with partition trees.  If we
cannot come up with a safe solution, I would be fine with disallowing
TABLESPACE in this case, as a first step.  Not all problems have to be
solved at once, and even without this part the feature is still
useful.



I have read more about lock levels and ShareLock should prevent any kind 
of physical modification of indexes. We already hold ShareLock doing 
find_all_inheritors(), which is higher than ShareUpdateExclusiveLock, so 
using ShareLock seems to be safe here, but I will look on it closer.




+   /* It's not a shared catalog, so refuse to move it to shared 
tablespace */

+   if (params->tablespaceOid == GLOBALTABLESPACE_OID)
+   ereport(ERROR,
+   (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+errmsg("cannot move non-shared relation to tablespace 
\"%s\"",

+get_tablespace_name(params->tablespaceOid;
Why is that needed if CheckRelationTableSpaceMove() is used?



This is from ReindexRelationConcurrently() where we do not use 
CheckRelationTableSpaceMove(). For me it makes sense to add only this 
GLOBALTABLESPACE_OID check there, since before we already check for 
system catalogs and after for temp relations, so adding 
CheckRelationTableSpaceMove() will be a double-check.




- indexRelation->rd_rel->reltablespace,
+ OidIsValid(tablespaceOid) ?
+   tablespaceOid :
indexRelation->rd_rel->reltablespace,
Let's remove this logic from index_concurrently_create_copy() and let
the caller directly decide the tablespace to use, without a dependency
on InvalidOid in the inner routine.  A share update exclusive lock is
already hold on the old index when creating the concurrent copy, so
there won't be concurrent schema changes.



Changed.

Also added tests for ACL checks, relfilenode changes. Added ACL recheck 
for multi-transactional case. Added info about TOAST index reindexing. 
Changed some comments.



Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres CompanyFrom f176a6e5a81ab133fee849f72e4edb8b287d6062 Mon Sep 17 00:00:00 2001
From: Alexey Kondratov 
Date: Wed, 27 Jan 2021 00:46:17 +0300
Subject: [PATCH v8] Allow REINDEX to change tablespace

REINDEX already does full relation rewrite, this patch adds a
possibility to specify a new tablespace where new relfilenode
will be created.
---
 doc/src/sgml/ref/reindex.sgml |  31 +++-
 src/backend/catalog/index.c   |  50 +-
 src/backend/commands/indexcmds.c  | 112 -
 src/bin/psql/tab-complete.c   |   4 +-
 src/include/catalog/index.h   |   9 +-
 src/test/regress/input/tablespace.source  | 106 +
 src/test/regress/output/tablespace.source | 181 ++
 7 files changed, 481 insertions(+), 12 deletions(-)

diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 627b36300c..e610a0f52c 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -27,6 +27,7 @@ REINDEX [ ( option [, ...] ) ] { IN
 
 CONCURRENTLY [ boolean ]
 VERBOSE [ boolean ]
+TABLESPACE new_tablespace
 
  
 
@@ -187,6 +188,21 @@ REINDEX [ ( option [, ...] ) ] { IN
 

 
+   
+TABLESPACE
+
+ 
+  Specifies that indexes will be rebuilt on a new tablespace.
+  Cannot be used with "mapped" and system (unless allow_system_table_mods
+  is set to TRUE) relations. If SCHEMA,
+  DATABASE or SYSTEM are specified,
+  then all "mapped" and system relations will be skipped and a single
+  WARNING will be generated. Indexes on TOA

Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2021-01-26 Thread Alexey Kondratov

On 2021-01-26 09:58, Michael Paquier wrote:

On Mon, Jan 25, 2021 at 11:11:38PM +0300, Alexey Kondratov wrote:
I updated comment with CCI info, did pgindent run and renamed new 
function

to SetRelationTableSpace(). New patch is attached.

[...]

Yeah, all these checks we complicated from the beginning. I will try 
to find

a better place tomorrow or put more info into the comments at least.


I was reviewing that, and I think that we can do a better
consolidation on several points that will also help the features
discussed on this thread for VACUUM, CLUSTER and REINDEX.

If you look closely, ATExecSetTableSpace() uses the same logic as the
code modified here to check if a relation can be moved to a new
tablespace, with extra checks for mapped relations,
GLOBALTABLESPACE_OID or if attempting to manipulate a temp relation
from another session.  There are two differences though:
- Custom actions are taken between the phase where we check if a
relation can be moved to a new tablespace, and the update of
pg_class.
- ATExecSetTableSpace() needs to be able to set a given relation
relfilenode on top of reltablespace, the newly-created one.

So I think that the heart of the problem is made of two things here:
- We should have one common routine for the existing code paths and
the new code paths able to check if a tablespace move can be done or
not.  The case of a cluster, reindex or vacuum on a list of relations
extracted from pg_class would still require a different handling
as incorrect relations have to be skipped, but the case of individual
relations can reuse the refactoring pieces done here
(see CheckRelationTableSpaceMove() in the attached).
- We need to have a second routine able to update reltablespace and
optionally relfilenode for a given relation's pg_class entry, once the
caller has made sure that CheckRelationTableSpaceMove() validates a
tablespace move.



I think that I got your idea. One comment:

+bool
+CheckRelationTableSpaceMove(Relation rel, Oid newTableSpaceId)
+{
+   Oid oldTableSpaceId;
+   Oid reloid = RelationGetRelid(rel);
+
+   /*
+* No work if no change in tablespace.  Note that MyDatabaseTableSpace
+* is stored as 0.
+*/
+   oldTableSpaceId = rel->rd_rel->reltablespace;
+   if (newTableSpaceId == oldTableSpaceId ||
+   (newTableSpaceId == MyDatabaseTableSpace && oldTableSpaceId == 
0))
+   {
+   InvokeObjectPostAlterHook(RelationRelationId, reloid, 0);
+   return false;
+   }

CheckRelationTableSpaceMove() does not feel like a right place for 
invoking post alter hooks. It is intended only to check for tablespace 
change possibility. Anyway, ATExecSetTableSpace() and 
ATExecSetTableSpaceNoStorage() already do that in the no-op case.



Please note that was a bug in your previous patch 0002: shared
dependencies need to be registered if reltablespace is updated of
course, but also iff the relation has no physical storage.  So
changeDependencyOnTablespace() requires a check based on
RELKIND_HAS_STORAGE(), or REINDEX would have registered shared
dependencies even for relations with storage, something we don't
want per the recent work done by Alvaro in ebfe2db.



Yes, thanks.

I have removed this InvokeObjectPostAlterHook() from your 0001 and made 
0002 to work on top of it. I think that now it should look closer to 
what you described above.


In the new 0002 I moved ACL check to the upper level, i.e. 
ExecReindex(), and removed expensive text generation in test. Not 
touched yet some of your previously raised concerns. Also, you made 
SetRelationTableSpace() to accept Relation instead of Oid, so now we 
have to open/close indexes in the ReindexPartitions(), I am not sure 
that I use proper locking there, but it works.



Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres CompanyFrom 96a37399a9cf9ae08d62e28496e73b36087e5a19 Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Tue, 26 Jan 2021 15:53:06 +0900
Subject: [PATCH v7 1/2] Refactor code to detect and process tablespace moves

---
 src/backend/commands/tablecmds.c | 218 +--
 src/include/commands/tablecmds.h |   4 +
 2 files changed, 127 insertions(+), 95 deletions(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 8687e9a97c..c08eedf995 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -3037,6 +3037,116 @@ SetRelationHasSubclass(Oid relationId, bool relhassubclass)
 	table_close(relationRelation, RowExclusiveLock);
 }
 
+/*
+ * CheckRelationTableSpaceMove
+ *		Check if relation can be moved to new tablespace.
+ *
+ * NOTE: caller must be holding an appropriate lock on the relation.
+ * ShareUpdateExclusiveLock is sufficient to prevent concurrent schema
+ * changes.
+ *
+ * Returns true if the relation can be moved to the new tablesp

Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2021-01-25 Thread Alexey Kondratov

On 2021-01-25 11:07, Michael Paquier wrote:

On Fri, Jan 22, 2021 at 05:07:02PM +0300, Alexey Kondratov wrote:
I have updated patches accordingly and also simplified tablespaceOid 
checks
and assignment in the newly added SetRelTableSpace(). Result is 
attached as
two separate patches for an ease of review, but no objections to merge 
them

and apply at once if everything is fine.


 extern void SetRelationHasSubclass(Oid relationId, bool 
relhassubclass);

+extern bool SetRelTableSpace(Oid reloid, Oid tablespaceOid);
Seeing SetRelationHasSubclass(), wouldn't it be more consistent to use
SetRelationTableSpace() as routine name?

I think that we should document that the caller of this routine had
better do a CCI once done to make the tablespace chage visible.
Except for those two nits, the patch needs an indentation run and some
style tweaks but its logic looks fine.  So I'll apply that first
piece.



I updated comment with CCI info, did pgindent run and renamed new 
function to SetRelationTableSpace(). New patch is attached.



+INSERT INTO regress_tblspace_test_tbl (num1, num2, t)
+  SELECT round(random()*100), random(), repeat('text', 100)
+  FROM generate_series(1, 10) s(i);
Repeating 1M times a text value is too costly for such a test.  And as
even for empty tables there is one page created for toast indexes,
there is no need for that?



Yes, TOAST relation is created anyway. I just wanted to put some data 
into a TOAST index, so REINDEX did some meaningful work there, not only 
a new relfilenode creation. However you are right and this query 
increases tablespace tests execution for more for more than 2 times on 
my machine. I think that it is not really required.




This patch is introducing three new checks for system catalogs:
- don't use tablespace for mapped relations.
- don't use tablespace for system relations, except if
allowSystemTableMods.
- don't move non-shared relation to global tablespace.
For the non-concurrent case, all three checks are in reindex_index().
For the concurrent case, the two first checks are in
ReindexMultipleTables() and the third one is in
ReindexRelationConcurrently().  That's rather tricky to follow because
CONCURRENTLY is not allowed on system relations.  I am wondering if it
would be worth an extra comment effort, or if there is a way to
consolidate that better.



Yeah, all these checks we complicated from the beginning. I will try to 
find a better place tomorrow or put more info into the comments at 
least.


I am also going to check/fix the remaining points regarding 002 
tomorrow.



Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres CompanyFrom 39880842d7af31dcbfcffe7219250b31102955d5 Mon Sep 17 00:00:00 2001
From: Alexey Kondratov 
Date: Wed, 20 Jan 2021 20:21:12 +0300
Subject: [PATCH v5 1/2] Extract common part from ATExecSetTableSpaceNoStorage
 for a future usage

---
 src/backend/commands/tablecmds.c | 95 +++-
 src/include/commands/tablecmds.h |  2 +
 2 files changed, 58 insertions(+), 39 deletions(-)

diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 8687e9a97c..ec9c440e4e 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -13291,6 +13291,59 @@ ATExecSetTableSpace(Oid tableOid, Oid newTableSpace, LOCKMODE lockmode)
 	list_free(reltoastidxids);
 }
 
+/*
+ * SetRelationTableSpace - modify relation tablespace in the pg_class entry.
+ *
+ * 'reloid' is an Oid of relation to be modified.
+ * 'tablespaceOid' is an Oid of new tablespace.
+ *
+ * Catalog modification is done only if tablespaceOid is different from
+ * the currently set.  Returned bool value is indicating whether any changes
+ * were made or not.  Note that caller is responsible for doing
+ * CommandCounterIncrement() to make tablespace changes visible.
+ */
+bool
+SetRelationTableSpace(Oid reloid, Oid tablespaceOid)
+{
+	Relation	pg_class;
+	HeapTuple	tuple;
+	Form_pg_class rd_rel;
+	bool		changed = false;
+
+	/* Get a modifiable copy of the relation's pg_class row. */
+	pg_class = table_open(RelationRelationId, RowExclusiveLock);
+
+	tuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(reloid));
+	if (!HeapTupleIsValid(tuple))
+		elog(ERROR, "cache lookup failed for relation %u", reloid);
+	rd_rel = (Form_pg_class) GETSTRUCT(tuple);
+
+	/* MyDatabaseTableSpace is stored as InvalidOid. */
+	if (tablespaceOid == MyDatabaseTableSpace)
+		tablespaceOid = InvalidOid;
+
+	/* No work if no change in tablespace. */
+	if (tablespaceOid != rd_rel->reltablespace)
+	{
+		/* Update the pg_class row. */
+		rd_rel->reltablespace = tablespaceOid;
+		CatalogTupleUpdate(pg_class, >t_self, tuple);
+
+		/* Record dependency on tablespace. */
+		changeDependencyOnTablespace(RelationRelationId,
+	 reloid, rd_rel->reltablespace);
+
+		changed = true;
+	}
+
+	InvokeObjectPostAlterHook(RelationRelationId, reloid, 0);
+
+	heap_freetuple

Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2021-01-22 Thread Alexey Kondratov

On 2021-01-22 00:26, Justin Pryzby wrote:

On Thu, Jan 21, 2021 at 11:48:08PM +0300, Alexey Kondratov wrote:
Attached is a new patch set of first two patches, that should resolve 
all
the issues raised before (ACL, docs, tests) excepting TOAST. Double 
thanks
for suggestion to add more tests with nested partitioning. I have 
found and
squashed a huge bug related to the returning back to the default 
tablespace

using newly added tests.

Regarding TOAST. Now we skip moving toast indexes or throw error if 
someone

wants to move TOAST index directly. I had a look on ALTER TABLE SET
TABLESPACE and it has a bit complicated logic:

1) You cannot move TOAST table directly.
2) But if you move basic relation that TOAST table belongs to, then 
they are

moved altogether.
3) Same logic as 2) happens if one does ALTER TABLE ALL IN TABLESPACE 
...


That way, ALTER TABLE allows moving TOAST tables (with indexes) 
implicitly,
but does not allow doing that explicitly. In the same time I found 
docs to

be vague about such behavior it only says:

All tables in the current database in a tablespace can be moved
by using the ALL IN TABLESPACE ... Note that system catalogs are
not moved by this command

Changing any part of a system catalog table is not permitted.

So actually ALTER TABLE treats TOAST relations as system sometimes, 
but

sometimes not.

From the end user perspective it makes sense to move TOAST with main 
table
when doing ALTER TABLE SET TABLESPACE. But should we touch indexes on 
TOAST
table with REINDEX? We cannot move TOAST relation itself, since we are 
doing
only a reindex, so we end up in the state when TOAST table and its 
index are
placed in the different tablespaces. This state is not reachable with 
ALTER
TABLE/INDEX, so it seem we should not allow it with REINDEX as well, 
should

we?


+		 * Even if a table's indexes were moved to a new tablespace, the 
index

+* on its toast table is not normally moved.
 */
ReindexParams newparams = *params;

newparams.options &= ~(REINDEXOPT_MISSING_OK);
+   if (!allowSystemTableMods)
+   newparams.tablespaceOid = InvalidOid;


I think you're right.  So actually TOAST should never move, even if
allowSystemTableMods, right ?



I think so. I would prefer to do not move TOAST indexes implicitly at 
all during reindex.




@@ -292,7 +315,11 @@ REINDEX [ ( class="parameter">option [, ...] ) ] { IN
with REINDEX INDEX or REINDEX 
TABLE,
respectively. Each partition of the specified partitioned relation 
is
reindexed in a separate transaction. Those commands cannot be used 
inside

-   a transaction block when working on a partitioned table or index.
+   a transaction block when working on a partitioned table or index. 
If
+   REINDEX with TABLESPACE 
executed
+   on partitioned relation fails it may have moved some partitions to 
the new
+   tablespace. Repeated command will still reindex all partitions 
even if they

+   are already in the new tablespace.


Minor corrections here:

If a REINDEX command fails when run on a partitioned
relation, and TABLESPACE was specified, then it may 
have
moved indexes on some partitions to the new tablespace.  Re-running the 
command
will reindex all partitions and move previously-unprocessed indexes to 
the new

tablespace.


Sounds good to me.

I have updated patches accordingly and also simplified tablespaceOid 
checks and assignment in the newly added SetRelTableSpace(). Result is 
attached as two separate patches for an ease of review, but no 
objections to merge them and apply at once if everything is fine.



Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres CompanyFrom 87e47e9b5b3d6b49230045e5db8f844b14b34ba0 Mon Sep 17 00:00:00 2001
From: Alexey Kondratov 
Date: Mon, 23 Mar 2020 21:10:29 +0300
Subject: [PATCH v4 2/2] Allow REINDEX to change tablespace

REINDEX already does full relation rewrite, this patch adds a
possibility to specify a new tablespace where new relfilenode
will be created.
---
 doc/src/sgml/ref/reindex.sgml |  30 +++-
 src/backend/catalog/index.c   |  81 ++-
 src/backend/commands/indexcmds.c  |  81 ++-
 src/bin/psql/tab-complete.c   |   4 +-
 src/include/catalog/index.h   |   2 +
 src/test/regress/input/tablespace.source  |  85 
 src/test/regress/output/tablespace.source | 159 ++
 7 files changed, 436 insertions(+), 6 deletions(-)

diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 627b36300c..a1c7736aec 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -27,6 +27,7 @@ REINDEX [ ( option [, ...] ) ] { IN
 
 CONCURRENTLY [ boolean ]
 VERBOSE [ boolean ]
+TABLESPACE new_tablespace
 
  
 
@@ -187,6 +188,20 @@ R

Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2021-01-21 Thread Alexey Kondratov

On 2021-01-21 17:06, Alexey Kondratov wrote:

On 2021-01-21 04:41, Michael Paquier wrote:


There are no tests for partitioned tables, aka we'd want to make sure
that the new partitioned index is on the correct tablespace, as well
as all its leaves.  It may be better to have at least two levels of
partitioned tables, as well as a partitioned table with no leaves in
the cases dealt with.



Yes, sure, it makes sense.


+*
+* Even if a table's indexes were moved to a new tablespace, 
the index

+* on its toast table is not normally moved.
 */
Still, REINDEX (TABLESPACE) TABLE should move all of them to be
consistent with ALTER TABLE SET TABLESPACE, but that's not the case
with this code, no?  This requires proper test coverage, but there is
nothing of the kind in this patch.


You are right, we do not move TOAST indexes now, since
IsSystemRelation() is true for TOAST indexes, so I thought that we
should not allow moving them without allow_system_table_mods=true. Now
I wonder why ALTER TABLE does that.

I am going to attach the new version of patch set today or tomorrow.



Attached is a new patch set of first two patches, that should resolve 
all the issues raised before (ACL, docs, tests) excepting TOAST. Double 
thanks for suggestion to add more tests with nested partitioning. I have 
found and squashed a huge bug related to the returning back to the 
default tablespace using newly added tests.


Regarding TOAST. Now we skip moving toast indexes or throw error if 
someone wants to move TOAST index directly. I had a look on ALTER TABLE 
SET TABLESPACE and it has a bit complicated logic:


1) You cannot move TOAST table directly.
2) But if you move basic relation that TOAST table belongs to, then they 
are moved altogether.
3) Same logic as 2) happens if one does ALTER TABLE ALL IN TABLESPACE 
...


That way, ALTER TABLE allows moving TOAST tables (with indexes) 
implicitly, but does not allow doing that explicitly. In the same time I 
found docs to be vague about such behavior it only says:


All tables in the current database in a tablespace can be moved
by using the ALL IN TABLESPACE ... Note that system catalogs are
not moved by this command

Changing any part of a system catalog table is not permitted.

So actually ALTER TABLE treats TOAST relations as system sometimes, but 
sometimes not.


From the end user perspective it makes sense to move TOAST with main 
table when doing ALTER TABLE SET TABLESPACE. But should we touch indexes 
on TOAST table with REINDEX? We cannot move TOAST relation itself, since 
we are doing only a reindex, so we end up in the state when TOAST table 
and its index are placed in the different tablespaces. This state is not 
reachable with ALTER TABLE/INDEX, so it seem we should not allow it with 
REINDEX as well, should we?



Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres CompanyFrom bcd690da6bc3db16a96305b45546d3c9e400f769 Mon Sep 17 00:00:00 2001
From: Alexey Kondratov 
Date: Mon, 23 Mar 2020 21:10:29 +0300
Subject: [PATCH v3 2/2] Allow REINDEX to change tablespace

REINDEX already does full relation rewrite, this patch adds a
possibility to specify a new tablespace where new relfilenode
will be created.
---
 doc/src/sgml/ref/reindex.sgml |  29 +++-
 src/backend/catalog/index.c   |  82 +++-
 src/backend/commands/indexcmds.c  |  81 +++-
 src/bin/psql/tab-complete.c   |   4 +-
 src/include/catalog/index.h   |   2 +
 src/test/regress/input/tablespace.source  |  79 +++
 src/test/regress/output/tablespace.source | 154 ++
 7 files changed, 425 insertions(+), 6 deletions(-)

diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 627b36300c..90fdad0b4c 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -27,6 +27,7 @@ REINDEX [ ( option [, ...] ) ] { IN
 
 CONCURRENTLY [ boolean ]
 VERBOSE [ boolean ]
+TABLESPACE new_tablespace
 
  
 
@@ -187,6 +188,20 @@ REINDEX [ ( option [, ...] ) ] { IN
 

 
+   
+TABLESPACE
+
+ 
+  Specifies that indexes will be rebuilt on a new tablespace.
+  Cannot be used with "mapped" and system (unless allow_system_table_mods
+  is set to TRUE) relations. If SCHEMA,
+  DATABASE or SYSTEM are specified, then
+  all "mapped" and system relations will be skipped and a single
+  WARNING will be generated.
+ 
+
+   
+

 VERBOSE
 
@@ -210,6 +225,14 @@ REINDEX [ ( option [, ...] ) ] { IN
 

 
+   
+new_tablespace
+
+ 
+  The tablespace where indexes will be rebuilt.
+ 
+
+   
   
  
 
@@ -292,7 +315,11 @@ REINDEX [ ( option [, ...] ) ] { IN
with REINDEX INDEX or REINDEX TABLE,
respectively. Each partition of the specified partitioned relation is
reindexed in a separat

Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2021-01-21 Thread Alexey Kondratov

On 2021-01-21 04:41, Michael Paquier wrote:

On Wed, Jan 20, 2021 at 03:34:39PM -0300, Alvaro Herrera wrote:

On 2021-Jan-20, Alexey Kondratov wrote:

Ugh, forgot to attach the patches. Here they are.


Yeah, looks reasonable.



+
+   if (changed)
+   /* Record dependency on tablespace */
+   changeDependencyOnTablespace(RelationRelationId,
+reloid, 
rd_rel->reltablespace);


Why have a separate "if (changed)" block here instead of merging with
the above?


Yep.



Sure, this is a refactoring artifact.


+   if (SetRelTablespace(reloid, newTableSpace))
+   /* Make sure the reltablespace change is visible */
+   CommandCounterIncrement();
At quick glance, I am wondering why you just don't do a CCI within
SetRelTablespace().



I did it that way for a better readability at first, since it looks more 
natural when you do some change (SetRelTablespace) and then make them 
visible with CCI. Second argument was that in the case of 
reindex_index() we have to also call RelationAssumeNewRelfilenode() and 
RelationDropStorage() before doing CCI and making the new tablespace 
visible. And this part is critical, I guess.




+  This specifies that indexes will be rebuilt on a new tablespace.
+  Cannot be used with "mapped" relations. If 
SCHEMA,

+  DATABASE or SYSTEM is
specified, then
+  all unsuitable relations will be skipped and a single
WARNING
+  will be generated.
What is an unsuitable relation?  How can the end user know that?



This was referring to mapped relations mentioned in the previous 
sentence. I have tried to rewrite this part and make it more specific in 
my current version. Also added Justin's changes to the docs and comment.



This is missing ACL checks when moving the index into a new location,
so this requires some pg_tablespace_aclcheck() calls, and the other
patches share the same issue.



I added proper pg_tablespace_aclcheck()'s into the reindex_index() and 
ReindexPartitions().



+   else if (partkind == RELKIND_PARTITIONED_TABLE)
+   {
+   Relation rel = table_open(partoid, ShareLock);
+   List*indexIds = RelationGetIndexList(rel);
+   ListCell *lc;
+
+   table_close(rel, NoLock);
+   foreach (lc, indexIds)
+   {
+   Oid indexid = lfirst_oid(lc);
+   (void) set_rel_tablespace(indexid, 
params->tablespaceOid);

+   }
+   }
This is really a good question.  ReindexPartitions() would trigger one
transaction per leaf to work on.  Changing the tablespace of the
partitioned table(s) before doing any work has the advantage to tell
any new partition to use the new tablespace.  Now, I see a struggling
point here: what should we do if the processing fails in the middle of
the move, leaving a portion of the leaves in the previous tablespace?
On a follow-up reindex with the same command, should the command force
a reindex even on the partitions that have been moved?  Or could there
be a point in skipping the partitions that are already on the new
tablespace and only process the ones on the previous tablespace?  It
seems to me that the first scenario makes the most sense as currently
a REINDEX works on all the relations defined, though there could be
use cases for the second case.  This should be documented, I think.



I agree that follow-up REINDEX should also reindex moved partitions, 
since REINDEX (TABLESPACE ...) is still reindex at first. I will try to 
put something about this part into the docs. Also I think that we cannot 
be sure that nothing happened with already reindexed partitions between 
two consequent REINDEX calls.



There are no tests for partitioned tables, aka we'd want to make sure
that the new partitioned index is on the correct tablespace, as well
as all its leaves.  It may be better to have at least two levels of
partitioned tables, as well as a partitioned table with no leaves in
the cases dealt with.



Yes, sure, it makes sense.


+*
+* Even if a table's indexes were moved to a new tablespace, 
the index

+* on its toast table is not normally moved.
 */
Still, REINDEX (TABLESPACE) TABLE should move all of them to be
consistent with ALTER TABLE SET TABLESPACE, but that's not the case
with this code, no?  This requires proper test coverage, but there is
nothing of the kind in this patch.


You are right, we do not move TOAST indexes now, since 
IsSystemRelation() is true for TOAST indexes, so I thought that we 
should not allow moving them without allow_system_table_mods=true. Now I 
wonder why ALTER TABLE does that.


I am going to attach the new version of patch set today or tomorrow.


Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company




Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2021-01-20 Thread Alexey Kondratov

On 2021-01-20 21:08, Alexey Kondratov wrote:

On 2021-01-20 18:54, Alvaro Herrera wrote:

On 2021-Jan-20, Alvaro Herrera wrote:


On 2021-Jan-20, Michael Paquier wrote:

> +/*
> + * This is mostly duplicating ATExecSetTableSpaceNoStorage,
> + * which should maybe be factored out to a library function.
> + */
> Wouldn't it be better to do first the refactoring of 0002 and then
> 0001 so as REINDEX can use the new routine, instead of putting that
> into a comment?

I think merging 0001 and 0002 into a single commit is a reasonable
approach.


... except it doesn't make a lot of sense to have set_rel_tablespace 
in

either indexcmds.c or index.c.  I think tablecmds.c is a better place
for it.  (I would have thought catalog/storage.c, but that one's not 
the

right abstraction level it seems.)



I did a refactoring of ATExecSetTableSpaceNoStorage() in the 0001. New
function SetRelTablesapce() is placed into the tablecmds.c. Following
0002 gets use of it. Is it close to what you and Michael suggested?



Ugh, forgot to attach the patches. Here they are.

--
AlexeyFrom 2c3876f99bc8ebdd07c532619992e7ec3093e50a Mon Sep 17 00:00:00 2001
From: Alexey Kondratov 
Date: Mon, 23 Mar 2020 21:10:29 +0300
Subject: [PATCH v2 2/2] Allow REINDEX to change tablespace

REINDEX already does full relation rewrite, this patch adds a
possibility to specify a new tablespace where new relfilenode
will be created.
---
 doc/src/sgml/ref/reindex.sgml |  22 +
 src/backend/catalog/index.c   |  72 ++-
 src/backend/commands/indexcmds.c  |  68 ++-
 src/bin/psql/tab-complete.c   |   4 +-
 src/include/catalog/index.h   |   2 +
 src/test/regress/input/tablespace.source  |  53 +++
 src/test/regress/output/tablespace.source | 102 ++
 7 files changed, 318 insertions(+), 5 deletions(-)

diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 627b36300c..4f84060c4d 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -27,6 +27,7 @@ REINDEX [ ( option [, ...] ) ] { IN
 
 CONCURRENTLY [ boolean ]
 VERBOSE [ boolean ]
+TABLESPACE new_tablespace
 
  
 
@@ -187,6 +188,19 @@ REINDEX [ ( option [, ...] ) ] { IN
 

 
+   
+TABLESPACE
+
+ 
+  This specifies that indexes will be rebuilt on a new tablespace.
+  Cannot be used with "mapped" relations. If SCHEMA,
+  DATABASE or SYSTEM is specified, then
+  all unsuitable relations will be skipped and a single WARNING
+  will be generated.
+ 
+
+   
+

 VERBOSE
 
@@ -210,6 +224,14 @@ REINDEX [ ( option [, ...] ) ] { IN
 

 
+   
+new_tablespace
+
+ 
+  The tablespace where indexes will be rebuilt.
+ 
+
+   
   
  
 
diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c
index b8cd35e995..ed98b17483 100644
--- a/src/backend/catalog/index.c
+++ b/src/backend/catalog/index.c
@@ -57,6 +57,7 @@
 #include "commands/event_trigger.h"
 #include "commands/progress.h"
 #include "commands/tablecmds.h"
+#include "commands/tablespace.h"
 #include "commands/trigger.h"
 #include "executor/executor.h"
 #include "miscadmin.h"
@@ -1394,9 +1395,13 @@ index_update_collation_versions(Oid relid, Oid coll)
  * Create concurrently an index based on the definition of the one provided by
  * caller.  The index is inserted into catalogs and needs to be built later
  * on.  This is called during concurrent reindex processing.
+ *
+ * "tablespaceOid" is the new tablespace to use for this index.  If
+ * InvalidOid, use the tablespace in-use instead.
  */
 Oid
-index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId, const char *newName)
+index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId,
+			   Oid tablespaceOid, const char *newName)
 {
 	Relation	indexRelation;
 	IndexInfo  *oldInfo,
@@ -1526,7 +1531,8 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId, const char
 			  newInfo,
 			  indexColNames,
 			  indexRelation->rd_rel->relam,
-			  indexRelation->rd_rel->reltablespace,
+			  OidIsValid(tablespaceOid) ?
+tablespaceOid : indexRelation->rd_rel->reltablespace,
 			  indexRelation->rd_indcollation,
 			  indclass->values,
 			  indcoloptions->values,
@@ -3591,6 +3597,8 @@ IndexGetRelation(Oid indexId, bool missing_ok)
 
 /*
  * reindex_index - This routine is used to recreate a single index
+ *
+ * See comments of reindex_relation() for details about "tablespaceOid".
  */
 void
 reindex_index(Oid indexId, bool skip_constraint_checks, char persistence,
@@ -3603,6 +3611,7 @@ reindex_index(Oid indexId, bool skip_constraint_checks, char persistence,
 	volatile bool skipped_constraint = false;
 	PGRUsage	ru0;
 	bo

Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2021-01-20 Thread Alexey Kondratov

On 2021-01-20 18:54, Alvaro Herrera wrote:

On 2021-Jan-20, Alvaro Herrera wrote:


On 2021-Jan-20, Michael Paquier wrote:

> +/*
> + * This is mostly duplicating ATExecSetTableSpaceNoStorage,
> + * which should maybe be factored out to a library function.
> + */
> Wouldn't it be better to do first the refactoring of 0002 and then
> 0001 so as REINDEX can use the new routine, instead of putting that
> into a comment?

I think merging 0001 and 0002 into a single commit is a reasonable
approach.


... except it doesn't make a lot of sense to have set_rel_tablespace in
either indexcmds.c or index.c.  I think tablecmds.c is a better place
for it.  (I would have thought catalog/storage.c, but that one's not 
the

right abstraction level it seems.)



I did a refactoring of ATExecSetTableSpaceNoStorage() in the 0001. New 
function SetRelTablesapce() is placed into the tablecmds.c. Following 
0002 gets use of it. Is it close to what you and Michael suggested?




But surely ATExecSetTableSpaceNoStorage should be using this new
routine.  (I first thought 0002 was doing that, since that commit is
calling itself a "refactoring", but now that I look closer, it's not.)



Yeah, this 'refactoring' was initially referring to refactoring of what 
Justin added to one of the previous 0001. And it was meant to be merged 
with 0001, once agreed, but we got distracted by other stuff.


I have not yet addressed Michael's concerns regarding reindex of 
partitions. I am going to look closer on it tomorrow.



Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company




Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2021-01-13 Thread Alexey Kondratov

On 2021-01-13 14:34, Michael Paquier wrote:

On Wed, Jan 13, 2021 at 05:22:49PM +0900, Michael Paquier wrote:

Yeah, that makes sense.  I'll send an updated patch based on that.


And here you go as per the attached.  I don't think that there was
anything remaining on my radar.  This version still needs to be
indented properly though.

Thoughts?



Thanks.

+   bits32  options;/* bitmask of 
CLUSTEROPT_* */

This should say '/* bitmask of CLUOPT_* */', I guess, since there are 
only CLUOPT's defined. Otherwise, everything looks as per discussed 
upthread.


By the way, something went wrong with the last email subject, so I have 
changed it back to the original in this response. I also attached your 
patch (with only this CLUOPT_* correction) to keep it in the thread for 
sure. Although, postgresql.org's web archive is clever enough to link 
your email to the same thread even with different subject.



Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Companydiff --git a/src/include/catalog/index.h b/src/include/catalog/index.h
index 9904a76387..43cfdeaa6b 100644
--- a/src/include/catalog/index.h
+++ b/src/include/catalog/index.h
@@ -30,13 +30,16 @@ typedef enum
 } IndexStateFlagsAction;
 
 /* options for REINDEX */
-typedef enum ReindexOption
+typedef struct ReindexParams
 {
-	REINDEXOPT_VERBOSE = 1 << 0,	/* print progress info */
-	REINDEXOPT_REPORT_PROGRESS = 1 << 1,	/* report pgstat progress */
-	REINDEXOPT_MISSING_OK = 1 << 2, /* skip missing relations */
-	REINDEXOPT_CONCURRENTLY = 1 << 3	/* concurrent mode */
-} ReindexOption;
+	bits32		options;			/* bitmask of REINDEXOPT_* */
+} ReindexParams;
+
+/* flag bits for ReindexParams->flags */
+#define REINDEXOPT_VERBOSE		0x01	/* print progress info */
+#define REINDEXOPT_REPORT_PROGRESS 0x02 /* report pgstat progress */
+#define REINDEXOPT_MISSING_OK 	0x04	/* skip missing relations */
+#define REINDEXOPT_CONCURRENTLY	0x08	/* concurrent mode */
 
 /* state info for validate_index bulkdelete callback */
 typedef struct ValidateIndexState
@@ -146,7 +149,7 @@ extern void index_set_state_flags(Oid indexId, IndexStateFlagsAction action);
 extern Oid	IndexGetRelation(Oid indexId, bool missing_ok);
 
 extern void reindex_index(Oid indexId, bool skip_constraint_checks,
-		  char relpersistence, int options);
+		  char relpersistence, ReindexParams *params);
 
 /* Flag bits for reindex_relation(): */
 #define REINDEX_REL_PROCESS_TOAST			0x01
@@ -155,7 +158,7 @@ extern void reindex_index(Oid indexId, bool skip_constraint_checks,
 #define REINDEX_REL_FORCE_INDEXES_UNLOGGED	0x08
 #define REINDEX_REL_FORCE_INDEXES_PERMANENT 0x10
 
-extern bool reindex_relation(Oid relid, int flags, int options);
+extern bool reindex_relation(Oid relid, int flags, ReindexParams *params);
 
 extern bool ReindexIsProcessingHeap(Oid heapOid);
 extern bool ReindexIsProcessingIndex(Oid indexOid);
diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h
index 401a0827ae..1245d944dc 100644
--- a/src/include/commands/cluster.h
+++ b/src/include/commands/cluster.h
@@ -18,16 +18,17 @@
 #include "storage/lock.h"
 #include "utils/relcache.h"
 
-
 /* options for CLUSTER */
-typedef enum ClusterOption
+#define CLUOPT_RECHECK 0x01		/* recheck relation state */
+#define CLUOPT_VERBOSE 0x02		/* print progress info */
+
+typedef struct ClusterParams
 {
-	CLUOPT_RECHECK = 1 << 0,	/* recheck relation state */
-	CLUOPT_VERBOSE = 1 << 1		/* print progress info */
-} ClusterOption;
+	bits32		options;			/* bitmask of CLUOPT_* */
+} ClusterParams;
 
 extern void cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel);
-extern void cluster_rel(Oid tableOid, Oid indexOid, int options);
+extern void cluster_rel(Oid tableOid, Oid indexOid, ClusterParams *params);
 extern void check_index_is_clusterable(Relation OldHeap, Oid indexOid,
 	   bool recheck, LOCKMODE lockmode);
 extern void mark_index_clustered(Relation rel, Oid indexOid, bool is_internal);
diff --git a/src/include/commands/defrem.h b/src/include/commands/defrem.h
index e2d2a77ca4..91281d6f8e 100644
--- a/src/include/commands/defrem.h
+++ b/src/include/commands/defrem.h
@@ -14,6 +14,7 @@
 #ifndef DEFREM_H
 #define DEFREM_H
 
+#include "catalog/index.h"
 #include "catalog/objectaddress.h"
 #include "nodes/params.h"
 #include "parser/parse_node.h"
@@ -34,11 +35,7 @@ extern ObjectAddress DefineIndex(Oid relationId,
  bool check_not_in_use,
  bool skip_build,
  bool quiet);
-extern int	ReindexParseOptions(ParseState *pstate, ReindexStmt *stmt);
-extern void ReindexIndex(RangeVar *indexRelation, int options, bool isTopLevel);
-extern Oid	ReindexTable(RangeVar *relation, int options, bool isTopLevel);
-extern void ReindexMultipleTables(const char *objectName, ReindexObjectType objectKind,
-  int optio

Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit

2020-12-30 Thread Alexey Kondratov

On 2020-12-30 17:59, Bharath Rupireddy wrote:

On Wed, Dec 30, 2020 at 5:20 PM Alexey Kondratov
 wrote:


On 2020-12-30 09:10, Bharath Rupireddy wrote:
I still have some doubts that it is worth of allowing to call
postgres_fdw_disconnect() in the explicit transaction block, since it
adds a lot of things to care and check for. But otherwise current 
logic

looks solid.

+errdetail("Such connections get 
closed either in the next use or

at the end of the current transaction.")
+: errdetail("Such connection gets 
closed either in the next use or

at the end of the current transaction.")));

Does it really have a chance to get closed on the next use? If foreign
server is dropped then user mapping should be dropped as well (either
with CASCADE or manually), but we do need user mapping for a local 
cache

lookup. That way, if I understand all the discussion up-thread
correctly, we can only close such connections at the end of xact, do 
we?


The next use of such a connection in the following query whose foreign
server would have been dropped fails because of the cascading that can
happen to drop the user mapping and the foreign table as well. During
the start of the next query such connection will be marked as
invalidated because xact_depth of that connection is > 1 and when the
fail happens, txn gets aborted due to which pgfdw_xact_callback gets
called and in that the connection gets closed. To make it more clear,
please have a look at the scenarios [1].



In my understanding 'connection gets closed either in the next use' 
means that connection will be closed next time someone will try to use 
it, i.e. GetConnection() will be called and it closes this connection 
because of a bad state. However, if foreign server is dropped 
GetConnection() cannot lookup the connection because it needs a user 
mapping oid as a key.


I had a look on your scenarios. IIUC, under **next use** you mean a 
select attempt from a table belonging to the same foreign server, which 
leads to a transaction abort and connection gets closed in the xact 
callback. Sorry, maybe I am missing something, but this just confirms 
that such connections only get closed in the xact callback (taking into 
account your recently committed patch [1]), so 'next use' looks 
misleading.


[1] 
https://www.postgresql.org/message-id/8b2aa1aa-c638-12a8-cb56-ea0f0a5019cf%40oss.nttdata.com



Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company




Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit

2020-12-30 Thread Alexey Kondratov

On 2020-12-30 09:10, Bharath Rupireddy wrote:

Hi,

I'm posting a v4-0001 patch for the new functions
postgres_fdw_get_connections() and postgres_fdw_disconnect().  In this
patch, I tried to address the review comments provided upthread.

Thoughts?



I still have some doubts that it is worth of allowing to call 
postgres_fdw_disconnect() in the explicit transaction block, since it 
adds a lot of things to care and check for. But otherwise current logic 
looks solid.


+ errdetail("Such connections get closed either in the next use or 
at the end of the current transaction.")
+ : errdetail("Such connection gets closed either in the next use or 
at the end of the current transaction.")));


Does it really have a chance to get closed on the next use? If foreign 
server is dropped then user mapping should be dropped as well (either 
with CASCADE or manually), but we do need user mapping for a local cache 
lookup. That way, if I understand all the discussion up-thread 
correctly, we can only close such connections at the end of xact, do we?


+ * This function returns false if the cache doesn't exist.
+ * When the cache exists:

I think that this will be corrected later by pg_indent, but still. In 
this comment section following points 1) and 2) have a different 
combination of tabs/spaces.



Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres CompanyFrom 82b82b901e8f0ca84e1b21454a3b68f4fd8f0016 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy 
Date: Wed, 30 Dec 2020 11:07:35 +0530
Subject: [PATCH v4] postgres_fdw function to discard cached connections

This patch introduces new function postgres_fdw_disconnect() when
called with a foreign server name discards the associated
connections with the server name. When called without any argument,
discards all the existing cached connections.

This patch also adds another function postgres_fdw_get_connections()
to get the list of all cached connections by corresponding foreign
server names.
---
 contrib/postgres_fdw/connection.c | 319 +-
 .../postgres_fdw/expected/postgres_fdw.out| 415 +-
 contrib/postgres_fdw/postgres_fdw--1.0.sql|  15 +
 contrib/postgres_fdw/sql/postgres_fdw.sql | 166 +++
 doc/src/sgml/postgres-fdw.sgml|  57 ++-
 5 files changed, 953 insertions(+), 19 deletions(-)

diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index d841cec39b..496ef2bae2 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -22,6 +22,7 @@
 #include "postgres_fdw.h"
 #include "storage/fd.h"
 #include "storage/latch.h"
+#include "utils/builtins.h"
 #include "utils/datetime.h"
 #include "utils/hsearch.h"
 #include "utils/inval.h"
@@ -57,6 +58,8 @@ typedef struct ConnCacheEntry
 	bool		have_error;		/* have any subxacts aborted in this xact? */
 	bool		changing_xact_state;	/* xact state change in process */
 	bool		invalidated;	/* true if reconnect is pending */
+	/* Server oid to get the associated foreign server name. */
+	Oid			serverid;
 	uint32		server_hashvalue;	/* hash value of foreign server OID */
 	uint32		mapping_hashvalue;	/* hash value of user mapping OID */
 } ConnCacheEntry;
@@ -73,6 +76,12 @@ static unsigned int prep_stmt_number = 0;
 /* tracks whether any work is needed in callback functions */
 static bool xact_got_connection = false;
 
+/*
+ * SQL functions
+ */
+PG_FUNCTION_INFO_V1(postgres_fdw_get_connections);
+PG_FUNCTION_INFO_V1(postgres_fdw_disconnect);
+
 /* prototypes of private functions */
 static void make_new_connection(ConnCacheEntry *entry, UserMapping *user);
 static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user);
@@ -94,6 +103,8 @@ static bool pgfdw_exec_cleanup_query(PGconn *conn, const char *query,
 static bool pgfdw_get_cleanup_result(PGconn *conn, TimestampTz endtime,
 	 PGresult **result);
 static bool UserMappingPasswordRequired(UserMapping *user);
+static bool disconnect_cached_connections(uint32 hashvalue, bool all,
+		  bool *is_in_use);
 
 /*
  * Get a PGconn which can be used to execute queries on the remote PostgreSQL
@@ -273,6 +284,7 @@ make_new_connection(ConnCacheEntry *entry, UserMapping *user)
 	entry->have_error = false;
 	entry->changing_xact_state = false;
 	entry->invalidated = false;
+	entry->serverid = server->serverid;
 	entry->server_hashvalue =
 		GetSysCacheHashValue1(FOREIGNSERVEROID,
 			  ObjectIdGetDatum(server->serverid));
@@ -789,6 +801,13 @@ pgfdw_xact_callback(XactEvent event, void *arg)
 	if (!xact_got_connection)
 		return;
 
+	/*
+	 * Quick exit if the cache has been destroyed in
+	 * disconnect_cached_connections.
+	 */
+	if (!ConnectionHash)
+		return;
+
 	/*
 	 * Scan all connection cache entries to find open remote transactions, and
 	 * close them.
@@ -98

Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2020-12-23 Thread Alexey Kondratov

On 2020-12-23 10:38, Michael Paquier wrote:

On Tue, Dec 22, 2020 at 03:15:37PM -0600, Justin Pryzby wrote:

Now, I really think utility.c ought to pass in a pointer to a local
ReindexOptions variable to avoid all the memory context, which is 
unnecessary

and prone to error.


Yeah, it sounds right to me to just bite the bullet and do this
refactoring, limiting the manipulations of the options as much as
possible across contexts.  So +1 from me to merge 0001 and 0002
together.

I have adjusted a couple of comments and simplified a bit more the
code in utility.c.  I think that this is commitable, but let's wait
more than a couple of days for Alvaro and Peter first.  This is a
period of vacations for a lot of people, and there is no point to
apply something that would need more work at the end.  Using hexas for
the flags with bitmasks is the right conclusion IMO, but we are not
alone.



After eyeballing the patch I can add that we should alter this comment:

int options;/* bitmask of VacuumOption */

as you are going to replace VacuumOption with VACOPT_* defs. So this 
should say:


/* bitmask of VACOPT_* */

Also I have found naming to be a bit inconsistent:

 * we have ReindexOptions, but VacuumParams
 * and ReindexOptions->flags, but VacuumParams->options

And the last one, you have used bits32 for Cluster/ReindexOptions, but 
left VacuumParams->options as int. Maybe we should also change it to 
bits32 for consistency?



Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company




Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2020-12-23 Thread Alexey Kondratov

On 2020-12-23 08:22, Justin Pryzby wrote:

On Tue, Dec 22, 2020 at 03:22:19PM -0800, Zhihong Yu wrote:

Justin:
For reindex_index() :

+   if (options->tablespaceOid == MyDatabaseTableSpace)
+   options->tablespaceOid = InvalidOid;
...
+   oldTablespaceOid = iRel->rd_rel->reltablespace;
+   if (set_tablespace &&
+   (options->tablespaceOid != oldTablespaceOid ||
+   (options->tablespaceOid == MyDatabaseTableSpace &&
OidIsValid(oldTablespaceOid

I wonder why the options->tablespaceOid == MyDatabaseTableSpace clause
appears again in the second if statement.
Since the first if statement would assign InvalidOid
to options->tablespaceOid when the first if condition is satisfied.


Good question.  Alexey mentioned on Sept 23 that he added the first 
stanza.  to

avoid storing the DB's tablespace OID (rather than InvalidOid).

I think the 2nd half of the "or" is unnecessary since that was added 
setting to

options->tablespaceOid = InvalidOid.
If requesting to move to the DB's default tablespace, it'll now hit the 
first

part of the OR:


+   (options->tablespaceOid != oldTablespaceOid ||


Without the first stanza setting, it would've hit the 2nd condition:

+   (options->tablespaceOid == MyDatabaseTableSpace && 
OidIsValid(oldTablespaceOid


which means: "user requested to move a table to the DB's default 
tblspace, and

it was previously on a nondefault space".

So I think we can drop the 2nd half of the OR.  Thanks for noticing.


Yes, I have not noticed that we would have already assigned 
tablespaceOid to InvalidOid in this case. Back to the v7 we were doing 
this assignment a bit later, so this could make sense, but now it seems 
to be redundant. For some reason I have mixed these refactorings 
separated by a dozen of versions...



Thanks
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company




Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit

2020-12-17 Thread Alexey Kondratov

On 2020-12-17 18:02, Bharath Rupireddy wrote:
On Thu, Dec 17, 2020 at 5:38 PM Hou, Zhijie  
wrote:

I took a look into the patch, and have a little issue:

+bool disconnect_cached_connections(uint32 hashvalue, bool all)
+   if (all)
+   {
+   hash_destroy(ConnectionHash);
+   ConnectionHash = NULL;
+   result = true;
+   }

If disconnect_cached_connections is called to disconnect all the 
connections,

should we reset the 'xact_got_connection' flag ?


I think we must allow postgres_fdw_disconnect() to disconnect the
particular/all connections only when the corresponding entries have no
open txns or connections are not being used in that txn, otherwise
not. We may end up closing/disconnecting the connection that's still
being in use because entry->xact_dept can even go more than 1 for sub
txns. See use case [1].

+if ((all || entry->server_hashvalue == hashvalue) &&
entry->xact_depth == 0 &&
+entry->conn)
+{
+disconnect_pg_server(entry);
+result = true;
+}

Thoughts?



I think that you are right. Actually, I was thinking about much more 
simple solution to this problem --- just restrict 
postgres_fdw_disconnect() to run only *outside* of explicit transaction 
block. This should protect everyone from closing its underlying 
connections, but seems to be a bit more restrictive than you propose.


Just thought, that if we start closing fdw connections in the open xact 
block:


1) Close a couple of them.
2) Found one with xact_depth > 0 and error out.
3) End up in the mixed state: some of connections were closed, but some 
them not, and it cannot be rolled back with the xact.


In other words, I have some doubts about allowing to call a 
non-transactional by its nature function in the transaction block.



Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company




Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2020-12-15 Thread Alexey Kondratov

On 2020-12-15 03:14, Justin Pryzby wrote:

On Sat, Dec 12, 2020 at 01:45:26PM -0600, Justin Pryzby wrote:

On Sat, Dec 12, 2020 at 09:20:35AM +0100, Peter Eisentraut wrote:
> On 2020-12-11 21:27, Alvaro Herrera wrote:
> > By the way--  What did you think of the idea of explictly marking the
> > types used for bitmasks using types bits32 and friends, instead of plain
> > int, which is harder to spot?
>
> If we want to make it clearer, why not turn the thing into a struct, as in
> the attached patch, and avoid the bit fiddling altogether.

I like this.
It's a lot like what I wrote as [PATCH v31 1/5] ExecReindex and 
ReindexParams
In my v31 patch, I moved ReindexOptions to a private structure in 
indexcmds.c,
with an "int options" bitmask which is passed to reindex_index() et 
al.  Your
patch keeps/puts ReindexOptions index.h, so it also applies to 
reindex_index,

which I think is good.

So I've rebased this branch on your patch.

Some thoughts:

 - what about removing the REINDEXOPT_* prefix ?
 - You created local vars with initialization like "={}". But I 
thought it's
   needed to include at least one struct member like "={false}", or 
else

   they're not guaranteed to be zerod ?
 - You passed the structure across function calls.  The usual 
convention is to

   pass a pointer.


I think maybe Michael missed this message (?)
I had applied some changes on top of Peter's patch.

I squished those commits now, and also handled ClusterOption and 
VacuumOption

in the same style.

Some more thoughts:
 - should the structures be named in plural ? "ReindexOptions" etc.  
Since they

   define *all* the options, not just a single bit.
 - For vacuum, do we even need a separate structure, or should the 
members be

   directly within VacuumParams ?  It's a bit odd to write
   params.options.verbose.  Especially since there's also ternary 
options which

   are directly within params.


This is exactly what I have thought after looking on Peter's patch. 
Writing 'params.options.some_option' looks just too verbose. I even 
started moving all vacuum options into VacuumParams on my own and was in 
the middle of the process when realized that there are some places that 
do not fit well, like:


if (!vacuum_is_relation_owner(RelationGetRelid(onerel),
onerel->rd_rel,
params->options & VACOPT_ANALYZE))

Here we do not want to set option permanently, but rather to trigger 
some additional code path in the vacuum_is_relation_owner(), IIUC. With 
unified VacuumParams we should do:


bool opt_analyze = params->analyze;
...
params->analyze = true;
if (!vacuum_is_relation_owner(RelationGetRelid(onerel),
onerel->rd_rel, params))
...
params->analyze = opt_analyze;

to achieve the same, but it does not look good to me, or change the 
whole interface. I have found at least one other place like that so far 
--- vacuum_open_relation() in the analyze_rel().


Not sure how we could better cope with such logic.

 - Then, for cluster, I think it should be called ClusterParams, and 
eventually

   include the tablespaceOid, like what we're doing for Reindex.

I am awaiting feedback on these before going further since I've done 
too much

rebasing with these ideas going back and forth and back.


Yes, we have moved a bit from my original patch set in the thread with 
all this refactoring. However, all the consequent patches are heavily 
depend on this interface, so let us decide first on the proposed 
refactoring.



Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company




Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2020-12-04 Thread Alexey Kondratov

On 2020-12-04 04:25, Justin Pryzby wrote:

On Thu, Dec 03, 2020 at 04:12:53PM +0900, Michael Paquier wrote:

> +typedef struct ReindexParams {
> +  bool concurrently;
> +  bool verbose;
> +  bool missingok;
> +
> +  int options;/* bitmask of lowlevel REINDEXOPT_* */
> +} ReindexParams;
> +

By moving everything into indexcmds.c, keeping ReindexParams within it
makes sense to me.  Now, there is no need for the three booleans
because options stores the same information, no?


 I liked the bools, but dropped them so the patch is smaller.



I had a look on 0001 and it looks mostly fine to me except some strange 
mixture of tabs/spaces in the ExecReindex(). There is also a couple of 
meaningful comments:


-   options =
-   (verbose ? REINDEXOPT_VERBOSE : 0) |
-   (concurrently ? REINDEXOPT_CONCURRENTLY : 0);
+   if (verbose)
+   params.options |= REINDEXOPT_VERBOSE;

Why do we need this intermediate 'verbose' variable here? We only use it 
once to set a bitmask. Maybe we can do it like this:


params.options |= defGetBoolean(opt) ?
REINDEXOPT_VERBOSE : 0;

See also attached txt file with diff (I wonder can I trick cfbot this 
way, so it does not apply the diff).


+   int options;/* bitmask of lowlevel REINDEXOPT_* */

I would prefer if the comment says '/* bitmask of ReindexOption */' as 
in the VacuumOptions, since citing the exact enum type make it easier to 
navigate source code.




Regarding the REINDEX patch, I think this comment is misleading:

|+* Even if table was moved to new tablespace,
normally toast cannot move.
| */
|+   Oid toasttablespaceOid = allowSystemTableMods ?
tablespaceOid : InvalidOid;
|result |= reindex_relation(toast_relid, flags,

I think it ought to say "Even if a table's indexes were moved to a new
tablespace, its toast table's index is not normally moved"
Right ?



Yes, I think so, we are dealing only with index tablespace changing 
here. Thanks for noticing.




Also, I don't know whether we should check for GLOBALTABLESPACE_OID 
after

calling get_tablespace_oid(), or in the lowlevel routines.  Note that
reindex_relation is called during cluster/vacuum, and in the later 
patches, I
moved the test from from cluster() and ExecVacuum() to 
rebuild_relation().




IIRC, I wanted to do GLOBALTABLESPACE_OID check as early as possible 
(just after getting Oid), since it does not make sense to proceed 
further if tablespace is set to that value. So initially there were a 
lot of duplicative GLOBALTABLESPACE_OID checks, since there were a lot 
of reindex entry-points (index, relation, concurrently, etc.). Now we 
are going to have ExecReindex(), so there are much less entry-points and 
in my opinion it is fine to keep this validation just after 
get_tablespace_oid().


However, this is mostly a sanity check. I can hardly imagine a lot of 
users trying to constantly move indexes to the global tablespace, so it 
is also OK to put this check deeper into guts.



Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Companydiff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index a27f8f9d83..0b1884815c 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -2472,8 +2472,6 @@ void
 ExecReindex(ParseState *pstate, ReindexStmt *stmt, bool isTopLevel)
 {
 	ReindexParams		params = {0};
-	bool		verbose = false,
-concurrently = false;
 	ListCell   	*lc;
 	char	*tablespace = NULL;
 
@@ -2483,9 +2481,11 @@ ExecReindex(ParseState *pstate, ReindexStmt *stmt, bool isTopLevel)
 		DefElem*opt = (DefElem *) lfirst(lc);
 
 		if (strcmp(opt->defname, "verbose") == 0)
-			verbose = defGetBoolean(opt);
+			params.options |= defGetBoolean(opt) ?
+REINDEXOPT_VERBOSE : 0;
 		else if (strcmp(opt->defname, "concurrently") == 0)
-			concurrently = defGetBoolean(opt);
+			params.options |= defGetBoolean(opt) ?
+REINDEXOPT_CONCURRENTLY : 0;
 		else if (strcmp(opt->defname, "tablespace") == 0)
 			tablespace = defGetString(opt);
 		else
@@ -2496,18 +2496,12 @@ ExecReindex(ParseState *pstate, ReindexStmt *stmt, bool isTopLevel)
 	 parser_errposition(pstate, opt->location)));
 	}
 
-	if (verbose)
-		params.options |= REINDEXOPT_VERBOSE;
+	params.tablespaceOid = tablespace ?
+		get_tablespace_oid(tablespace, false) : InvalidOid;
 
-	if (concurrently)
-	{
-		params.options |= REINDEXOPT_CONCURRENTLY;
+	if (params.options & REINDEXOPT_CONCURRENTLY)
 		PreventInTransactionBlock(isTopLevel,
   "REINDEX CONCURRENTLY");
-	}
-
-	params.tablespaceOid = tablespace ?
-		get_tablespace_oid(tablespace, false) : InvalidOid;
 
 	switch (stmt->kind)
 	{


Re: Notes on physical replica failover with logical publisher or subscriber

2020-11-30 Thread Alexey Kondratov

Hi Craig,

On 2020-11-30 06:59, Craig Ringer wrote:


https://wiki.postgresql.org/wiki/Logical_replication_and_physical_standby_failover



Thank you for sharing these notes. I have not dealt a lot with 
physical/logical replication interoperability, so those were mostly new 
problems for me to know.


One point from the wiki page, which seems clear enough to me:

```
Logical slots can fill pg_wal and can't benefit from archiving. Teach 
the logical decoding page read callback how to use the restore_command 
to retrieve WAL segs temporarily if they're not found in pg_wal...

```

It does not look like a big deal to teach logical decoding process to 
use restore_command, but I have some doubts about how everything will 
perform in the case when we started getting WAL from archive for 
decoding purposes. If we started using restore_command, then subscriber 
lagged long enough to exceed max_slot_wal_keep_size. Taking into account 
that getting WAL files from the archive has an additional overhead and 
that primary continues generating (and archiving) new segments, there is 
a possibility for primary to start doing this double duty forever --- 
archive WAL file at first and get it back for decoding when requested.


Another problem is that there are maybe several active decoders, IIRC, 
so they would have better to communicate in order to avoid fetching the 
same segment twice.




I tried to address many of these issues with failover slots, but I am
not trying to beat that dead horse now. I know that at least some
people here are of the opinion that effort shouldn't go into
logical/physical replication interoperation anyway - that we should
instead address the remaining limitations in logical replication so
that it can provide complete HA capabilities without use of physical
replication. So for now I'm just trying to save others who go looking
into these issues some time and warn them about some of the less
obvious booby-traps.



Another point to add regarding logical replication capabilities to build 
logical-only HA system --- logical equivalent of pg_rewind. At least I 
have not noticed anything after brief reading of the wiki page. IIUC, 
currently there is no way to quickly return ex-primary (ex-logical 
publisher) into HA-cluster without doing a pg_basebackup, isn't it? It 
seems that we should have the same problem here as with physical 
replication --- ex-primary may accept some xacts after promotion of new 
primary, so their history diverges and old primary should be rewound 
before being returned as standby (subscriber).



Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company




Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2020-11-30 Thread Alexey Kondratov

On 2020-11-30 14:33, Michael Paquier wrote:

On Tue, Nov 24, 2020 at 09:31:23AM -0600, Justin Pryzby wrote:

@cfbot: rebased


Catching up with the activity here, I can see four different things in
the patch set attached:
1) Refactoring of the grammar of CLUSTER, VACUUM, ANALYZE and REINDEX
to support values in parameters.
2) Tablespace change for REINDEX.
3) Tablespace change for VACUUM FULL/CLUSTER.
4) Tablespace change for indexes with VACUUM FULL/CLUSTER.

I am not sure yet about the last three points, so let's begin with 1)
that is dealt with in 0001 and 0002.  I have spent some time on 0001,
renaming the rule names to be less generic than "common", and applied
it.  0002 looks to be in rather good shape, still there are a few
things that have caught my eyes.  I'll look at that more closely
tomorrow.



Thanks. I have rebased the remaining patches on top of 873ea9ee to use 
'utility_option_list' instead of 'common_option_list'.



Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres CompanyFrom ac3b77aec26a40016784ada9dab8b9059f424fa4 Mon Sep 17 00:00:00 2001
From: Justin Pryzby 
Date: Tue, 31 Mar 2020 20:35:41 -0500
Subject: [PATCH v31 5/5] Implement vacuum full/cluster (INDEX_TABLESPACE
 )

---
 doc/src/sgml/ref/cluster.sgml | 12 -
 doc/src/sgml/ref/vacuum.sgml  | 12 -
 src/backend/commands/cluster.c| 64 ++-
 src/backend/commands/matview.c|  3 +-
 src/backend/commands/tablecmds.c  |  2 +-
 src/backend/commands/vacuum.c | 46 +++-
 src/backend/postmaster/autovacuum.c   |  1 +
 src/include/commands/cluster.h|  6 ++-
 src/include/commands/vacuum.h |  5 +-
 src/test/regress/input/tablespace.source  | 13 +
 src/test/regress/output/tablespace.source | 20 +++
 11 files changed, 123 insertions(+), 61 deletions(-)

diff --git a/doc/src/sgml/ref/cluster.sgml b/doc/src/sgml/ref/cluster.sgml
index cbfc0582be..6781e3a025 100644
--- a/doc/src/sgml/ref/cluster.sgml
+++ b/doc/src/sgml/ref/cluster.sgml
@@ -28,6 +28,7 @@ CLUSTER [VERBOSE] [ ( option [, ...
 
 VERBOSE [ boolean ]
 TABLESPACE new_tablespace
+INDEX_TABLESPACE new_tablespace
 
 
  
@@ -105,6 +106,15 @@ CLUSTER [VERBOSE] [ ( option [, ...
 

 
+   
+INDEX_TABLESPACE
+
+ 
+  Specifies that the table's indexes will be rebuilt on a new tablespace.
+ 
+
+   
+

 table_name
 
@@ -141,7 +151,7 @@ CLUSTER [VERBOSE] [ ( option [, ...
 new_tablespace
 
  
-  The tablespace where the table will be rebuilt.
+  The tablespace where the table or its indexes will be rebuilt.
  
 

diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml
index 5261a7c727..28cab119b6 100644
--- a/doc/src/sgml/ref/vacuum.sgml
+++ b/doc/src/sgml/ref/vacuum.sgml
@@ -36,6 +36,7 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [ boolean ]
 PARALLEL integer
 TABLESPACE new_tablespace
+INDEX_TABLESPACE new_tablespace
 
 and table_and_columns is:
 
@@ -265,6 +266,15 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [ boolean
 
@@ -314,7 +324,7 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [ new_tablespace
 
  
-  The tablespace where the relation will be rebuilt.
+  The tablespace where the relation or its indexes will be rebuilt.
  
 

diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c
index b289a76d58..0f9f09a15a 100644
--- a/src/backend/commands/cluster.c
+++ b/src/backend/commands/cluster.c
@@ -71,7 +71,7 @@ typedef struct
 
 
 static void rebuild_relation(Relation OldHeap, Oid indexOid, bool verbose,
-			 Oid NewTableSpaceOid);
+			 Oid NewTableSpaceOid, Oid NewIdxTableSpaceOid);
 static void copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex,
 			bool verbose, bool *pSwapToastByContent,
 			TransactionId *pFreezeXid, MultiXactId *pCutoffMulti);
@@ -107,9 +107,11 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
 {
 	ListCell	*lc;
 	int			options = 0;
-	/* Name and Oid of tablespace to use for clustered relation. */
-	char		*tablespaceName = NULL;
-	Oid			tablespaceOid = InvalidOid;
+	/* Name and Oid of tablespaces to use for clustered relations. */
+	char		*tablespaceName = NULL,
+*idxtablespaceName = NULL;
+	Oid			tablespaceOid,
+idxtablespaceOid;
 
 	/* Parse list of generic parameters not handled by the parser */
 	foreach(lc, stmt->params)
@@ -123,6 +125,8 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel)
 options &= ~CLUOPT_VERBOSE;
 		else if (strcmp(opt->defname, "tablespace") == 0)
 			tablespaceName = defGetString(opt);
+		else if (strcmp(opt->defname, "index_tablespace") == 0)
+			idxtablespaceName = defGetString(opt);
 		else
 			ereport(ERROR,
 	(errcode(ERRCODE_SYNTAX_ERROR),
@@ -1

Re: Printing LSN made easy

2020-11-27 Thread Alexey Kondratov

Hi,

On 2020-11-27 13:40, Ashutosh Bapat wrote:


Off list Peter Eisentraut pointed out that we can not use these macros
in elog/ereport since it creates problems for translations. He
suggested adding functions which return strings and use %s when doing
so.

The patch has two functions pg_lsn_out_internal() which takes an LSN
as input and returns a palloc'ed string containing the string
representation of LSN. This may not be suitable in performance
critical paths and also may leak memory if not freed. So there's
another function pg_lsn_out_buffer() which takes LSN and a char array
as input, fills the char array with the string representation and
returns the pointer to the char array. This allows the function to be
used as an argument in printf/elog etc. Macro MAXPG_LSNLEN has been
extern'elized for this purpose.



If usage of macros in elog/ereport can cause problems for translation, 
then even with this patch life is not get simpler significantly. For 
example, instead of just doing like:


 elog(WARNING,
- "xlog min recovery request %X/%X is past current point 
%X/%X",

- (uint32) (lsn >> 32), (uint32) lsn,
- (uint32) (newMinRecoveryPoint >> 32),
- (uint32) newMinRecoveryPoint);
+ "xlog min recovery request " LSN_FORMAT " is past 
current point " LSN_FORMAT,

+ LSN_FORMAT_ARG(lsn),
+ LSN_FORMAT_ARG(newMinRecoveryPoint));

we have to either declare two additional local buffers, which is 
verbose; or use pg_lsn_out_internal() and rely on memory contexts (or do 
pfree() manually, which is verbose again) to prevent memory leaks.




Off list Craig Ringer suggested introducing a new format specifier
similar to %m for LSN but I did not get time to take a look at the
relevant code. AFAIU it's available only to elog/ereport, so may not
be useful generally. But teaching printf variants about the new format
would be the best solution. However, I didn't find any way to do that.



It seems that this topic has been extensively discussed off-list, but 
still strong +1 for the patch. I always wanted LSN printing to be more 
concise.


I have just tried new printing utilities in a couple of new places and 
it looks good to me.


+char *
+pg_lsn_out_internal(XLogRecPtr lsn)
+{
+   charbuf[MAXPG_LSNLEN + 1];
+
+   snprintf(buf, sizeof(buf), LSN_FORMAT, LSN_FORMAT_ARG(lsn));
+
+   return pstrdup(buf);
+}

Would it be a bit more straightforward if we palloc buf initially and 
just return a pointer instead of doing pstrdup()?



Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres CompanyFrom 698e481f5f55b967b5c60dba4bc577f8baa20ff4 Mon Sep 17 00:00:00 2001
From: Ashutosh Bapat 
Date: Fri, 16 Oct 2020 17:09:29 +0530
Subject: [PATCH] Make it easy to print LSN

The commit introduces following macros and functions to make it easy to
use LSNs in printf variants, elog, ereport and appendStringInfo
variants.

LSN_FORMAT - macro representing the format in which LSN is printed

LSN_FORMAT_ARG - macro to pass LSN as an argument to the above format

pg_lsn_out_internal - a function which returns palloc'ed char array
containing string representation of given LSN.

pg_lsn_out_buffer - similar to above but accepts and returns a char
array of size (MAXPG_LSNLEN + 1)

The commit also has some example usages of these.

Ashutosh Bapat
---
 contrib/pageinspect/rawpage.c|  3 +-
 src/backend/access/rmgrdesc/replorigindesc.c |  5 +-
 src/backend/access/rmgrdesc/xlogdesc.c   |  3 +-
 src/backend/access/transam/xlog.c|  8 ++--
 src/backend/utils/adt/pg_lsn.c   | 49 ++--
 src/include/access/xlogdefs.h|  7 +++
 src/include/utils/pg_lsn.h   |  3 ++
 7 files changed, 55 insertions(+), 23 deletions(-)

diff --git a/contrib/pageinspect/rawpage.c b/contrib/pageinspect/rawpage.c
index c0181506a5..2cd055a5f0 100644
--- a/contrib/pageinspect/rawpage.c
+++ b/contrib/pageinspect/rawpage.c
@@ -261,8 +261,7 @@ page_header(PG_FUNCTION_ARGS)
 	{
 		char		lsnchar[64];
 
-		snprintf(lsnchar, sizeof(lsnchar), "%X/%X",
- (uint32) (lsn >> 32), (uint32) lsn);
+		snprintf(lsnchar, sizeof(lsnchar), LSN_FORMAT, LSN_FORMAT_ARG(lsn));
 		values[0] = CStringGetTextDatum(lsnchar);
 	}
 	else
diff --git a/src/backend/access/rmgrdesc/replorigindesc.c b/src/backend/access/rmgrdesc/replorigindesc.c
index 19e14f910b..a3f49b5750 100644
--- a/src/backend/access/rmgrdesc/replorigindesc.c
+++ b/src/backend/access/rmgrdesc/replorigindesc.c
@@ -29,10 +29,9 @@ replorigin_desc(StringInfo buf, XLogReaderState *record)
 
 xlrec = (xl_replorigin_set *) rec;
 
-appendStringInfo(buf, "set %u; lsn %X/%X; force: %d",
+appendStringInfo(buf, "set %u; lsn " LSN_FORMAT "; force: %d",
  xlrec->

Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit

2020-11-25 Thread Alexey Kondratov

On 2020-11-25 06:17, Bharath Rupireddy wrote:

On Wed, Nov 25, 2020 at 7:24 AM Craig Ringer
 wrote:


A quick thought here.

Would it make sense to add a hook in the DISCARD ALL implementation 
that postgres_fdw can register for?


There's precedent here, since DISCARD ALL already has the same effect 
as SELECT pg_advisory_unlock_all(); amongst other things.




IIUC, then it is like a core(server) function doing some work for the
postgres_fdw module. Earlier in the discussion, one point raised was
that it's better not to have core handling something related to
postgres_fdw. This is the reason we have come up with postgres_fdw
specific function and a GUC, which get defined when extension is
created. Similarly, dblink also has it's own bunch of functions one
among them is dblink_disconnect().



If I have got Craig correctly, he proposed that we already have a 
DISCARD ALL statement, which is processed by DiscardAll(), and it 
releases internal resources known from the core perspective. That way, 
we can introduce a general purpose hook DiscardAll_hook(), so 
postgres_fdw can get use of it to clean up its own resources 
(connections in our context) if needed. In other words, it is not like a 
core function doing some work for the postgres_fdw module, but rather 
like a callback/hook, that postgres_fdw is able to register to do some 
additional work.


It can be a good replacement for 0001, but won't it be already an 
overkill to drop all local caches along with remote connections? I mean, 
that it would be a nice to have hook from the extensibility perspective, 
but postgres_fdw_disconnect() still makes sense, since it does a very 
narrow and specific job.



Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company




Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit

2020-11-24 Thread Alexey Kondratov

On 2020-11-24 06:52, Bharath Rupireddy wrote:

Thanks for the review comments.

On Mon, Nov 23, 2020 at 9:57 PM Alexey Kondratov
 wrote:


> v1-0001-postgres_fdw-function-to-discard-cached-connections.patch

This patch looks pretty straightforward for me, but there are some
things to be addressed IMO:

+   server = GetForeignServerByName(servername, true);
+
+   if (server != NULL)
+   {

Yes, you return a false if no server was found, but for me it worth
throwing an error in this case as, for example, dblink does in the
dblink_disconnect().



dblink_disconnect() "Returns status, which is always OK (since any
error causes the function to throw an error instead of returning)."
This behaviour doesn't seem okay to me.

Since we throw true/false, I would prefer to throw a warning(with a
reason) while returning false over an error.



I thought about something a bit more sophisticated:

1) Return 'true' if there were open connections and we successfully 
closed them.
2) Return 'false' in the no-op case, i.e. there were no open 
connections.
3) Rise an error if something went wrong. And non-existing server case 
belongs to this last category, IMO.


That looks like a semantically correct behavior, but let us wait for any 
other opinion.






+ result = disconnect_cached_connections(FOREIGNSERVEROID,
+hashvalue,
+false);

+   if (all || (!all && cacheid == FOREIGNSERVEROID &&
+   entry->server_hashvalue == hashvalue))
+   {
+   if (entry->conn != NULL &&
+   !all && cacheid == FOREIGNSERVEROID &&
+   entry->server_hashvalue == hashvalue)

These conditions look bulky for me. First, you pass FOREIGNSERVEROID 
to

disconnect_cached_connections(), but actually it just duplicates 'all'
flag, since when it is 'FOREIGNSERVEROID', then 'all == false'; when 
it

is '-1', then 'all == true'. That is all, there are only two calls of
disconnect_cached_connections(). That way, it seems that we should 
keep

only 'all' flag at least for now, doesn't it?



I added cachid as an argument to disconnect_cached_connections() for
reusability. Say, someone wants to use it with a user mapping then
they can pass cacheid USERMAPPINGOID, hash value of user mapping. The
cacheid == USERMAPPINGOID && entry->mapping_hashvalue == hashvalue can
be added to disconnect_cached_connections().



Yeah, I have got your point and motivation to add this argument, but how 
we can use it? To disconnect all connections belonging to some specific 
user mapping? But any user mapping is hard bound to some foreign server, 
AFAIK, so we can pass serverid-based hash in this case.


In the case of pgfdw_inval_callback() this argument makes sense, since 
syscache callbacks work that way, but here I can hardly imagine a case 
where we can use it. Thus, it still looks as a preliminary complication 
for me, since we do not have plans to use it, do we? Anyway, everything 
seems to be working fine, so it is up to you to keep this additional 
argument.




v1-0003-postgres_fdw-server-level-option-keep_connection.patch
This patch adds a new server level option, keep_connection, default
being on, when set to off, the local session doesn't cache the
connections associated with the foreign server.



This patch looks good to me, except one note:

(entry->used_in_current_xact &&
-   !keep_connections))
+   (!keep_connections || !entry->keep_connection)))
{

Following this logic:

1) If keep_connections == true, then per-server keep_connection has a 
*higher* priority, so one can disable caching of a single foreign 
server.


2) But if keep_connections == false, then it works like a global switch 
off indifferently of per-server keep_connection's, i.e. they have a 
*lower* priority.


It looks fine for me, at least I cannot propose anything better, but 
maybe it should be documented in 0004?




v1-0004-postgres_fdw-connection-cache-discard-tests-and-documentation.patch
This patch adds the tests and documentation related to this feature.



I have not read all texts thoroughly, but what caught my eye:

+   A GUC, postgres_fdw.keep_connections, default 
being
+   on, when set to off, the local 
session


I think that GUC acronym is used widely only in the source code and 
Postgres docs tend to do not use it at all, except from acronyms list 
and a couple of 'GUC parameters' collocation usage. And it never used in 
a singular form there, so I think that it should be rather:


A configuration parameter, 
postgres_fdw.keep_connections, default being...


+ 
+  Note that when postgres_fdw.keep_connections 
is set to
+  off, postgres_fdw discards either the 
connections
+  that are made previously and will be used by th

Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit

2020-11-23 Thread Alexey Kondratov

Hi,

On 2020-11-23 09:48, Bharath Rupireddy wrote:


Here is how I'm making 4 separate patches:

1. new function and it's documentation.
2. GUC and it's documentation.
3. server level option and it's documentation.
4. test cases for all of the above patches.



Hi, I'm attaching the patches here. Note that, though the code changes
for this feature are small, I divided them up as separate patches to
make review easy.

v1-0001-postgres_fdw-function-to-discard-cached-connections.patch



This patch looks pretty straightforward for me, but there are some 
things to be addressed IMO:


+   server = GetForeignServerByName(servername, true);
+
+   if (server != NULL)
+   {

Yes, you return a false if no server was found, but for me it worth 
throwing an error in this case as, for example, dblink does in the 
dblink_disconnect().


+ result = disconnect_cached_connections(FOREIGNSERVEROID,
+hashvalue,
+false);

+   if (all || (!all && cacheid == FOREIGNSERVEROID &&
+   entry->server_hashvalue == hashvalue))
+   {
+   if (entry->conn != NULL &&
+   !all && cacheid == FOREIGNSERVEROID &&
+   entry->server_hashvalue == hashvalue)

These conditions look bulky for me. First, you pass FOREIGNSERVEROID to 
disconnect_cached_connections(), but actually it just duplicates 'all' 
flag, since when it is 'FOREIGNSERVEROID', then 'all == false'; when it 
is '-1', then 'all == true'. That is all, there are only two calls of 
disconnect_cached_connections(). That way, it seems that we should keep 
only 'all' flag at least for now, doesn't it?


Second, I think that we should just rewrite this if statement in order 
to simplify it and make more readable, e.g.:


if ((all || entry->server_hashvalue == hashvalue) &&
entry->conn != NULL)
{
disconnect_pg_server(entry);
result = true;
}

+   if (all)
+   {
+   hash_destroy(ConnectionHash);
+   ConnectionHash = NULL;
+   result = true;
+   }

Also, I am still not sure that it is a good idea to destroy the whole 
cache even in 'all' case, but maybe others will have a different 
opinion.




v1-0002-postgres_fdw-add-keep_connections-GUC-to-not-cache-connections.patch



+   entry->changing_xact_state) ||
+   (entry->used_in_current_xact &&
+   !keep_connections))

I am not sure, but I think, that instead of adding this additional flag 
into ConnCacheEntry structure we can look on entry->xact_depth and use 
local:


bool used_in_current_xact = entry->xact_depth > 0;

for exactly the same purpose. Since we set entry->xact_depth to zero at 
the end of xact, then it was used if it is not zero. It is set to 1 by 
begin_remote_xact() called by GetConnection(), so everything seems to be 
fine.


Otherwise, both patches seem to be working as expected. I am going to 
have a look on the last two patches a bit later.


Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company




Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit

2020-11-19 Thread Alexey Kondratov

On 2020-11-19 07:11, Bharath Rupireddy wrote:

On Wed, Nov 18, 2020 at 10:32 PM Alexey Kondratov
 wrote:

Thanks! I will make separate patches and post them soon.



>> Attached is a small POC patch, which implements this contrib-level
>> postgres_fdw.keep_connections GUC. What do you think?

 >

> I see two problems with your patch: 1) It just disconnects the remote
> connection at the end of txn if the GUC is set to false, but it
> doesn't remove the connection cache entry from ConnectionHash.

Yes, and this looks like a valid state for postgres_fdw and it can get
into the same state even without my patch. Next time GetConnection()
will find this cache entry, figure out that entry->conn is NULL and
establish a fresh connection. It is not clear for me right now, what
benefits we will get from clearing also this cache entry, except just
doing this for sanity.



By clearing the cache entry we will have 2 advantages: 1) we could
save a(small) bit of memory 2) we could allow new connections to be
cached, currently ConnectionHash can have only 8 entries. IMHO, along
with disconnecting, we can also clear off the cache entry. Thoughts?



IIUC, 8 is not a hard limit, it is just a starting size. ConnectionHash 
is not a shared-memory hash table, so dynahash can expand it on-the-fly 
as follow, for example, from the comment before hash_create():


 * Note: for a shared-memory hashtable, nelem needs to be a pretty good
 * estimate, since we can't expand the table on the fly.  But an 
unshared

 * hashtable can be expanded on-the-fly, so it's better for nelem to be
 * on the small side and let the table grow if it's exceeded.  An overly
 * large nelem will penalize hash_seq_search speed without buying much.

Also I am not sure that by doing just a HASH_REMOVE you will free any 
memory, since hash table is already allocated (or expanded) to some 
size. So HASH_REMOVE will only add removed entry to the freeList, I 
guess.


Anyway, I can hardly imagine bloating of ConnectionHash to be a problem 
even in the case, when one has thousands of foreign servers all being 
accessed during a single backend life span.



Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company




Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit

2020-11-18 Thread Alexey Kondratov

On 2020-11-18 16:39, Bharath Rupireddy wrote:

Thanks for the interest shown!

On Wed, Nov 18, 2020 at 1:07 AM Alexey Kondratov
 wrote:


Regarding the initial issue I prefer point #3, i.e. foreign server
option. It has a couple of benefits IMO: 1) it may be set separately 
on

per foreign server basis, 2) it will live only in the postgres_fdw
contrib without any need to touch core. I would only supplement this
postgres_fdw foreign server option with a GUC, e.g.
postgres_fdw.keep_connections, so one could easily define such 
behavior

for all foreign servers at once or override server-level option by
setting this GUC on per session basis.



Below is what I have in my mind, mostly inline with yours:

a) Have a server level option (keep_connetion true/false, with the
default being true), when set to false the connection that's made with
this foreign server is closed and cached entry from the connection
cache is deleted at the end of txn in pgfdw_xact_callback.
b) Have postgres_fdw level GUC postgres_fdw.keep_connections default
being true. When set to false by the user, the connections, that are
used after this, are closed and removed from the cache at the end of
respective txns. If we don't use a connection that was cached prior to
the user setting the GUC as false,  then we may not be able to clear
it. We can avoid this problem by recommending users either to set the
GUC to false right after the CREATE EXTENSION postgres_fdw; or else
use the function specified in (c).
c) Have a new function that gets defined as part of CREATE EXTENSION
postgres_fdw;, say postgres_fdw_discard_connections(), similar to
dblink's dblink_disconnect(), which discards all the remote
connections and clears connection cache. And we can also have server
name as input to postgres_fdw_discard_connections() to discard
selectively.

Thoughts? If okay with the approach, I will start working on the patch.



This approach looks solid enough from my perspective to give it a try. I 
would only make it as three separate patches for an ease of further 
review.




Attached is a small POC patch, which implements this contrib-level
postgres_fdw.keep_connections GUC. What do you think?



I see two problems with your patch: 1) It just disconnects the remote
connection at the end of txn if the GUC is set to false, but it
doesn't remove the connection cache entry from ConnectionHash.


Yes, and this looks like a valid state for postgres_fdw and it can get 
into the same state even without my patch. Next time GetConnection() 
will find this cache entry, figure out that entry->conn is NULL and 
establish a fresh connection. It is not clear for me right now, what 
benefits we will get from clearing also this cache entry, except just 
doing this for sanity.



2) What
happens if there are some cached connections, user set the GUC to
false and not run any foreign queries or not use those connections
thereafter, so only the new connections will not be cached? Will the
existing unused connections still remain in the connection cache? See
(b) above for a solution.



Yes, they will. This could be solved with that additional disconnect 
function as you proposed in c).



Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company




Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit

2020-11-17 Thread Alexey Kondratov

Hi,

On 2020-11-06 18:56, Anastasia Lubennikova wrote:

Status update for a commitfest entry.

This thread was inactive for a while and from the latest messages, I
see that the patch needs some further work.
So I move it to "Waiting on Author".

The new status of this patch is: Waiting on Author


I had a look on the initial patch and discussed options [1] to proceed 
with this issue. I agree with Bruce about idle_session_timeout, it would 
be a nice to have in-core feature on its own. However, this should be a 
cluster-wide option and it will start dropping all idle connection not 
only foreign ones. So it may be not an option for some cases, when the 
same foreign server is used for another load as well.


Regarding the initial issue I prefer point #3, i.e. foreign server 
option. It has a couple of benefits IMO: 1) it may be set separately on 
per foreign server basis, 2) it will live only in the postgres_fdw 
contrib without any need to touch core. I would only supplement this 
postgres_fdw foreign server option with a GUC, e.g. 
postgres_fdw.keep_connections, so one could easily define such behavior 
for all foreign servers at once or override server-level option by 
setting this GUC on per session basis.


Attached is a small POC patch, which implements this contrib-level 
postgres_fdw.keep_connections GUC. What do you think?


[1] 
https://www.postgresql.org/message-id/CALj2ACUFNydy0uo0JL9A1isHQ9pFe1Fgqa_HVanfG6F8g21nSQ%40mail.gmail.com



Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Companydiff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index ab3226287d..64f0e96635 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -28,6 +28,8 @@
 #include "utils/memutils.h"
 #include "utils/syscache.h"
 
+#include "postgres_fdw.h"
+
 /*
  * Connection cache hash table entry
  *
@@ -948,6 +950,7 @@ pgfdw_xact_callback(XactEvent event, void *arg)
 		 */
 		if (PQstatus(entry->conn) != CONNECTION_OK ||
 			PQtransactionStatus(entry->conn) != PQTRANS_IDLE ||
+			!keep_connections ||
 			entry->changing_xact_state)
 		{
 			elog(DEBUG3, "discarding connection %p", entry->conn);
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 9c5aaacc51..4cd5f71223 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -45,6 +45,8 @@
 #include "utils/sampling.h"
 #include "utils/selfuncs.h"
 
+#include "postgres_fdw.h"
+
 PG_MODULE_MAGIC;
 
 /* Default CPU cost to start up a foreign query. */
@@ -301,6 +303,8 @@ typedef struct
 	List	   *already_used;	/* expressions already dealt with */
 } ec_member_foreign_arg;
 
+bool keep_connections = true;
+
 /*
  * SQL functions
  */
@@ -505,6 +509,15 @@ static void merge_fdw_options(PgFdwRelationInfo *fpinfo,
 			  const PgFdwRelationInfo *fpinfo_o,
 			  const PgFdwRelationInfo *fpinfo_i);
 
+void
+_PG_init(void)
+{
+	DefineCustomBoolVariable("postgres_fdw.keep_connections",
+			 "Enables postgres_fdw connection caching.",
+			 "When off postgres_fdw will close connections at the end of transaction.",
+			 _connections, true, PGC_USERSET, 0, NULL,
+			 NULL, NULL);
+}
 
 /*
  * Foreign-data wrapper handler function: return a struct with pointers
diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h
index eef410db39..7f1bdb96d6 100644
--- a/contrib/postgres_fdw/postgres_fdw.h
+++ b/contrib/postgres_fdw/postgres_fdw.h
@@ -124,9 +124,12 @@ typedef struct PgFdwRelationInfo
 	int			relation_index;
 } PgFdwRelationInfo;
 
+extern bool keep_connections;
+
 /* in postgres_fdw.c */
 extern int	set_transmission_modes(void);
 extern void reset_transmission_modes(int nestlevel);
+extern void _PG_init(void);
 
 /* in connection.c */
 extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);


Re: Misuse of TimestampDifference() in the autoprewarm feature of pg_prewarm

2020-11-12 Thread Alexey Kondratov

On 2020-11-11 06:59, Tom Lane wrote:

Alexey Kondratov  writes:
After looking on the autoprewarm code more closely I have realised 
that

this 'double dump' issues was not an issues at all. I have just
misplaced a debug elog(), so its second output in the log was only
indicating that we calculated delay_in_ms one more time.


Ah --- that explains why I couldn't see a problem.

I've pushed 0001+0002 plus some followup work to fix other places
that could usefully use TimestampDifferenceMilliseconds().  I have
not done anything with 0003 (the TAP test for pg_prewarm), and will
leave that to the judgment of somebody who's worked with pg_prewarm
before.  To me it looks like it's not really testing things very
carefully at all; on the other hand, we have exactly zero test
coverage of that module today, so maybe something is better than
nothing.



Great, thank you for generalisation of the issue and working on it.


Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company




Re: Misuse of TimestampDifference() in the autoprewarm feature of pg_prewarm

2020-11-10 Thread Alexey Kondratov

On 2020-11-09 23:25, Tom Lane wrote:

Alexey Kondratov  writes:

On 2020-11-09 21:53, Tom Lane wrote:
0002 seems like a pretty clear bug fix, though I wonder if this is 
exactly
what we want to do going forward.  It seems like a very large 
fraction of
the callers of TimestampDifference would like to have the value in 
msec,

which means we're doing a whole lot of expensive and error-prone
arithmetic to break down the difference to sec/usec and then put it
back together again.  Let's get rid of that by inventing, say
TimestampDifferenceMilliseconds(...).



Yeah, I get into this problem after a bug in another extension —
pg_wait_sampling. I have attached 0002, which implements
TimestampDifferenceMilliseconds(), so 0003 just uses this new function
to solve the initial issues. If it looks good to you, then we can 
switch

all similar callers to it.


Yeah, let's move forward with that --- in fact, I'm inclined to
back-patch it.  (Not till the current release cycle is done, though.
I don't find this important enough to justify a last-moment patch.)

BTW, I wonder if we shouldn't make TimestampDifferenceMilliseconds
round any fractional millisecond up rather than down.  Rounding down
seems to create a hazard of uselessly waking just before the delay is
completed.  Better to wake just after.



Yes, it make sense. I have changed TimestampDifferenceMilliseconds() to 
round result up if there is a reminder.


After looking on the autoprewarm code more closely I have realised that 
this 'double dump' issues was not an issues at all. I have just 
misplaced a debug elog(), so its second output in the log was only 
indicating that we calculated delay_in_ms one more time. Actually, even 
with wrong calculation of delay_in_ms the only problem was that we were 
busy looping with ~1 second interval instead of waiting on latch.


It is still a buggy behaviour, but much less harmful than I have 
originally thought.



Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres CompanyFrom ce09103d9d58b611728b66366cd24e8a4069f7ac Mon Sep 17 00:00:00 2001
From: Alexey Kondratov 
Date: Mon, 9 Nov 2020 19:04:10 +0300
Subject: [PATCH v3 3/3] pg_prewarm: add tap test for autoprewarm feature

---
 contrib/pg_prewarm/Makefile |  2 +
 contrib/pg_prewarm/t/001_autoprewarm.pl | 59 +
 2 files changed, 61 insertions(+)
 create mode 100644 contrib/pg_prewarm/t/001_autoprewarm.pl

diff --git a/contrib/pg_prewarm/Makefile b/contrib/pg_prewarm/Makefile
index b13ac3c813..9cfde8c4e4 100644
--- a/contrib/pg_prewarm/Makefile
+++ b/contrib/pg_prewarm/Makefile
@@ -10,6 +10,8 @@ EXTENSION = pg_prewarm
 DATA = pg_prewarm--1.1--1.2.sql pg_prewarm--1.1.sql pg_prewarm--1.0--1.1.sql
 PGFILEDESC = "pg_prewarm - preload relation data into system buffer cache"
 
+TAP_TESTS = 1
+
 ifdef USE_PGXS
 PG_CONFIG = pg_config
 PGXS := $(shell $(PG_CONFIG) --pgxs)
diff --git a/contrib/pg_prewarm/t/001_autoprewarm.pl b/contrib/pg_prewarm/t/001_autoprewarm.pl
new file mode 100644
index 00..f55b2a5352
--- /dev/null
+++ b/contrib/pg_prewarm/t/001_autoprewarm.pl
@@ -0,0 +1,59 @@
+#
+# Check that pg_prewarm can dump blocks from shared buffers
+# to PGDATA/autoprewarm.blocks.
+#
+
+use strict;
+use Test::More;
+use TestLib;
+use Time::HiRes qw(usleep);
+use warnings;
+
+use PostgresNode;
+
+plan tests => 3;
+
+# Wait up to 180s for pg_prewarm to dump blocks.
+sub wait_for_dump
+{
+	my $path = shift;
+
+	foreach my $i (0 .. 1800)
+	{
+		last if -e $path;
+		usleep(100_000);
+	}
+}
+
+my $node = get_new_node("node");
+$node->init;
+$node->append_conf(
+	'postgresql.conf', qq(
+shared_preload_libraries = 'pg_prewarm'
+pg_prewarm.autoprewarm = 'on'
+pg_prewarm.autoprewarm_interval = 1
+));
+$node->start;
+
+my $blocks_path = $node->data_dir . '/autoprewarm.blocks';
+
+# Check that we can dump blocks on timeout.
+wait_for_dump($blocks_path);
+ok(-e $blocks_path, 'file autoprewarm.blocks should be present in the PGDATA');
+
+# Check that we can dump blocks on shutdown.
+$node->stop;
+$node->append_conf(
+	'postgresql.conf', qq(
+pg_prewarm.autoprewarm_interval = 0
+));
+
+# Remove autoprewarm.blocks
+unlink($blocks_path) || die "$blocks_path: $!";
+ok(!-e $blocks_path, 'sanity check, dump on timeout is turned off');
+
+$node->start;
+$node->stop;
+
+wait_for_dump($blocks_path);
+ok(-e $blocks_path, 'file autoprewarm.blocks should be present in the PGDATA after clean shutdown');
-- 
2.19.1

From fba212ed765c8c411db1ca19c2ac991662109d99 Mon Sep 17 00:00:00 2001
From: Alexey Kondratov 
Date: Mon, 9 Nov 2020 19:12:00 +0300
Subject: [PATCH v3 2/3] pg_prewarm: fix autoprewarm_interval behaviour

Previously it misused seconds from TimestampDifference() as
milliseconds, so it was busy looping with ~1 second interval
instead of waiting on latch with default autoprewarm_interval = 300s.
---
 contrib/pg_prewarm/autoprewarm.c | 8 +++-
 1 file chang

Re: Misuse of TimestampDifference() in the autoprewarm feature of pg_prewarm

2020-11-09 Thread Alexey Kondratov

On 2020-11-09 21:53, Tom Lane wrote:

Alexey Kondratov  writes:
After fixing this issue I have noticed that it still dumps blocks 
twice

at each timeout (here I set autoprewarm_interval to 15s):
...
This happens because at timeout time we were using continue, but
actually we still have to wait the entire autoprewarm_interval after
successful dumping.


I don't think your 0001 is correct.  It would be okay if apw_dump_now()
could be counted on to take negligible time, but we shouldn't assume
that should we?



Yes, it seems so, if I understand you correctly. I had a doubt about 
possibility of pg_ctl to exit earlier than a dumping process. Now I 
added an explicit wait for dump file into test.



I agree that the "continue" seems a bit bogus, because it's skipping
the ResetLatch call at the bottom of the loop; it's not quite clear
to me whether that's a good thing or not.  But the general idea of
the existing code seems to be to loop around and make a fresh 
calculation

of how-long-to-wait, and that doesn't seem wrong.


I have left the last patch intact, since it resolves the 'double dump' 
issue, but I agree with нщгк point about existing logic of the code, 
although it is a bit broken. So I have to think more about how to fix it 
in a better way.


0002 seems like a pretty clear bug fix, though I wonder if this is 
exactly
what we want to do going forward.  It seems like a very large fraction 
of
the callers of TimestampDifference would like to have the value in 
msec,

which means we're doing a whole lot of expensive and error-prone
arithmetic to break down the difference to sec/usec and then put it
back together again.  Let's get rid of that by inventing, say
TimestampDifferenceMilliseconds(...).


Yeah, I get into this problem after a bug in another extension — 
pg_wait_sampling. I have attached 0002, which implements 
TimestampDifferenceMilliseconds(), so 0003 just uses this new function 
to solve the initial issues. If it looks good to you, then we can switch 
all similar callers to it.



BTW, I see another bug of a related ilk.  Look what
postgres_fdw/connection.c is doing:

TimestampDifference(now, endtime, , );

/* To protect against clock skew, limit sleep to one 
minute. */
cur_timeout = Min(6, secs * USECS_PER_SEC + 
microsecs);


/* Sleep until there's something to do */
wc = WaitLatchOrSocket(MyLatch,
   WL_LATCH_SET | 
WL_SOCKET_READABLE |
   WL_TIMEOUT | 
WL_EXIT_ON_PM_DEATH,

   PQsocket(conn),
   cur_timeout, PG_WAIT_EXTENSION);

WaitLatchOrSocket's timeout is measured in msec not usec.  I think the
comment about "clock skew" is complete BS, and the Min() calculation 
was

put in as a workaround by somebody observing that the sleep waited too
long, but not understanding why.


I wonder how many troubles one can get with all these unit conversions.


Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres CompanyFrom c79de17014753b311858b4570ca475f713328c62 Mon Sep 17 00:00:00 2001
From: Alexey Kondratov 
Date: Mon, 9 Nov 2020 19:24:55 +0300
Subject: [PATCH v2 4/4] pg_prewarm: refactor autoprewarm waits

Previously it was dumping twice at every timeout time.
---
 contrib/pg_prewarm/autoprewarm.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/contrib/pg_prewarm/autoprewarm.c b/contrib/pg_prewarm/autoprewarm.c
index e5bd130bc8..872c7d51b1 100644
--- a/contrib/pg_prewarm/autoprewarm.c
+++ b/contrib/pg_prewarm/autoprewarm.c
@@ -236,7 +236,9 @@ autoprewarm_main(Datum main_arg)
 			{
 last_dump_time = GetCurrentTimestamp();
 apw_dump_now(true, false);
-continue;
+
+/* We have to sleep even after a successful dump */
+delay_in_ms = autoprewarm_interval * 1000;
 			}
 
 			/* Sleep until the next dump time. */
-- 
2.19.1

From c38c07708d57d6dec5a8a1697ca9c9810ad4d7ce Mon Sep 17 00:00:00 2001
From: Alexey Kondratov 
Date: Mon, 9 Nov 2020 19:12:00 +0300
Subject: [PATCH v2 3/4] pg_prewarm: fix autoprewarm_interval behaviour.

Previously it misused seconds from TimestampDifference() as
milliseconds, so it was dumping autoprewarm.blocks ~every second
event with default autoprewarm_interval = 300s.
---
 contrib/pg_prewarm/autoprewarm.c | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/contrib/pg_prewarm/autoprewarm.c b/contrib/pg_prewarm/autoprewarm.c
index d3dec6e3ec..e5bd130bc8 100644
--- a/contrib/pg_prewarm/autoprewarm.c
+++ b/contrib/pg_prewarm/autoprewarm.c
@@ -222,16 +222,14 @@ autoprewarm_main(Datum main_arg)
 		{
 			long		delay_in_ms = 0;
 			TimestampTz next_dump_time = 0;
-			long		secs = 0;
-			int			usecs = 0;
 
 			/* Compute the next dump time. */
 			next_dump_time =
 TimestampTzPlusMilliseconds(last_dump_time,
 			

Misuse of TimestampDifference() in the autoprewarm feature of pg_prewarm

2020-11-09 Thread Alexey Kondratov

Hi Hackers,

Today I have accidentally noticed that autoprewarm feature of pg_prewarm 
used TimestampDifference()'s results in a wrong way.


First, it used *seconds* result from it as a *milliseconds*. It was 
causing it to make dump file autoprewarm.blocks ~every second with 
default setting of autoprewarm_interval = 300s.


Here is a log part with debug output in this case:

```
2020-11-09 19:09:00.162 MSK [85328] LOG:  dumping autoprewarm.blocks
2020-11-09 19:09:01.161 MSK [85328] LOG:  dumping autoprewarm.blocks
2020-11-09 19:09:02.160 MSK [85328] LOG:  dumping autoprewarm.blocks
2020-11-09 19:09:03.159 MSK [85328] LOG:  dumping autoprewarm.blocks
```

After fixing this issue I have noticed that it still dumps blocks twice 
at each timeout (here I set autoprewarm_interval to 15s):


```
2020-11-09 19:18:59.692 MSK [85662] LOG:  dumping autoprewarm.blocks
2020-11-09 19:18:59.700 MSK [85662] LOG:  dumping autoprewarm.blocks

2020-11-09 19:19:14.694 MSK [85662] LOG:  dumping autoprewarm.blocks
2020-11-09 19:19:14.704 MSK [85662] LOG:  dumping autoprewarm.blocks
```

This happens because at timeout time we were using continue, but 
actually we still have to wait the entire autoprewarm_interval after 
successful dumping.


I have fixed both issues in the attached patches and also added a 
minimalistic tap test as a first one to verify that this automatic 
damping still works after refactoring. I put Robert into CC, since he is 
an author of this feature.


What do you think?


Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres CompanyFrom 6d4bab7f21c3661dd4dd5a0de7e097b1de3f642c Mon Sep 17 00:00:00 2001
From: Alexey Kondratov 
Date: Mon, 9 Nov 2020 19:24:55 +0300
Subject: [PATCH v1 3/3] pg_prewarm: refactor autoprewarm waits

Previously it was dumping twice at every timeout time.
---
 contrib/pg_prewarm/autoprewarm.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/contrib/pg_prewarm/autoprewarm.c b/contrib/pg_prewarm/autoprewarm.c
index b18a065ed5..f52c83de1e 100644
--- a/contrib/pg_prewarm/autoprewarm.c
+++ b/contrib/pg_prewarm/autoprewarm.c
@@ -238,7 +238,9 @@ autoprewarm_main(Datum main_arg)
 			{
 last_dump_time = GetCurrentTimestamp();
 apw_dump_now(true, false);
-continue;
+
+/* We have to sleep even after a successfull dump */
+delay_in_ms = autoprewarm_interval * 1000;
 			}
 
 			/* Sleep until the next dump time. */
-- 
2.19.1

From 8793b8beb6a5c1ae730f1fffb09dff64c83bc631 Mon Sep 17 00:00:00 2001
From: Alexey Kondratov 
Date: Mon, 9 Nov 2020 19:12:00 +0300
Subject: [PATCH v1 2/3] pg_prewarm: fix autoprewarm_interval behaviour.

Previously it misused seconds from TimestampDifference() as
milliseconds, so it was dumping autoprewarm.blocks ~every second
event with default autoprewarm_interval = 300s.
---
 contrib/pg_prewarm/autoprewarm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/contrib/pg_prewarm/autoprewarm.c b/contrib/pg_prewarm/autoprewarm.c
index d3dec6e3ec..b18a065ed5 100644
--- a/contrib/pg_prewarm/autoprewarm.c
+++ b/contrib/pg_prewarm/autoprewarm.c
@@ -231,7 +231,7 @@ autoprewarm_main(Datum main_arg)
 			autoprewarm_interval * 1000);
 			TimestampDifference(GetCurrentTimestamp(), next_dump_time,
 , );
-			delay_in_ms = secs + (usecs / 1000);
+			delay_in_ms = secs * 1000 + (usecs / 1000);
 
 			/* Perform a dump if it's time. */
 			if (delay_in_ms <= 0)
-- 
2.19.1

From 31dc30c97861afae9c34852afc5a5b1c91bbeadc Mon Sep 17 00:00:00 2001
From: Alexey Kondratov 
Date: Mon, 9 Nov 2020 19:04:10 +0300
Subject: [PATCH v1 1/3] pg_prewarm: add tap test for autoprewarm feature

---
 contrib/pg_prewarm/Makefile |  2 +
 contrib/pg_prewarm/t/001_autoprewarm.pl | 51 +
 2 files changed, 53 insertions(+)
 create mode 100644 contrib/pg_prewarm/t/001_autoprewarm.pl

diff --git a/contrib/pg_prewarm/Makefile b/contrib/pg_prewarm/Makefile
index b13ac3c813..9cfde8c4e4 100644
--- a/contrib/pg_prewarm/Makefile
+++ b/contrib/pg_prewarm/Makefile
@@ -10,6 +10,8 @@ EXTENSION = pg_prewarm
 DATA = pg_prewarm--1.1--1.2.sql pg_prewarm--1.1.sql pg_prewarm--1.0--1.1.sql
 PGFILEDESC = "pg_prewarm - preload relation data into system buffer cache"
 
+TAP_TESTS = 1
+
 ifdef USE_PGXS
 PG_CONFIG = pg_config
 PGXS := $(shell $(PG_CONFIG) --pgxs)
diff --git a/contrib/pg_prewarm/t/001_autoprewarm.pl b/contrib/pg_prewarm/t/001_autoprewarm.pl
new file mode 100644
index 00..b564c29931
--- /dev/null
+++ b/contrib/pg_prewarm/t/001_autoprewarm.pl
@@ -0,0 +1,51 @@
+#
+# Check that pg_prewarm can dump blocks from shared buffers
+# to PGDATA/autoprewarm.blocks.
+#
+
+use strict;
+use Test::More;
+use TestLib;
+use Time::HiRes qw(usleep);
+use warnings;
+
+use PostgresNode;
+
+plan tests => 3;
+
+my $node = get_new_node("node");
+$node->init;
+$node->append_conf(
+'postgresql.conf', qq(
+shared_preload_libraries = 'pg_

Re: Global snapshots

2020-09-21 Thread Alexey Kondratov

On 2020-09-18 00:54, Bruce Momjian wrote:

On Tue, Sep  8, 2020 at 01:36:16PM +0300, Alexey Kondratov wrote:

Thank you for the link!

After a quick look on the Sawada-san's patch set I think that there 
are two

major differences:

1. There is a built-in foreign xacts resolver in the [1], which should 
be
much more convenient from the end-user perspective. It involves huge 
in-core

changes and additional complexity that is of course worth of.

However, it's still not clear for me that it is possible to resolve 
all
foreign prepared xacts on the Postgres' own side with a 100% 
guarantee.
Imagine a situation when the coordinator node is actually a HA cluster 
group
(primary + sync + async replica) and it failed just after PREPARE 
stage of

after local COMMIT. In that case all foreign xacts will be left in the
prepared state. After failover process complete synchronous replica 
will
become a new primary. Would it have all required info to properly 
resolve

orphan prepared xacts?

Probably, this situation is handled properly in the [1], but I've not 
yet
finished a thorough reading of the patch set, though it has a great 
doc!


On the other hand, previous 0003 and my proposed patch rely on either 
manual
resolution of hung prepared xacts or usage of external 
monitor/resolver.
This approach is much simpler from the in-core perspective, but 
doesn't look

as complete as [1] though.


Have we considered how someone would clean up foreign transactions if 
the
coordinating server dies?  Could it be done manually?  Would an 
external

resolver, rather than an internal one, make this easier?


Both Sawada-san's patch [1] and in this thread (e.g. mine [2]) use 2PC 
with a special gid format including a xid + server identification info. 
Thus, one can select from pg_prepared_xacts, get xid and coordinator 
info, then use txid_status() on the coordinator (or ex-coordinator) to 
get transaction status and finally either commit or abort these stale 
prepared xacts. Of course this could be wrapped into some user-level 
support routines as it is done in the [1].


As for the benefits of using an external resolver, I think that there 
are some of them from the whole system perspective:


1) If one follows the logic above, then this resolver could be 
stateless, it takes all the required info from the Postgres nodes 
themselves.


2) Then you can easily put it into container, which make it easier do 
deploy to all these 'cloud' stuff like kubernetes.


3) Also you can scale resolvers independently from Postgres nodes.

I do not think that either of these points is a game changer, but we use 
a very simple external resolver altogether with [2] in our sharding 
prototype and it works just fine so far.



[1] 
https://www.postgresql.org/message-id/CA%2Bfd4k4HOVqqC5QR4H984qvD0Ca9g%3D1oLYdrJT_18zP9t%2BUsJg%40mail.gmail.com


[2] 
https://www.postgresql.org/message-id/3ef7877bfed0582019eab3d462a43275%40postgrespro.ru


--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company




Re: Concurrency issue in pg_rewind

2020-09-17 Thread Alexey Kondratov

On 2020-09-17 15:27, Alexander Kukushkin wrote:

On Thu, 17 Sep 2020 at 14:04, Alexey Kondratov
 wrote:

With --restore-target-wal pg_rewind is trying to call restore_command 
on

its own and it can happen at two stages:

1) When pg_rewind is trying to find the last checkpoint preceding a
divergence point. In that case file map is not even yet initialized.
Thus, all fetched WAL segments at this stage will be present in the 
file

map created later.


Nope, it will fetch files you requested, and in addition to that it
will leave a child process running in the background which is doing
the prefetch (manipulating with pg_wal/.wal-g/...)



2) When it creates a data pages map. It should traverse WAL from the
last common checkpoint till the final shutdown point in order to find
all modified pages on the target. At this stage pg_rewind only updates
info about data segments in the file map. That way, I see a minor
problem that WAL segments fetched at this stage would not be deleted,
since they are absent in the file map.

Anyway, pg_rewind does not delete neither WAL segments, not any other
files in the middle of the file map creation, so I cannot imagine, how
it can get into the same trouble on its own.


When pg_rewind was creating the map, some temporary files where there,
because the forked child process of wal-g was still running.
When the wal-g child process exits, it removes some of these files.
Specifically, it was trying to prefetch 00840A760024 into
the pg_wal/.wal-g/prefetch/running/00840A760024, but
apparently the file wasn't available on S3 and prefetch failed,
therefore the empty file was removed.



I do understand how you got into this problem with wal-g. This part of 
my answer was about bare postgres and pg_rewind. And my point was that 
from my perspective pg_rewind with --restore-target-wal cannot get into 
the same trouble on its own, without 'help' of some side tools like 
wal-g.



Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company




Re: Concurrency issue in pg_rewind

2020-09-17 Thread Alexey Kondratov

On 2020-09-16 15:55, Alexander Kukushkin wrote:

Hello,

Today I bumped into an issue with pg_rewind which is not 100% clear
where should be better fixed.

The first call of pg_rewind failed with the following message:
servers diverged at WAL location A76/39E55338 on timeline 132
could not open file
"/home/postgres/pgdata/pgroot/data/pg_wal/00840A76001E":
No such file or directory

could not find previous WAL record at A76/1EFFE620
Failure, exiting

In order to avoid rebuilding the replica from scratch, we restored the
missing file by calling restore_command (wal-g) and repeated the call
of pg_rewind.

The second time pg_rewind also failed, but the error looked 
differently:

servers diverged at WAL location A76/39E55338 on timeline 132
rewinding from last common checkpoint at A76/1EF254B8 on timeline 132

could not remove file
"/home/postgres/pgdata/pgroot/data/pg_wal/.wal-g/prefetch/running/00840A760024":
No such file or directory
Failure, exiting

The second call left PGDATA in an inconsistent state (empty 
pg_control).


A few words about where the pg_wal/.wal-g/prefetch/running/ is coming
from: wal-g by default when fetching the WAL file is also trying to do
a prefetch of a few next WAL files. For that it forks and the child
process doing prefetch while the parent process exits. In order to
avoid multiple parallel prefetches of the same file, wal-g keeps its
state in the pg_wal/.wal-g directory. It also keeps prefetched files
there.



Hm, I cannot understand why wal-g (or any other tool) is trying to run 
pg_rewind, while WAL copying (and prefetch) is still in progress? Why do 
not just wait until it is finished?


It is also not clear for me why it does not put prefetched WAL files 
directly into the pg_wal?




The issue occurred on 10.14, but I believe very similar might happen
with postgres 13 when pg_rewind is called with --restore-target-wal
option.



With --restore-target-wal pg_rewind is trying to call restore_command on 
its own and it can happen at two stages:


1) When pg_rewind is trying to find the last checkpoint preceding a 
divergence point. In that case file map is not even yet initialized. 
Thus, all fetched WAL segments at this stage will be present in the file 
map created later.


2) When it creates a data pages map. It should traverse WAL from the 
last common checkpoint till the final shutdown point in order to find 
all modified pages on the target. At this stage pg_rewind only updates 
info about data segments in the file map. That way, I see a minor 
problem that WAL segments fetched at this stage would not be deleted, 
since they are absent in the file map.


Anyway, pg_rewind does not delete neither WAL segments, not any other 
files in the middle of the file map creation, so I cannot imagine, how 
it can get into the same trouble on its own.




That made me think about how it could be improved in the pg_rewind.
The thing is, that we want to have a specific file to be removed, and
it is already not there. Should it be a fatal error?
traverse_datadir()/recurse_dir() already ignoring all failed lstat()
calls with errno == ENOENT.



recurse_dir() is also called on the source, which should be running when 
--source-server is used, so it can remove files. There is actually a 
TODO there:


/*
 * File doesn't exist anymore. This is ok, if the new master
 * is running and the file was just removed. If it was a data
 * file, there should be a WAL record of the removal. If it
 * was something else, it couldn't have been anyway.
 *
 * TODO: But complain if we're processing the target dir!
 */



Basically I have to options:
1. Remove modify remove_target_file(const char *path, bool missing_ok)
function and remove the missing_ok option, that would be consistent
with recurse_dir()
2. Change the logic of remove_target_file(), so it doesn't exit with
the fatal error if the file is missing, but shows only a warning.

In addition to the aforementioned options the remove_target_dir() also
should be improved, i.e. it should check errno and behave similarly to
the remove_target_file() if the errno == ENOENT

What do you think?



Although keeping arbitrary files inside PGDATA does not look like a good 
idea for me, I do not see anything criminal in skipping non-existing 
file, when executing a file map by pg_rewind.



Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company




Re: Global snapshots

2020-09-10 Thread Alexey Kondratov

On 2020-09-09 20:29, Fujii Masao wrote:

On 2020/09/09 2:00, Alexey Kondratov wrote:


According to the Sawada-san's v25 0002 the logic is pretty much the 
same there:


+2. Pre-Commit phase (1st phase of two-phase commit)

+3. Commit locally
+Once we've prepared all of them, commit the transaction locally.

+4. Post-Commit Phase (2nd phase of two-phase commit)

Brief look at the code confirms this scheme. IIUC, AtEOXact_FdwXact / 
FdwXactParticipantEndTransaction happens after 
ProcArrayEndTransaction() in the CommitTransaction(). Thus, I don't 
see many difference between these approach and CallXactCallbacks() 
usage regarding this point.


IIUC the commit logic in Sawada-san's patch looks like

1. PreCommit_FdwXact()
PREPARE TRANSACTION command is issued

2. RecordTransactionCommit()
2-1. WAL-log the commit record
2-2. Update CLOG
2-3. Wait for sync rep
2-4. FdwXactWaitForResolution()
Wait until COMMIT PREPARED commands are issued to the
remote servers and completed.

3. ProcArrayEndTransaction()
4. AtEOXact_FdwXact(true)

So ISTM that the timing of when COMMIT PREPARED is issued
to the remote server is different between the patches.
Am I missing something?



No, you are right, sorry. At a first glance I thought that 
AtEOXact_FdwXact is responsible for COMMIT PREPARED as well, but it is 
only calling FdwXactParticipantEndTransaction in the abort case.



Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company




Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2020-09-09 Thread Alexey Kondratov

On 2020-09-09 15:22, Michael Paquier wrote:

On Tue, Sep 08, 2020 at 07:17:58PM -0500, Justin Pryzby wrote:

Initially I added List *params, and Michael suggested to retire
ReindexStmt->concurrent.  I provided a patch to do so, initially by 
leaving int
options and then, after this, removing it to "complete the thought", 
and get
rid of the remnants of the "old way" of doing it.  This is also how 
vacuum and

explain are done.
https://www.postgresql.org/message-id/20200902022410.GA20149%40telsasoft.com


Defining a set of DefElem when parsing and then using the int
"options" with bitmasks where necessary at the beginning of the
execution looks like a good balance to me.  This way, you can extend
the grammar to use things like (verbose = true), etc.

By the way, skimming through the patch set, I was wondering if we
could do the refactoring of patch 0005 as a first step



Yes, I did it with intention to put as a first patch, but wanted to get 
some feedback. It's easier to refactor the last patch without rebasing 
others.




until I
noticed this part:
+common_option_name:
NonReservedWord { $$ = $1; }
| analyze_keyword { $$ = "analyze"; }
This is not a good idea as you make ANALYZE an option available for
all the commands involved in the refactoring.  A portion of that could
be considered though, like the use of common_option_arg.



From the grammar perspective ANY option is available for any command 
that uses parenthesized option list. All the checks and validations are 
performed at the corresponding command code.
This analyze_keyword is actually doing only an ANALYZE word 
normalization if it's used as an option. Why it could be harmful?



Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company




Re: Global snapshots

2020-09-09 Thread Alexey Kondratov

On 2020-09-09 08:35, Masahiko Sawada wrote:

On Wed, 9 Sep 2020 at 02:00, Alexey Kondratov
 wrote:


On 2020-09-08 14:48, Fujii Masao wrote:
>
> IIUC, yes, the information required for automatic resolution is
> WAL-logged and the standby tries to resolve those orphan transactions
> from WAL after the failover. But Sawada-san's patch provides
> the special function for manual resolution, so there may be some cases
> where manual resolution is necessary.
>

I've found a note about manual resolution in the v25 0002:

+After that we prepare all foreign transactions by calling
+PrepareForeignTransaction() API. If we failed on any of them we 
change

to
+rollback, therefore at this time some participants might be prepared
whereas
+some are not prepared. The former foreign transactions need to be
resolved
+using pg_resolve_foreign_xact() manually and the latter ends
transaction
+in one-phase by calling RollbackForeignTransaction() API.

but it's not yet clear for me.


Sorry, the above description in README is out of date. In the v25
patch, it's true that if a backend fails to prepare a transaction on a
foreign server, it’s possible that some foreign transactions are
prepared whereas others are not. But at the end of the transaction
after changing to rollback, the process does rollback (or rollback
prepared) all of them. So the use case of pg_resolve_foreign_xact() is
to resolve orphaned foreign prepared transactions or to resolve a
foreign transaction that is not resolved for some reasons, bugs etc.



OK, thank you for the explanation!



Once the transaction is committed locally any ERROR (or higher level
message) will be escalated to PANIC.


I think this is true only inside the critical section and it's not
necessarily true for all errors happening after the local commit,
right?



It's not actually related to critical section errors escalation. Any 
error in the backend after the local commit and 
ProcArrayEndTransaction() will try to abort the current transaction and 
do RecordTransactionAbort(), but it's too late to do so and PANIC will 
be risen:


/*
	 * Check that we haven't aborted halfway through 
RecordTransactionCommit.

 */
if (TransactionIdDidCommit(xid))
elog(PANIC, "cannot abort transaction %u, it was already 
committed",
 xid);

At least that's how I understand it.



And I do see possible ERROR level
messages in the postgresCommitForeignTransaction() for example:

+   if (PQresultStatus(res) != PGRES_COMMAND_OK)
+   ereport(ERROR, (errmsg("could not commit transaction 
on server %s",
+  
frstate->server->servername)));


I don't think that it's very convenient to get a PANIC every time we
failed to commit one of the prepared foreign xacts, since it could be
not so rare in the distributed system. That's why I tried to get rid 
of

possible ERRORs as far as possible in my proposed patch.



In my patch, the second phase of 2PC is executed only by the resolver
process. Therefore, even if an error would happen during committing a
foreign prepared transaction, we just need to relaunch the resolver
process and trying again. During that, the backend process will be
just waiting. If a backend process raises an error after the local
commit, the client will see transaction failure despite the local
transaction having been committed. An error could happen even by
palloc. So the patch uses a background worker to commit prepared
foreign transactions, not by backend itself.



Yes, if it's a background process, then it seems to be safe.

BTW, it seems that I've chosen a wrong thread for posting my patch and 
staring a discussion :) Activity from this thread moved to [1] and you 
solution with built-in resolver is discussed [2]. I'll try to take a 
look on v25 closely and write to [2] instead.



[1] 
https://www.postgresql.org/message-id/2020081009525213277261%40highgo.ca


[2] 
https://www.postgresql.org/message-id/CAExHW5uBy9QwjdSO4j82WC4aeW-Q4n2ouoZ1z70o%3D8Vb0skqYQ%40mail.gmail.com


Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company




Re: [POC] Fast COPY FROM command for the table with foreign partitions

2020-09-09 Thread Alexey Kondratov

On 2020-09-09 11:45, Andrey V. Lepikhov wrote:

On 9/8/20 8:34 PM, Alexey Kondratov wrote:

On 2020-09-08 17:00, Amit Langote wrote:

 wrote:

On 2020-09-08 10:34, Amit Langote wrote:
Another ambiguous part of the refactoring was in changing
InitResultRelInfo() arguments:

@@ -1278,6 +1280,7 @@ InitResultRelInfo(ResultRelInfo 
*resultRelInfo,

  Relation resultRelationDesc,
  Index resultRelationIndex,
  Relation partition_root,
+ bool use_multi_insert,
  int instrument_options)

Why do we need to pass this use_multi_insert flag here? Would it be
better to set resultRelInfo->ri_usesMultiInsert in the
InitResultRelInfo() unconditionally like it is done for
ri_usesFdwDirectModify? And after that it will be up to the caller
whether to use multi-insert or not based on their own circumstances.
Otherwise now we have a flag to indicate that we want to check for
another flag, while this check doesn't look costly.


Hmm, I think having two flags seems confusing and bug prone,
especially if you consider partitions.  For example, if a partition's
ri_usesMultiInsert is true, but CopyFrom()'s local flag is false, 
then

execPartition.c: ExecInitPartitionInfo() would wrongly perform
BeginForeignCopy() based on only ri_usesMultiInsert, because it
wouldn't know CopyFrom()'s local flag.  Am I missing something?


No, you're right. If someone want to share a state and use 
ResultRelInfo (RRI) for that purpose, then it's fine, but CopyFrom() 
may simply override RRI->ri_usesMultiInsert if needed and pass this 
RRI further.


This is how it's done for RRI->ri_usesFdwDirectModify. 
InitResultRelInfo() initializes it to false and then 
ExecInitModifyTable() changes the flag if needed.


Probably this is just a matter of personal choice, but for me the 
current implementation with additional argument in InitResultRelInfo() 
doesn't look completely right. Maybe because a caller now should pass 
an additional argument (as false) even if it doesn't care about 
ri_usesMultiInsert at all. It also adds additional complexity and 
feels like abstractions leaking.

I didn't feel what the problem was and prepared a patch version
according to Alexey's suggestion (see Alternate.patch).


Yes, that's very close to what I've meant.

+	leaf_part_rri->ri_usesMultiInsert = (leaf_part_rri->ri_usesMultiInsert 
&&

+   rootResultRelInfo->ri_usesMultiInsert) ? true : false;

This could be just:

+	leaf_part_rri->ri_usesMultiInsert = (leaf_part_rri->ri_usesMultiInsert 
&&

+   rootResultRelInfo->ri_usesMultiInsert);


This does not seem very convenient and will lead to errors in the
future. So, I agree with Amit.


And InitResultRelInfo() may set ri_usesMultiInsert to false by default, 
since it's used only by COPY now. Then you won't need this in several 
places:


+   resultRelInfo->ri_usesMultiInsert = false;

While the logic of turning multi-insert on with all the validations 
required could be factored out of InitResultRelInfo() to a separate 
routine.


Anyway, I don't insist at all and think it's fine to stick to the 
original v7's logic.


Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company




Re: Global snapshots

2020-09-08 Thread Alexey Kondratov

On 2020-09-08 14:48, Fujii Masao wrote:

On 2020/09/08 19:36, Alexey Kondratov wrote:

On 2020-09-08 05:49, Fujii Masao wrote:

On 2020/09/05 3:31, Alexey Kondratov wrote:


Attached is a patch, which implements a plain 2PC in the 
postgres_fdw and adds a GUC 'postgres_fdw.use_twophase'. Also it 
solves these errors handling issues above and tries to add proper 
comments everywhere. I think, that 0003 should be rebased on the top 
of it, or it could be a first patch in the set, since it may be used 
independently. What do you think?


Thanks for the patch!

Sawada-san was proposing another 2PC patch at [1]. Do you have any 
thoughts

about pros and cons between your patch and Sawada-san's?

[1]
https://www.postgresql.org/message-id/ca+fd4k4z6_b1etevqamwqhu4rx7xsrn5orl7ohj4b5b6sw-...@mail.gmail.com


Thank you for the link!

After a quick look on the Sawada-san's patch set I think that there 
are two major differences:


Thanks for sharing your thought! As far as I read your patch quickly,
I basically agree with your this view.




1. There is a built-in foreign xacts resolver in the [1], which should 
be much more convenient from the end-user perspective. It involves 
huge in-core changes and additional complexity that is of course worth 
of.


However, it's still not clear for me that it is possible to resolve 
all foreign prepared xacts on the Postgres' own side with a 100% 
guarantee. Imagine a situation when the coordinator node is actually a 
HA cluster group (primary + sync + async replica) and it failed just 
after PREPARE stage of after local COMMIT. In that case all foreign 
xacts will be left in the prepared state. After failover process 
complete synchronous replica will become a new primary. Would it have 
all required info to properly resolve orphan prepared xacts?


IIUC, yes, the information required for automatic resolution is
WAL-logged and the standby tries to resolve those orphan transactions
from WAL after the failover. But Sawada-san's patch provides
the special function for manual resolution, so there may be some cases
where manual resolution is necessary.



I've found a note about manual resolution in the v25 0002:

+After that we prepare all foreign transactions by calling
+PrepareForeignTransaction() API. If we failed on any of them we change 
to
+rollback, therefore at this time some participants might be prepared 
whereas
+some are not prepared. The former foreign transactions need to be 
resolved
+using pg_resolve_foreign_xact() manually and the latter ends 
transaction

+in one-phase by calling RollbackForeignTransaction() API.

but it's not yet clear for me.



Implementing 2PC feature only inside postgres_fdw seems to cause
another issue; COMMIT PREPARED is issued to the remote servers
after marking the local transaction as committed
(i.e., ProcArrayEndTransaction()).



According to the Sawada-san's v25 0002 the logic is pretty much the same 
there:


+2. Pre-Commit phase (1st phase of two-phase commit)

+3. Commit locally
+Once we've prepared all of them, commit the transaction locally.

+4. Post-Commit Phase (2nd phase of two-phase commit)

Brief look at the code confirms this scheme. IIUC, AtEOXact_FdwXact / 
FdwXactParticipantEndTransaction happens after ProcArrayEndTransaction() 
in the CommitTransaction(). Thus, I don't see many difference between 
these approach and CallXactCallbacks() usage regarding this point.



Is this safe? This issue happens
because COMMIT PREPARED is issued via
CallXactCallbacks(XACT_EVENT_COMMIT) and that CallXactCallbacks()
is called after ProcArrayEndTransaction().



Once the transaction is committed locally any ERROR (or higher level 
message) will be escalated to PANIC. And I do see possible ERROR level 
messages in the postgresCommitForeignTransaction() for example:


+   if (PQresultStatus(res) != PGRES_COMMAND_OK)
+   ereport(ERROR, (errmsg("could not commit transaction on server 
%s",
+  
frstate->server->servername)));

I don't think that it's very convenient to get a PANIC every time we 
failed to commit one of the prepared foreign xacts, since it could be 
not so rare in the distributed system. That's why I tried to get rid of 
possible ERRORs as far as possible in my proposed patch.



Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company




Re: [POC] Fast COPY FROM command for the table with foreign partitions

2020-09-08 Thread Alexey Kondratov

On 2020-09-08 17:00, Amit Langote wrote:

Hi Alexey,

On Tue, Sep 8, 2020 at 6:29 PM Alexey Kondratov
 wrote:

On 2020-09-08 10:34, Amit Langote wrote:
> Any thoughts on the taking out the refactoring changes out of the main
> patch as I suggested?
>

+1 for splitting the patch. It was rather difficult for me to
distinguish changes required by COPY via postgres_fdw from this
refactoring.

Another ambiguous part of the refactoring was in changing
InitResultRelInfo() arguments:

@@ -1278,6 +1280,7 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
  Relation resultRelationDesc,
  Index resultRelationIndex,
  Relation partition_root,
+ bool use_multi_insert,
  int instrument_options)

Why do we need to pass this use_multi_insert flag here? Would it be
better to set resultRelInfo->ri_usesMultiInsert in the
InitResultRelInfo() unconditionally like it is done for
ri_usesFdwDirectModify? And after that it will be up to the caller
whether to use multi-insert or not based on their own circumstances.
Otherwise now we have a flag to indicate that we want to check for
another flag, while this check doesn't look costly.


Hmm, I think having two flags seems confusing and bug prone,
especially if you consider partitions.  For example, if a partition's
ri_usesMultiInsert is true, but CopyFrom()'s local flag is false, then
execPartition.c: ExecInitPartitionInfo() would wrongly perform
BeginForeignCopy() based on only ri_usesMultiInsert, because it
wouldn't know CopyFrom()'s local flag.  Am I missing something?


No, you're right. If someone want to share a state and use ResultRelInfo 
(RRI) for that purpose, then it's fine, but CopyFrom() may simply 
override RRI->ri_usesMultiInsert if needed and pass this RRI further.


This is how it's done for RRI->ri_usesFdwDirectModify. 
InitResultRelInfo() initializes it to false and then 
ExecInitModifyTable() changes the flag if needed.


Probably this is just a matter of personal choice, but for me the 
current implementation with additional argument in InitResultRelInfo() 
doesn't look completely right. Maybe because a caller now should pass an 
additional argument (as false) even if it doesn't care about 
ri_usesMultiInsert at all. It also adds additional complexity and feels 
like abstractions leaking.



Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company




Re: Global snapshots

2020-09-08 Thread Alexey Kondratov

On 2020-09-08 05:49, Fujii Masao wrote:

On 2020/09/05 3:31, Alexey Kondratov wrote:


Attached is a patch, which implements a plain 2PC in the postgres_fdw 
and adds a GUC 'postgres_fdw.use_twophase'. Also it solves these 
errors handling issues above and tries to add proper comments 
everywhere. I think, that 0003 should be rebased on the top of it, or 
it could be a first patch in the set, since it may be used 
independently. What do you think?


Thanks for the patch!

Sawada-san was proposing another 2PC patch at [1]. Do you have any 
thoughts

about pros and cons between your patch and Sawada-san's?

[1]
https://www.postgresql.org/message-id/ca+fd4k4z6_b1etevqamwqhu4rx7xsrn5orl7ohj4b5b6sw-...@mail.gmail.com


Thank you for the link!

After a quick look on the Sawada-san's patch set I think that there are 
two major differences:


1. There is a built-in foreign xacts resolver in the [1], which should 
be much more convenient from the end-user perspective. It involves huge 
in-core changes and additional complexity that is of course worth of.


However, it's still not clear for me that it is possible to resolve all 
foreign prepared xacts on the Postgres' own side with a 100% guarantee. 
Imagine a situation when the coordinator node is actually a HA cluster 
group (primary + sync + async replica) and it failed just after PREPARE 
stage of after local COMMIT. In that case all foreign xacts will be left 
in the prepared state. After failover process complete synchronous 
replica will become a new primary. Would it have all required info to 
properly resolve orphan prepared xacts?


Probably, this situation is handled properly in the [1], but I've not 
yet finished a thorough reading of the patch set, though it has a great 
doc!


On the other hand, previous 0003 and my proposed patch rely on either 
manual resolution of hung prepared xacts or usage of external 
monitor/resolver. This approach is much simpler from the in-core 
perspective, but doesn't look as complete as [1] though.


2. In the patch from this thread all 2PC logic sit in the postgres_fdw, 
while [1] tries to put it into the generic fdw core, which also feels 
like a more general and architecturally correct way. However, how many 
from the currently available dozens of various FDWs are capable to 
perform 2PC? And how many of them are maintained well enough to adopt 
this new API? This is not an argument against [1] actually, since 
postgres_fdw is known to be the most advanced FDW and an early adopter 
of new feature, just a little doubt about a usefulness of this 
preliminary generalisation.


Anyway, I think that [1] is a great work and really hope to find more 
time to investigate it deeper later this year.



Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company




Re: [POC] Fast COPY FROM command for the table with foreign partitions

2020-09-08 Thread Alexey Kondratov

Hi,

I've started doing a review of v7 yesterday.

On 2020-09-08 10:34, Amit Langote wrote:

On Mon, Sep 7, 2020 at 7:31 PM Andrey V. Lepikhov
 wrote:

>
v.7 (in attachment) fixes this problem.
I also accepted Amit's suggestion to rename all fdwapi routines such 
as

ForeignCopyIn to *ForeignCopy.




It seems that naming is quite inconsistent now:

+   /* COPY a bulk of tuples into a foreign relation */
+   BeginForeignCopyIn_function BeginForeignCopy;
+   EndForeignCopyIn_function EndForeignCopy;
+   ExecForeignCopyIn_function ExecForeignCopy;

You get rid of this 'In' in the function names, but the types are still 
with it:


+typedef void (*BeginForeignCopyIn_function) (ModifyTableState *mtstate,
+   ResultRelInfo *rinfo);
+
+typedef void (*EndForeignCopyIn_function) (EState *estate,
+   ResultRelInfo *rinfo);
+
+typedef void (*ExecForeignCopyIn_function) (ResultRelInfo *rinfo,
+   TupleTableSlot **slots,
+   int nslots);

Also docs refer to old function names:

+void
+BeginForeignCopyIn(ModifyTableState *mtstate,
+   ResultRelInfo *rinfo);

I think that it'd be better to choose either of these two naming schemes 
and use it everywhere for consistency.




Any thoughts on the taking out the refactoring changes out of the main
patch as I suggested?



+1 for splitting the patch. It was rather difficult for me to 
distinguish changes required by COPY via postgres_fdw from this 
refactoring.


Another ambiguous part of the refactoring was in changing 
InitResultRelInfo() arguments:


@@ -1278,6 +1280,7 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
  Relation resultRelationDesc,
  Index resultRelationIndex,
  Relation partition_root,
+ bool use_multi_insert,
  int instrument_options)

Why do we need to pass this use_multi_insert flag here? Would it be 
better to set resultRelInfo->ri_usesMultiInsert in the 
InitResultRelInfo() unconditionally like it is done for 
ri_usesFdwDirectModify? And after that it will be up to the caller 
whether to use multi-insert or not based on their own circumstances. 
Otherwise now we have a flag to indicate that we want to check for 
another flag, while this check doesn't look costly.




Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company




Re: Global snapshots

2020-09-04 Thread Alexey Kondratov

Hi,

On 2020-07-27 09:44, Andrey V. Lepikhov wrote:

On 7/27/20 11:22 AM, tsunakawa.ta...@fujitsu.com wrote:


US8356007B2 - Distributed transaction management for database systems 
with multiversioning - Google Patents

https://patents.google.com/patent/US8356007


If it is, can we circumvent this patent?



Thank you for the research (and previous links too).
I haven't seen this patent before. This should be carefully studied.


I had a look on the patch set, although it is quite outdated, especially 
on 0003.


Two thoughts about 0003:

First, IIUC atomicity of the distributed transaction in the postgres_fdw 
is achieved by the usage of 2PC. I think that this postgres_fdw 2PC 
support should be separated from global snapshots. It could be useful to 
have such atomic distributed transactions even without a proper 
visibility, which is guaranteed by the global snapshot. Especially 
taking into account the doubts about Clock-SI and general questions 
about algorithm choosing criteria above in the thread.


Thus, I propose to split 0003 into two parts and add a separate GUC 
'postgres_fdw.use_twophase', which could be turned on independently from 
'postgres_fdw.use_global_snapshots'. Of course if the latter is enabled, 
then 2PC should be forcedly turned on as well.


Second, there are some problems with errors handling in the 0003 (thanks 
to Arseny Sher for review).


+error:
+   if (!res)
+   {
+   sql = psprintf("ABORT PREPARED '%s'", 
fdwTransState->gid);
+   BroadcastCmd(sql);
+   elog(ERROR, "Failed to PREPARE transaction on remote 
node");
+   }

It seems that we should never reach this point, just because 
BroadcastStmt will throw an ERROR if it fails to prepare transaction on 
the foreign server:


+   if (PQresultStatus(result) != expectedStatus ||
+   (handler && !handler(result, arg)))
+   {
+elog(WARNING, "Failed command %s: status=%d, expected status=%d", 
sql, PQresultStatus(result), expectedStatus);

+   pgfdw_report_error(ERROR, result, entry->conn, 
true, sql);
+   allOk = false;
+   }

Moreover, It doesn't make much sense to try to abort prepared xacts, 
since if we failed to prepare it somewhere, then some foreign servers 
may become unavailable already and this doesn't provide us a 100% 
guarantee of clean up.


+   /* COMMIT open transaction of we were doing 2PC */
+   if (fdwTransState->two_phase_commit &&
+   (event == XACT_EVENT_PARALLEL_COMMIT || event == 
XACT_EVENT_COMMIT))
+   {
+   BroadcastCmd(psprintf("COMMIT PREPARED '%s'", 
fdwTransState->gid));
+   }

At this point, the host (local) transaction is already committed and 
there is no way to abort it gracefully. However, BroadcastCmd may rise 
an ERROR that will cause a PANIC, since it is non-recoverable state:


PANIC:  cannot abort transaction 487, it was already committed

Attached is a patch, which implements a plain 2PC in the postgres_fdw 
and adds a GUC 'postgres_fdw.use_twophase'. Also it solves these errors 
handling issues above and tries to add proper comments everywhere. I 
think, that 0003 should be rebased on the top of it, or it could be a 
first patch in the set, since it may be used independently. What do you 
think?



Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres CompanyFrom debdffade7abcdbf29031bda6c8359a89776ad36 Mon Sep 17 00:00:00 2001
From: Alexey Kondratov 
Date: Fri, 7 Aug 2020 16:50:57 +0300
Subject: [PATCH] Add postgres_fdw.use_twophase GUC to use 2PC for transactions
 involving several servers.

---
 contrib/postgres_fdw/connection.c   | 234 +---
 contrib/postgres_fdw/postgres_fdw.c |  17 ++
 contrib/postgres_fdw/postgres_fdw.h |   2 +
 3 files changed, 228 insertions(+), 25 deletions(-)

diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c
index 08daf26fdf0..d18fdd1f94e 100644
--- a/contrib/postgres_fdw/connection.c
+++ b/contrib/postgres_fdw/connection.c
@@ -66,6 +66,20 @@ typedef struct ConnCacheEntry
  */
 static HTAB *ConnectionHash = NULL;
 
+/*
+ * FdwTransactionState
+ *
+ * Holds number of open remote transactions and shared state
+ * needed for all connection entries.
+ */
+typedef struct FdwTransactionState
+{
+	char	   *gid;
+	int			nparticipants;
+	bool		two_phase_commit;
+} FdwTransactionState;
+static FdwTransactionState *fdwTransState;
+
 /* for assigning cursor numbers and prepared statement numbers */
 static unsigned int cursor_number = 0;
 static unsigned int prep_stmt_number = 0;
@@ -73,6 +87,9 @@ static unsigned int prep_stmt_number = 0;
 /* tracks whether any work is need

Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2020-09-01 Thread Alexey Kondratov

On 2020-09-01 13:12, Justin Pryzby wrote:
This patch seems to be missing a call to RelationAssumeNewRelfilenode() 
in

reindex_index().

That's maybe the related to the cause of the crashes I pointed out 
earlier this

year.

Alexey's v4 patch changed RelationSetNewRelfilenode() to accept a 
tablespace
parameter, but Michael seemed to object to that.  However that seems 
cleaner

and ~30 line shorter.

Michael, would you comment on that ?  The v4 patch and your comments 
are here.

https://www.postgresql.org/message-id/attachment/105574/v4-0001-Allow-REINDEX-and-REINDEX-CONCURRENTLY-to-change-tablespace.patch
https://www.postgresql.org/message-id/20191127035416.GG5435%40paquier.xyz



Actually, the last time we discussed this point I only got the gut 
feeling that this is a subtle place and it is very easy to break things 
with these changes. However, it isn't clear for me how exactly. That 
way, I'd be glad if Michael could reword his explanation, so it'd more 
clear for me as well.


BTW, I've started doing a review of the last patch set yesterday and 
will try to post some comments later.



Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company




Re: [POC] Fast COPY FROM command for the table with foreign partitions

2020-07-28 Thread Alexey Kondratov

On 2020-07-28 03:33, Andrey Lepikhov wrote:

27.07.2020 21:34, Alexey Kondratov пишет:

Hi Andrey,

On 2020-07-23 09:23, Andrey V. Lepikhov wrote:

On 7/16/20 2:14 PM, Amit Langote wrote:

Amit Langote
EnterpriseDB: http://www.enterprisedb.com



Version 5 of the patch. With changes caused by Amit's comments.


Just got a segfault with your v5 patch by deleting from a foreign 
table. It seems that ShutdownForeignScan inside node->fdwroutine 
doesn't have a correct pointer to the required function.


I haven't had a chance to look closer on the code, but you can easily 
reproduce this error with the attached script (patched Postgres 
binaries should be available in the PATH). It works well with master 
and fails with your patch applied.


I used master a3ab7a707d and v5 version of the patch with your script.
No errors found. Can you check your test case?


Yes, my bad. I forgot to re-install postgres_fdw extension, only did it 
for postgres core, sorry for disturb.



Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company




Re: [POC] Fast COPY FROM command for the table with foreign partitions

2020-07-27 Thread Alexey Kondratov

Hi Andrey,

On 2020-07-23 09:23, Andrey V. Lepikhov wrote:

On 7/16/20 2:14 PM, Amit Langote wrote:

Amit Langote
EnterpriseDB: http://www.enterprisedb.com



Version 5 of the patch. With changes caused by Amit's comments.


Just got a segfault with your v5 patch by deleting from a foreign table. 
Here is a part of backtrace:


  * frame #0: 0x0001029069ec 
postgres`ExecShutdownForeignScan(node=0x7ff28c8909b0) at 
nodeForeignscan.c:385:3
frame #1: 0x0001028e7b06 
postgres`ExecShutdownNode(node=0x7ff28c8909b0) at 
execProcnode.c:779:4
frame #2: 0x00010299b3fa 
postgres`planstate_walk_members(planstates=0x7ff28c8906d8, nplans=1, 
walker=(postgres`ExecShutdownNode at execProcnode.c:752), 
context=0x) at nodeFuncs.c:3998:7
frame #3: 0x00010299b010 
postgres`planstate_tree_walker(planstate=0x7ff28c8904c0, 
walker=(postgres`ExecShutdownNode at execProcnode.c:752), 
context=0x) at nodeFuncs.c:3914:8
frame #4: 0x0001028e7ab7 
postgres`ExecShutdownNode(node=0x7ff28c8904c0) at 
execProcnode.c:771:2


(lldb) f 0
frame #0: 0x0001029069ec 
postgres`ExecShutdownForeignScan(node=0x7ff28c8909b0) at 
nodeForeignscan.c:385:3

   382  FdwRoutine *fdwroutine = node->fdwroutine;
   383
   384  if (fdwroutine->ShutdownForeignScan)
-> 385   fdwroutine->ShutdownForeignScan(node);
   386  }
(lldb) p node->fdwroutine->ShutdownForeignScan
(ShutdownForeignScan_function) $1 = 0x7f7f7f7f7f7f7f7f

It seems that ShutdownForeignScan inside node->fdwroutine doesn't have a 
correct pointer to the required function.


I haven't had a chance to look closer on the code, but you can easily 
reproduce this error with the attached script (patched Postgres binaries 
should be available in the PATH). It works well with master and fails 
with your patch applied.



Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company#!/usr/bin/env sh

pg_ctl -D node1 stop > /dev/null
pg_ctl -D node2 stop > /dev/null

rm -rf node1 node2
rm node1.log node2.log

initdb -D node1
initdb -D node2

echo "port = 5433" >> node2/postgresql.conf

pg_ctl -D node1 -l node1.log start
pg_ctl -D node2 -l node2.log start

createdb
createdb -p5433

psql -p5433 -c "CREATE TABLE test (id INT) PARTITION BY HASH (id)"

psql -c "CREATE EXTENSION IF NOT EXISTS postgres_fdw"
psql -c "CREATE SERVER node2 FOREIGN DATA WRAPPER postgres_fdw OPTIONS (port 
'5433'); CREATE USER MAPPING FOR current_user SERVER node2"
psql -c "CREATE FOREIGN TABLE test(id INT) SERVER node2"

psql -c "DELETE FROM test"


Re: Partitioning and postgres_fdw optimisations for multi-tenancy

2020-07-16 Thread Alexey Kondratov

On 2020-07-16 19:35, Etsuro Fujita wrote:

On Thu, Jul 16, 2020 at 8:56 PM Andrey Lepikhov
 wrote:

On 7/16/20 9:55 AM, Etsuro Fujita wrote:



>>>> On Tue, Jul 14, 2020 at 12:48 AM Alexey Kondratov
>>>>  wrote:
>>>>> Some real-life test queries show, that all single-node queries aren't
>>>>> pushed-down to the required node. For example:
>>>>>
>>>>> SELECT
>>>>>   *
>>>>> FROM
>>>>>   documents
>>>>>   INNER JOIN users ON documents.user_id = users.id
>>>>> WHERE
>>>>>   documents.company_id = 5
>>>>>   AND users.company_id = 5;



> PWJ cannot be applied
> to the join due to the limitation of the PWJ matching logic.  See the
> discussion started in [1].  I think the patch in [2] would address
> this issue as well, though the patch is under review.


I think, discussion [1] is little relevant to the current task. Here 
we

join not on partition attribute and PWJ can't be used at all.


The main point of the discussion is to determine whether PWJ can be
used for a join between partitioned tables, based on
EquivalenceClasses, not just join clauses created by
build_joinrel_restrictlist().  For the above join, for example, the
patch in [2] would derive a join clause "documents.company_id =
users.company_id" from an EquivalenceClass that recorded the knowledge
"documents.company_id = 5" and "users.company_id = 5", and then the
planner would consider from it that PWJ can be used for the join.



Yes, it really worked well. Thank you for the explanation, it wasn't so 
obvious for me as well. That way, I think that the patch from [1] covers 
many cases of joins targeting a single partition / foreign server.


However, there is an issue with aggregates as well. For a query like:

SELECT
count(*)
FROM
documents
WHERE
company_id = 5;

It would be great to teach planner to understand, that it's a 
partition-wise aggregate as well, even without GROUP BY company_id, 
which doesn't always help as well. I'll try to look closer on this 
problem, but if you have any thoughts about it, then I'd be glad to 
know.



Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company




Re: Partitioning and postgres_fdw optimisations for multi-tenancy

2020-07-16 Thread Alexey Kondratov

On 2020-07-16 14:56, Andrey Lepikhov wrote:

On 7/16/20 9:55 AM, Etsuro Fujita wrote:
On Wed, Jul 15, 2020 at 9:02 PM Etsuro Fujita 
 wrote:

On Wed, Jul 15, 2020 at 12:12 AM Alexey Kondratov
 wrote:

On 2020-07-14 15:27, Ashutosh Bapat wrote:

On Tue, Jul 14, 2020 at 12:48 AM Alexey Kondratov
 wrote:
Some real-life test queries show, that all single-node queries 
aren't

pushed-down to the required node. For example:

SELECT
  *
FROM
  documents
  INNER JOIN users ON documents.user_id = users.id
WHERE
  documents.company_id = 5
  AND users.company_id = 5;


There are a couple of things happening here
1. the clauses on company_id in WHERE clause are causing partition
pruning. Partition-wise join is disabled with partition pruning 
before

PG13.


More precisely, PWJ cannot be applied when there are no matched
partitions on the nullable side due to partition pruning before PG13.


On reflection, I think I was wrong: the limitation applies to PG13,
even with advanced PWJ.

But the join is an inner join, so I think PWJ can still be applied 
for

the join.


I think I was wrong in this point as well :-(.  PWJ cannot be applied
to the join due to the limitation of the PWJ matching logic.  See the
discussion started in [1].  I think the patch in [2] would address
this issue as well, though the patch is under review.



Thanks for sharing the links, Fujita-san.



I think, discussion [1] is little relevant to the current task. Here
we join not on partition attribute and PWJ can't be used at all. Here
we can use push-down join of two foreign relations.
We can analyze baserestrictinfo's of outer and inner RelOptInfo's and
may detect that only one partition from outer and inner need to be
joined.
Next, we will create joinrel from RelOptInfo's of these partitions and
replace joinrel of partitioned tables. But it is only rough outline of
a possible solution...



I was a bit skeptical after eyeballing the thread [1], but still tried 
v3 patch with the current master and my test setup. Surprisingly, it 
just worked, though it isn't clear for me how. With this patch 
aforementioned simple join is completely pushed down to the foreign 
server. And speedup is approximately the same (~3 times) as when 
required partitions are explicitly used in the query.


As a side-effected it also affected join + aggregate queries like:

SELECT
user_id,
count(*) AS documents_count
FROM
documents
INNER JOIN users ON documents.user_id = users.id
WHERE
documents.company_id = 5
AND users.company_id = 5
GROUP BY
user_id;

With patch it is executed as:

 GroupAggregate
   Group Key: documents.user_id
   ->  Sort
 Sort Key: documents.user_id
 ->  Foreign Scan
   Relations: (documents_node2 documents)
   INNER JOIN (users_node2 users)

Without patch its plan was:

 GroupAggregate
   Group Key: documents.user_id
   ->  Sort
 Sort Key: documents.user_id
 ->  Hash Join
   Hash Cond: (documents.user_id = users.id)
   ->  Foreign Scan on documents_node2 documents
   ->  Hash
 ->  Foreign Scan on users_node2 users

I cannot say that it is most efficient plan in that case, since the 
entire query could be pushed down to the foreign server, but still it 
gives a 5-10% speedup on my setup.



Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company




Re: Partitioning and postgres_fdw optimisations for multi-tenancy

2020-07-14 Thread Alexey Kondratov

On 2020-07-14 15:27, Ashutosh Bapat wrote:

On Tue, Jul 14, 2020 at 12:48 AM Alexey Kondratov
 wrote:

I built a simple two node multi-tenant schema for tests, which can be
easily set up with attached scripts. It creates three tables 
(companies,
users, documents) distributed over two nodes. Everything can be found 
in

this Gist [2] as well.

Some real-life test queries show, that all single-node queries aren't
pushed-down to the required node. For example:

SELECT
 *
FROM
 documents
 INNER JOIN users ON documents.user_id = users.id
WHERE
 documents.company_id = 5
 AND users.company_id = 5;


There are a couple of things happening here
1. the clauses on company_id in WHERE clause are causing partition
pruning. Partition-wise join is disabled with partition pruning before
PG13. In PG13 we have added advanced partition matching algorithm
which will allow partition-wise join with partition pruning.



I forgot to mention that I use a recent master (991c444e7a) for tests 
with


enable_partitionwise_join = 'on'
enable_partitionwise_aggregate = 'on'

of course. I've also tried postgres_fdw.use_remote_estimate = true 
followed by ANALYSE on both nodes (it is still used in setup.sh script).


BTW, can you, please, share a link to commit / thread about allowing 
partition-wise join and partition pruning to work together in PG13?




2. the query has no equality condition on the partition key of the
tables being joined. Partitionwise join is possible only when there's
an equality condition on the partition keys (company_id) of the
joining tables. PostgreSQL's optimizer is not smart enough to convert
the equality conditions in WHERE clause into equality conditions on
partition keys. So having those conditions just in WHERE clause does
not help. Instead please add equality conditions on partition keys in
JOIN .. ON clause or WHERE clause (only for INNER join).



With adding documents.company_id = users.company_id

SELECT *
FROM
documents
INNER JOIN users ON (documents.company_id = users.company_id
AND documents.user_id = users.id)
WHERE
documents.company_id = 5
AND users.company_id = 5;

query plan remains the same.



executed as following

   QUERY PLAN
---
  Nested Loop
Join Filter: (documents.user_id = users.id)
->  Foreign Scan on users_node2 users
->  Materialize
  ->  Foreign Scan on documents_node2 documents

i.e. it uses two foreign scans and does the final join locally. 
However,

once I specify target partitions explicitly, then the entire query is
pushed down to the foreign node:

QUERY PLAN
-
  Foreign Scan
Relations: (documents_node2) INNER JOIN (users_node2)

Execution time is dropped significantly as well — by more than 3 times
even for this small test database. Situation for simple queries with
aggregates or joins and aggregates followed by the sharding key filter
is the same. Something similar was briefly discussed in this thread 
[3].


IIUC, it means that push-down of queries through the postgres_fdw 
works

perfectly well, the problem is with partition-wise operation detection
at the planning time. Currently, partition-wise aggregate routines,
e.g., looks for a GROUP BY and checks whether sharding key exists 
there

or not. After that PARTITIONWISE_AGGREGATE_* flag is set. However, it
doesn't look for a content of WHERE clause, so frankly speaking it 
isn't

a problem, this functionality is not yet implemented.

Actually, sometimes I was able to push down queries with aggregate
simply by adding an additional GROUP BY with sharding key, like this:

SELECT
 count(*)
FROM
 documents
WHERE
 company_id = 5
GROUP BY company_id;


This gets pushed down since GROUP BY clause is on the partition key.



Sure, but it only works *sometimes*, I've never seen most of such simple 
queries with aggregates to be pushed down, e.g.:


SELECT
sum(id)
FROM
documents_node2
WHERE
company_id = 5
GROUP BY
company_id;

whether 'GROUP BY company_id' is used or not.



Although it seems that it will be easier to start with aggregates,
probably we should initially plan a more general solution? For 
example,
check that all involved tables are filtered by partitioning key and 
push

down the entire query if all of them target the same foreign server.

Any thoughts?


I think adding just equality conditions on the partition key will be
enough. No need for any code change.


So, it hasn't helped. Maybe I could modify some costs to verify that 
push-down of such joins is ever possible?


Anyway, what about aggregates? Partition-wise aggregates work fine for 
queries like


SELECT
count(*)
FROM
documents
GROUP BY
company_id;

but once I narrow it to a single partition with 'WHERE company_id = 5', 
then it is being executed in a very inefficient way — takes

Partitioning and postgres_fdw optimisations for multi-tenancy

2020-07-13 Thread Alexey Kondratov

Hi Hackers,

The idea of achieving Postgres scaling via sharding using postgres_fdw + 
partitioning got a lot of attention last years. Many optimisations have 
been done in this direction: partition pruning, partition-wise 
aggregates / joins, postgres_fdw push-down of LIMIT, GROUP BY, etc. In 
many cases they work really nice.


However, still there is a vast case, where postgres_fdw + native 
partitioning doesn't perform so good — Multi-tenant architecture. From 
the database perspective it is presented well in this Citus tutorial 
[1]. The main idea is that there is a number of tables and all of them 
are sharded / partitioned by the same key, e.g. company_id. That way, if 
every company mostly works within its own data, then every query may be 
effectively executed on a single node without a need for an internode 
communication.


I built a simple two node multi-tenant schema for tests, which can be 
easily set up with attached scripts. It creates three tables (companies, 
users, documents) distributed over two nodes. Everything can be found in 
this Gist [2] as well.


Some real-life test queries show, that all single-node queries aren't 
pushed-down to the required node. For example:


SELECT
*
FROM
documents
INNER JOIN users ON documents.user_id = users.id
WHERE
documents.company_id = 5
AND users.company_id = 5;

executed as following

  QUERY PLAN
---
 Nested Loop
   Join Filter: (documents.user_id = users.id)
   ->  Foreign Scan on users_node2 users
   ->  Materialize
 ->  Foreign Scan on documents_node2 documents

i.e. it uses two foreign scans and does the final join locally. However, 
once I specify target partitions explicitly, then the entire query is 
pushed down to the foreign node:


   QUERY PLAN
-
 Foreign Scan
   Relations: (documents_node2) INNER JOIN (users_node2)

Execution time is dropped significantly as well — by more than 3 times 
even for this small test database. Situation for simple queries with 
aggregates or joins and aggregates followed by the sharding key filter 
is the same. Something similar was briefly discussed in this thread [3].


IIUC, it means that push-down of queries through the postgres_fdw works 
perfectly well, the problem is with partition-wise operation detection 
at the planning time. Currently, partition-wise aggregate routines, 
e.g., looks for a GROUP BY and checks whether sharding key exists there 
or not. After that PARTITIONWISE_AGGREGATE_* flag is set. However, it 
doesn't look for a content of WHERE clause, so frankly speaking it isn't 
a problem, this functionality is not yet implemented.


Actually, sometimes I was able to push down queries with aggregate 
simply by adding an additional GROUP BY with sharding key, like this:


SELECT
count(*)
FROM
documents
WHERE
company_id = 5
GROUP BY company_id;

where this GROUP BY obviously doesn't change a results, it just allows 
planner to choose from more possible paths.


Also, I have tried to hack it a bit and forcedly set 
PARTITIONWISE_AGGREGATE_FULL for this particular query. Everything 
executed fine and returned result was correct, which means that all 
underlying machinery is ready.


That way, I propose a change to the planner, which will check whether 
partitioning key exist in the WHERE clause and will set 
PARTITIONWISE_AGGREGATE_* flags if appropriate. The whole logic may look 
like:


1. If the only one condition by partitioning key is used (like above), 
then it is PARTITIONWISE_AGGREGATE_FULL.
2. If several conditions are used, then it should be 
PARTITIONWISE_AGGREGATE_PARTIAL.


I'm aware that WHERE clause may be extremely complex in general, but we 
could narrow this possible optimisation to the same restrictions as 
postgres_fdw push-down "only WHERE clauses using built-in operators and 
functions will be considered for execution on the remote server".


Although it seems that it will be easier to start with aggregates, 
probably we should initially plan a more general solution? For example, 
check that all involved tables are filtered by partitioning key and push 
down the entire query if all of them target the same foreign server.


Any thoughts?


[1] 
https://docs.citusdata.com/en/v9.3/get_started/tutorial_multi_tenant.html

[2] https://gist.github.com/ololobus/8fba33241f68be2e3765d27bf04882a3
[3] 
https://www.postgresql.org/message-id/flat/CAFT%2BaqL1Tt0qfYqjHH%2BshwPoW8qdFjpJ8vBR5ABoXJDUcHyN1w%40mail.gmail.com


Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres CompanyDROP TABLE IF EXISTS companies CASCADE;
DROP TABLE IF EXISTS users CASCADE;
DROP TABLE IF EXISTS documents CASCADE;
DROP SERVER IF EXISTS node2 CASCADE;

CREATE EXTENSION IF NOT EXISTS postgres_fdw;

CREATE SERVER node2 FOREIGN DATA WRAPPER postgres_fdw OPTIONS (p

Re: [PATCH] Allow to specify restart_lsn in pg_create_physical_replication_slot()

2020-06-26 Thread Alexey Kondratov

On 2020-06-23 04:18, Michael Paquier wrote:

On Mon, Jun 22, 2020 at 08:18:58PM +0300, Alexey Kondratov wrote:

Things get worse when we allow specifying an older LSN, since it has a
higher chances to be at the horizon of deletion by checkpointer. 
Anyway, if
I get it correctly, with a current patch slot will be created 
successfully,

but will be obsolete and should be invalidated by the next checkpoint.


Is that a behavior acceptable for the end user?  For example, a
physical slot that is created to immediately reserve WAL may get
invalidated, causing it to actually not keep WAL around contrary to
what the user has wanted the command to do.



I can imagine that it could be acceptable in the initially proposed 
scenario for someone, since creation of a slot with historical 
restart_lsn is already unpredictable — required segment may exist or may 
do not exist. However, adding here an undefined behaviour even after a 
slot creation does not look good to me anyway.


I have looked closely on the checkpointer code and another problem is 
that it decides once which WAL segments to delete based on the 
replicationSlotMinLSN, and does not check anything before the actual 
file deletion. That way the gap for a possible race is even wider. I do 
not know how to completely get rid of this race without introducing of 
some locking mechanism, which may be costly.


Thanks for feedback
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company




Re: [PATCH] Allow to specify restart_lsn in pg_create_physical_replication_slot()

2020-06-22 Thread Alexey Kondratov

On 2020-06-19 21:57, Fujii Masao wrote:

On 2020/06/19 23:20, Alexey Kondratov wrote:

On 2020-06-19 03:59, Michael Paquier wrote:

On Thu, Jun 18, 2020 at 03:39:09PM +0300, Vyacheslav Makarov wrote:
If the WAL segment for the specified restart_lsn (STOP_LSN of the 
backup)
exists, then the function will create a physical replication slot 
and will
keep all the WAL segments required by the replica to catch up with 
the
primary. Otherwise, it returns error, which means that the required 
WAL
segments have been already utilised, so we do need to take a new 
backup.

Without passing this newly added parameter
pg_create_physical_replication_slot() works as before.

What do you think about this?


Currently pg_create_physical_replication_slot() and 
CREATE_REPLICATION_SLOT
replication command seem to be "idential". So if we add new option into 
one,

we should add it also into another?



I wonder how it could be used via the replication protocol, but probably 
this option should be added there as well for consistency.




What happen if future LSN is specified in restart_lsn? With the patch,
in this case, if the segment at that LSN exists (e.g., because it's 
recycled

one), the slot seems to be successfully created. However if the LSN is
far future and the segment doesn't exist, the creation of slot seems to 
fail.

This behavior looks fragile and confusing. We should accept future LSN
whether its segment currently exists or not?



But what about a possible timeline switch? If we allow specifying it as 
further in the future as one wanted, then appropriate segment with 
specified LSN may be created in the different timeline if it would be 
switched, so it may be misleading. I am not even sure about allowing 
future LSN for existing segments, since PITR / timeline switch may occur 
just after the slot creation, so the pointer may never be valid. Would 
it be better to completely disallow future LSN?


And here I noticed another moment in the patch. TimeLineID of the last 
restart/checkpoint is used to detect whether WAL segment file exists or 
not. It means that if we try to create a slot just after a timeline 
switch, then we could not specify the oldest LSN actually available on 
the disk, since it may be from the previous timeline. One can use LSN 
only within the current timeline. It seems to be fine, but should be 
covered in the docs.




+   if (!RecoveryInProgress() && !SlotIsLogical(MyReplicationSlot))

With the patch, the given restart_lsn seems to be ignored during 
recovery.

Why?



I have the same question, not sure that this is needed here. It looks 
more like a forgotten copy-paste from ReplicationSlotReserveWal().




I think that this was discussed in the past (perhaps one of the
threads related to WAL advancing actually?),



I have searched through the archives a bit and found one thread 
related to slots advancing [1]. It was dedicated to a problem of 
advancing slots which do not reserve WAL yet, if I get it correctly. 
Although it is somehow related to the topic, it was a slightly 
different issue, IMO.




and this stuff is full of
holes when it comes to think about error handling with checkpoints
running in parallel, potentially doing recycling of segments you 
would

expect to be around based on your input value for restart_lsn *while*
pg_create_physical_replication_slot() is still running and
manipulating the on-disk slot information.
...


These are the right concerns, but all of them should be applicable to 
the pg_create_physical_replication_slot() + immediately_reserve == 
true in the same way, doesn't it? I think so, since in that case we 
are doing a pretty similar thing — trying to reserve some WAL segment 
that may be concurrently deleted.


And this is exactly the reason why ReplicationSlotReserveWal() does it 
in several steps in a loop:


1. Creates a slot with some restart_lsn.
2. Does ReplicationSlotsComputeRequiredLSN() to prevent removal of the 
WAL segment with this restart_lsn.

3. Checks that required WAL segment is still there.
4. Repeat if this attempt to prevent WAL removal has failed.


What happens if concurrent checkpoint decides to remove the segment
at restart_lsn before #2 and then actually removes it after #3?
The replication slot is successfully created with the given 
restart_lsn,

but the reserved segment has already been removed?



I though about it a bit more and it seems that yes, there is a race even 
for a current pg_create_physical_replication_slot() + 
immediately_reserve == true, i.e. ReplicationSlotReserveWal(). However, 
the chance is very subtle since we take a current GetRedoRecPtr() there. 
Probably one could reproduce it with wal_keep_segments = 1 by holding / 
releasing backend doing the slot creation and checkpointer with gdb, but 
not sure that it is an issue anywhere in the real world.


Maybe I am wrong, but it is not clear for me why current 
ReplicationSlotReserveWal() routine does not have tha

Re: [PATCH] Allow to specify restart_lsn in pg_create_physical_replication_slot()

2020-06-19 Thread Alexey Kondratov

On 2020-06-19 03:59, Michael Paquier wrote:

On Thu, Jun 18, 2020 at 03:39:09PM +0300, Vyacheslav Makarov wrote:
If the WAL segment for the specified restart_lsn (STOP_LSN of the 
backup)
exists, then the function will create a physical replication slot and 
will

keep all the WAL segments required by the replica to catch up with the
primary. Otherwise, it returns error, which means that the required 
WAL
segments have been already utilised, so we do need to take a new 
backup.

Without passing this newly added parameter
pg_create_physical_replication_slot() works as before.

What do you think about this?


I think that this was discussed in the past (perhaps one of the
threads related to WAL advancing actually?),



I have searched through the archives a bit and found one thread related 
to slots advancing [1]. It was dedicated to a problem of advancing slots 
which do not reserve WAL yet, if I get it correctly. Although it is 
somehow related to the topic, it was a slightly different issue, IMO.




and this stuff is full of
holes when it comes to think about error handling with checkpoints
running in parallel, potentially doing recycling of segments you would
expect to be around based on your input value for restart_lsn *while*
pg_create_physical_replication_slot() is still running and
manipulating the on-disk slot information. I suspect that this also
breaks a couple of assumptions behind concurrent calls of the minimum
LSN calculated across slots when a caller sees fit to recompute the
thresholds (WAL senders mainly here, depending on the replication
activity).



These are the right concerns, but all of them should be applicable to 
the pg_create_physical_replication_slot() + immediately_reserve == true 
in the same way, doesn't it? I think so, since in that case we are doing 
a pretty similar thing — trying to reserve some WAL segment that may be 
concurrently deleted.


And this is exactly the reason why ReplicationSlotReserveWal() does it 
in several steps in a loop:


1. Creates a slot with some restart_lsn.
2. Does ReplicationSlotsComputeRequiredLSN() to prevent removal of the 
WAL segment with this restart_lsn.

3. Checks that required WAL segment is still there.
4. Repeat if this attempt to prevent WAL removal has failed.

I guess that the only difference in the case of proposed scenario is 
that we do not have a chance for step 4, since we do need some specific 
restart_lsn, not any recent restart_lsn, i.e. in this case we have to:


1. Create a slot with restart_lsn specified by user.
2. Do ReplicationSlotsComputeRequiredLSN() to prevent WAL removal.
3. Check that required WAL segment is still there and report ERROR to 
the user if it is not.


I have eyeballed the attached patch and it looks like doing exactly the 
same, so issues with concurrent deletion are not obvious for me. Or, 
there are should be the same issues for 
pg_create_physical_replication_slot() + immediately_reserve == true with 
current master implementation.


[1] 
https://www.postgresql.org/message-id/flat/20180626071305.GH31353%40paquier.xyz



Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company




Re: Physical replication slot advance is not persistent

2020-06-18 Thread Alexey Kondratov

On 2020-06-16 10:27, Michael Paquier wrote:

On Wed, Jun 10, 2020 at 08:57:17PM +0300, Alexey Kondratov wrote:
New test reproduces this issue well. Left it running for a couple of 
hours

in repeat and it seems to be stable.


Thanks for testing.  I have been thinking about the minimum xmin and
LSN computations on advancing, and actually I have switched the
recomputing to be called at the end of pg_replication_slot_advance().
This may be a waste if no advancing is done, but it could also be an
advantage to enforce a recalculation of the thresholds for each
function call.  And that's more consistent with the slot copy, drop
and creation.



Sorry for a bit late response, but I see a couple of issues with this 
modified version of the patch in addition to the waste call if no 
advancing is done, mentioned by you:


1. Both ReplicationSlotsComputeRequiredXmin() and 
ReplicationSlotsComputeRequiredLSN() may have already been done in the 
LogicalConfirmReceivedLocation() if it was a logical slot. It may be 
fine and almost costless to do it twice, but it looks untidy for me.


2. It seems that we do not need ReplicationSlotsComputeRequiredXmin() at 
all if it was a physical slot, since we do not modify xmin in the 
pg_physical_replication_slot_advance(), doesn't it?


That's why I wanted (somewhere around v5 of the patch in this thread) to 
move all dirtying and recomputing calls to the places, where xmin / lsn 
slot modifications are actually done — 
pg_physical_replication_slot_advance() and 
LogicalConfirmReceivedLocation(). LogicalConfirmReceivedLocation() 
already does this, so we only needed to teach 
pg_physical_replication_slot_advance() to do the same.


However, just noted that LogicalConfirmReceivedLocation() only does 
ReplicationSlotsComputeRequiredLSN() if updated_xmin flag was set, which 
looks wrong from my perspective, since updated_xmin and updated_restart 
flags are set separately.


That way, I would solve this all as per attached, which works well for 
me, but definitely worth of a better testing.



Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Companydiff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 61902be3b0..2ceb078418 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -1081,8 +1081,14 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn)
 			SpinLockRelease(>mutex);
 
 			ReplicationSlotsComputeRequiredXmin(false);
-			ReplicationSlotsComputeRequiredLSN();
 		}
+
+		/*
+		 * Now recompute the minimum required LSN across all slots if
+		 * restart_lsn of the slot has been updated.
+		 */
+		if (updated_restart)
+			ReplicationSlotsComputeRequiredLSN();
 	}
 	else
 	{
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 06e4955de7..d9c19c07e5 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -419,6 +419,12 @@ pg_physical_replication_slot_advance(XLogRecPtr moveto)
 		 * crash, but this makes the data consistent after a clean shutdown.
 		 */
 		ReplicationSlotMarkDirty();
+
+		/*
+		 * Recompute the minimum required LSN across all slots to adjust with
+		 * the advancing done.
+		 */
+		ReplicationSlotsComputeRequiredLSN();
 	}
 
 	return retlsn;
@@ -621,13 +627,6 @@ pg_replication_slot_advance(PG_FUNCTION_ARGS)
 	values[0] = NameGetDatum(>data.name);
 	nulls[0] = false;
 
-	/*
-	 * Recompute the minimum LSN and xmin across all slots to adjust with the
-	 * advancing potentially done.
-	 */
-	ReplicationSlotsComputeRequiredXmin(false);
-	ReplicationSlotsComputeRequiredLSN();
-
 	ReplicationSlotRelease();
 
 	/* Return the reached position. */


Re: Physical replication slot advance is not persistent

2020-06-10 Thread Alexey Kondratov

On 2020-06-10 11:38, Kyotaro Horiguchi wrote:

At Wed, 10 Jun 2020 15:53:53 +0900, Michael Paquier
 wrote in

> > I find it really depressing how much obviously untested stuff gets
> > added in this area.
>
> Prior to this patch pg_replication_slot_advance was not being tested
> at all.
> Unfortunately, added tests appeared to be not enough to cover all
> cases. It
> seems that the whole machinery of WAL holding and trimming is worth
> to be
> tested more thoroughly.

I think that it would be interesting if we had a SQL representation of
the contents of XLogCtlData (wanted that a couple of times).  Now we
are actually limited to use a checkpoint and check that past segments
are getting recycled by looking at the contents of pg_wal.  Doing that
here does not cause the existing tests to be much more expensive as we
only need one extra call to pg_switch_wal(), mostly.  Please see the
attached.


The test in the patch looks fine to me and worked well for me.

Using smaller wal_segment_size (1(MB) worked for me) reduces the cost
of the check, but I'm not sure it's worth doing.



New test reproduces this issue well. Left it running for a couple of 
hours in repeat and it seems to be stable.


Just noted that we do not need to keep $phys_restart_lsn_pre:

my $phys_restart_lsn_pre = $node_master->safe_psql('postgres',
	"SELECT restart_lsn from pg_replication_slots WHERE slot_name = 
'$phys_slot';"

);
chomp($phys_restart_lsn_pre);

we can safely use $current_lsn used for pg_replication_slot_advance(), 
since reatart_lsn is set as is there. It may make the test a bit simpler 
as well.



--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company




Re: Physical replication slot advance is not persistent

2020-06-09 Thread Alexey Kondratov

On 2020-06-09 20:19, Andres Freund wrote:

Hi,

On 2020-01-28 17:01:14 +0900, Michael Paquier wrote:

So attached is an updated patch which addresses the problem just by
marking a physical slot as dirty if any advancing is done.  Some
documentation is added about the fact that an advance is persistent
only at the follow-up checkpoint.  And the tests are fixed to not use
a fake LSN but instead advance to the latest LSN position produced.




-   /* Update the on disk state when lsn was updated. */
-   if (XLogRecPtrIsInvalid(endlsn))
-   {
-   ReplicationSlotMarkDirty();
-   ReplicationSlotsComputeRequiredXmin(false);
-   ReplicationSlotsComputeRequiredLSN();
-   ReplicationSlotSave();
-   }
-


I am quite confused by the wholesale removal of these lines. That 
wasn't

in previous versions of the patch. As far as I can tell not calling
ReplicationSlotsComputeRequiredLSN() for the physical slot leads to the
global minimum LSN never beeing advanced, and thus WAL reserved by the
slot not being removable.  Only if there's some independent call to
ReplicationSlotsComputeRequiredLSN() XLogSetReplicationSlotMinimumLSN()
will be called, allowing for slots to advance.

I realize this stuff has been broken since the introduction in
9c7d06d6068 (due to the above being if (XLogRecPtrIsInvalid()) rather
than if (!XLogRecPtrIsInvalid()) , but this seems to make it even 
worse?




Yes, there was a ReplicationSlotsComputeRequiredLSN() call inside 
pg_physical_replication_slot_advance() in the v5 of the patch:


@@ -370,6 +370,11 @@ pg_physical_replication_slot_advance(XLogRecPtr 
moveto)

MyReplicationSlot->data.restart_lsn = moveto;
SpinLockRelease(>mutex);
retlsn = moveto;
+
+   ReplicationSlotMarkDirty();
+
+   /* We moved retart_lsn, update the global value. */
+   ReplicationSlotsComputeRequiredLSN();

But later it has been missed and we have not noticed that.

I think that adding it back as per attached will be enough.



I find it really depressing how much obviously untested stuff gets
added in this area.



Prior to this patch pg_replication_slot_advance was not being tested at 
all. Unfortunately, added tests appeared to be not enough to cover all 
cases. It seems that the whole machinery of WAL holding and trimming is 
worth to be tested more thoroughly.



Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Companydiff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 1b929a603e..8e543e276f 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -419,6 +419,11 @@ pg_physical_replication_slot_advance(XLogRecPtr moveto)
 		 * crash, but this makes the data consistent after a clean shutdown.
 		 */
 		ReplicationSlotMarkDirty();
+
+		/*
+		 * We advanced retart_lsn, update the global minimum required value.
+		 */
+		ReplicationSlotsComputeRequiredLSN();
 	}
 
 	return retlsn;


Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line

2020-06-08 Thread Alexey Kondratov

On 2020-06-08 09:14, Michael Paquier wrote:

On Sun, Jun 07, 2020 at 10:05:03PM +0300, Alexander Korotkov wrote:

On Sat, Jun 6, 2020 at 8:49 PM Peter Eisentraut
 wrote:
Why is fe_archive.c in src/common/ and not in src/fe_utils/?  It is 
not

"common" to frontend and backend.


Yep, this seems wrong to me.  I can propose following file rename.

src/common/fe_archive.c => src/fe_utils/archive.c
include/common/fe_archive.h => include/fe_utils/archive.h


Is it technically a problem though?  fe_archive.c is listed as a
frontend-only file with OBJS_FRONTEND in src/common/Makefile.  Anyway,
for this one I would not mind to do the move so please see the
attached, but I am not completely sure either why src/fe_utils/ had
better be chosen than src/common/.





Perhaps we have more things that are frontend-only in src/common/ that
could be moved to src/fe_utils/?  I am looking at restricted_token.c,
fe_memutils.c, logging.c and file_utils.c here, but that would mean
breaking a couple of declarations, something never free for plugin
developers.



I noticed this before in the thread [1], but it has been left unnoticed 
I guess:


"I just went through the both patches and realized that I cannot get 
into

semantics of splitting frontend code between common and fe_utils. This
applies only to 0002, where we introduce fe_archive.c. Should it be
placed into fe_utils alongside of the recent recovery_gen.c also used by
pg_rewind? This is a frontend-only code intended to be used by frontend
applications, so fe_utils feels like the right place, doesn't it? Just
tried to do so and everything went fine, so it seems that there is no
obstacles from the build system.

BTW, most of 'common' is a really common code with only four exceptions
like logging.c, which is frontend-only. Is it there for historical
reasons only or something else?"

Personally, I would prefer that everything in the 'common' was actually 
common. I also do not sure about moving an older code, because of 
possible backward compatibility breakage, but doing so for a newer one 
seems to be a good idea.




It actually defines functions that clash with functions in the 
backend,

so this seems doubly wrong.


Let's have frontend version of RestoreArchivedFile() renamed as well.
What about RestoreArchivedFileFrontend()?


-1 from me for the renaming, which was intentional to ease grepping
with the backend counterpart.  We have other cases like that, say
palloc(), fsync_fname(), etc.



Do not like it either. Having the same naming and a guarantee that this 
code is always compiled separately looks more convenient for me.


[1] 
https://www.postgresql.org/message-id/784fa7dc-414b-9dc9-daae-138033db298c%40postgrespro.ru



Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company




Re: [HACKERS] make async slave to wait for lsn to be replayed

2020-04-10 Thread Alexey Kondratov

On 2020-04-10 05:25, Fujii Masao wrote:

On 2020/04/10 3:16, Alexey Kondratov wrote:
Just another idea in case if one will still decide to go with a 
separate statement + BEGIN integration instead of a function. We could 
use parenthesized options list here. This is already implemented for 
VACUUM, REINDEX, etc. There was an idea to allow CONCURRENTLY in 
REINDEX there [1] and recently this was proposed again for new options 
[2], since it is much more extensible from the grammar perspective.


That way, the whole feature may look like:

WAIT (LSN '16/B374D848', TIMEOUT 100);

and/or

BEGIN
WAIT (LSN '16/B374D848', WHATEVER_OPTION_YOU_WANT);
...
COMMIT;

It requires only one reserved keyword 'WAIT'. The advantage of this 
approach is that it can be extended to support xid, timestamp, csn or 
anything else, that may be invented in the future, without affecting 
the grammar.


What do you think?

Personally, I find this syntax to be more convenient and 
human-readable compared with function call:


SELECT pg_wait_for_lsn('16/B374D848');
BEGIN;


I can imagine that some users want to specify the LSN to wait for,
from the result of another query, for example,
SELECT pg_wait_for_lsn(lsn) FROM xxx. If this is valid use case,
isn't the function better?



I think that the main purpose of the feature is to achieve 
read-your-writes-consistency, while using async replica for reads. In 
that case lsn of last modification is stored inside application, so 
there is no need to do any query for that. Moreover, you cannot store 
this lsn inside database, since reads are distributed across all 
replicas (+ primary).


Thus, I could imagine that 'xxx' in your example states for some kind of 
stored procedure, that fetches lsn from the off-postgres storage, but it 
looks like very narrow case to count on it, doesn't it?


Anyway, I am not against implementing this as a function. That was just 
another option to consider.


Just realized that the last patch I have seen does not allow usage of 
wait on primary. It may be a problem if reads are pooled not only across 
replicas, but on primary as well, which should be quite usual I guess. 
In that case application does not know either request will be processed 
on replica, or on primary. I think it should be allowed without any 
warning, or just saying some LOG/DEBUG at most, that there was no 
waiting performed.



Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company




Re: [HACKERS] make async slave to wait for lsn to be replayed

2020-04-09 Thread Alexey Kondratov

On 2020-04-09 16:33, Tom Lane wrote:

Fujii Masao  writes:

On 2020/04/09 16:11, Kyotaro Horiguchi wrote:
At Wed, 08 Apr 2020 16:35:46 -0400, Tom Lane  
wrote in

Why is this getting grafted onto BEGIN/START TRANSACTION in the
first place?


The rationale for not being a fmgr function is stated in the 
following

comments. [...]


This issue happens because the function is executed after BEGIN? If 
yes,
what about executing the function (i.e., as separate transaction) 
before BEGIN?
If so, the snapshot taken in the function doesn't affect the 
subsequent

transaction whatever its isolation level is.


I wonder whether making it a procedure, rather than a plain function,
would help any.



Just another idea in case if one will still decide to go with a separate 
statement + BEGIN integration instead of a function. We could use 
parenthesized options list here. This is already implemented for VACUUM, 
REINDEX, etc. There was an idea to allow CONCURRENTLY in REINDEX there 
[1] and recently this was proposed again for new options [2], since it 
is much more extensible from the grammar perspective.


That way, the whole feature may look like:

WAIT (LSN '16/B374D848', TIMEOUT 100);

and/or

BEGIN
WAIT (LSN '16/B374D848', WHATEVER_OPTION_YOU_WANT);
...
COMMIT;

It requires only one reserved keyword 'WAIT'. The advantage of this 
approach is that it can be extended to support xid, timestamp, csn or 
anything else, that may be invented in the future, without affecting the 
grammar.


What do you think?

Personally, I find this syntax to be more convenient and human-readable 
compared with function call:


SELECT pg_wait_for_lsn('16/B374D848');
BEGIN;


[1] 
https://www.postgresql.org/message-id/aad2ec49-5142-7356-ffb2-a9b2649cdd1f%402ndquadrant.com


[2] 
https://www.postgresql.org/message-id/20200401060334.GB142683%40paquier.xyz



Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company




Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2020-04-07 Thread Alexey Kondratov

On 2020-04-06 21:44, Justin Pryzby wrote:

On Mon, Apr 06, 2020 at 08:43:46PM +0300, Alexey Kondratov wrote:


+/* XXX: reusing reindex_option_list */
+   | CLUSTER opt_verbose '(' reindex_option_list ')' 
qualified_name
cluster_index_specification

Could we actually simply reuse vac_analyze_option_list? From the first 
sight
it does just the right thing, excepting the special handling of 
spelling

ANALYZE/ANALYSE, but it does not seem to be a problem.


Hm, do you mean to let cluster.c reject the other options like 
"analyze" ?

I'm not sure why that would be better than reusing reindex?
I think the suggestion will probably be to just copy+paste the reindex 
option
list and rename it to cluster (possibly with the explanation that 
they're
separate and independant and so their behavior shouldn't be tied 
together).




I mean to literally use vac_analyze_option_list for reindex and cluster 
as well. Please, check attached 0007. Now, vacuum, reindex and cluster 
filter options list and reject everything that is not supported, so it 
seems completely fine to just reuse vac_analyze_option_list, doesn't it?




ReindexRelationConcurrently is used for all cases, but it hits 
different
code paths in the case of database, table and index. I have not 
checked yet,

but are you sure it is safe removing these validations in the case of
REINDEX CONCURRENTLY?


You're right about the pg_global case, fixed.  System catalogs can't be
reindexed CONCURRENTLY, so they're already caught by that check.


> XXX: for cluster/vacuum, it might be more friendly to check before
> clustering
> the table, rather than after clustering and re-indexing.

Yes, I think it would be much more user-friendly.


I realized it's not needed or useful to check indexes in advance of 
clustering,
since 1) a mapped index will be on a mapped relation, which is already 
checked;

2) a system index will be on a system relation.  Right ?



Yes, it seems that you are right. I have tried to create user index on 
system relation with allow_system_table_mods=1, but this new index 
appeared to become system as well. That way, we do not have to check 
indexes in advance.



--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres CompanyFrom afa132674d2b49328a5e95ecd2ff30d838f7ec6c Mon Sep 17 00:00:00 2001
From: Alexey Kondratov 
Date: Tue, 7 Apr 2020 13:42:42 +0300
Subject: [PATCH v18 7/7] Reuse vac_analyze_option_list for cluster and reindex

---
 src/backend/parser/gram.y | 38 +++---
 1 file changed, 3 insertions(+), 35 deletions(-)

diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
index 3665ee8700..7821f07c40 100644
--- a/src/backend/parser/gram.y
+++ b/src/backend/parser/gram.y
@@ -512,10 +512,6 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query);
 %type 	explain_option_list
 
 %type 	reindex_target_type reindex_target_multitable
-%type 		reindex_option_name
-%type 	reindex_option_arg
-%type 	reindex_option_list
-%type 	reindex_option_elem
 
 %type 	copy_generic_opt_arg copy_generic_opt_arg_list_item
 %type 	copy_generic_opt_elem
@@ -8428,7 +8424,7 @@ ReindexStmt:
 	n->params = NIL;
 	$$ = (Node *)n;
 }
-			| REINDEX '(' reindex_option_list ')' reindex_target_type opt_concurrently qualified_name
+			| REINDEX '(' vac_analyze_option_list ')' reindex_target_type opt_concurrently qualified_name
 {
 	ReindexStmt *n = makeNode(ReindexStmt);
 	n->kind = $5;
@@ -8438,7 +8434,7 @@ ReindexStmt:
 	n->params = $3;
 	$$ = (Node *)n;
 }
-			| REINDEX '(' reindex_option_list ')' reindex_target_multitable opt_concurrently name
+			| REINDEX '(' vac_analyze_option_list ')' reindex_target_multitable opt_concurrently name
 {
 	ReindexStmt *n = makeNode(ReindexStmt);
 	n->kind = $5;
@@ -8458,33 +8454,6 @@ reindex_target_multitable:
 			| SYSTEM_P{ $$ = REINDEX_OBJECT_SYSTEM; }
 			| DATABASE{ $$ = REINDEX_OBJECT_DATABASE; }
 		;
-reindex_option_list:
-			reindex_option_elem
-{
-	$$ = list_make1($1);
-}
-			| reindex_option_list ',' reindex_option_elem
-{
-	$$ = lappend($1, $3);
-}
-		;
-
-reindex_option_elem:
-			reindex_option_name reindex_option_arg
-{
-	$$ = makeDefElem($1, $2, @1);
-}
-		;
-
-reindex_option_name:
-			NonReservedWord			{ $$ = $1; }
-		;
-
-reindex_option_arg:
-			opt_boolean_or_string	{ $$ = (Node *) makeString($1); }
-			| NumericOnly			{ $$ = (Node *) $1; }
-			| /* EMPTY */		 	{ $$ = NULL; }
-		;
 
 /*
  *
@@ -10643,8 +10612,7 @@ ClusterStmt:
 	n->params = NIL;
 	$$ = (Node*)n;
 }
-/* XXX: reusing reindex_option_list */
-			| CLUSTER opt_verbose '(' reindex_option_list ')' qualified_name cluster_index_specification
+			| CLUSTER opt_verbose '(' vac_analyze_option_list ')' qualifie

Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2020-04-06 Thread Alexey Kondratov

On 2020-04-03 21:27, Justin Pryzby wrote:

On Wed, Apr 01, 2020 at 08:08:36AM -0500, Justin Pryzby wrote:
Or maybe you'd want me to squish my changes into yours and resend 
after any

review comments ?


I didn't hear any feedback, so I've now squished all "parenthesized" 
and "fix"

commits.



Thanks for the input, but I am afraid that the patch set became a bit 
messy now. I have eyeballed it and found some inconsistencies.


const char *name;   /* name of database to reindex 
*/
-   int options;/* Reindex options 
flags */
+   List*rawoptions;/* Raw options */
+   int options;/* Parsed options */
boolconcurrent; /* reindex concurrently? */

You introduced rawoptions in the 0002, but then removed it in 0003. So 
is it required or not? Probably this is a rebase artefact.


+/* XXX: reusing reindex_option_list */
+			| CLUSTER opt_verbose '(' reindex_option_list ')' qualified_name 
cluster_index_specification


Could we actually simply reuse vac_analyze_option_list? From the first 
sight it does just the right thing, excepting the special handling of 
spelling ANALYZE/ANALYSE, but it does not seem to be a problem.




0004 reduces duplicative error handling, as a separate commit so
Alexey can review it and/or integrate it.



@@ -2974,27 +2947,6 @@ ReindexRelationConcurrently(Oid relationOid, Oid 
tablespaceOid, int options)

/* Open relation to get its indexes */
heapRelation = table_open(relationOid, ShareUpdateExclusiveLock);
-   /*
-	 * We don't support moving system relations into different 
tablespaces,

-* unless allow_system_table_mods=1.
-*/
-   if (OidIsValid(tablespaceOid) &&
-   !allowSystemTableMods && IsSystemRelation(heapRelation))
-   ereport(ERROR,
-   (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
-   errmsg("permission denied: \"%s\" is a system 
catalog",
-   
RelationGetRelationName(heapRelation;

ReindexRelationConcurrently is used for all cases, but it hits different 
code paths in the case of database, table and index. I have not checked 
yet, but are you sure it is safe removing these validations in the case 
of REINDEX CONCURRENTLY?




The last two commits save a few
dozen lines of code, but not sure they're desirable.



Sincerely, I do not think that passing raw strings down to the guts is a 
good idea. Yes, it saves us a few checks here and there now, but it may 
reduce a further reusability of these internal routines in the future.




XXX: for cluster/vacuum, it might be more friendly to check before 
clustering

the table, rather than after clustering and re-indexing.



Yes, I think it would be much more user-friendly.


--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company




Re: [HACKERS] make async slave to wait for lsn to be replayed

2020-04-03 Thread Alexey Kondratov

On 2020-04-01 02:26, Anna Akenteva wrote:

On 2020-03-27 04:15, Kartyshov Ivan wrote:

Anna, feel free to work on this patch.


Ivan and I worked on this patch a bit more. We fixed the bugs that we
could find and cleaned up the code. For now, we've kept both options:
WAIT as a standalone statement and WAIT as a part of BEGIN. The new
patch is attached.

The syntax looks a bit different now:

- WAIT FOR [ANY | ALL] event [, ...]
- BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ] [ WAIT FOR
[ANY | ALL] event [, ...]]
where event is one of:
LSN value
TIMEOUT number_of_milliseconds
timestamp

Now, one event cannot contain both an LSN and a TIMEOUT.



In my understanding the whole idea of having TIMEOUT was to do something 
like 'Do wait for this LSN to be replicated, but no longer than TIMEOUT 
milliseconds'. What is the point of having plain TIMEOUT? It seems to be 
equivalent to pg_sleep, doesn't it?



Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company




Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line

2020-04-01 Thread Alexey Kondratov

On 2020-04-01 05:19, Michael Paquier wrote:

On Tue, Mar 31, 2020 at 03:48:21PM +0900, Michael Paquier wrote:

Thanks, committed 0001 after fixing the order of the headers.  One
patch left.


And committed now 0002, meaning that we are officially done.  Thanks
Alexey for your patience.



Thanks for doing that! And to all others who participated.


--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company




Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2020-03-31 Thread Alexey Kondratov

On 2020-03-30 21:34, Justin Pryzby wrote:

On Mon, Mar 30, 2020 at 09:02:22PM +0300, Alexey Kondratov wrote:
Hmm, I went through the well known to me SQL commands in Postgres and 
a bit

more. Parenthesized options list is mostly used in two common cases:


There's also ANALYZE(VERBOSE), REINDEX(VERBOSE).
There was debate a year ago [0] as to whether to make "reindex 
CONCURRENTLY" a
separate command, or to use parenthesized syntax "REINDEX 
(CONCURRENTLY)".  I

would propose to support that now (and implemented that locally).



I am fine with allowing REINDEX (CONCURRENTLY), but then we will have to 
support both syntaxes as we already do for VACUUM. Anyway, if we agree 
to add parenthesized options to REINDEX/CLUSTER, then it should be done 
as a separated patch before the current patch set.




..and explain(...)


- In the beginning for boolean options only, e.g. VACUUM


You're right that those are currently boolean, but note that 
explain(FORMAT ..)

is not boolean.



Yep, I forgot EXPLAIN, this is a good example.



.. and create table (LIKE ..)



LIKE is used in the table definition, so it is a slightly different 
case.




Putting it into the WITH (...) options list looks like an option to 
me.
However, doing it only for VACUUM will ruin the consistency, while 
doing it

for CLUSTER and REINDEX is not necessary, so I do not like it either.


It's not necessary but I think it's a more flexible way to add new
functionality (requiring no changes to the grammar for vacuum, and for
REINDEX/CLUSTER it would allow future options to avoid changing the 
grammar).


If we use parenthesized syntax for vacuum, my proposal is to do it for
REINDEX, and
consider adding parenthesized syntax for cluster, too.


To summarize, currently I see only 2 + 1 extra options:

1) Keep everything with syntax as it is in 0001-0002
2) Implement tail syntax for VACUUM, but with limitation for VACUUM 
FULL of

the entire database + TABLESPACE change
3) Change TABLESPACE to a fully reserved word


+ 4) Use parenthesized syntax for all three.

Note, I mentioned that maybe VACUUM/CLUSTER should support not only 
"TABLESPACE
foo" but also "INDEX TABLESPACE bar" (I would use that, too).  I think 
that
would be easy to implement, and for sure it would suggest using () for 
both.
(For sure we don't want to implement "VACUUM t TABLESPACE foo" now, and 
then
later implement "INDEX TABLESPACE bar" and realize that for consistency 
we

cannot parenthesize it.

Michael ? Alvaro ? Robert ?



Yes, I would be glad to hear other opinions too, before doing this 
preliminary refactoring.



--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company




Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2020-03-30 Thread Alexey Kondratov

On 2020-03-28 03:11, Justin Pryzby wrote:

On Thu, Mar 26, 2020 at 11:01:06PM -0500, Justin Pryzby wrote:

> Another issue is this:
> > +VACUUM ( FULL [, ...] ) [ TABLESPACE new_tablespace 
] [ table_and_columns [, ...] ]
> As you mentioned in your v1 patch, in the other cases, "tablespace
> [tablespace]" is added at the end of the command rather than in the middle.  I
> wasn't able to make that work, maybe because "tablespace" isn't a fully
> reserved word (?).  I didn't try with "SET TABLESPACE", although I understand
> it'd be better without "SET".




SET does not change anything in my experience. The problem is that 
opt_vacuum_relation_list is... optional and TABLESPACE is not a fully 
reserved word (why?) as you correctly noted. I have managed to put 
TABLESPACE to the end, but with vacuum_relation_list, like:


| VACUUM opt_full opt_freeze opt_verbose opt_analyze 
vacuum_relation_list TABLESPACE name
| VACUUM '(' vac_analyze_option_list ')' vacuum_relation_list TABLESPACE 
name


It means that one would not be able to do VACUUM FULL of the entire 
database + TABLESPACE change. I do not think that it is a common 
scenario, but this limitation would be very annoying, wouldn't it?






I think we should use the parenthesized syntax for vacuum - it seems 
clear in

hindsight.

Possibly REINDEX should use that, too, instead of adding OptTablespace 
at the

end.  I'm not sure.


The attached mostly implements generic parenthesized options to 
REINDEX(...),
so I'm soliciting opinions: should TABLESPACE be implemented in 
parenthesized

syntax or non?


CLUSTER doesn't support parenthesized syntax, but .. maybe it should?


And this ?



Hmm, I went through the well known to me SQL commands in Postgres and a 
bit more. Parenthesized options list is mostly used in two common cases:


- In the beginning for boolean options only, e.g. VACUUM
- In the end for options of a various type, but accompanied by WITH, 
e.g. COPY, CREATE SUBSCRIPTION


Moreover, TABLESPACE is already used in CREATE TABLE/INDEX in the same 
way I did in 0001-0002. That way, putting TABLESPACE option into the 
parenthesized options list does not look to be convenient and 
semantically correct, so I do not like it. Maybe others will have a 
different opinion.


Putting it into the WITH (...) options list looks like an option to me. 
However, doing it only for VACUUM will ruin the consistency, while doing 
it for CLUSTER and REINDEX is not necessary, so I do not like it either.


To summarize, currently I see only 2 + 1 extra options:

1) Keep everything with syntax as it is in 0001-0002
2) Implement tail syntax for VACUUM, but with limitation for VACUUM FULL 
of the entire database + TABLESPACE change

3) Change TABLESPACE to a fully reserved word


Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company




Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line

2020-03-30 Thread Alexey Kondratov

On 2020-03-30 04:56, Michael Paquier wrote:

On Fri, Mar 27, 2020 at 07:47:29AM +0900, Michael Paquier wrote:

On Thu, Mar 26, 2020 at 06:56:36PM -0300, Alvaro Herrera wrote:
I don't think any such cleanup should hamper the patch at hand 
anyway.


I don't think either, so I would do the split for the archive routines
at least.  It still feels strange to me to have a different routine
name for the frontend-side of RestoreArchivedFile().  That's not
really consistent with the current practice we have palloc() & co, as
well as the sync routines.


Okay.  Hearing nothing, I have rebased the patch set this morning,
improving some comments and the code format while on it.  I would like
to commit both 0001 (creation of xlogarchiver.h) and 0002 attached in
the next couple of days.  If you have an objection, please feel free.



0001:

What do think about adding following sanity check into xlogarchive.c?

+#ifdef FRONTEND
+#error "This file is not expected to be compiled for frontend code"
+#endif

It would prevent someone from adding typedefs and any other common 
definitions into xlogarchive.h in the future, leading to the accidental 
inclusion of both xlogarchive.h and fe_archive.h in the same time.


0002:

+ */
+int
+RestoreArchivedFile(const char *path, const char *xlogfname,
+   off_t expectedSize, 
const char *restoreCommand)
+{

There is a couple of extra tabs IMO.

+extern int RestoreArchivedFile(const char *path,
+   const char *xlogfname,
+   off_t expectedSize,
+   const char *restoreCommand);

And the same here.

+ * This uses a logic based on "postgres -C" to get the value from
+ * from the cluster.

Repetitive 'from'.

 extractPageMap(const char *datadir, XLogRecPtr startpoint, int 
tliIndex,

-  XLogRecPtr endpoint)
+  XLogRecPtr endpoint, const char *restore_command)

Let us use camel case 'restoreCommand' here as in the header for 
tidiness.


I have left 0001 intact, but fixed all these small remarks in the 0002. 
Please, find it attached.



Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company
From f94990da84844841a35e80ca30e2b3d8c3bdaded Mon Sep 17 00:00:00 2001
From: Michael Paquier 
Date: Mon, 30 Mar 2020 10:51:51 +0900
Subject: [PATCH v6 2/2] Add -c/--restore-target-wal to pg_rewind

pg_rewind needs to copy from the source server to the target a set of
relation blocks changed from the previous checkpoint where WAL forked
between both up to the end of WAL on the target.  This may require WAL
segments that are not present anymore on the target's pg_wal, causing
pg_rewind to fail.  It is possible to fix that by copying manually the
WAL segments needed but this may lead to extra and useless work.

Instead, this commit introduces a new option allowing pg_rewind to use a
restore_command when running by grabbing it from the target
configuration.  This allows the rewind operation to be more reliable,
and only the necessary segments are fetched from the archives in this
case.

In order to do that, a new routine is added to src/common/ to allow
frontend tools to retrieve a WAL segment using an already-built restore
command.  This version is much more simple than the backend equivalent.

Author: Alexey Kondratov
Reviewed-by: Andrey Borodin, Andres Freund, Alvaro Herrera, Alexander
Korotkov, Michael Paquier
Discussion: https://postgr.es/m/a3acff50-5a0d-9a2c-b3b2-ee3616895...@postgrespro.ru
---
 doc/src/sgml/ref/pg_rewind.sgml   |  28 +--
 src/bin/pg_rewind/parsexlog.c |  33 +++-
 src/bin/pg_rewind/pg_rewind.c |  87 ++--
 src/bin/pg_rewind/pg_rewind.h |   6 +-
 src/bin/pg_rewind/t/001_basic.pl  |   3 +-
 src/bin/pg_rewind/t/RewindTest.pm |  70 +++-
 src/common/Makefile   |   1 +
 src/common/exec.c |   3 +-
 src/common/fe_archive.c   | 128 ++
 src/include/common/fe_archive.h   |  21 +
 src/include/port.h|   3 +-
 src/tools/msvc/Mkvcbuild.pm   |   4 +-
 12 files changed, 360 insertions(+), 27 deletions(-)
 create mode 100644 src/common/fe_archive.c
 create mode 100644 src/include/common/fe_archive.h

diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml
index f64d659522..07c49e4719 100644
--- a/doc/src/sgml/ref/pg_rewind.sgml
+++ b/doc/src/sgml/ref/pg_rewind.sgml
@@ -66,11 +66,11 @@ PostgreSQL documentation
can be found either on the target timeline, the source timeline, or their common
ancestor. In the typical failover scenario where the target cluster was
shut down soon after the divergence, this is not a problem, but if the
-   target cluster ran for a long time after the divergence, the old WAL
-   files might no longer 

Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line

2020-03-26 Thread Alexey Kondratov

On 2020-03-26 10:34, Michael Paquier wrote:

On Tue, Mar 24, 2020 at 12:22:16PM +0900, Michael Paquier wrote:

Thanks Alvaro and Alexander.  0001 has been applied as of e09ad07.
Now for 0002, let's see about it later.  Attached is a rebased
version, with no actual changes.


I was looking at this patch again today and I am rather fine with the
existing semantics.  Still I don't like much to name the frontend-side
routine FrontendRestoreArchivedFile() and use a different name than
the backend counterpart because we have to include xlog_internal.h in
fe_archive.c to be able to grab XLOGDIR.

So here is an idea: let's move the declaration of the routines part of
xlogarchive.c to a new header, called xlogarchive.h, and then let's
use the same routine name for the frontend and the backend in this
second patch.  We include xlog_internal.h already in many frontend
tools, so that would clean up things a bit.  Two extra things are the
routines for the checkpointer as well as the variables like
ArchiveRecoveryRequested.  It may be nice to move those while on it,
but I am not sure where and that's not actually required for this
patch set so that could be addressed later if need be.

Any thoughts?



The block of function declarations for xlogarchive.c inside 
xlog_internal.h looks a bit dangling already, since all other functions 
and variables defined/initialized in xlog.c. That way, it looks good to 
me to move it outside.


The only one concern about using the same name I have is that later 
someone may introduce a new variable or typedef inside xlogarchive.h. So 
someone else would be required to include both fe_archive.h and 
xlogarchive.h at once. But probably there should be a warning in the 
header comment section against doing so.


Anyway, I have tried to do what you proposed and attached is a small 
patch, that introduces xlogarchive.h.



Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Companydiff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c
index 860a996414..d683af377f 100644
--- a/src/backend/access/transam/timeline.c
+++ b/src/backend/access/transam/timeline.c
@@ -36,6 +36,7 @@
 
 #include "access/timeline.h"
 #include "access/xlog.h"
+#include "access/xlogarchive.h"
 #include "access/xlog_internal.h"
 #include "access/xlogdefs.h"
 #include "pgstat.h"
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7621fc05e2..242427f815 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -32,6 +32,7 @@
 #include "access/transam.h"
 #include "access/twophase.h"
 #include "access/xact.h"
+#include "access/xlogarchive.h"
 #include "access/xlog_internal.h"
 #include "access/xloginsert.h"
 #include "access/xlogreader.h"
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index 914ad340ea..04104b55ea 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -20,6 +20,7 @@
 #include 
 
 #include "access/xlog.h"
+#include "access/xlogarchive.h"
 #include "access/xlog_internal.h"
 #include "common/archive.h"
 #include "miscadmin.h"
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 25e0333c9e..cc91954ac1 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -48,6 +48,7 @@
 #include "access/htup_details.h"
 #include "access/timeline.h"
 #include "access/transam.h"
+#include "access/xlogarchive.h"
 #include "access/xlog_internal.h"
 #include "catalog/pg_authid.h"
 #include "catalog/pg_type.h"
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index 27ded593ab..8e3cfcf83e 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -320,22 +320,4 @@ extern bool InArchiveRecovery;
 extern bool StandbyMode;
 extern char *recoveryRestoreCommand;
 
-/*
- * Prototypes for functions in xlogarchive.c
- */
-extern bool RestoreArchivedFile(char *path, const char *xlogfname,
-const char *recovername, off_t expectedSize,
-bool cleanupEnabled);
-extern void ExecuteRecoveryCommand(const char *command, const char *commandName,
-   bool failOnSignal);
-extern void KeepFileRestoredFromArchive(const char *path, const char *xlogfname);
-extern void XLogArchiveNotify(const char *xlog);
-extern void XLogArchiveNotifySeg(XLogSegNo segno);
-extern void XLogArchiveForceDone(const char *xlog);
-extern bool XLogArchiveCheckDone(const char *xlog);
-extern bool XLogArchiveIsBusy(const char *xlog);
-extern bool XLogArchiveIsReady(const char *xlog);
-extern bool XLogArchiveIs

Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2020-03-26 Thread Alexey Kondratov

On 2020-03-26 21:01, Justin Pryzby wrote:



@@ -262,7 +280,7 @@ cluster(ClusterStmt *stmt, bool isTopLevel)
  * and error messages should refer to the operation as VACUUM not 
CLUSTER.

  */
 void
-cluster_rel(Oid tableOid, Oid indexOid, int options)
+cluster_rel(Oid tableOid, Oid indexOid, Oid tablespaceOid, int 
options)


Add a comment here about the tablespaceOid parameter, like the other 
functions

where it's added.

The permission checking is kind of duplicitive, so I'd suggest to 
factor it
out.  Ideally we'd only have one place that checks for 
pg_global/system/mapped.
It needs to check that it's not a system relation, or that 
system_table_mods
are allowed, and in any case that if it's a mapped rel, that it's not 
being
moved.  I would pass a boolean indicating if the tablespace is being 
changed.




Yes, but I wanted to make sure first that all necessary validations are 
there to do not miss something as I did last time. I do not like 
repetitive code either, so I would like to introduce more common check 
after reviewing the code as a whole.




Another issue is this:
+VACUUM ( FULL [, ...] ) [ TABLESPACE class="parameter">new_tablespace ] [ class="parameter">table_and_columns [, ...] ]

As you mentioned in your v1 patch, in the other cases, "tablespace
[tablespace]" is added at the end of the command rather than in the 
middle.  I

wasn't able to make that work, maybe because "tablespace" isn't a fully
reserved word (?).  I didn't try with "SET TABLESPACE", although I 
understand

it'd be better without "SET".



Initially I tried "SET TABLESPACE", but also failed to completely get 
rid of shift/reduce conflicts. I will try to rewrite VACUUM's part again 
with OptTableSpace. Maybe I will manage it this time.


I will take into account all your text edits as well.


Thanks
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company




Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2020-03-26 Thread Alexey Kondratov

On 2020-03-26 02:40, Justin Pryzby wrote:

On Thu, Mar 12, 2020 at 08:08:46PM +0300, Alexey Kondratov wrote:

On 09.03.2020 23:04, Justin Pryzby wrote:

On Sat, Feb 29, 2020 at 08:53:04AM -0600, Justin Pryzby wrote:

On Sat, Feb 29, 2020 at 03:35:27PM +0300, Alexey Kondratov wrote:
tests for that.  (I'm including your v8 untouched in hopes of not 
messing up
the cfbot).  My fixes avoid an issue if you try to REINDEX onto 
pg_default, I

think due to moving system toast indexes.
I was able to avoid this issue by adding a call to GetNewRelFileNode, 
even
though that's already called by RelationSetNewRelfilenode().  Not 
sure if
there's a better way, or if it's worth Alexey's v3 patch which added 
a

tablespace param to RelationSetNewRelfilenode.


Do you have any understanding of what exactly causes this error? I 
have
tried to debug it a little bit, but still cannot figure out why we 
need this

extra GetNewRelFileNode() call and a mechanism how it helps.


The PANIC is from smgr hashtable, which couldn't find an entry it 
expected.  My
very tentative understanding is that smgr is prepared to handle a 
*relation*
which is dropped/recreated multiple times in a transaction, but it's 
*not*
prepared to deal with a given RelFileNode(Backend) being 
dropped/recreated,

since that's used as a hash key.

I revisited it and solved it in a somewhat nicer way.



I included your new solution regarding this part from 0004 into 0001. It 
seems that at least a tip of the problem was in that we tried to change 
tablespace to pg_default being already there.




It's still not clear to
me if there's an issue with your original way of adding a tablespace 
parameter

to RelationSetNewRelfilenode().



Yes, it is not clear for me too.



Many thanks for you review and fixups! There are some inconsistencies 
like

mentions of SET TABLESPACE in error messages and so on. I am going to
refactor and include your fixes 0003-0004 into 0001 and 0002, but keep 
0005
separated for now, since this part requires more understanding IMO 
(and

comparison with v4 implementation).


I'd suggest to keep the CLUSTER/VACUUM FULL separate from REINDEX, in 
case
Michael or someone else wants to progress one but cannot commit to 
both.




Yes, sure, I did not have plans to melt everything into a single patch.

So, it has taken much longer to understand and rework all these fixes 
and permission validations. Attached is the updated patch set.


0001:
 — It is mostly the same, but refactored
 — I also included your most recent fix for REINDEX DATABASE with 
allow_system_table_mods=1
 — With this patch REINDEX + TABLESPACE simply errors out, when index on 
TOAST table is met and allow_system_table_mods=0


0002:
 — I reworked it a bit, since REINDEX CONCURRENTLY is not allowed on 
system catalog anyway, that is checked at the hegher levels of statement 
processing. So we have to care about TOAST relations

 — Also added the same check into the plain REINDEX
 — It works fine, but I am not entirely happy that with this patch 
errors/warnings are a bit inconsistent:


template1=# REINDEX INDEX CONCURRENTLY pg_toast.pg_toast_12773_index 
TABLESPACE pg_default;

WARNING:  skipping tablespace change of "pg_toast_12773_index"
DETAIL:  Cannot move system relation, only REINDEX CONCURRENTLY is 
performed.


template1=# REINDEX TABLE CONCURRENTLY pg_toast.pg_toast_12773 
TABLESPACE pg_default;

ERROR:  permission denied: "pg_toast_12773" is a system catalog

And REINDEX DATABASE CONCURRENTLY will generate a warning again.

Maybe we should always throw a warning and do only reindex if it is not 
possible to change tablespace?


0003:
 — I have get rid of some of previous refactoring pieces like 
check_relation_is_movable for now. Let all these validations to settle 
and then think whether we could do it better

 — Added CLUSTER to copy/equalfuncs
 — Cleaned up messages and comments

I hope that I did not forget anything from your proposals.


--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company
From 692bcadfacfa0c47fec7b6969525d33d0cac1f83 Mon Sep 17 00:00:00 2001
From: Alexey Kondratov 
Date: Tue, 24 Mar 2020 18:16:06 +0300
Subject: [PATCH v12 3/3] Allow CLUSTER and VACUUM FULL to change tablespace

---
 doc/src/sgml/ref/cluster.sgml | 11 -
 doc/src/sgml/ref/vacuum.sgml  | 11 +
 src/backend/commands/cluster.c| 58 ---
 src/backend/commands/vacuum.c | 51 ++--
 src/backend/nodes/copyfuncs.c |  2 +
 src/backend/nodes/equalfuncs.c|  2 +
 src/backend/parser/gram.y | 45 --
 src/backend/postmaster/autovacuum.c   |  1 +
 src/include/commands/cluster.h|  2 +-
 src/include/commands/vacuum.h |  2 +
 src/include/nodes/parsenodes.h|  2 +
 src/test/regress/input/tablespace.source  | 23 -
 src/test/regress/output/ta

Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2020-03-12 Thread Alexey Kondratov

Hi Justin,

On 09.03.2020 23:04, Justin Pryzby wrote:

On Sat, Feb 29, 2020 at 08:53:04AM -0600, Justin Pryzby wrote:

On Sat, Feb 29, 2020 at 03:35:27PM +0300, Alexey Kondratov wrote:

Anyway, new version is attached. It is rebased in order to resolve conflicts
with a recent fix of REINDEX CONCURRENTLY + temp relations, and includes
this small comment fix.

Thanks for rebasing - I actually started to do that yesterday.

I extracted the bits from your original 0001 patch which handled CLUSTER and
VACUUM FULL.  I don't think if there's any interest in combining that with
ALTER anymore.  On another thread (1), I tried to implement that, and Tom
pointed out problem with the implementation, but also didn't like the idea.

I'm including some proposed fixes, but didn't yet update the docs, errors or
tests for that.  (I'm including your v8 untouched in hopes of not messing up
the cfbot).  My fixes avoid an issue if you try to REINDEX onto pg_default, I
think due to moving system toast indexes.

I was able to avoid this issue by adding a call to GetNewRelFileNode, even
though that's already called by RelationSetNewRelfilenode().  Not sure if
there's a better way, or if it's worth Alexey's v3 patch which added a
tablespace param to RelationSetNewRelfilenode.


Do you have any understanding of what exactly causes this error? I have 
tried to debug it a little bit, but still cannot figure out why we need 
this extra GetNewRelFileNode() call and a mechanism how it helps.


Probably you mean v4 patch. Yes, interestingly, if we do everything at 
once inside RelationSetNewRelfilenode(), then there is no issue at all with:


REINDEX DATABASE template1 TABLESPACE pg_default;

It feels like I am doing a monkey coding here, so I want to understand 
it better :)



The current logic allows moving all the indexes and toast indexes, but I think
we should use IsSystemRelation() unless allow_system_table_mods, like existing
behavior of ALTER.

template1=# ALTER TABLE pg_extension_oid_index SET tablespace pg_default;
ERROR:  permission denied: "pg_extension_oid_index" is a system catalog
template1=# REINDEX INDEX pg_extension_oid_index TABLESPACE pg_default;
REINDEX


Yeah, we definitely should obey the same rules as ALTER TABLE / INDEX in 
my opinion.



Finally, I think the CLUSTER is missing permission checks.  It looks like
relation_is_movable was factored out, but I don't see how that helps ?


I did this relation_is_movable refactoring in order to share the same 
check between REINDEX + TABLESPACE and ALTER INDEX + SET TABLESPACE. 
Then I realized that REINDEX already has its own temp tables check and 
does mapped relations validation in multiple places, so I just added 
global tablespace checks instead. Thus, relation_is_movable seems to be 
outdated right now. Probably, we have to do another refactoring here 
once all proper validations will be accumulated in this patch set.



Alexey, I'm hoping to hear back if you think these changes are ok or if you'll
publish a new version of the patch addressing the crash I reported.
Or if you're too busy, maybe someone else can adopt the patch (I can help).


Sorry for the late response, I was not going to abandon this patch, but 
was a bit busy last month.


Many thanks for you review and fixups! There are some inconsistencies 
like mentions of SET TABLESPACE in error messages and so on. I am going 
to refactor and include your fixes 0003-0004 into 0001 and 0002, but 
keep 0005 separated for now, since this part requires more understanding 
IMO (and comparison with v4 implementation).


That way, I am going to prepare a more clear patch set till the middle 
of the next week. I will be glad to receive more feedback from you then.



Regards

--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company





Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line

2020-03-12 Thread Alexey Kondratov

On 12.03.2020 07:39, Michael Paquier wrote:

I'd like to commit the refactoring piece in 0001 tomorrow, then let's
move on with the rest as of 0002.  If more comments and docs are
needed for archive.c, let's continue discussing that.


I just went through the both patches and realized that I cannot get into 
semantics of splitting frontend code between common and fe_utils. This 
applies only to 0002, where we introduce fe_archive.c. Should it be 
placed into fe_utils alongside of the recent recovery_gen.c also used by 
pg_rewind? This is a frontend-only code intended to be used by frontend 
applications, so fe_utils feels like the right place, doesn't it? Just 
tried to do so and everything went fine, so it seems that there is no 
obstacles from the build system.


BTW, most of 'common' is a really common code with only four exceptions 
like logging.c, which is frontend-only. Is it there for historical 
reasons only or something else?



Regards

--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company





Re: Conflict handling for COPY FROM

2020-03-10 Thread Alexey Kondratov

On 09.03.2020 15:34, Surafel Temesgen wrote:


okay attached is a rebased patch with it



+    Portal        portal = NULL;
...
+        portal = GetPortalByName("");
+        SetRemoteDestReceiverParams(dest, portal);

I think that you do not need this, since you are using a ready 
DestReceiver. The whole idea of passing DestReceiver down to the 
CopyFrom was to avoid that code. This unnamed portal is created in the 
exec_simple_query [1] and has been already set to the DestReceiver there 
[2].


Maybe I am missing something, but I have just removed this code and 
everything works just fine.


[1] 
https://github.com/postgres/postgres/blob/0a42a2e9/src/backend/tcop/postgres.c#L1178


[2] 
https://github.com/postgres/postgres/blob/0a42a2e9/src/backend/tcop/postgres.c#L1226



Regards

--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company





Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line

2020-03-05 Thread Alexey Kondratov

On 05.03.2020 09:24, Michael Paquier wrote:

On Wed, Mar 04, 2020 at 08:14:20PM +0300, Alexey Kondratov wrote:

- I did not actually get why you don't check for a missing command
when using wait_result_is_any_signal.  In this case I'd think that it
is better to exit immediately as follow-up calls would just fail.

Believe me or not, but I put 'false' there intentionally. The idea was that
if the reason is a signal, then maybe user tired of waiting and killed that
restore_command process theirself or something like that, so it is better to
exit immediately. If it was a missing command, then there is no hurry, so we
can go further and complain that attempt of recovering WAL segment has
failed.

Actually, I guess that there is no big difference if we include missing
command here or not. There is no complicated logic further compared to real
recovery process in Postgres, where we cannot simply return false in that
case.

On the contrary, it seems to me that the difference is very important.
Imagine for example a frontend tool which calls RestoreArchivedWALFile
in a loop, and that this one fails because the command called is
missing.  This tool would keep looping for nothing.  So checking for a
missing command and leaving immediately would be more helpful for the
user.  Can you think about scenarios where it would make sense to be
able to loop in this case instead of failing?



OK, I was still having in mind pg_rewind as the only one user of this 
routine. Now it is a part of the common and I could imagine a 
hypothetical tool that is polling the archive and waiting for a specific 
WAL segment to become available. In this case 'command not found' is 
definitely the end of game, while the absence of segment is expected 
error, so we can continue looping.



Regards

--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company





Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line

2020-03-04 Thread Alexey Kondratov

On 04.03.2020 10:45, Michael Paquier wrote:

On Mon, Mar 02, 2020 at 08:59:49PM +0300, Alexey Kondratov wrote:

All other remarks look clear for me, so I fix them in the next patch
version, thanks.

Already done as per the attached, with a new routine named
getRestoreCommand() and more done.


Many thanks for doing that. I went through the diff between v21 and v20. 
Most of the changes look good to me.


- *        Functions for finding and validating executable files
+ *        Functions for finding and validating from executables files

There is probably something missing here. Finding and validating what? 
And 'executables files' does not seem to be correct as well.


+        # First, remove all the content in the archive directory,
+        # as RecursiveCopy::copypath does not support copying to
+        # existing directories.

I think that 'remove all the content' is not completely correct in this 
case. We are simply removing archive directory. There is no content 
there yet, so 'First, remove archive directory...' should be fine.



- I did not actually get why you don't check for a missing command
when using wait_result_is_any_signal.  In this case I'd think that it
is better to exit immediately as follow-up calls would just fail.


Believe me or not, but I put 'false' there intentionally. The idea was 
that if the reason is a signal, then maybe user tired of waiting and 
killed that restore_command process theirself or something like that, so 
it is better to exit immediately. If it was a missing command, then 
there is no hurry, so we can go further and complain that attempt of 
recovering WAL segment has failed.


Actually, I guess that there is no big difference if we include missing 
command here or not. There is no complicated logic further compared to 
real recovery process in Postgres, where we cannot simply return false 
in that case.



- The code was rather careless about error handling and
RestoreArchivedWALFile(), and it seemed to me that it is rather
pointless to report an extra message "could not restore file \"%s\"
from archive" on top of the other error.


Probably you mean several pg_log_error calls not followed by 'return 
-1;'. Yes, I did it to fall down to the function end and show this extra 
message, but I agree that there is no much sense in doing so.



Regards

--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company





Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line

2020-03-02 Thread Alexey Kondratov

On 2020-03-02 07:53, Michael Paquier wrote:




+ * For fixed-size files, the caller may pass the expected size as an
+ * additional crosscheck on successful recovery.  If the file size is 
not

+ * known, set expectedSize = 0.
+ */
+int
+RestoreArchivedWALFile(const char *path, const char *xlogfname,
+  off_t expectedSize, const char 
*restoreCommand)


Actually, expectedSize is IMO a bad idea, because any caller of this
routine passing down zero could be trapped with an incorrect file
size.  So let's remove the behavior where it is possible to bypass
this sanity check.  We don't need it in pg_rewind either.



OK, sounds reasonable, but just to be clear. I will remove only a 
possibility to bypass this sanity check (with 0), but leave expectedSize 
argument intact. We still need it, since pg_rewind takes WalSegSz from 
ControlFile and should pass it further, am I right?





+   /* Remove trailing newline */
+   if (strchr(cmd_output, '\n') != NULL)
+   *strchr(cmd_output, '\n') = '\0';


It seems to me that what you are looking here is to use
pg_strip_crlf().  Thinking harder, we have pipe_read_line() in
src/common/exec.c which does the exact same job..



pg_strip_crlf fits well, but would you mind if I also make 
pipe_read_line external in this patch?





-   /*
-* construct the command to be executed
-*/


Perhaps you meant "build" here.



Actually, the verb 'construct' is used historically applied to 
archive/restore commands (see also xlogarchive.c and pgarch.c), but it 
should be 'build' in (fe_)archive.c, since we have BuildRestoreCommand 
there now.


All other remarks look clear for me, so I fix them in the next patch 
version, thanks.



Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
The Russian Postgres Company




Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2020-02-29 Thread Alexey Kondratov

On 2020-02-11 19:48, Justin Pryzby wrote:
For your v7 patch, which handles REINDEX to a new tablespace, I have a 
few

minor comments:

+ * the relation will be rebuilt.  If InvalidOid is used, the default

=> should say "currrent", not default ?



Yes, it keeps current index tablespace in that case, thanks.



+++ b/doc/src/sgml/ref/reindex.sgml
+TABLESPACE
...
+class="parameter">new_tablespace


=> I saw you split the description of TABLESPACE from new_tablespace 
based on
comment earlier in the thread, but I suggest that the descriptions for 
these

should be merged, like:

+   
+TABLESPACEnew_tablespace
+
+ 
+  Allow specification of a tablespace where all rebuilt indexes
will be created.
+  Cannot be used with "mapped" relations. If 
SCHEMA,

+  DATABASE or SYSTEM are
specified, then
+  all unsuitable relations will be skipped and a single
WARNING
+  will be generated.
+ 
+
+   



It sounds good to me, but here I just obey the structure, which is used 
all around. Documentation of ALTER TABLE/DATABASE, REINDEX and many 
others describes each literal/parameter in a separate entry, e.g. 
new_tablespace. So I would prefer to keep it as it is for now.




The existing patch is very natural, especially the parts in the 
original patch

handling vacuum full and cluster.  Those were removed to concentrate on
REINDEX, and based on comments that it might be nice if ALTER handled 
CLUSTER
and VACUUM FULL.  On a separate thread, I brought up the idea of ALTER 
using
clustered order.  Tom pointed out some issues with my implementation, 
but

didn't like the idea, either.

So I suggest to re-include the CLUSTER/VAC FULL parts as a separate 
0002 patch,

the same way they were originally implemented.

BTW, I think if "ALTER" were updated to support REINDEX (to allow 
multiple

operations at once), it might be either:
|ALTER INDEX i SET TABLESPACE , REINDEX -- to reindex a single index
on a given tlbspc
or
|ALTER TABLE tbl REINDEX USING INDEX TABLESPACE spc; -- to reindex all
inds on table inds moved to a given tblspc
"USING INDEX TABLESPACE" is already used for ALTER..ADD column/table 
CONSTRAINT.




Yes, I also think that allowing REINDEX/CLUSTER/VACUUM FULL to put 
resulting relation in a different tablespace is a very natural 
operation. However, I did a couple of attempts to integrate latter two 
with ALTER TABLE and failed with it, since it is already complex enough. 
I am still willing to proceed with it, but not sure how soon it will be.


Anyway, new version is attached. It is rebased in order to resolve 
conflicts with a recent fix of REINDEX CONCURRENTLY + temp relations, 
and includes this small comment fix.



Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
The Russian Postgres Company
From d2b7a5fa2e11601759b47af0c142a7824ef907a2 Mon Sep 17 00:00:00 2001
From: Alexey Kondratov 
Date: Mon, 30 Dec 2019 20:00:37 +0300
Subject: [PATCH v8] Allow REINDEX to change tablespace

REINDEX already does full relation rewrite, this patch adds a
possibility to specify a new tablespace where new relfilenode
will be created.
---
 doc/src/sgml/ref/reindex.sgml | 24 +-
 src/backend/catalog/index.c   | 75 --
 src/backend/commands/cluster.c|  2 +-
 src/backend/commands/indexcmds.c  | 96 ---
 src/backend/commands/tablecmds.c  |  2 +-
 src/backend/nodes/copyfuncs.c |  1 +
 src/backend/nodes/equalfuncs.c|  1 +
 src/backend/parser/gram.y | 14 ++--
 src/backend/tcop/utility.c|  6 +-
 src/bin/psql/tab-complete.c   |  6 ++
 src/include/catalog/index.h   |  7 +-
 src/include/commands/defrem.h |  6 +-
 src/include/nodes/parsenodes.h|  1 +
 src/test/regress/input/tablespace.source  | 49 
 src/test/regress/output/tablespace.source | 66 
 15 files changed, 323 insertions(+), 33 deletions(-)

diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index c54a7c420d4..0628c94bb1e 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -21,7 +21,7 @@ PostgreSQL documentation
 
  
 
-REINDEX [ ( option [, ...] ) ] { INDEX | TABLE | SCHEMA | DATABASE | SYSTEM } [ CONCURRENTLY ] name
+REINDEX [ ( option [, ...] ) ] { INDEX | TABLE | SCHEMA | DATABASE | SYSTEM } [ CONCURRENTLY ] name [ TABLESPACE new_tablespace ]
 
 where option can be one of:
 
@@ -174,6 +174,28 @@ REINDEX [ ( option [, ...] ) ] { IN
 

 
+   
+TABLESPACE
+
+ 
+  This specifies a tablespace, where all rebuilt indexes will be created.
+  Cannot be used with "mapped" relations. If SCHEMA,
+  DATABASE or SYSTEM is specified, then
+  all unsuitable relations will be skipped and a single WARNING
+  will be generated.
+   

Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line

2020-02-28 Thread Alexey Kondratov

On 2020-02-28 09:43, Michael Paquier wrote:

On Thu, Feb 27, 2020 at 06:29:34PM +0300, Alexey Kondratov wrote:

On 2020-02-27 16:41, Alexey Kondratov wrote:
>
> New version of the patch is attached. Thanks again for your review.
>

Last patch (v18) got a conflict with one of today commits 
(05d8449e73).

Rebased version is attached.


The shape of the patch is getting better.  I have found some issues
when reading through the patch, but nothing huge.

+   printf(_("  -c, --restore-target-wal   use restore_command in
target config\n"));
+   printf(_(" to retrieve WAL files
from archive\n"));
[...]
{"progress", no_argument, NULL, 'P'},
+   {"restore-target-wal", no_argument, NULL, 'c'},
It may be better to reorder that alphabetically.



Sure, I put it in order. However, the recent -R option is out of order 
too.




+   if (rc != 0)
+   /* Sanity check, should never happen. */
+   elog(ERROR, "failed to build restore_command due to missing
parameters");
No point in having this comment IMO.



I would prefer to keep it, since there are plenty of similar comments 
near Asserts and elogs all over the Postgres. Otherwise it may look like 
a valid error state. It may be obvious now, but for someone who is not 
aware of BuildRestoreCommand refactoring it may be not. So from my 
perspective there is nothing bad in this extra one line comment.




+/* logging support */
+#define pg_fatal(...) do { pg_log_fatal(__VA_ARGS__); exit(1); } 
while(0)

Actually, I don't think that this is a good idea to name this
pg_fatal() as we have the same think in pg_rewind so it could be
confusing.



I have added explicit exit(1) calls, since pg_fatal was used only twice 
in the archive.c. Probably, pg_log_fatal from common/logging should obey 
the same logic as FATAL log-level in the backend and do exit the 
process, but for now including pg_rewind.h inside archive.c or vice 
versa does not look like a solution.




-   while ((c = getopt_long(argc, argv, "D:nNPR", long_options,
_index)) != -1)
+   while ((c = getopt_long(argc, argv, "D:nNPRc", long_options,
_index)) != -1)
Alphabetical order here.



Done.



+   rmdir($node_master->archive_dir);
rmtree() is used in all our other tests.



Done. There was an unobvious logic that rmdir only deletes empty 
directories, which is true in the case of archive_dir in that test, but 
I have unified it for consistency.




+   pg_log_error("archive file \"%s\" has wrong size: %lu
instead of %lu, %s",
+xlogfname, (unsigned long) 
stat_buf.st_size,
+(unsigned long) expectedSize, 
strerror(errno));

I think that the error message should be reworded: "unexpected WAL
file size for \"%s\": %lu instead of %lu".  Please note that there is
no need for strerror() here at all, as errno should be 0.

+if (xlogfd < 0)
+pg_log_error("could not open file \"%s\" restored from 
archive: %s\n",

+ xlogpath, strerror(errno));
[...]
+pg_log_error("could not stat file \"%s\" restored from archive: 
%s",

+xlogpath, strerror(errno));
No need for strerror() as you can just use %m.  And no need for the
extra newline at the end as pg_log_* routines do that by themselves.

+   pg_log_error("could not restore file \"%s\" from archive\n",
+xlogfname);
No need for a newline here.



Thanks, I have cleaned up these log statements.


--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
The Russian Postgres Company

From ba20808ffddf3fe2eefe96d3385697fb6583ce9a Mon Sep 17 00:00:00 2001
From: Alexander Korotkov 
Date: Tue, 25 Feb 2020 02:22:45 +0300
Subject: [PATCH v20] pg_rewind: Add options to restore WAL files from archive

Currently, pg_rewind fails when it could not find required WAL files in the
target data directory.  One have to manually figure out which WAL files are
required and copy them back from archive.

This commit implements new pg_rewind options, which allow pg_rewind to
automatically retrieve missing WAL files from archival storage. The
restore_command option is read from postgresql.conf.

Discussion: https://postgr.es/m/a3acff50-5a0d-9a2c-b3b2-ee36168955c1%40postgrespro.ru
Author: Alexey Kondratov
Reviewed-by: Michael Paquier, Andrey Borodin, Alvaro Herrera
Reviewed-by: Andres Freund, Alexander Korotkov
---
 doc/src/sgml/ref/pg_rewind.sgml  |  28 --
 src/backend/access/transam/xlogarchive.c |  60 ++--
 src/bin/pg_rewind/parsexlog.c|  33 ++-
 src/bin/pg_rewind/pg_rewind.c|  77 ++-
 src/bin/pg_rewind/pg_rewind.h|   6 +-
 src/bin/pg_rewind/t/001_basic.pl |   3 +-
 src/bi

Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line

2020-02-27 Thread Alexey Kondratov

On 2020-02-27 16:41, Alexey Kondratov wrote:


New version of the patch is attached. Thanks again for your review.



Last patch (v18) got a conflict with one of today commits (05d8449e73). 
Rebased version is attached.


--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
The Russian Postgres Company
From ea93b52b298d80aac547735c5917386b37667595 Mon Sep 17 00:00:00 2001
From: Alexander Korotkov 
Date: Tue, 25 Feb 2020 02:22:45 +0300
Subject: [PATCH v19] pg_rewind: Add options to restore WAL files from archive

Currently, pg_rewind fails when it could not find required WAL files in the
target data directory.  One have to manually figure out which WAL files are
required and copy them back from archive.

This commit implements new pg_rewind options, which allow pg_rewind to
automatically retrieve missing WAL files from archival storage. The
restore_command option is read from postgresql.conf.

Discussion: https://postgr.es/m/a3acff50-5a0d-9a2c-b3b2-ee36168955c1%40postgrespro.ru
Author: Alexey Kondratov
Reviewed-by: Michael Paquier, Andrey Borodin, Alvaro Herrera
Reviewed-by: Andres Freund, Alexander Korotkov
---
 doc/src/sgml/ref/pg_rewind.sgml  |  28 --
 src/backend/access/transam/xlogarchive.c |  60 ++--
 src/bin/pg_rewind/parsexlog.c|  33 ++-
 src/bin/pg_rewind/pg_rewind.c|  77 +++-
 src/bin/pg_rewind/pg_rewind.h|   6 +-
 src/bin/pg_rewind/t/001_basic.pl |   3 +-
 src/bin/pg_rewind/t/RewindTest.pm|  66 +-
 src/common/Makefile  |   2 +
 src/common/archive.c | 102 +
 src/common/fe_archive.c  | 111 +++
 src/include/common/archive.h |  22 +
 src/include/common/fe_archive.h  |  19 
 src/tools/msvc/Mkvcbuild.pm  |   8 +-
 13 files changed, 457 insertions(+), 80 deletions(-)
 create mode 100644 src/common/archive.c
 create mode 100644 src/common/fe_archive.c
 create mode 100644 src/include/common/archive.h
 create mode 100644 src/include/common/fe_archive.h

diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml
index 42d29edd4e..64a6942031 100644
--- a/doc/src/sgml/ref/pg_rewind.sgml
+++ b/doc/src/sgml/ref/pg_rewind.sgml
@@ -66,11 +66,11 @@ PostgreSQL documentation
can be found either on the target timeline, the source timeline, or their common
ancestor. In the typical failover scenario where the target cluster was
shut down soon after the divergence, this is not a problem, but if the
-   target cluster ran for a long time after the divergence, the old WAL
-   files might no longer be present. In that case, they can be manually
-   copied from the WAL archive to the pg_wal directory, or
-   fetched on startup by configuring  or
-   .  The use of
+   target cluster ran for a long time after the divergence, its old WAL
+   files might no longer be present. In this case, you can manually copy them
+   from the WAL archive to the pg_wal directory, or run
+   pg_rewind with the -c option to
+   automatically retrieve them from the WAL archive. The use of
pg_rewind is not limited to failover, e.g.  a standby
server can be promoted, run some write transactions, and then rewinded
to become a standby again.
@@ -232,6 +232,19 @@ PostgreSQL documentation
   
  
 
+ 
+  -c
+  --restore-target-wal
+  
+   
+Use the restore_command defined in
+the target cluster configuration to retrieve WAL files from
+the WAL archive if these files are no longer available in the
+pg_wal directory.
+   
+  
+ 
+
  
   --debug
   
@@ -318,7 +331,10 @@ GRANT EXECUTE ON function pg_catalog.pg_read_binary_file(text, bigint, bigint, b
   history forked off from the target cluster. For each WAL record,
   record each data block that was touched. This yields a list of all
   the data blocks that were changed in the target cluster, after the
-  source cluster forked off.
+  source cluster forked off. If some of the WAL files are no longer
+  available, try re-running pg_rewind with
+  the -c option to search for the missing files in
+  the WAL archive.
  
 
 
diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c
index 188b73e752..f78a7e8f02 100644
--- a/src/backend/access/transam/xlogarchive.c
+++ b/src/backend/access/transam/xlogarchive.c
@@ -21,6 +21,7 @@
 
 #include "access/xlog.h"
 #include "access/xlog_internal.h"
+#include "common/archive.h"
 #include "miscadmin.h"
 #include "postmaster/startup.h"
 #include "replication/walsender.h"
@@ -55,9 +56,6 @@ RestoreArchivedFile(char *path, const char *xlogfname,
 	char		xlogpath[MAXPGPATH];
 	char		xlogRestoreCmd[MAXPGPATH];
 	c

Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line

2020-02-27 Thread Alexey Kondratov

On 2020-02-27 04:52, Michael Paquier wrote:

On Thu, Feb 27, 2020 at 12:43:55AM +0300, Alexander Korotkov wrote:

Regarding text split change, it was made by pgindent.  I didn't notice
it belongs to unchanged part of code.  Sure, we shouldn't include this
into the patch.


I have read through v17 (not tested, sorry), and spotted a couple of
issues that need to be addressed.

+   "--source-pgdata=$standby_pgdata",
+   "--target-pgdata=$master_pgdata",
+   "--no-sync", "--no-ensure-shutdown",
FWIW, I think that perl indenting would reshape this part.  I would
recommend to run src/tools/pgindent/pgperltidy and
./src/tools/perlcheck/pgperlcritic before commit.



Thanks, formatted this part with perltidy. It also has modified 
RecursiveCopy's indents. Pgperlcritic has no complains about this file. 
BTW, being executed on the whole project pgperltidy modifies dozens of 
perl files an even pgindent itself.




+ * Copyright (c) 2020, PostgreSQL Global Development Group
Wouldn't it be better to just use the full copyright here?  I mean the
following:
Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group
Portions Copyright (c) 1994, The Regents of the University of 
California




I think so, it contains some older code parts, so it is better to use 
unified copyrights.




+++ b/src/common/archive.c
[...]
+#include "postgres.h"
+
+#include "common/archive.h"
This is incorrect.  All files shared between the backend and the
frontend in src/common/ have to include the following set of headers:
#ifndef FRONTEND
#include "postgres.h"
#else
#include "postgres_fe.h"
#endif

+++ b/src/common/fe_archive.c
[...]
+#include "postgres_fe.h"
This is incomplete.  The following piece should be added:
#ifndef FRONTEND
#error "This file is not expected to be compiled for backend code"
#endif



Fixed both.



+   snprintf(postgres_cmd, sizeof(postgres_cmd), "%s -D %s
-C restore_command",
+postgres_exec_path, datadir_target);
+
I think that this is missing proper quoting.



Yep, added the same quoting as in pg_upgrade/options.



I would rename ConstructRestoreCommand() to BuildRestoreCommand()
while on it..



OK, shorter is better.



I think that it would be saner to check the return status of
ConstructRestoreCommand() in xlogarchive.c as a sanity check, with an
elog(ERROR) if not 0, as that should never happen.



Added.

New version of the patch is attached. Thanks again for your review.


Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
The Russian Postgres Company
From c775a2e40e405474f6ecef35843d276d43fb462f Mon Sep 17 00:00:00 2001
From: Alexander Korotkov 
Date: Tue, 25 Feb 2020 02:22:45 +0300
Subject: [PATCH v18] pg_rewind: Add options to restore WAL files from archive

Currently, pg_rewind fails when it could not find required WAL files in the
target data directory.  One have to manually figure out which WAL files are
required and copy them back from archive.

This commit implements new pg_rewind options, which allow pg_rewind to
automatically retrieve missing WAL files from archival storage. The
restore_command option is read from postgresql.conf.

Discussion: https://postgr.es/m/a3acff50-5a0d-9a2c-b3b2-ee36168955c1%40postgrespro.ru
Author: Alexey Kondratov
Reviewed-by: Michael Paquier, Andrey Borodin, Alvaro Herrera
Reviewed-by: Andres Freund, Alexander Korotkov
---
 doc/src/sgml/ref/pg_rewind.sgml  |  28 --
 src/backend/access/transam/xlogarchive.c |  60 ++--
 src/bin/pg_rewind/parsexlog.c|  33 ++-
 src/bin/pg_rewind/pg_rewind.c|  77 +++-
 src/bin/pg_rewind/pg_rewind.h|   6 +-
 src/bin/pg_rewind/t/001_basic.pl |   3 +-
 src/bin/pg_rewind/t/RewindTest.pm|  66 +-
 src/common/Makefile  |   2 +
 src/common/archive.c | 102 +
 src/common/fe_archive.c  | 111 +++
 src/include/common/archive.h |  22 +
 src/include/common/fe_archive.h  |  19 
 src/tools/msvc/Mkvcbuild.pm  |   8 +-
 13 files changed, 457 insertions(+), 80 deletions(-)
 create mode 100644 src/common/archive.c
 create mode 100644 src/common/fe_archive.c
 create mode 100644 src/include/common/archive.h
 create mode 100644 src/include/common/fe_archive.h

diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml
index 42d29edd4e..64a6942031 100644
--- a/doc/src/sgml/ref/pg_rewind.sgml
+++ b/doc/src/sgml/ref/pg_rewind.sgml
@@ -66,11 +66,11 @@ PostgreSQL documentation
can be found either on the target timeline, the source timeline, or their common
ancestor. In the typical

Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line

2020-02-26 Thread Alexey Kondratov

On 2020-02-26 22:03, Alexander Korotkov wrote:

On Tue, Feb 25, 2020 at 1:48 PM Alexander Korotkov
 wrote:


I think usage of chmod() deserves comment.  As I get default
permissions are sufficient for work, but we need to set them to
satisfy 'check PGDATA permissions' test.



I've added this comment myself.



Thanks for doing it yourself, I was going to answer tonight, but it 
would be obviously too late.




I've also fixes some indentation.
Patch now looks good to me.  I'm going to push it if no objections.



I think that docs should be corrected. Previously Michael was against 
the phrase 'restore_command defined in the postgresql.conf', since it 
also could be defined in any config file included there. We corrected it 
in the pg_rewind --help output, but now docs say:


+Use the restore_command defined in
+postgresql.conf to retrieve WAL files from
+the WAL archive if these files are no longer available in the
+pg_wal directory of the target cluster.

Probably it should be something like:

+Use the restore_command defined in
+the target cluster configuration to retrieve WAL files from
+the WAL archive if these files are no longer available in the
+pg_wal directory.

Here the only text split changed:

-* Ignore restore_command when not in archive recovery (meaning
-* we are in crash recovery).
+	 * Ignore restore_command when not in archive recovery (meaning we are 
in

+* crash recovery).

Should we do so in this patch?

I think that this extra dot at the end is not necessary here:

+		pg_log_debug("using config variable restore_command=\'%s\'.", 
restore_command);


If you agree then attached is a patch with all the corrections above. It 
is made with default git format-patch format, but yours were in a 
slightly different format, so I only was able to apply them with git am 
--patch-format=stgit.



--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
The Russian Postgres Company

From fa2fc359dd9852afc608663fa32733e800652ffa Mon Sep 17 00:00:00 2001
From: Alexander Korotkov 
Date: Tue, 25 Feb 2020 02:22:45 +0300
Subject: [PATCH v17] pg_rewind: Add options to restore WAL files from archive

Currently, pg_rewind fails when it could not find required WAL files in the
target data directory.  One have to manually figure out which WAL files are
required and copy them back from archive.

This commit implements new pg_rewind options, which allow pg_rewind to
automatically retrieve missing WAL files from archival storage. The
restore_command option is read from postgresql.conf.

Discussion: https://postgr.es/m/a3acff50-5a0d-9a2c-b3b2-ee36168955c1%40postgrespro.ru
    Author: Alexey Kondratov
Reviewed-by: Michael Paquier, Andrey Borodin, Alvaro Herrera
Reviewed-by: Andres Freund, Alexander Korotkov
---
 doc/src/sgml/ref/pg_rewind.sgml  |  28 --
 src/backend/access/transam/xlogarchive.c |  58 +
 src/bin/pg_rewind/parsexlog.c|  33 ++-
 src/bin/pg_rewind/pg_rewind.c|  77 ++--
 src/bin/pg_rewind/pg_rewind.h|   6 +-
 src/bin/pg_rewind/t/001_basic.pl |   3 +-
 src/bin/pg_rewind/t/RewindTest.pm|  67 +-
 src/common/Makefile  |   2 +
 src/common/archive.c |  97 +
 src/common/fe_archive.c  | 106 +++
 src/include/common/archive.h |  21 +
 src/include/common/fe_archive.h  |  18 
 src/tools/msvc/Mkvcbuild.pm  |   8 +-
 13 files changed, 443 insertions(+), 81 deletions(-)
 create mode 100644 src/common/archive.c
 create mode 100644 src/common/fe_archive.c
 create mode 100644 src/include/common/archive.h
 create mode 100644 src/include/common/fe_archive.h

diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml
index 42d29edd4e..64a6942031 100644
--- a/doc/src/sgml/ref/pg_rewind.sgml
+++ b/doc/src/sgml/ref/pg_rewind.sgml
@@ -66,11 +66,11 @@ PostgreSQL documentation
can be found either on the target timeline, the source timeline, or their common
ancestor. In the typical failover scenario where the target cluster was
shut down soon after the divergence, this is not a problem, but if the
-   target cluster ran for a long time after the divergence, the old WAL
-   files might no longer be present. In that case, they can be manually
-   copied from the WAL archive to the pg_wal directory, or
-   fetched on startup by configuring  or
-   .  The use of
+   target cluster ran for a long time after the divergence, its old WAL
+   files might no longer be present. In this case, you can manually copy them
+   from the WAL archive to the pg_wal directory, or run
+   pg_rewind with the -c option to
+   automatically retrieve them from the WAL archive. The use of
pg_rewind is not limited to failover, e.g. 

Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line

2020-01-31 Thread Alexey Kondratov

On 2020-01-24 08:50, Michael Paquier wrote:

On Wed, Jan 22, 2020 at 12:55:30AM +0300, Alexander Korotkov wrote:
On Sun, Jan 19, 2020 at 1:24 PM Michael Paquier  
wrote:

+static int
+RestoreArchivedWALFile(const char *path, const char *xlogfname,
+  off_t expectedSize, const char 
*restoreCommand)
Not a fan of putting that to pg_rewind/parsexlog.c.  It has nothing 
to
do with WAL parsing, and it seems to me that we could have an 
argument

for making this available as a frontend-only API in src/common/.


I've put it into src/common/fe_archive.c.


This split makes sense.  You have forgotten to update
@pgcommonfrontendfiles in Mkvcbuild.pm for the MSVC build.  Still, I
can see that we have a lot of duplication between the code of the
frontend and the backend, which is annoying..  The execution part is
tricky to split because the backend has pre- and post- callbacks, the
interruption handling is not the same and there is the problem of
elog() calls, with elevel that can be changed depending on the backend
configuration.  However, I can see one portion of the code which is
fully duplicated and could be refactored out: the construction of the
command to execute, by having in input the restore_command string and
the arguments that we expect to replace in the '%' markers which are
part of the command.



OK, I agree that duplication of this part directly is annoying. I did a 
refactoring and moved this restore_command building code into 
common/archive. Separate file was needed, since fe_archive is 
frontend-only, so I cannot put universal code there. I do not know is 
there any better place for it.




If NULL is specified for one of those values,
then the construction routine returns NULL, synonym of failure.  And
the result of the routine is the built command.  I think that you
would need an extra argument to give space for the error message
generated in the event of an error when building the command.



I did it in a bit different way, but the overall logic is exactly as you 
proposed, I hope.
Also I have checked win/linux build in the similar to cfbot CI setup and 
now everything runs well.




+++ b/src/include/common/fe_archive.h
+#ifndef ARCHIVE_H
+#define ARCHIVE_H
This should be FE_ARCHIVE_H.



Changed.



-   while ((c = getopt_long(argc, argv, "D:nNPR", long_options,
_index)) != -1)
+   while ((c = getopt_long(argc, argv, "D:nNPRC:c", long_options,
_index)) != -1)
This still includes 'C', but that should not be the case.

+   pg_rewind with the
-c or
+   -C option to automatically retrieve them from
the WAL
[...]
+  -C restore_command
+  --target-restore-command=restore_command
[...]
+  available, try re-running pg_rewind 
with

+  the -c or -C option to search
+  for the missing files in the WAL archive.
Three places in the docs need to be cleaned up.



I have fixed all the above, thanks.



Do we really need to test the new "archive" mode as well for
003_extrafiles and 002_databases?  I would imagine that only 001_basic
is enough.



We do not check any unique code paths with it, so I do it only for 
001_basic now. I have left it as a test mode, since it will be easy to 
turn this on for any other test in the future if needed and it fits well 
RewindTest.pm module.




+# Moves WAL files to the temporary location and returns 
restore_command

+# to get them back.
+sub move_wal
+{
+   my ($tmp_dir, $master_pgdata) = @_;
+   my $wal_archive_path = "$tmp_dir/master_wal_archive";
+   my $wal_path = "$master_pgdata/pg_wal";
+   my $wal_dir;
+   my $restore_command;
I think that this routine is wrongly designed.  First, in order to
copy the contents from one path to another, we have
RecursiveCopy::copy_path, and there is a long history about making
it safe for the TAP tests.  Second, there is in
PostgresNode::enable_restoring already a code path doing the
decision-making on how building restore_command.  We should keep this
code path unique.



Yeah, thank you for pointing me to the RecursiveCopy::copypath and 
especially PostgresNode::enable_restoring, I have modified test to use 
them instead. The copypath does not work with existing destination 
directories and does not preserve initial permissions, so I had to do 
some dirty work to achieve what we need in the test and keep it simple 
in the same time. However, I think that using these unified routines is 
much better for a future support.


New version is attached. Do you have any other comments or objections?


Regards
--
Alexey

From 7af0b3642f6218c764eee361e013f24bfb43ffbe Mon Sep 17 00:00:00 2001
From: Alexey Kondratov 
Date: Fri, 31 Jan 2020 20:08:13 +0300
Subject: [PATCH v14] pg_rewind: Add options to restore WAL files from archive

Currently, pg_rewind fails when it could not find required WAL files in the
target data directory.  One have to manually figure out which WAL files are
required and copy them back from archive.

  

Re: Physical replication slot advance is not persistent

2020-01-31 Thread Alexey Kondratov

On 30.01.2020 05:19, Michael Paquier wrote:

On Wed, Jan 29, 2020 at 05:10:20PM +0900, Kyotaro Horiguchi wrote:

Looks perfect.

Thanks Horiguchi-san and others.  Applied and back-patched down to
11.


Great! Thanks for getting this done.

--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company





Re: Physical replication slot advance is not persistent

2020-01-28 Thread Alexey Kondratov

On 28.01.2020 15:14, Kyotaro Horiguchi wrote:

At Tue, 28 Jan 2020 17:45:47 +0800, Craig Ringer  wrote 
in

On Tue, 28 Jan 2020 at 16:01, Michael Paquier  wrote:

So attached is an updated patch which addresses the problem just by
marking a physical slot as dirty if any advancing is done.  Some
documentation is added about the fact that an advance is persistent
only at the follow-up checkpoint.  And the tests are fixed to not use
a fake LSN but instead advance to the latest LSN position produced.

Any objections?

LGTM. Thankyou.

I agree not to save slots immediately. The code is wrtten as described
above. The TAP test is correct.


+1, removing this broken saving code path from 
pg_replication_slot_advance and marking slot as dirty looks good to me. 
It solves the issue and does not add any unnecessary complexity.




But the doc part looks a bit too detailed to me. Couldn't we explain
that without the word 'dirty'?

-and it will not be moved beyond the current insert location.  Returns
-name of the slot and real position to which it was advanced to.
+and it will not be moved beyond the current insert location. Returns
+name of the slot and real position to which it was advanced to. The
+updated slot is marked as dirty if any advancing is done, with its
+information being written out at the follow-up checkpoint. In the
+event of a crash, the slot may return to an earlier position.

and it will not be moved beyond the current insert location. Returns
name of the slot and real position to which it was advanced to. The
information of the updated slot is scheduled to be written out at the
follow-up checkpoint if any advancing is done. In the event of a
crash, the slot may return to an earlier position.


Just searched through the *.sgml files, we already use terms 'dirty' and 
'flush' applied to writing out pages during checkpoints. Here we are 
trying to describe the very similar process, but in relation to 
replication slots, so it looks fine for me. In the same time, the term 
'schedule' is used for VACUUM, constraint check or checkpoint itself.



Regards

--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company





Re: Physical replication slot advance is not persistent

2020-01-16 Thread Alexey Kondratov

On 09.01.2020 09:36, Kyotaro Horiguchi wrote:

Hello.

At Sun, 29 Dec 2019 15:12:16 +0300, Alexey Kondratov 
 wrote in

On 2019-12-26 16:35, Alexey Kondratov wrote:

Another concern is that ReplicationSlotIsDirty is added with the only
one user. It also cannot be used by SaveSlotToPath due to the
simultaneous usage of both flags dirty and just_dirtied there.
In that way, I hope that we should call ReplicationSlotSave
unconditionally in the pg_replication_slot_advance, so slot will be
saved or not automatically based on the slot->dirty flag. In the same
time, ReplicationSlotsComputeRequiredXmin and
ReplicationSlotsComputeRequiredLSN should be called by anyone, who
modifies xmin and LSN fields in the slot. Otherwise, currently we are
getting some leaky abstractions.

Sounds reasonable.


Great, so it seems that we have reached some agreement about who should 
mark slot as dirty, at least for now.





If someone will utilise old WAL and after that crash will happen
between steps 2) and 3), then we start with old value of restart_lsn,
but without required WAL. I do not know how to properly reproduce it
without gdb and power off, so the chance is pretty low, but still it
could be a case.

In the first place we advance required LSN for every reply message but
save slot data only at checkpoint on physical repliation.  Such a
strict guarantee seems too much.

...

I think we shouldn't touch the paths used by replication protocol. And
don't we focus on how we make a change of a replication slot from SQL
interface persistent?  It seems to me that generaly we don't need to
save dirty slots other than checkpoint, but the SQL function seems
wanting the change to be saved immediately.

As the result, please find the attached, which is following only the
first paragraph cited above.


OK, I have definitely overthought that, thanks. This looks like a 
minimal subset of changes that actually solves the bug. I would only 
prefer to keep some additional comments (something like the attached), 
otherwise after half a year it will be unclear again, why we save slot 
unconditionally here.



Regards

--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company

diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index bb69683e2a..084e0c2960 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -370,6 +370,11 @@ pg_physical_replication_slot_advance(XLogRecPtr moveto)
 		MyReplicationSlot->data.restart_lsn = moveto;
 		SpinLockRelease(>mutex);
 		retlsn = moveto;
+
+		ReplicationSlotMarkDirty();
+
+		/* We moved retart_lsn, update the global value. */
+		ReplicationSlotsComputeRequiredLSN();
 	}
 
 	return retlsn;
@@ -564,7 +569,10 @@ pg_replication_slot_advance(PG_FUNCTION_ARGS)
 		(uint32) (moveto >> 32), (uint32) moveto,
 		(uint32) (minlsn >> 32), (uint32) minlsn)));
 
-	/* Do the actual slot update, depending on the slot type */
+	/*
+	 * Do the actual slot update, depending on the slot type.  Slot will be
+	 * marked as dirty by pg_*_replication_slot_advance if changed.
+	 */
 	if (OidIsValid(MyReplicationSlot->data.database))
 		endlsn = pg_logical_replication_slot_advance(moveto);
 	else
@@ -573,14 +581,11 @@ pg_replication_slot_advance(PG_FUNCTION_ARGS)
 	values[0] = NameGetDatum(>data.name);
 	nulls[0] = false;
 
-	/* Update the on disk state when lsn was updated. */
-	if (XLogRecPtrIsInvalid(endlsn))
-	{
-		ReplicationSlotMarkDirty();
-		ReplicationSlotsComputeRequiredXmin(false);
-		ReplicationSlotsComputeRequiredLSN();
-		ReplicationSlotSave();
-	}
+	/*
+	 * Update the on disk state.  No work here if
+	 * pg_*_replication_slot_advance call was a no-op.
+	 */
+	ReplicationSlotSave();
 
 	ReplicationSlotRelease();
 


Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly

2020-01-04 Thread Alexey Kondratov

On 2019-12-02 11:21, Michael Paquier wrote:

On Wed, Nov 27, 2019 at 08:47:06PM +0300, Alexey Kondratov wrote:

Thus, I cannot get your point correctly here. Can you, please, 
elaborate a

little bit more your concerns?


The case of REINDEX CONCURRENTLY is pretty simple, because a new
relation which is a copy of the old relation is created before doing
the reindex, so you simply need to set the tablespace OID correctly
in index_concurrently_create_copy().  And actually, I think that the
computation is incorrect because we need to check after
MyDatabaseTableSpace as well, no?

The case of REINDEX is more tricky, because you are working on a
relation that already exists, hence I think that what you need to do a
different thing before the actual REINDEX:
1) Update the existing relation's pg_class tuple to point to the new
tablespace.
2) Do a CommandCounterIncrement.
So I think that the order of the operations you are doing is incorrect,
and that you have a risk of breaking the existing tablespace assignment
logic done when first flushing a new relfilenode.

This actually brings an extra thing: when doing a plain REINDEX you
need to make sure that the past relfilenode of the relation gets away
properly.  The attached POC patch does that before doing the CCI which
is a bit ugly, but that's enough to show my point, and there is no
need to touch RelationSetNewRelfilenode() this way.



OK, I hope that now I understand your concerns better. Another thing I 
just realised is that RelationSetNewRelfilenode is also used for mapped 
relations, which are not movable at all, so adding a tablespace options 
there seems to be not semantically correct as well. However, I still 
have not find a way to reproduce how to actually brake anything with my 
previous version of the patch.


As for doing RelationDropStorage before CCI, I do not think that there 
is something wrong with it, this is exactly what 
RelationSetNewRelfilenode does. I have only moved RelationDropStorage 
before CatalogTupleUpdate compared to your proposal to match order 
inside RelationSetNewRelfilenode.




Your patch has forgotten to update copyfuncs.c and equalfuncs.c with
the new tablespace string field.

It would be nice to add tab completion for this new clause in psql.
This is not ready for committer yet in my opinion, and more work is
done, so I am marking it as returned with feedback for now.



Finally, I have also merged and unified all your and Masahiko's 
proposals with my recent changes: ereport corrections, tab-completion, 
docs update, copy/equalfuncs update, etc. New version is attached. Have 
it come any closer to a committable state now?



Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company
From 81a9df46db91b6ba47e11f53e6e2379f835b576f Mon Sep 17 00:00:00 2001
From: Alexey Kondratov 
Date: Mon, 30 Dec 2019 20:00:37 +0300
Subject: [PATCH v7] Allow REINDEX to change tablespace

REINDEX already does full relation rewrite, this patch adds a
possibility to specify a new tablespace where new relfilenode
will be created.
---
 doc/src/sgml/ref/reindex.sgml | 24 +-
 src/backend/catalog/index.c   | 75 --
 src/backend/commands/cluster.c|  2 +-
 src/backend/commands/indexcmds.c  | 96 ---
 src/backend/commands/tablecmds.c  |  2 +-
 src/backend/nodes/copyfuncs.c |  1 +
 src/backend/nodes/equalfuncs.c|  1 +
 src/backend/parser/gram.y | 14 ++--
 src/backend/tcop/utility.c|  6 +-
 src/bin/psql/tab-complete.c   |  6 ++
 src/include/catalog/index.h   |  7 +-
 src/include/commands/defrem.h |  6 +-
 src/include/nodes/parsenodes.h|  1 +
 src/test/regress/input/tablespace.source  | 49 
 src/test/regress/output/tablespace.source | 66 
 15 files changed, 323 insertions(+), 33 deletions(-)

diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml
index 5aa59d3b751..cf6bd88c128 100644
--- a/doc/src/sgml/ref/reindex.sgml
+++ b/doc/src/sgml/ref/reindex.sgml
@@ -21,7 +21,7 @@ PostgreSQL documentation
 
  
 
-REINDEX [ ( option [, ...] ) ] { INDEX | TABLE | SCHEMA | DATABASE | SYSTEM } [ CONCURRENTLY ] name
+REINDEX [ ( option [, ...] ) ] { INDEX | TABLE | SCHEMA | DATABASE | SYSTEM } [ CONCURRENTLY ] name [ TABLESPACE new_tablespace ]
 
 where option can be one of:
 
@@ -169,6 +169,28 @@ REINDEX [ ( option [, ...] ) ] { IN
 

 
+   
+TABLESPACE
+
+ 
+  This specifies a tablespace, where all rebuilt indexes will be created.
+  Cannot be used with "mapped" relations. If SCHEMA,
+  DATABASE or SYSTEM is specified, then
+  all unsuitable relations will be skipped and a single WARNING
+  will be generated.
+ 
+
+   
+
+   
+new_tablespace
+
+ 
+  The name of the specific tablespace to store rebui

Re: Physical replication slot advance is not persistent

2019-12-29 Thread Alexey Kondratov

On 2019-12-26 16:35, Alexey Kondratov wrote:


Another concern is that ReplicationSlotIsDirty is added with the only
one user. It also cannot be used by SaveSlotToPath due to the
simultaneous usage of both flags dirty and just_dirtied there.

In that way, I hope that we should call ReplicationSlotSave
unconditionally in the pg_replication_slot_advance, so slot will be
saved or not automatically based on the slot->dirty flag. In the same
time, ReplicationSlotsComputeRequiredXmin and
ReplicationSlotsComputeRequiredLSN should be called by anyone, who
modifies xmin and LSN fields in the slot. Otherwise, currently we are
getting some leaky abstractions.



It seems that there was even a race in the order of actions inside 
pg_replication_slot_advance, it did following:


- ReplicationSlotMarkDirty();
- ReplicationSlotsComputeRequiredXmin(false);
- ReplicationSlotsComputeRequiredLSN();
- ReplicationSlotSave();

1) Mark slot as dirty, which actually does nothing immediately, but 
setting dirty flag;

2) Do compute new global required LSN;
3) Flush slot state to disk.

If someone will utilise old WAL and after that crash will happen between 
steps 2) and 3), then we start with old value of restart_lsn, but 
without required WAL. I do not know how to properly reproduce it without 
gdb and power off, so the chance is pretty low, but still it could be a 
case.


Logical slots were not affected again, since there was a proper 
operations order (with comments) and slot flushing routines inside 
LogicalConfirmReceivedLocation.


Thus, in the attached patch I have decided to do not perform slot 
flushing in the pg_replication_slot_advance at all and do it in the 
pg_physical_replication_slot_advance instead, as it is done in the 
LogicalConfirmReceivedLocation.


Since this bugfix have not moved forward during the week, I will put it 
on the 01.2020 commitfest. Kyotaro, if you do not object I will add you 
as a reviewer, as you have already gave a lot of feedback, thank you for 
that!



Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company
From c9446bfc60ae125adea476bcc5b7d50f4bafa38b Mon Sep 17 00:00:00 2001
From: Alexey Kondratov 
Date: Sun, 29 Dec 2019 14:38:13 +0300
Subject: [PATCH v3] Make physical slot advance to be persistent

Starting from v11 pg_replication_slot_advance result was not persistent
after restart. This patch fixes that bug by proper slot flushing if it
was modified.
---
 src/backend/replication/logical/logical.c | 10 ++--
 src/backend/replication/slotfuncs.c   | 31 +++
 2 files changed, 29 insertions(+), 12 deletions(-)

diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c
index 7e06615864..bdb82da39b 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -1049,7 +1049,10 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn)
 
 		SpinLockRelease(>mutex);
 
-		/* first write new xmin to disk, so we know what's up after a crash */
+		/*
+		 * First write new xmin and/or LSN to disk, so we know what's up
+		 * after a crash.
+		 */
 		if (updated_xmin || updated_restart)
 		{
 			ReplicationSlotMarkDirty();
@@ -1057,6 +1060,10 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn)
 			elog(DEBUG1, "updated xmin: %u restart: %u", updated_xmin, updated_restart);
 		}
 
+		/* Compute global required LSN if restart_lsn was changed */
+		if (updated_restart)
+			ReplicationSlotsComputeRequiredLSN();
+
 		/*
 		 * Now the new xmin is safely on disk, we can let the global value
 		 * advance. We do not take ProcArrayLock or similar since we only
@@ -1070,7 +1077,6 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn)
 			SpinLockRelease(>mutex);
 
 			ReplicationSlotsComputeRequiredXmin(false);
-			ReplicationSlotsComputeRequiredLSN();
 		}
 	}
 	else
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index ba08ad405f..9d0e99afb8 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -370,6 +370,19 @@ pg_physical_replication_slot_advance(XLogRecPtr moveto)
 		MyReplicationSlot->data.restart_lsn = moveto;
 		SpinLockRelease(>mutex);
 		retlsn = moveto;
+
+		/*
+		 * Dirty the slot as we updated data that is meant to be persistent
+		 * between restarts, flush it and re-compute global required LSN.
+		 *
+		 * If we change the order of operations and do not flush slot before
+		 * required LSN computing, then there will be a race.  Least recent WAL
+		 * segments may be already utilized, so if we crash and start again with
+		 * old restart_lsn there will be no WAL to proceed properly.
+		 */
+		ReplicationSlotMarkDirty();
+		ReplicationSlotSave();
+		ReplicationSlotsComputeRequiredLSN();
 	}
 
 	return retlsn;
@@ -463,6 +476,10 @@ pg_logical_replication_slot_advance(XLogRecPtr moveto)
 
 		if (ctx->reader->EndRecPtr != InvalidXLogRecPtr)
 		{
+

Re: Physical replication slot advance is not persistent

2019-12-26 Thread Alexey Kondratov

On 26.12.2019 11:33, Kyotaro Horiguchi wrote:

At Wed, 25 Dec 2019 20:28:04 +0300, Alexey Kondratov 
 wrote in

Yep, it helps with physical replication slot persistence after
advance, but the whole validation (moveto <= endlsn) does not make
sense for me. The value of moveto should be >= than minlsn ==
confirmed_flush / restart_lsn, while endlsn == retlsn is also always
initialized with confirmed_flush / restart_lsn. Thus, your condition
seems to be true in any case, even if it was no-op one, which we were
intended to catch.

...

If I get it correctly, then we already keep previous slot position in
the minlsn, so we just have to compare endlsn with minlsn and treat
endlsn <= minlsn as a no-op without slot state flushing.

I think you're right about the condition. (endlsn cannot be less than
minlsn, though) But I came to think that we shouldn't use locations in
that decision.


Attached is a patch that does this, so it fixes the bug without
affecting any user-facing behavior. Detailed comment section and DEBUG
output are also added. What do you think now?

I have also forgotten to mention that all versions down to 11.0 should
be affected with this bug.

pg_replication_slot_advance is the only caller of
pg_logical/physical_replication_slot_advacne so there's no apparent
determinant on who-does-what about dirtying and other housekeeping
calculation like *ComputeRequired*() functions, but the current shape
seems a kind of inconsistent between logical and physical.

I think pg_logaical/physical_replication_slot_advance should dirty the
slot if they actually changed anything. And
pg_replication_slot_advance should do the housekeeping if the slots
are dirtied.  (Otherwise both the caller function should dirty the
slot in lieu of the two.)

The attached does that.


Both approaches looks fine for me: my last patch with as minimal 
intervention as possible and yours refactoring. I think that it is a 
right direction to let everyone who modifies slot->data also mark slot 
as dirty.


I found some comment section in your code as rather misleading:

+        /*
+ * We don't need to dirty the slot only for the above change, 
but dirty

+         * this slot for the same reason with
+         * pg_logical_replication_slot_advance.
+         */

We just modified MyReplicationSlot->data, which is "On-Disk data of a 
replication slot, preserved across restarts.", so it definitely should 
be marked as dirty, not because pg_logical_replication_slot_advance does 
the same.


Also I think that using this transient variable in 
ReplicationSlotIsDirty is not necessary. MyReplicationSlot is already a 
pointer to the slot in shared memory.


+    ReplicationSlot *slot = MyReplicationSlot;
+
+    Assert(MyReplicationSlot != NULL);
+
+    SpinLockAcquire(>mutex);

Otherwise it looks fine for me, so attached is the same diff, but with 
these proposed corrections.


Another concern is that ReplicationSlotIsDirty is added with the only 
one user. It also cannot be used by SaveSlotToPath due to the 
simultaneous usage of both flags dirty and just_dirtied there.


In that way, I hope that we should call ReplicationSlotSave 
unconditionally in the pg_replication_slot_advance, so slot will be 
saved or not automatically based on the slot->dirty flag. In the same 
time, ReplicationSlotsComputeRequiredXmin and 
ReplicationSlotsComputeRequiredLSN should be called by anyone, who 
modifies xmin and LSN fields in the slot. Otherwise, currently we are 
getting some leaky abstractions.



Regards

--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company

diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c
index 21ae8531b3..edf661521a 100644
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -672,6 +672,23 @@ ReplicationSlotMarkDirty(void)
 	SpinLockRelease(>mutex);
 }
 
+/*
+ * Verify whether currently acquired slot is dirty.
+ */
+bool
+ReplicationSlotIsDirty(void)
+{
+	bool dirty;
+
+	Assert(MyReplicationSlot != NULL);
+
+	SpinLockAcquire(>mutex);
+	dirty = MyReplicationSlot->dirty;
+	SpinLockRelease(>mutex);
+
+	return dirty;
+}
+
 /*
  * Convert a slot that's marked as RS_EPHEMERAL to a RS_PERSISTENT slot,
  * guaranteeing it will be there after an eventual crash.
diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 6683fc3f9b..d7a16a9071 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -370,6 +370,12 @@ pg_physical_replication_slot_advance(XLogRecPtr moveto)
 		MyReplicationSlot->data.restart_lsn = moveto;
 		SpinLockRelease(>mutex);
 		retlsn = moveto;
+
+		/*
+		 * Dirty the slot as we updated data that is meant to be
+		 * persistent on disk.
+		 */
+		ReplicationSlotMarkDirty();
 	}
 
 	return retlsn;
@@ -574,9 +580,8 @@ pg_replication_slot_advance(PG_FUNCTION_ARGS)
 	nulls[0] = false;
 
 

Re: Physical replication slot advance is not persistent

2019-12-25 Thread Alexey Kondratov

On 25.12.2019 16:51, Alexey Kondratov wrote:

On 25.12.2019 07:03, Kyotaro Horiguchi wrote:

As the result I think what is needed there is just checking if the
returned lsn is equal or larger than moveto. Doen't the following
change work?

-    if (XLogRecPtrIsInvalid(endlsn))
+    if (moveto <= endlsn)


Yep, it helps with physical replication slot persistence after 
advance, but the whole validation (moveto <= endlsn) does not make 
sense for me. The value of moveto should be >= than minlsn == 
confirmed_flush / restart_lsn, while endlsn == retlsn is also always 
initialized with confirmed_flush / restart_lsn. Thus, your condition 
seems to be true in any case, even if it was no-op one, which we were 
intended to catch.


I will recheck everything again and try to come up with something 
during this week.


If I get it correctly, then we already keep previous slot position in 
the minlsn, so we just have to compare endlsn with minlsn and treat 
endlsn <= minlsn as a no-op without slot state flushing.


Attached is a patch that does this, so it fixes the bug without 
affecting any user-facing behavior. Detailed comment section and DEBUG 
output are also added. What do you think now?


I have also forgotten to mention that all versions down to 11.0 should 
be affected with this bug.



Regards

--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company

>From e08299ddf92abc3fb4e802e8b475097fa746c458 Mon Sep 17 00:00:00 2001
From: Alexey Kondratov 
Date: Wed, 25 Dec 2019 20:12:42 +0300
Subject: [PATCH v2] Make physical replslot advance persistent

---
 src/backend/replication/slotfuncs.c | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 6683fc3f9b..bc5c93b089 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -573,9 +573,17 @@ pg_replication_slot_advance(PG_FUNCTION_ARGS)
 	values[0] = NameGetDatum(>data.name);
 	nulls[0] = false;
 
-	/* Update the on disk state when lsn was updated. */
-	if (XLogRecPtrIsInvalid(endlsn))
+	/*
+	 * Update the on disk state when LSN was updated.  Here we rely on the facts
+	 * that: 1) minlsn is initialized with restart_lsn and confirmed_flush LSN for
+	 * physical and logical replication slot respectively, and 2) endlsn is set in
+	 * the same way by pg_*_replication_slot_advance, but after advance.  Thus,
+	 * endlsn <= minlsn is treated as a no-op.
+	 */
+	if (endlsn > minlsn)
 	{
+		elog(DEBUG1, "flushing replication slot '%s' state",
+			NameStr(MyReplicationSlot->data.name));
 		ReplicationSlotMarkDirty();
 		ReplicationSlotsComputeRequiredXmin(false);
 		ReplicationSlotsComputeRequiredLSN();

base-commit: 8ce3aa9b5914d1ac45ed3f9bc484f66b3c4850c7
-- 
2.17.1



Re: Physical replication slot advance is not persistent

2019-12-25 Thread Alexey Kondratov

On 25.12.2019 07:03, Kyotaro Horiguchi wrote:

At Tue, 24 Dec 2019 20:12:32 +0300, Alexey Kondratov 
 wrote in

I dig into the code and it happens because of this if statement:

     /* Update the on disk state when lsn was updated. */
     if (XLogRecPtrIsInvalid(endlsn))
     {
         ReplicationSlotMarkDirty();
         ReplicationSlotsComputeRequiredXmin(false);
         ReplicationSlotsComputeRequiredLSN();
         ReplicationSlotSave();
     }

Yes, it seems just broken.


Attached is a small patch, which fixes this bug. I have tried to
stick to the same logic in this 'if (XLogRecPtrIsInvalid(endlsn))'
and now pg_logical_replication_slot_advance and
pg_physical_replication_slot_advance return InvalidXLogRecPtr if
no-op.

What do you think?

I think we shoudn't change the definition of
pg_*_replication_slot_advance since the result is user-facing.


Yes, that was my main concern too. OK.


The functions return a invalid value only when the slot had the
invalid value and failed to move the position. I think that happens
only for uninitalized slots.

Anyway what we should do there is dirtying the slot when the operation
can be assumed to have been succeeded.

As the result I think what is needed there is just checking if the
returned lsn is equal or larger than moveto. Doen't the following
change work?

-   if (XLogRecPtrIsInvalid(endlsn))
+   if (moveto <= endlsn)


Yep, it helps with physical replication slot persistence after advance, 
but the whole validation (moveto <= endlsn) does not make sense for me. 
The value of moveto should be >= than minlsn == confirmed_flush / 
restart_lsn, while endlsn == retlsn is also always initialized with 
confirmed_flush / restart_lsn. Thus, your condition seems to be true in 
any case, even if it was no-op one, which we were intended to catch.


Actually, if we do not want to change pg_*_replication_slot_advance, we 
can just add straightforward validation that either confirmed_flush, or 
restart_lsn changed after slot advance guts execution. It will be a 
little bit bulky, but much more clear and will never be affected by 
pg_*_replication_slot_advance logic change.



Another weird part I have found is this assignment inside 
pg_logical_replication_slot_advance:


/* Initialize our return value in case we don't do anything */
retlsn = MyReplicationSlot->data.confirmed_flush;

It looks redundant, since later we do the same assignment, which should 
be reachable in any case.


I will recheck everything again and try to come up with something during 
this week.



Regards

--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company





Physical replication slot advance is not persistent

2019-12-24 Thread Alexey Kondratov

Hi Hackers,

I have accidentally noticed that pg_replication_slot_advance only 
changes in-memory state of the slot when its type is physical. Its new 
value does not survive restart.


Reproduction steps:

1) Create new slot and remember its restart_lsn

SELECT pg_create_physical_replication_slot('slot1', true);
SELECT * from pg_replication_slots;

2) Generate some dummy WAL

CHECKPOINT;
SELECT pg_switch_wal();
CHECKPOINT;
SELECT pg_switch_wal();

3) Advance slot to the value of pg_current_wal_insert_lsn()

SELECT pg_replication_slot_advance('slot1', '0/160001A0');

4) Check that restart_lsn has been updated

SELECT * from pg_replication_slots;

5) Restart server and check restart_lsn again. It should be the same as 
in the step 1.



I dig into the code and it happens because of this if statement:

    /* Update the on disk state when lsn was updated. */
    if (XLogRecPtrIsInvalid(endlsn))
    {
        ReplicationSlotMarkDirty();
        ReplicationSlotsComputeRequiredXmin(false);
        ReplicationSlotsComputeRequiredLSN();
        ReplicationSlotSave();
    }

Actually, endlsn is always a valid LSN after the execution of 
replication slot advance guts. It works for logical slots only by 
chance, since there is an implicit ReplicationSlotMarkDirty() call 
inside LogicalConfirmReceivedLocation.


Attached is a small patch, which fixes this bug. I have tried to
stick to the same logic in this 'if (XLogRecPtrIsInvalid(endlsn))'
and now pg_logical_replication_slot_advance and
pg_physical_replication_slot_advance return InvalidXLogRecPtr if
no-op.

What do you think?


Regards
--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company

P.S. CCed Simon and Michael as they are the last who seriously touched 
pg_replication_slot_advance code.


>From 36d1fa2a89b3fb354a813354496df475ee11b62e Mon Sep 17 00:00:00 2001
From: Alexey Kondratov 
Date: Tue, 24 Dec 2019 18:21:50 +0300
Subject: [PATCH v1] Make phsycal replslot advance persistent

---
 src/backend/replication/slotfuncs.c | 17 +
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c
index 46e6dd4d12..826708d3f6 100644
--- a/src/backend/replication/slotfuncs.c
+++ b/src/backend/replication/slotfuncs.c
@@ -358,12 +358,14 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
  * The LSN position to move to is compared simply to the slot's restart_lsn,
  * knowing that any position older than that would be removed by successive
  * checkpoints.
+ *
+ * Returns InvalidXLogRecPtr if no-op.
  */
 static XLogRecPtr
 pg_physical_replication_slot_advance(XLogRecPtr moveto)
 {
 	XLogRecPtr	startlsn = MyReplicationSlot->data.restart_lsn;
-	XLogRecPtr	retlsn = startlsn;
+	XLogRecPtr	retlsn = InvalidXLogRecPtr;
 
 	if (startlsn < moveto)
 	{
@@ -386,6 +388,8 @@ pg_physical_replication_slot_advance(XLogRecPtr moveto)
  * because we need to digest WAL to advance restart_lsn allowing to recycle
  * WAL and removal of old catalog tuples.  As decoding is done in fast_forward
  * mode, no changes are generated anyway.
+ *
+ * Returns InvalidXLogRecPtr if no-op.
  */
 static XLogRecPtr
 pg_logical_replication_slot_advance(XLogRecPtr moveto)
@@ -393,7 +397,7 @@ pg_logical_replication_slot_advance(XLogRecPtr moveto)
 	LogicalDecodingContext *ctx;
 	ResourceOwner old_resowner = CurrentResourceOwner;
 	XLogRecPtr	startlsn;
-	XLogRecPtr	retlsn;
+	XLogRecPtr	retlsn = InvalidXLogRecPtr;
 
 	PG_TRY();
 	{
@@ -414,9 +418,6 @@ pg_logical_replication_slot_advance(XLogRecPtr moveto)
 		 */
 		startlsn = MyReplicationSlot->data.restart_lsn;
 
-		/* Initialize our return value in case we don't do anything */
-		retlsn = MyReplicationSlot->data.confirmed_flush;
-
 		/* invalidate non-timetravel entries */
 		InvalidateSystemCaches();
 
@@ -480,9 +481,9 @@ pg_logical_replication_slot_advance(XLogRecPtr moveto)
 			 * better than always losing the position even on clean restart.
 			 */
 			ReplicationSlotMarkDirty();
-		}
 
-		retlsn = MyReplicationSlot->data.confirmed_flush;
+			retlsn = MyReplicationSlot->data.confirmed_flush;
+		}
 
 		/* free context, call shutdown callback */
 		FreeDecodingContext(ctx);
@@ -575,7 +576,7 @@ pg_replication_slot_advance(PG_FUNCTION_ARGS)
 	nulls[0] = false;
 
 	/* Update the on disk state when lsn was updated. */
-	if (XLogRecPtrIsInvalid(endlsn))
+	if (!XLogRecPtrIsInvalid(endlsn))
 	{
 		ReplicationSlotMarkDirty();
 		ReplicationSlotsComputeRequiredXmin(false);
-- 
2.17.1



Re: [PATCH] Increase the maximum value track_activity_query_size

2019-12-20 Thread Alexey Kondratov

On 19.12.2019 20:52, Robert Haas wrote:

On Thu, Dec 19, 2019 at 10:59 AM Tom Lane  wrote:

Bruce Momjian  writes:

Good question.  I am in favor of allowing a larger value if no one
objects.  I don't think adding the min/max is helpful.


The original poster.



And probably anyone else, who debugs stuck queries of yet another crazy 
ORM. Yes, one could use log_min_duration_statement, but having a 
possibility to directly get it from pg_stat_activity without eyeballing 
the logs is nice. Also, IIRC log_min_duration_statement applies only to 
completed statements.



I think there are pretty obvious performance and memory-consumption
penalties to very large track_activity_query_size values.  Who exactly
are we really helping if we let them set it to huge values?

(wanders away wondering if we have suitable integer-overflow checks
in relevant code paths...)



The value of pgstat_track_activity_query_size is in bytes, so setting it 
to any value below INT_MAX seems to be safe from that perspective. 
However, being multiplied by NumBackendStatSlots its reasonable value 
should be far below INT_MAX (~2 GB).


Sincerely, It does not look for me like something badly needed, but 
still. We already have hundreds of GUCs and it is easy for a user to 
build a sub-optimal configuration, so does this overprotection make sense?



Regards

--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company





Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line

2019-12-03 Thread Alexey Kondratov

On 01.12.2019 5:57, Michael Paquier wrote:

On Thu, Sep 26, 2019 at 03:08:22PM +0300, Alexey Kondratov wrote:

As Alvaro correctly pointed in the nearby thread [1], we've got an
interference regarding -R command line argument. I agree that it's a good
idea to reserve -R for recovery configuration write to be consistent with
pg_basebackup, so I've updated my patch to use another letters:

The patch has rotten and does not apply anymore.  Could you please
send a rebased version?  I have moved the patch to next CF, waiting on
author for now.


Rebased and updated patch is attached.

There was a problem with testing new restore_command options altogether 
with recent ensureCleanShutdown. My test simply moves all WAL from 
pg_wal and generates restore_command for a new options testing, but this 
prevents startup recovery required by ensureCleanShutdown. To test both 
options in the same we have to leave some recent WAL segments in the 
pg_wal and be sure that they are enough for startup recovery, but not 
enough for successful pg_rewind run. I have manually figured out that 
required amount of inserted records (and generated WAL) to achieve this. 
However, I think that this approach is not good for test, since tests 
may be modified in the future (amount of writes to DB changed) or even 
volume of WAL written by Postgres will change. It will lead to falsely 
always failed or passed tests.


Moreover, testing both ensureCleanShutdown and new options in the same 
time doesn't hit new code paths, so I decided to test new options with 
--no-ensure-shutdown for simplicity and stability of tests.



Regards

--
Alexey Kondratov

Postgres Professional https://www.postgrespro.com
Russian Postgres Company

>From a05c3343e0bd6fe339c944f6b0cde64ceb46a0b3 Mon Sep 17 00:00:00 2001
From: Alexey Kondratov 
Date: Tue, 19 Feb 2019 19:14:53 +0300
Subject: [PATCH v11] pg_rewind: options to use restore_command from command
 line or cluster config

Previously, when pg_rewind could not find required WAL files in the
target data directory the rewind process would fail. One had to
manually figure out which of required WAL files have already moved to
the archival storage and copy them back.

This patch adds possibility to specify restore_command via command
line option or use one specified inside postgresql.conf. Specified
restore_command will be used for automatic retrieval of missing WAL
files from archival storage.
---
 doc/src/sgml/ref/pg_rewind.sgml   |  49 +++-
 src/bin/pg_rewind/parsexlog.c | 164 +-
 src/bin/pg_rewind/pg_rewind.c | 118 +++---
 src/bin/pg_rewind/pg_rewind.h |   6 +-
 src/bin/pg_rewind/t/001_basic.pl  |   4 +-
 src/bin/pg_rewind/t/002_databases.pl  |   4 +-
 src/bin/pg_rewind/t/003_extrafiles.pl |   4 +-
 src/bin/pg_rewind/t/RewindTest.pm | 105 -
 8 files changed, 416 insertions(+), 38 deletions(-)

diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml
index 42d29edd4e..b601a5c7e4 100644
--- a/doc/src/sgml/ref/pg_rewind.sgml
+++ b/doc/src/sgml/ref/pg_rewind.sgml
@@ -66,11 +66,12 @@ PostgreSQL documentation
can be found either on the target timeline, the source timeline, or their common
ancestor. In the typical failover scenario where the target cluster was
shut down soon after the divergence, this is not a problem, but if the
-   target cluster ran for a long time after the divergence, the old WAL
-   files might no longer be present. In that case, they can be manually
-   copied from the WAL archive to the pg_wal directory, or
-   fetched on startup by configuring  or
-   .  The use of
+   target cluster ran for a long time after the divergence, its old WAL
+   files might no longer be present. In this case, you can manually copy them
+   from the WAL archive to the pg_wal directory, or run
+   pg_rewind with the -c or
+   -C option to automatically retrieve them from the WAL
+   archive. The use of
pg_rewind is not limited to failover, e.g.  a standby
server can be promoted, run some write transactions, and then rewinded
to become a standby again.
@@ -232,6 +233,39 @@ PostgreSQL documentation
   
  
 
+ 
+  -c
+  --restore-target-wal
+  
+   
+Use the restore_command defined in
+postgresql.conf to retrieve WAL files from
+the WAL archive if these files are no longer available in the
+pg_wal directory of the target cluster.
+   
+   
+This option cannot be used together with --target-restore-command.
+   
+  
+ 
+
+ 
+  -C restore_command
+  --target-restore-command=restore_command
+  
+   
+Specifies the restore_command to use for retrieving
+WAL files from the WAL archive if these files are no longer available
+in the pg_wal directory of the target cluster.
+   
+   
+If restore_command is already 

  1   2   >