Re: Supply restore_command to pg_rewind via CLI argument
Hi, On Tue, Mar 22, 2022 at 3:32 AM Andres Freund wrote: > > Doesn't apply once more: http://cfbot.cputube.org/patch_37_3213.log > Thanks for the reminder, a rebased version is attached. Regards -- Alexey Kondratov From df56b5c7b882e781fdc0b92e7a83331f0baab094 Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Tue, 29 Jun 2021 17:17:47 +0300 Subject: [PATCH v4] Allow providing restore_command as a command line option to pg_rewind This could be useful when postgres is usually run with -c config_file=..., so the actual configuration and restore_command is not inside $PGDATA/postgresql.conf. --- doc/src/sgml/ref/pg_rewind.sgml | 19 + src/bin/pg_rewind/pg_rewind.c| 45 ++- src/bin/pg_rewind/t/001_basic.pl | 1 + src/bin/pg_rewind/t/RewindTest.pm| 95 ++-- src/test/perl/PostgreSQL/Test/Cluster.pm | 5 +- 5 files changed, 106 insertions(+), 59 deletions(-) diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml index 33e6bb64ad..af75f35867 100644 --- a/doc/src/sgml/ref/pg_rewind.sgml +++ b/doc/src/sgml/ref/pg_rewind.sgml @@ -241,6 +241,25 @@ PostgreSQL documentation + + -C restore_command + --target-restore-command=restore_command + + +Specifies the restore_command to use for retrieving +WAL files from the WAL archive if these files are no longer available +in the pg_wal directory of the target cluster. + + +If restore_command is already set in +postgresql.conf, you can provide the +--restore-target-wal option instead. If both options +are provided, then --target-restore-command +will be used. + + + + --debug diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c index b39b5c1aac..9aca041425 100644 --- a/src/bin/pg_rewind/pg_rewind.c +++ b/src/bin/pg_rewind/pg_rewind.c @@ -85,21 +85,22 @@ usage(const char *progname) printf(_("%s resynchronizes a PostgreSQL cluster with another copy of the cluster.\n\n"), progname); printf(_("Usage:\n %s [OPTION]...\n\n"), progname); printf(_("Options:\n")); - printf(_(" -c, --restore-target-wal use restore_command in target configuration to\n" - " retrieve WAL files from archives\n")); - printf(_(" -D, --target-pgdata=DIRECTORY existing data directory to modify\n")); - printf(_(" --source-pgdata=DIRECTORY source data directory to synchronize with\n")); - printf(_(" --source-server=CONNSTRsource server to synchronize with\n")); - printf(_(" -n, --dry-run stop before modifying anything\n")); - printf(_(" -N, --no-sync do not wait for changes to be written\n" - " safely to disk\n")); - printf(_(" -P, --progress write progress messages\n")); - printf(_(" -R, --write-recovery-conf write configuration for replication\n" - " (requires --source-server)\n")); - printf(_(" --debugwrite a lot of debug messages\n")); - printf(_(" --no-ensure-shutdown do not automatically fix unclean shutdown\n")); - printf(_(" -V, --version output version information, then exit\n")); - printf(_(" -?, --help show this help, then exit\n")); + printf(_(" -c, --restore-target-wal use restore_command in target configuration to\n" + "retrieve WAL files from archives\n")); + printf(_(" -C, --target-restore-command=COMMAND target WAL restore_command\n")); + printf(_(" -D, --target-pgdata=DIRECTORY existing data directory to modify\n")); + printf(_(" --source-pgdata=DIRECTORY source data directory to synchronize with\n")); + printf(_(" --source-server=CONNSTR source server to synchronize with\n")); + printf(_(" -n, --dry-run stop before modifying anything\n")); + printf(_(" -N, --no-sync do not wait for changes to be written\n" + "safely to disk\n")); + printf(_(" -P, --progresswrite progress messages\n")); + printf(_(" -R, --write-recovery-conf write configuration for replication\n" + "(requires --source-server)\n")); + printf(_(" --debug write a lot of debug messages\n")); + printf(_(" --no-ensure-shutdown do not automatically fix
Re: Supply restore_command to pg_rewind via CLI argument
On Fri, Aug 27, 2021 at 10:05 AM Andrey Borodin wrote: > There is a small bug > + /* > +* Take restore_command from the postgresql.conf only if it is not > already > +* provided as a command line option. > +*/ > + if (!restore_wal && restore_command == NULL) > return; > > I think we should use condition (!restore_wal || restore_command != NULL). > Yes, you are right, thanks. Attached is a fixed version. Tests were passing since PostgresNode->enable_restoring is adding restore_command to the postgresql.conf anyway. > > Besides this patch looks good and is ready for committer IMV. > -- Alexey Kondratov v2-0001-Allow-providing-restore_command-as-a-command-line.patch Description: Binary data
Re: Supply restore_command to pg_rewind via CLI argument
On Fri, Jun 18, 2021 at 10:06 PM Alexey Kondratov wrote: > On Fri, Jun 18, 2021 at 5:42 PM Andrey Borodin wrote: > > > > If we run 'pg_rewind --restore-target-wal' there must be restore_command in > > config of target installation. But if the config is not within > > $PGDATA\postgresql.conf pg_rewind cannot use it. > > If we run postmaster with `-c > > config_file=/etc/postgresql/10/data/postgresql.conf`, we simply cannot use > > the feature. We solved the problem by putting config into PGDATA only > > during pg_rewind, but this does not seem like a very robust solution. > > > > Yeah, Michael was against it, while we had no good arguments, so > Alexander removed it, IIRC. This example sounds reasonable to me. I > also recall some complaints from PostgresPro support folks, that it is > sad to not have a cli option to pass restore_command. However, I just > thought about another recent feature --- ensure clean shutdown, which > is turned on by default. So you usually run Postgres with one config, > but pg_rewind may start it with another, although in single-user mode. > Is it fine for you? > > > > > Maybe we could add "-C, --target-restore-command=COMMAND target WAL > > restore_command\n" as was proposed within earlier versions of the patch[0]? > > Or instruct pg_rewind to pass config to 'postgres -C restore_command' run? > > Hm, adding --target-restore-command is the simplest way, sure, but > forwarding something like '-c config_file=...' to postgres sounds > interesting too. Could it have any use case beside providing a > restore_command? I cannot imagine anything right now, but if any > exist, then it could be a more universal approach. > > > > > From my POV adding --target-restore-command is simplest way, I can extract > > corresponding portions from original patch. > > > > I will have a look, maybe I even already have this patch separately. I > remember that we were considering adding this option to PostgresPro, > when we did a backport of this feature. > Here it is. I have slightly adapted the previous patch to the recent pg_rewind changes. In this version -C does not conflict with -c, it just overrides it. -- Alexey Kondratov v1-0001-Allow-providing-restore_command-as-a-command-line.patch Description: Binary data
Re: Supply restore_command to pg_rewind via CLI argument
Hi, On Fri, Jun 18, 2021 at 5:42 PM Andrey Borodin wrote: > > If we run 'pg_rewind --restore-target-wal' there must be restore_command in > config of target installation. But if the config is not within > $PGDATA\postgresql.conf pg_rewind cannot use it. > If we run postmaster with `-c > config_file=/etc/postgresql/10/data/postgresql.conf`, we simply cannot use > the feature. We solved the problem by putting config into PGDATA only during > pg_rewind, but this does not seem like a very robust solution. > Yeah, Michael was against it, while we had no good arguments, so Alexander removed it, IIRC. This example sounds reasonable to me. I also recall some complaints from PostgresPro support folks, that it is sad to not have a cli option to pass restore_command. However, I just thought about another recent feature --- ensure clean shutdown, which is turned on by default. So you usually run Postgres with one config, but pg_rewind may start it with another, although in single-user mode. Is it fine for you? > > Maybe we could add "-C, --target-restore-command=COMMAND target WAL > restore_command\n" as was proposed within earlier versions of the patch[0]? > Or instruct pg_rewind to pass config to 'postgres -C restore_command' run? Hm, adding --target-restore-command is the simplest way, sure, but forwarding something like '-c config_file=...' to postgres sounds interesting too. Could it have any use case beside providing a restore_command? I cannot imagine anything right now, but if any exist, then it could be a more universal approach. > > From my POV adding --target-restore-command is simplest way, I can extract > corresponding portions from original patch. > I will have a look, maybe I even already have this patch separately. I remember that we were considering adding this option to PostgresPro, when we did a backport of this feature. -- Alexey Kondratov
Re: Free port choosing freezes when PostgresNode::use_tcp is used on BSD systems
On 2021-04-20 18:03, Tom Lane wrote: Andrew Dunstan writes: On 4/19/21 7:22 PM, Tom Lane wrote: I wonder whether we could get away with just replacing the $use_tcp test with $TestLib::windows_os. It's not really apparent to me why we should care about 127.0.0.not-1 on Unix-oid systems. Yeah The comment is a bit strange anyway - Cygwin is actually going to use Unix sockets, not TCP. I think I would just change the test to this: $use_tcp && $TestLib::windows_os. Works for me, but we need to revise the comment to match. Then it could be somewhat like that, I guess. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Companydiff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm index db47a97d196..f7b488ed464 100644 --- a/src/test/perl/PostgresNode.pm +++ b/src/test/perl/PostgresNode.pm @@ -1191,19 +1191,19 @@ sub get_free_port # Check to see if anything else is listening on this TCP port. # Seek a port available for all possible listen_addresses values, # so callers can harness this port for the widest range of purposes. - # The 0.0.0.0 test achieves that for post-2006 Cygwin, which - # automatically sets SO_EXCLUSIVEADDRUSE. The same holds for MSYS (a - # Cygwin fork). Testing 0.0.0.0 is insufficient for Windows native - # Perl (https://stackoverflow.com/a/14388707), so we also test - # individual addresses. + # The 0.0.0.0 test achieves that for MSYS, which automatically sets + # SO_EXCLUSIVEADDRUSE. Testing 0.0.0.0 is insufficient for Windows + # native Perl (https://stackoverflow.com/a/14388707), so we also + # have to test individual addresses. Doing that for 127.0.0/24 + # addresses other than 127.0.0.1 might fail with EADDRNOTAVAIL on + # non-Linux, non-Windows kernels. # - # On non-Linux, non-Windows kernels, binding to 127.0.0/24 addresses - # other than 127.0.0.1 might fail with EADDRNOTAVAIL. Binding to - # 0.0.0.0 is unnecessary on non-Windows systems. + # That way, 0.0.0.0 and individual 127.0.0/24 addresses are tested + # only on Windows when TCP usage is requested. if ($found == 1) { foreach my $addr (qw(127.0.0.1), -$use_tcp ? qw(127.0.0.2 127.0.0.3 0.0.0.0) : ()) +$use_tcp && $TestLib::windows_os ? qw(127.0.0.2 127.0.0.3 0.0.0.0) : ()) { if (!can_bind($addr, $port)) {
Free port choosing freezes when PostgresNode::use_tcp is used on BSD systems
Hi Hackers, Inside PostgresNode.pm there is a free port choosing routine --- get_free_port(). The comment section there says: # On non-Linux, non-Windows kernels, binding to 127.0.0/24 addresses # other than 127.0.0.1 might fail with EADDRNOTAVAIL. And this is an absolute true, on BSD-like systems (macOS and FreeBSD tested) it hangs on looping through the entire ports range over and over when $PostgresNode::use_tcp = 1 is set, since bind fails with: # Checking port 52208 # bind: 127.0.0.1 52208 # bind: 127.0.0.2 52208 bind: Can't assign requested address To reproduce just apply reproduce.diff and try to run 'make -C src/bin/pg_ctl check'. This is not a case with standard Postgres tests, since TestLib.pm chooses unix sockets automatically everywhere outside Windows. However, we got into this problem when tried to run a custom tap test that required TCP for stable running. That way, if it really could happen why not to just skip binding to 127.0.0/24 addresses other than 127.0.0.1 outside of Linux/Windows as per attached patch_PostgresNode.diff? Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Companydiff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm index db47a97d196..9add9bde2a4 100644 --- a/src/test/perl/PostgresNode.pm +++ b/src/test/perl/PostgresNode.pm @@ -1203,7 +1203,7 @@ sub get_free_port if ($found == 1) { foreach my $addr (qw(127.0.0.1), -$use_tcp ? qw(127.0.0.2 127.0.0.3 0.0.0.0) : ()) +$use_tcp && ($^O eq "linux" || $TestLib::windows_os) ? qw(127.0.0.2 127.0.0.3 0.0.0.0) : ()) { if (!can_bind($addr, $port)) { diff --git a/src/bin/pg_ctl/t/001_start_stop.pl b/src/bin/pg_ctl/t/001_start_stop.pl index b1e419f02e9..c25c0793537 100644 --- a/src/bin/pg_ctl/t/001_start_stop.pl +++ b/src/bin/pg_ctl/t/001_start_stop.pl @@ -11,6 +11,8 @@ use Test::More tests => 24; my $tempdir = TestLib::tempdir; my $tempdir_short = TestLib::tempdir_short; +$PostgresNode::use_tcp = 1; + program_help_ok('pg_ctl'); program_version_ok('pg_ctl'); program_options_handling_ok('pg_ctl');
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
On 2021-02-03 09:37, Michael Paquier wrote: On Tue, Feb 02, 2021 at 10:32:19AM +0900, Michael Paquier wrote: On Mon, Feb 01, 2021 at 06:28:57PM +0300, Alexey Kondratov wrote: > Hm, IIUC, REINDEX CONCURRENTLY doesn't use either of them. It directly uses > index_create() with a proper tablespaceOid instead of > SetRelationTableSpace(). And its checks structure is more restrictive even > without tablespace change, so it doesn't use CheckRelationTableSpaceMove(). Sure. I have not checked the patch in details, but even with that it would be much safer to me if we apply the same sanity checks everywhere. That's less potential holes to worry about. Thanks Alexey for the new patch. I have been looking at the main patch in details. /* -* Don't allow reindex on temp tables of other backends ... their local -* buffer manager is not going to cope. +* We don't support moving system relations into different tablespaces +* unless allow_system_table_mods=1. */ If you remove the check on RELATION_IS_OTHER_TEMP() in reindex_index(), you would allow the reindex of a temp relation owned by a different session if its tablespace is not changed, so this cannot be removed. +!allowSystemTableMods && IsSystemRelation(iRel)) ereport(ERROR, -(errcode(ERRCODE_FEATURE_NOT_SUPPORTED), - errmsg("cannot reindex temporary tables of other sessions"))); +(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), + errmsg("permission denied: \"%s\" is a system catalog", +RelationGetRelationName(iRel; Indeed, a system relation with a relfilenode should be allowed to move under allow_system_table_mods. I think that we had better move this check into CheckRelationTableSpaceMove() instead of reindex_index() to centralize the logic. ALTER TABLE does this business in RangeVarCallbackForAlterRelation(), but our code path opening the relation is different for the non-concurrent case. + if (OidIsValid(params->tablespaceOid) && + IsSystemClass(relid, classtuple)) + { + if (!allowSystemTableMods) + { + /* Skip all system relations, if not allowSystemTableMods * I don't see the need for having two warnings here to say the same thing if a relation is mapped or not mapped, so let's keep it simple. Yeah, I just wanted to separate mapped and system relations, but probably it is too complicated. I have found that the test suite was rather messy in its organization. Table creations were done first with a set of tests not really ordered, so that was really hard to follow. This has also led to a set of tests that were duplicated, while other tests have been missed, mainly some cross checks for the concurrent and non-concurrent behaviors. I have reordered the whole so as tests on catalogs, normal tables and partitions are done separately with relations created and dropped for each set. Partitions use a global check for tablespaces and relfilenodes after one concurrent reindex (didn't see the point in doubling with the non-concurrent case as the same code path to select the relations from the partition tree is taken). An ACL test has been added at the end. The case of partitioned indexes was kind of interesting and I thought about that a couple of days, and I took the decision to ignore relations that have no storage as you did, documenting that ALTER TABLE can be used to update the references of the partitioned relations. The command is still useful with this behavior, and the tests I have added track that. Finally, I have reworked the docs, separating the limitations related to system catalogs and partitioned relations, to be more consistent with the notes at the end of the page. Thanks for working on this. + if (tablespacename != NULL) + { + params.tablespaceOid = get_tablespace_oid(tablespacename, false); + + /* Check permissions except when moving to database's default */ + if (OidIsValid(params.tablespaceOid) && This check for OidIsValid() seems to be excessive, since you moved the whole ACL check under 'if (tablespacename != NULL)' here. + params.tablespaceOid != MyDatabaseTableSpace) + { + AclResult aclresult; +CREATE INDEX regress_tblspace_test_tbl_idx ON regress_tblspace_test_tbl (num1); +-- move to global tablespace move fails Maybe 'move to global tablespace, fail', just to match a style of the previous comments. +REINDEX (TABLESPACE pg_global) INDEX regress_tblspace_test_tbl_idx; +SELECT relid, parentrelid, level FROM pg_partition_tree('tbspace_reindex_part_index') + ORDER BY relid, level; +SELECT relid, parentrelid, level FROM pg_partition_tree('tbspace_reindex_part_index') + ORDER BY relid, level; Why do you do the same twic
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
On 2021-01-30 05:23, Michael Paquier wrote: On Fri, Jan 29, 2021 at 08:56:47PM +0300, Alexey Kondratov wrote: On 2021-01-28 14:42, Alexey Kondratov wrote: No, you are right, we are doing REINDEX with AccessExclusiveLock as it was before. This part is a more specific one. It only applies to partitioned indexes, which do not hold any data, so we do not reindex them directly, only their leafs. However, if we are doing a TABLESPACE change, we have to record it in their pg_class entry, so all future leaf partitions were created in the proper tablespace. That way, we open partitioned index relation only for a reference, i.e. read-only, but modify pg_class entry under a proper lock (RowExclusiveLock). That's why I thought that ShareLock will be enough. IIUC, 'ALTER TABLE ... SET TABLESPACE' uses AccessExclusiveLock even for relations with no storage, since AlterTableGetLockLevel() chooses it if AT_SetTableSpace is met. This is very similar to our case, so probably we should do the same? Actually it is not completely clear for me why ShareUpdateExclusiveLock is sufficient for newly added SetRelationTableSpace() as Michael wrote in the comment. Nay, it was not fine. That's something Alvaro has mentioned, leading to 2484329. This also means that the main patch of this thread should refresh the comments at the top of CheckRelationTableSpaceMove() and SetRelationTableSpace() to mention that this is used by REINDEX CONCURRENTLY with a lower lock. Hm, IIUC, REINDEX CONCURRENTLY doesn't use either of them. It directly uses index_create() with a proper tablespaceOid instead of SetRelationTableSpace(). And its checks structure is more restrictive even without tablespace change, so it doesn't use CheckRelationTableSpaceMove(). Changed patch to use AccessExclusiveLock in this part for now. This is what 'ALTER TABLE/INDEX ... SET TABLESPACE' and 'REINDEX' usually do. Anyway, all real leaf partitions are processed in the independent transactions later. + if (partkind == RELKIND_PARTITIONED_INDEX) + { + Relation iRel = index_open(partoid, AccessExclusiveLock); + + if (CheckRelationTableSpaceMove(iRel, params->tablespaceOid)) + SetRelationTableSpace(iRel, + params->tablespaceOid, + InvalidOid); + index_close(iRel, NoLock); Are you sure that this does not represent a risk of deadlocks as EAL is not taken consistently across all the partitions? A second issue here is that this breaks the assumption of REINDEX CONCURRENTLY kicked on partitioned relations that should use ShareUpdateExclusiveLock for all its steps. This would make the first transaction invasive for the user, but we don't want that. This makes me really wonder if we would not be better to restrict this operation for partitioned relation as part of REINDEX as a first step. Another thing, mentioned upthread, is that we could do this part of the switch at the last transaction, or we could silently *not* do the switch for partitioned indexes in the flow of REINDEX, letting users handle that with an extra ALTER TABLE SET TABLESPACE once REINDEX has finished on all the partitions, cascading the command only on the partitioned relation of a tree. It may be interesting to look as well at if we could lower the lock used for partitioned relations with ALTER TABLE SET TABLESPACE from AEL to SUEL, choosing AEL only if at least one partition with storage is involved in the command, CheckRelationTableSpaceMove() discarding anything that has no need to change. I am not sure right now, so I split previous patch into two parts: 0001: Adds TABLESPACE into REINDEX with tests, doc and all the stuff we did before with the only exception that it doesn't move partitioned indexes into the new tablespace. Basically, it implements this option "we could silently *not* do the switch for partitioned indexes in the flow of REINDEX, letting users handle that with an extra ALTER TABLE SET TABLESPACE once REINDEX has finished". It probably makes sense, since we are doing tablespace change altogether with index relation rewrite and don't touch relations without storage. Doing ALTER INDEX ... SET TABLESPACE will be almost cost-less on them, since they do not hold any data. 0002: Implements the remaining part where pg_class entry is also changed for partitioned indexes. I think that we should think more about it, maybe it is not so dangerous and proper locking strategy could be achieved. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres CompanyFrom 6322032b472e6b1a76e0ca9326974e5774371fb9 Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Mon, 1 Feb 2021 15:20:29 +0300 Subject: [PATCH v10 2/2] Change tablespace of partitioned indexes during REINDEX. There are some doubts about proper locking of partitions here. AccessExclusiveLock is surely enough, but
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
On 2021-01-28 14:42, Alexey Kondratov wrote: On 2021-01-28 00:36, Alvaro Herrera wrote: I didn't look at the patch closely enough to understand why you're trying to do something like CLUSTER, VACUUM FULL or REINDEX without holding full AccessExclusiveLock on the relation. But do keep in mind that once you hold a lock on a relation, trying to grab a weaker lock afterwards is pretty pointless. No, you are right, we are doing REINDEX with AccessExclusiveLock as it was before. This part is a more specific one. It only applies to partitioned indexes, which do not hold any data, so we do not reindex them directly, only their leafs. However, if we are doing a TABLESPACE change, we have to record it in their pg_class entry, so all future leaf partitions were created in the proper tablespace. That way, we open partitioned index relation only for a reference, i.e. read-only, but modify pg_class entry under a proper lock (RowExclusiveLock). That's why I thought that ShareLock will be enough. IIUC, 'ALTER TABLE ... SET TABLESPACE' uses AccessExclusiveLock even for relations with no storage, since AlterTableGetLockLevel() chooses it if AT_SetTableSpace is met. This is very similar to our case, so probably we should do the same? Actually it is not completely clear for me why ShareUpdateExclusiveLock is sufficient for newly added SetRelationTableSpace() as Michael wrote in the comment. Changed patch to use AccessExclusiveLock in this part for now. This is what 'ALTER TABLE/INDEX ... SET TABLESPACE' and 'REINDEX' usually do. Anyway, all real leaf partitions are processed in the independent transactions later. Also changed some doc/comment parts Justin pointed me to. + then all "mapped" and system relations will be skipped and a single + WARNING will be generated. Indexes on TOAST tables + are reindexed, but not moved the new tablespace. moved *to* the new tablespace. Fixed. I don't know if that needs to be said at all. We talked about it a lot to arrive at the current behavior, but I think that's only due to the difficulty of correcting the initial mistake. I do not think that it will be a big deal to move indexes on TOAST tables as well. I just thought that since 'ALTER TABLE/INDEX ... SET TABLESPACE' only moves them together with host table, we also should not do that. Yet, I am ready to change this logic if requested. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres CompanyFrom 6e9db8d362e794edf421733bc7cade38c917bff4 Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Wed, 27 Jan 2021 00:46:17 +0300 Subject: [PATCH v9] Allow REINDEX to change tablespace REINDEX already does full relation rewrite, this patch adds a possibility to specify a new tablespace where new relfilenode will be created. --- doc/src/sgml/ref/reindex.sgml | 31 +++- src/backend/catalog/index.c | 47 +- src/backend/commands/indexcmds.c | 112 - src/bin/psql/tab-complete.c | 4 +- src/include/catalog/index.h | 9 +- src/test/regress/input/tablespace.source | 106 + src/test/regress/output/tablespace.source | 181 ++ 7 files changed, 478 insertions(+), 12 deletions(-) diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml index 627b36300c..2b39699d42 100644 --- a/doc/src/sgml/ref/reindex.sgml +++ b/doc/src/sgml/ref/reindex.sgml @@ -27,6 +27,7 @@ REINDEX [ ( option [, ...] ) ] { IN CONCURRENTLY [ boolean ] VERBOSE [ boolean ] +TABLESPACE new_tablespace @@ -187,6 +188,21 @@ REINDEX [ ( option [, ...] ) ] { IN + +TABLESPACE + + + Specifies that indexes will be rebuilt on a new tablespace. + Cannot be used with "mapped" or (unless allow_system_table_mods) + system relations. If SCHEMA, + DATABASE or SYSTEM are specified, + then all "mapped" and system relations will be skipped and a single + WARNING will be generated. Indexes on TOAST tables + are reindexed, but not moved to the new tablespace. + + + + VERBOSE @@ -210,6 +226,14 @@ REINDEX [ ( option [, ...] ) ] { IN + +new_tablespace + + + The tablespace where indexes will be rebuilt. + + + @@ -292,7 +316,12 @@ REINDEX [ ( option [, ...] ) ] { IN with REINDEX INDEX or REINDEX TABLE, respectively. Each partition of the specified partitioned relation is reindexed in a separate transaction. Those commands cannot be used inside - a transaction block when working on a partitioned table or index. + a transaction block when working on a partitioned table or index. If + a REINDEX command fails when run on a partitioned + relation, and TABLESPACE was specified, then it may have + moved indexes on some partitions to the new tablespace. Re-running th
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
On 2021-01-28 00:36, Alvaro Herrera wrote: On 2021-Jan-28, Alexey Kondratov wrote: I have read more about lock levels and ShareLock should prevent any kind of physical modification of indexes. We already hold ShareLock doing find_all_inheritors(), which is higher than ShareUpdateExclusiveLock, so using ShareLock seems to be safe here, but I will look on it closer. You can look at lock.c where LockConflicts[] is; that would tell you that ShareLock indeed conflicts with ShareUpdateExclusiveLock ... but it does not conflict with itself! So it would be possible to have more than one process doing this thing at the same time, which surely makes no sense. Thanks for the explanation and pointing me to the LockConflicts[]. This is a good reference. I didn't look at the patch closely enough to understand why you're trying to do something like CLUSTER, VACUUM FULL or REINDEX without holding full AccessExclusiveLock on the relation. But do keep in mind that once you hold a lock on a relation, trying to grab a weaker lock afterwards is pretty pointless. No, you are right, we are doing REINDEX with AccessExclusiveLock as it was before. This part is a more specific one. It only applies to partitioned indexes, which do not hold any data, so we do not reindex them directly, only their leafs. However, if we are doing a TABLESPACE change, we have to record it in their pg_class entry, so all future leaf partitions were created in the proper tablespace. That way, we open partitioned index relation only for a reference, i.e. read-only, but modify pg_class entry under a proper lock (RowExclusiveLock). That's why I thought that ShareLock will be enough. IIUC, 'ALTER TABLE ... SET TABLESPACE' uses AccessExclusiveLock even for relations with no storage, since AlterTableGetLockLevel() chooses it if AT_SetTableSpace is met. This is very similar to our case, so probably we should do the same? Actually it is not completely clear for me why ShareUpdateExclusiveLock is sufficient for newly added SetRelationTableSpace() as Michael wrote in the comment. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
On 2021-01-27 06:14, Michael Paquier wrote: On Wed, Jan 27, 2021 at 01:00:50AM +0300, Alexey Kondratov wrote: In the new 0002 I moved ACL check to the upper level, i.e. ExecReindex(), and removed expensive text generation in test. Not touched yet some of your previously raised concerns. Also, you made SetRelationTableSpace() to accept Relation instead of Oid, so now we have to open/close indexes in the ReindexPartitions(), I am not sure that I use proper locking there, but it works. Passing down Relation to the new routines makes the most sense to me because we force the callers to think about the level of locking that's required when doing any tablespace moves. + Relation iRel = index_open(partoid, ShareLock); + + if (CheckRelationTableSpaceMove(iRel, params->tablespaceOid)) + SetRelationTableSpace(iRel, + params->tablespaceOid, + InvalidOid); Speaking of which, this breaks the locking assumptions of SetRelationTableSpace(). I feel that we should think harder about this part for partitioned indexes and tables because this looks rather unsafe in terms of locking assumptions with partition trees. If we cannot come up with a safe solution, I would be fine with disallowing TABLESPACE in this case, as a first step. Not all problems have to be solved at once, and even without this part the feature is still useful. I have read more about lock levels and ShareLock should prevent any kind of physical modification of indexes. We already hold ShareLock doing find_all_inheritors(), which is higher than ShareUpdateExclusiveLock, so using ShareLock seems to be safe here, but I will look on it closer. + /* It's not a shared catalog, so refuse to move it to shared tablespace */ + if (params->tablespaceOid == GLOBALTABLESPACE_OID) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), +errmsg("cannot move non-shared relation to tablespace \"%s\"", +get_tablespace_name(params->tablespaceOid; Why is that needed if CheckRelationTableSpaceMove() is used? This is from ReindexRelationConcurrently() where we do not use CheckRelationTableSpaceMove(). For me it makes sense to add only this GLOBALTABLESPACE_OID check there, since before we already check for system catalogs and after for temp relations, so adding CheckRelationTableSpaceMove() will be a double-check. - indexRelation->rd_rel->reltablespace, + OidIsValid(tablespaceOid) ? + tablespaceOid : indexRelation->rd_rel->reltablespace, Let's remove this logic from index_concurrently_create_copy() and let the caller directly decide the tablespace to use, without a dependency on InvalidOid in the inner routine. A share update exclusive lock is already hold on the old index when creating the concurrent copy, so there won't be concurrent schema changes. Changed. Also added tests for ACL checks, relfilenode changes. Added ACL recheck for multi-transactional case. Added info about TOAST index reindexing. Changed some comments. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres CompanyFrom f176a6e5a81ab133fee849f72e4edb8b287d6062 Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Wed, 27 Jan 2021 00:46:17 +0300 Subject: [PATCH v8] Allow REINDEX to change tablespace REINDEX already does full relation rewrite, this patch adds a possibility to specify a new tablespace where new relfilenode will be created. --- doc/src/sgml/ref/reindex.sgml | 31 +++- src/backend/catalog/index.c | 50 +- src/backend/commands/indexcmds.c | 112 - src/bin/psql/tab-complete.c | 4 +- src/include/catalog/index.h | 9 +- src/test/regress/input/tablespace.source | 106 + src/test/regress/output/tablespace.source | 181 ++ 7 files changed, 481 insertions(+), 12 deletions(-) diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml index 627b36300c..e610a0f52c 100644 --- a/doc/src/sgml/ref/reindex.sgml +++ b/doc/src/sgml/ref/reindex.sgml @@ -27,6 +27,7 @@ REINDEX [ ( option [, ...] ) ] { IN CONCURRENTLY [ boolean ] VERBOSE [ boolean ] +TABLESPACE new_tablespace @@ -187,6 +188,21 @@ REINDEX [ ( option [, ...] ) ] { IN + +TABLESPACE + + + Specifies that indexes will be rebuilt on a new tablespace. + Cannot be used with "mapped" and system (unless allow_system_table_mods + is set to TRUE) relations. If SCHEMA, + DATABASE or SYSTEM are specified, + then all "mapped" and system relations will be skipped and a single + WARNING will be generated. Indexes on TOA
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
On 2021-01-26 09:58, Michael Paquier wrote: On Mon, Jan 25, 2021 at 11:11:38PM +0300, Alexey Kondratov wrote: I updated comment with CCI info, did pgindent run and renamed new function to SetRelationTableSpace(). New patch is attached. [...] Yeah, all these checks we complicated from the beginning. I will try to find a better place tomorrow or put more info into the comments at least. I was reviewing that, and I think that we can do a better consolidation on several points that will also help the features discussed on this thread for VACUUM, CLUSTER and REINDEX. If you look closely, ATExecSetTableSpace() uses the same logic as the code modified here to check if a relation can be moved to a new tablespace, with extra checks for mapped relations, GLOBALTABLESPACE_OID or if attempting to manipulate a temp relation from another session. There are two differences though: - Custom actions are taken between the phase where we check if a relation can be moved to a new tablespace, and the update of pg_class. - ATExecSetTableSpace() needs to be able to set a given relation relfilenode on top of reltablespace, the newly-created one. So I think that the heart of the problem is made of two things here: - We should have one common routine for the existing code paths and the new code paths able to check if a tablespace move can be done or not. The case of a cluster, reindex or vacuum on a list of relations extracted from pg_class would still require a different handling as incorrect relations have to be skipped, but the case of individual relations can reuse the refactoring pieces done here (see CheckRelationTableSpaceMove() in the attached). - We need to have a second routine able to update reltablespace and optionally relfilenode for a given relation's pg_class entry, once the caller has made sure that CheckRelationTableSpaceMove() validates a tablespace move. I think that I got your idea. One comment: +bool +CheckRelationTableSpaceMove(Relation rel, Oid newTableSpaceId) +{ + Oid oldTableSpaceId; + Oid reloid = RelationGetRelid(rel); + + /* +* No work if no change in tablespace. Note that MyDatabaseTableSpace +* is stored as 0. +*/ + oldTableSpaceId = rel->rd_rel->reltablespace; + if (newTableSpaceId == oldTableSpaceId || + (newTableSpaceId == MyDatabaseTableSpace && oldTableSpaceId == 0)) + { + InvokeObjectPostAlterHook(RelationRelationId, reloid, 0); + return false; + } CheckRelationTableSpaceMove() does not feel like a right place for invoking post alter hooks. It is intended only to check for tablespace change possibility. Anyway, ATExecSetTableSpace() and ATExecSetTableSpaceNoStorage() already do that in the no-op case. Please note that was a bug in your previous patch 0002: shared dependencies need to be registered if reltablespace is updated of course, but also iff the relation has no physical storage. So changeDependencyOnTablespace() requires a check based on RELKIND_HAS_STORAGE(), or REINDEX would have registered shared dependencies even for relations with storage, something we don't want per the recent work done by Alvaro in ebfe2db. Yes, thanks. I have removed this InvokeObjectPostAlterHook() from your 0001 and made 0002 to work on top of it. I think that now it should look closer to what you described above. In the new 0002 I moved ACL check to the upper level, i.e. ExecReindex(), and removed expensive text generation in test. Not touched yet some of your previously raised concerns. Also, you made SetRelationTableSpace() to accept Relation instead of Oid, so now we have to open/close indexes in the ReindexPartitions(), I am not sure that I use proper locking there, but it works. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres CompanyFrom 96a37399a9cf9ae08d62e28496e73b36087e5a19 Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Tue, 26 Jan 2021 15:53:06 +0900 Subject: [PATCH v7 1/2] Refactor code to detect and process tablespace moves --- src/backend/commands/tablecmds.c | 218 +-- src/include/commands/tablecmds.h | 4 + 2 files changed, 127 insertions(+), 95 deletions(-) diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index 8687e9a97c..c08eedf995 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -3037,6 +3037,116 @@ SetRelationHasSubclass(Oid relationId, bool relhassubclass) table_close(relationRelation, RowExclusiveLock); } +/* + * CheckRelationTableSpaceMove + * Check if relation can be moved to new tablespace. + * + * NOTE: caller must be holding an appropriate lock on the relation. + * ShareUpdateExclusiveLock is sufficient to prevent concurrent schema + * changes. + * + * Returns true if the relation can be moved to the new tablesp
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
On 2021-01-25 11:07, Michael Paquier wrote: On Fri, Jan 22, 2021 at 05:07:02PM +0300, Alexey Kondratov wrote: I have updated patches accordingly and also simplified tablespaceOid checks and assignment in the newly added SetRelTableSpace(). Result is attached as two separate patches for an ease of review, but no objections to merge them and apply at once if everything is fine. extern void SetRelationHasSubclass(Oid relationId, bool relhassubclass); +extern bool SetRelTableSpace(Oid reloid, Oid tablespaceOid); Seeing SetRelationHasSubclass(), wouldn't it be more consistent to use SetRelationTableSpace() as routine name? I think that we should document that the caller of this routine had better do a CCI once done to make the tablespace chage visible. Except for those two nits, the patch needs an indentation run and some style tweaks but its logic looks fine. So I'll apply that first piece. I updated comment with CCI info, did pgindent run and renamed new function to SetRelationTableSpace(). New patch is attached. +INSERT INTO regress_tblspace_test_tbl (num1, num2, t) + SELECT round(random()*100), random(), repeat('text', 100) + FROM generate_series(1, 10) s(i); Repeating 1M times a text value is too costly for such a test. And as even for empty tables there is one page created for toast indexes, there is no need for that? Yes, TOAST relation is created anyway. I just wanted to put some data into a TOAST index, so REINDEX did some meaningful work there, not only a new relfilenode creation. However you are right and this query increases tablespace tests execution for more for more than 2 times on my machine. I think that it is not really required. This patch is introducing three new checks for system catalogs: - don't use tablespace for mapped relations. - don't use tablespace for system relations, except if allowSystemTableMods. - don't move non-shared relation to global tablespace. For the non-concurrent case, all three checks are in reindex_index(). For the concurrent case, the two first checks are in ReindexMultipleTables() and the third one is in ReindexRelationConcurrently(). That's rather tricky to follow because CONCURRENTLY is not allowed on system relations. I am wondering if it would be worth an extra comment effort, or if there is a way to consolidate that better. Yeah, all these checks we complicated from the beginning. I will try to find a better place tomorrow or put more info into the comments at least. I am also going to check/fix the remaining points regarding 002 tomorrow. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres CompanyFrom 39880842d7af31dcbfcffe7219250b31102955d5 Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Wed, 20 Jan 2021 20:21:12 +0300 Subject: [PATCH v5 1/2] Extract common part from ATExecSetTableSpaceNoStorage for a future usage --- src/backend/commands/tablecmds.c | 95 +++- src/include/commands/tablecmds.h | 2 + 2 files changed, 58 insertions(+), 39 deletions(-) diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index 8687e9a97c..ec9c440e4e 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -13291,6 +13291,59 @@ ATExecSetTableSpace(Oid tableOid, Oid newTableSpace, LOCKMODE lockmode) list_free(reltoastidxids); } +/* + * SetRelationTableSpace - modify relation tablespace in the pg_class entry. + * + * 'reloid' is an Oid of relation to be modified. + * 'tablespaceOid' is an Oid of new tablespace. + * + * Catalog modification is done only if tablespaceOid is different from + * the currently set. Returned bool value is indicating whether any changes + * were made or not. Note that caller is responsible for doing + * CommandCounterIncrement() to make tablespace changes visible. + */ +bool +SetRelationTableSpace(Oid reloid, Oid tablespaceOid) +{ + Relation pg_class; + HeapTuple tuple; + Form_pg_class rd_rel; + bool changed = false; + + /* Get a modifiable copy of the relation's pg_class row. */ + pg_class = table_open(RelationRelationId, RowExclusiveLock); + + tuple = SearchSysCacheCopy1(RELOID, ObjectIdGetDatum(reloid)); + if (!HeapTupleIsValid(tuple)) + elog(ERROR, "cache lookup failed for relation %u", reloid); + rd_rel = (Form_pg_class) GETSTRUCT(tuple); + + /* MyDatabaseTableSpace is stored as InvalidOid. */ + if (tablespaceOid == MyDatabaseTableSpace) + tablespaceOid = InvalidOid; + + /* No work if no change in tablespace. */ + if (tablespaceOid != rd_rel->reltablespace) + { + /* Update the pg_class row. */ + rd_rel->reltablespace = tablespaceOid; + CatalogTupleUpdate(pg_class, >t_self, tuple); + + /* Record dependency on tablespace. */ + changeDependencyOnTablespace(RelationRelationId, + reloid, rd_rel->reltablespace); + + changed = true; + } + + InvokeObjectPostAlterHook(RelationRelationId, reloid, 0); + + heap_freetuple
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
On 2021-01-22 00:26, Justin Pryzby wrote: On Thu, Jan 21, 2021 at 11:48:08PM +0300, Alexey Kondratov wrote: Attached is a new patch set of first two patches, that should resolve all the issues raised before (ACL, docs, tests) excepting TOAST. Double thanks for suggestion to add more tests with nested partitioning. I have found and squashed a huge bug related to the returning back to the default tablespace using newly added tests. Regarding TOAST. Now we skip moving toast indexes or throw error if someone wants to move TOAST index directly. I had a look on ALTER TABLE SET TABLESPACE and it has a bit complicated logic: 1) You cannot move TOAST table directly. 2) But if you move basic relation that TOAST table belongs to, then they are moved altogether. 3) Same logic as 2) happens if one does ALTER TABLE ALL IN TABLESPACE ... That way, ALTER TABLE allows moving TOAST tables (with indexes) implicitly, but does not allow doing that explicitly. In the same time I found docs to be vague about such behavior it only says: All tables in the current database in a tablespace can be moved by using the ALL IN TABLESPACE ... Note that system catalogs are not moved by this command Changing any part of a system catalog table is not permitted. So actually ALTER TABLE treats TOAST relations as system sometimes, but sometimes not. From the end user perspective it makes sense to move TOAST with main table when doing ALTER TABLE SET TABLESPACE. But should we touch indexes on TOAST table with REINDEX? We cannot move TOAST relation itself, since we are doing only a reindex, so we end up in the state when TOAST table and its index are placed in the different tablespaces. This state is not reachable with ALTER TABLE/INDEX, so it seem we should not allow it with REINDEX as well, should we? + * Even if a table's indexes were moved to a new tablespace, the index +* on its toast table is not normally moved. */ ReindexParams newparams = *params; newparams.options &= ~(REINDEXOPT_MISSING_OK); + if (!allowSystemTableMods) + newparams.tablespaceOid = InvalidOid; I think you're right. So actually TOAST should never move, even if allowSystemTableMods, right ? I think so. I would prefer to do not move TOAST indexes implicitly at all during reindex. @@ -292,7 +315,11 @@ REINDEX [ ( class="parameter">option [, ...] ) ] { IN with REINDEX INDEX or REINDEX TABLE, respectively. Each partition of the specified partitioned relation is reindexed in a separate transaction. Those commands cannot be used inside - a transaction block when working on a partitioned table or index. + a transaction block when working on a partitioned table or index. If + REINDEX with TABLESPACE executed + on partitioned relation fails it may have moved some partitions to the new + tablespace. Repeated command will still reindex all partitions even if they + are already in the new tablespace. Minor corrections here: If a REINDEX command fails when run on a partitioned relation, and TABLESPACE was specified, then it may have moved indexes on some partitions to the new tablespace. Re-running the command will reindex all partitions and move previously-unprocessed indexes to the new tablespace. Sounds good to me. I have updated patches accordingly and also simplified tablespaceOid checks and assignment in the newly added SetRelTableSpace(). Result is attached as two separate patches for an ease of review, but no objections to merge them and apply at once if everything is fine. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres CompanyFrom 87e47e9b5b3d6b49230045e5db8f844b14b34ba0 Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Mon, 23 Mar 2020 21:10:29 +0300 Subject: [PATCH v4 2/2] Allow REINDEX to change tablespace REINDEX already does full relation rewrite, this patch adds a possibility to specify a new tablespace where new relfilenode will be created. --- doc/src/sgml/ref/reindex.sgml | 30 +++- src/backend/catalog/index.c | 81 ++- src/backend/commands/indexcmds.c | 81 ++- src/bin/psql/tab-complete.c | 4 +- src/include/catalog/index.h | 2 + src/test/regress/input/tablespace.source | 85 src/test/regress/output/tablespace.source | 159 ++ 7 files changed, 436 insertions(+), 6 deletions(-) diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml index 627b36300c..a1c7736aec 100644 --- a/doc/src/sgml/ref/reindex.sgml +++ b/doc/src/sgml/ref/reindex.sgml @@ -27,6 +27,7 @@ REINDEX [ ( option [, ...] ) ] { IN CONCURRENTLY [ boolean ] VERBOSE [ boolean ] +TABLESPACE new_tablespace @@ -187,6 +188,20 @@ R
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
On 2021-01-21 17:06, Alexey Kondratov wrote: On 2021-01-21 04:41, Michael Paquier wrote: There are no tests for partitioned tables, aka we'd want to make sure that the new partitioned index is on the correct tablespace, as well as all its leaves. It may be better to have at least two levels of partitioned tables, as well as a partitioned table with no leaves in the cases dealt with. Yes, sure, it makes sense. +* +* Even if a table's indexes were moved to a new tablespace, the index +* on its toast table is not normally moved. */ Still, REINDEX (TABLESPACE) TABLE should move all of them to be consistent with ALTER TABLE SET TABLESPACE, but that's not the case with this code, no? This requires proper test coverage, but there is nothing of the kind in this patch. You are right, we do not move TOAST indexes now, since IsSystemRelation() is true for TOAST indexes, so I thought that we should not allow moving them without allow_system_table_mods=true. Now I wonder why ALTER TABLE does that. I am going to attach the new version of patch set today or tomorrow. Attached is a new patch set of first two patches, that should resolve all the issues raised before (ACL, docs, tests) excepting TOAST. Double thanks for suggestion to add more tests with nested partitioning. I have found and squashed a huge bug related to the returning back to the default tablespace using newly added tests. Regarding TOAST. Now we skip moving toast indexes or throw error if someone wants to move TOAST index directly. I had a look on ALTER TABLE SET TABLESPACE and it has a bit complicated logic: 1) You cannot move TOAST table directly. 2) But if you move basic relation that TOAST table belongs to, then they are moved altogether. 3) Same logic as 2) happens if one does ALTER TABLE ALL IN TABLESPACE ... That way, ALTER TABLE allows moving TOAST tables (with indexes) implicitly, but does not allow doing that explicitly. In the same time I found docs to be vague about such behavior it only says: All tables in the current database in a tablespace can be moved by using the ALL IN TABLESPACE ... Note that system catalogs are not moved by this command Changing any part of a system catalog table is not permitted. So actually ALTER TABLE treats TOAST relations as system sometimes, but sometimes not. From the end user perspective it makes sense to move TOAST with main table when doing ALTER TABLE SET TABLESPACE. But should we touch indexes on TOAST table with REINDEX? We cannot move TOAST relation itself, since we are doing only a reindex, so we end up in the state when TOAST table and its index are placed in the different tablespaces. This state is not reachable with ALTER TABLE/INDEX, so it seem we should not allow it with REINDEX as well, should we? Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres CompanyFrom bcd690da6bc3db16a96305b45546d3c9e400f769 Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Mon, 23 Mar 2020 21:10:29 +0300 Subject: [PATCH v3 2/2] Allow REINDEX to change tablespace REINDEX already does full relation rewrite, this patch adds a possibility to specify a new tablespace where new relfilenode will be created. --- doc/src/sgml/ref/reindex.sgml | 29 +++- src/backend/catalog/index.c | 82 +++- src/backend/commands/indexcmds.c | 81 +++- src/bin/psql/tab-complete.c | 4 +- src/include/catalog/index.h | 2 + src/test/regress/input/tablespace.source | 79 +++ src/test/regress/output/tablespace.source | 154 ++ 7 files changed, 425 insertions(+), 6 deletions(-) diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml index 627b36300c..90fdad0b4c 100644 --- a/doc/src/sgml/ref/reindex.sgml +++ b/doc/src/sgml/ref/reindex.sgml @@ -27,6 +27,7 @@ REINDEX [ ( option [, ...] ) ] { IN CONCURRENTLY [ boolean ] VERBOSE [ boolean ] +TABLESPACE new_tablespace @@ -187,6 +188,20 @@ REINDEX [ ( option [, ...] ) ] { IN + +TABLESPACE + + + Specifies that indexes will be rebuilt on a new tablespace. + Cannot be used with "mapped" and system (unless allow_system_table_mods + is set to TRUE) relations. If SCHEMA, + DATABASE or SYSTEM are specified, then + all "mapped" and system relations will be skipped and a single + WARNING will be generated. + + + + VERBOSE @@ -210,6 +225,14 @@ REINDEX [ ( option [, ...] ) ] { IN + +new_tablespace + + + The tablespace where indexes will be rebuilt. + + + @@ -292,7 +315,11 @@ REINDEX [ ( option [, ...] ) ] { IN with REINDEX INDEX or REINDEX TABLE, respectively. Each partition of the specified partitioned relation is reindexed in a separat
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
On 2021-01-21 04:41, Michael Paquier wrote: On Wed, Jan 20, 2021 at 03:34:39PM -0300, Alvaro Herrera wrote: On 2021-Jan-20, Alexey Kondratov wrote: Ugh, forgot to attach the patches. Here they are. Yeah, looks reasonable. + + if (changed) + /* Record dependency on tablespace */ + changeDependencyOnTablespace(RelationRelationId, +reloid, rd_rel->reltablespace); Why have a separate "if (changed)" block here instead of merging with the above? Yep. Sure, this is a refactoring artifact. + if (SetRelTablespace(reloid, newTableSpace)) + /* Make sure the reltablespace change is visible */ + CommandCounterIncrement(); At quick glance, I am wondering why you just don't do a CCI within SetRelTablespace(). I did it that way for a better readability at first, since it looks more natural when you do some change (SetRelTablespace) and then make them visible with CCI. Second argument was that in the case of reindex_index() we have to also call RelationAssumeNewRelfilenode() and RelationDropStorage() before doing CCI and making the new tablespace visible. And this part is critical, I guess. + This specifies that indexes will be rebuilt on a new tablespace. + Cannot be used with "mapped" relations. If SCHEMA, + DATABASE or SYSTEM is specified, then + all unsuitable relations will be skipped and a single WARNING + will be generated. What is an unsuitable relation? How can the end user know that? This was referring to mapped relations mentioned in the previous sentence. I have tried to rewrite this part and make it more specific in my current version. Also added Justin's changes to the docs and comment. This is missing ACL checks when moving the index into a new location, so this requires some pg_tablespace_aclcheck() calls, and the other patches share the same issue. I added proper pg_tablespace_aclcheck()'s into the reindex_index() and ReindexPartitions(). + else if (partkind == RELKIND_PARTITIONED_TABLE) + { + Relation rel = table_open(partoid, ShareLock); + List*indexIds = RelationGetIndexList(rel); + ListCell *lc; + + table_close(rel, NoLock); + foreach (lc, indexIds) + { + Oid indexid = lfirst_oid(lc); + (void) set_rel_tablespace(indexid, params->tablespaceOid); + } + } This is really a good question. ReindexPartitions() would trigger one transaction per leaf to work on. Changing the tablespace of the partitioned table(s) before doing any work has the advantage to tell any new partition to use the new tablespace. Now, I see a struggling point here: what should we do if the processing fails in the middle of the move, leaving a portion of the leaves in the previous tablespace? On a follow-up reindex with the same command, should the command force a reindex even on the partitions that have been moved? Or could there be a point in skipping the partitions that are already on the new tablespace and only process the ones on the previous tablespace? It seems to me that the first scenario makes the most sense as currently a REINDEX works on all the relations defined, though there could be use cases for the second case. This should be documented, I think. I agree that follow-up REINDEX should also reindex moved partitions, since REINDEX (TABLESPACE ...) is still reindex at first. I will try to put something about this part into the docs. Also I think that we cannot be sure that nothing happened with already reindexed partitions between two consequent REINDEX calls. There are no tests for partitioned tables, aka we'd want to make sure that the new partitioned index is on the correct tablespace, as well as all its leaves. It may be better to have at least two levels of partitioned tables, as well as a partitioned table with no leaves in the cases dealt with. Yes, sure, it makes sense. +* +* Even if a table's indexes were moved to a new tablespace, the index +* on its toast table is not normally moved. */ Still, REINDEX (TABLESPACE) TABLE should move all of them to be consistent with ALTER TABLE SET TABLESPACE, but that's not the case with this code, no? This requires proper test coverage, but there is nothing of the kind in this patch. You are right, we do not move TOAST indexes now, since IsSystemRelation() is true for TOAST indexes, so I thought that we should not allow moving them without allow_system_table_mods=true. Now I wonder why ALTER TABLE does that. I am going to attach the new version of patch set today or tomorrow. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
On 2021-01-20 21:08, Alexey Kondratov wrote: On 2021-01-20 18:54, Alvaro Herrera wrote: On 2021-Jan-20, Alvaro Herrera wrote: On 2021-Jan-20, Michael Paquier wrote: > +/* > + * This is mostly duplicating ATExecSetTableSpaceNoStorage, > + * which should maybe be factored out to a library function. > + */ > Wouldn't it be better to do first the refactoring of 0002 and then > 0001 so as REINDEX can use the new routine, instead of putting that > into a comment? I think merging 0001 and 0002 into a single commit is a reasonable approach. ... except it doesn't make a lot of sense to have set_rel_tablespace in either indexcmds.c or index.c. I think tablecmds.c is a better place for it. (I would have thought catalog/storage.c, but that one's not the right abstraction level it seems.) I did a refactoring of ATExecSetTableSpaceNoStorage() in the 0001. New function SetRelTablesapce() is placed into the tablecmds.c. Following 0002 gets use of it. Is it close to what you and Michael suggested? Ugh, forgot to attach the patches. Here they are. -- AlexeyFrom 2c3876f99bc8ebdd07c532619992e7ec3093e50a Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Mon, 23 Mar 2020 21:10:29 +0300 Subject: [PATCH v2 2/2] Allow REINDEX to change tablespace REINDEX already does full relation rewrite, this patch adds a possibility to specify a new tablespace where new relfilenode will be created. --- doc/src/sgml/ref/reindex.sgml | 22 + src/backend/catalog/index.c | 72 ++- src/backend/commands/indexcmds.c | 68 ++- src/bin/psql/tab-complete.c | 4 +- src/include/catalog/index.h | 2 + src/test/regress/input/tablespace.source | 53 +++ src/test/regress/output/tablespace.source | 102 ++ 7 files changed, 318 insertions(+), 5 deletions(-) diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml index 627b36300c..4f84060c4d 100644 --- a/doc/src/sgml/ref/reindex.sgml +++ b/doc/src/sgml/ref/reindex.sgml @@ -27,6 +27,7 @@ REINDEX [ ( option [, ...] ) ] { IN CONCURRENTLY [ boolean ] VERBOSE [ boolean ] +TABLESPACE new_tablespace @@ -187,6 +188,19 @@ REINDEX [ ( option [, ...] ) ] { IN + +TABLESPACE + + + This specifies that indexes will be rebuilt on a new tablespace. + Cannot be used with "mapped" relations. If SCHEMA, + DATABASE or SYSTEM is specified, then + all unsuitable relations will be skipped and a single WARNING + will be generated. + + + + VERBOSE @@ -210,6 +224,14 @@ REINDEX [ ( option [, ...] ) ] { IN + +new_tablespace + + + The tablespace where indexes will be rebuilt. + + + diff --git a/src/backend/catalog/index.c b/src/backend/catalog/index.c index b8cd35e995..ed98b17483 100644 --- a/src/backend/catalog/index.c +++ b/src/backend/catalog/index.c @@ -57,6 +57,7 @@ #include "commands/event_trigger.h" #include "commands/progress.h" #include "commands/tablecmds.h" +#include "commands/tablespace.h" #include "commands/trigger.h" #include "executor/executor.h" #include "miscadmin.h" @@ -1394,9 +1395,13 @@ index_update_collation_versions(Oid relid, Oid coll) * Create concurrently an index based on the definition of the one provided by * caller. The index is inserted into catalogs and needs to be built later * on. This is called during concurrent reindex processing. + * + * "tablespaceOid" is the new tablespace to use for this index. If + * InvalidOid, use the tablespace in-use instead. */ Oid -index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId, const char *newName) +index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId, + Oid tablespaceOid, const char *newName) { Relation indexRelation; IndexInfo *oldInfo, @@ -1526,7 +1531,8 @@ index_concurrently_create_copy(Relation heapRelation, Oid oldIndexId, const char newInfo, indexColNames, indexRelation->rd_rel->relam, - indexRelation->rd_rel->reltablespace, + OidIsValid(tablespaceOid) ? +tablespaceOid : indexRelation->rd_rel->reltablespace, indexRelation->rd_indcollation, indclass->values, indcoloptions->values, @@ -3591,6 +3597,8 @@ IndexGetRelation(Oid indexId, bool missing_ok) /* * reindex_index - This routine is used to recreate a single index + * + * See comments of reindex_relation() for details about "tablespaceOid". */ void reindex_index(Oid indexId, bool skip_constraint_checks, char persistence, @@ -3603,6 +3611,7 @@ reindex_index(Oid indexId, bool skip_constraint_checks, char persistence, volatile bool skipped_constraint = false; PGRUsage ru0; bo
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
On 2021-01-20 18:54, Alvaro Herrera wrote: On 2021-Jan-20, Alvaro Herrera wrote: On 2021-Jan-20, Michael Paquier wrote: > +/* > + * This is mostly duplicating ATExecSetTableSpaceNoStorage, > + * which should maybe be factored out to a library function. > + */ > Wouldn't it be better to do first the refactoring of 0002 and then > 0001 so as REINDEX can use the new routine, instead of putting that > into a comment? I think merging 0001 and 0002 into a single commit is a reasonable approach. ... except it doesn't make a lot of sense to have set_rel_tablespace in either indexcmds.c or index.c. I think tablecmds.c is a better place for it. (I would have thought catalog/storage.c, but that one's not the right abstraction level it seems.) I did a refactoring of ATExecSetTableSpaceNoStorage() in the 0001. New function SetRelTablesapce() is placed into the tablecmds.c. Following 0002 gets use of it. Is it close to what you and Michael suggested? But surely ATExecSetTableSpaceNoStorage should be using this new routine. (I first thought 0002 was doing that, since that commit is calling itself a "refactoring", but now that I look closer, it's not.) Yeah, this 'refactoring' was initially referring to refactoring of what Justin added to one of the previous 0001. And it was meant to be merged with 0001, once agreed, but we got distracted by other stuff. I have not yet addressed Michael's concerns regarding reindex of partitions. I am going to look closer on it tomorrow. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
On 2021-01-13 14:34, Michael Paquier wrote: On Wed, Jan 13, 2021 at 05:22:49PM +0900, Michael Paquier wrote: Yeah, that makes sense. I'll send an updated patch based on that. And here you go as per the attached. I don't think that there was anything remaining on my radar. This version still needs to be indented properly though. Thoughts? Thanks. + bits32 options;/* bitmask of CLUSTEROPT_* */ This should say '/* bitmask of CLUOPT_* */', I guess, since there are only CLUOPT's defined. Otherwise, everything looks as per discussed upthread. By the way, something went wrong with the last email subject, so I have changed it back to the original in this response. I also attached your patch (with only this CLUOPT_* correction) to keep it in the thread for sure. Although, postgresql.org's web archive is clever enough to link your email to the same thread even with different subject. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Companydiff --git a/src/include/catalog/index.h b/src/include/catalog/index.h index 9904a76387..43cfdeaa6b 100644 --- a/src/include/catalog/index.h +++ b/src/include/catalog/index.h @@ -30,13 +30,16 @@ typedef enum } IndexStateFlagsAction; /* options for REINDEX */ -typedef enum ReindexOption +typedef struct ReindexParams { - REINDEXOPT_VERBOSE = 1 << 0, /* print progress info */ - REINDEXOPT_REPORT_PROGRESS = 1 << 1, /* report pgstat progress */ - REINDEXOPT_MISSING_OK = 1 << 2, /* skip missing relations */ - REINDEXOPT_CONCURRENTLY = 1 << 3 /* concurrent mode */ -} ReindexOption; + bits32 options; /* bitmask of REINDEXOPT_* */ +} ReindexParams; + +/* flag bits for ReindexParams->flags */ +#define REINDEXOPT_VERBOSE 0x01 /* print progress info */ +#define REINDEXOPT_REPORT_PROGRESS 0x02 /* report pgstat progress */ +#define REINDEXOPT_MISSING_OK 0x04 /* skip missing relations */ +#define REINDEXOPT_CONCURRENTLY 0x08 /* concurrent mode */ /* state info for validate_index bulkdelete callback */ typedef struct ValidateIndexState @@ -146,7 +149,7 @@ extern void index_set_state_flags(Oid indexId, IndexStateFlagsAction action); extern Oid IndexGetRelation(Oid indexId, bool missing_ok); extern void reindex_index(Oid indexId, bool skip_constraint_checks, - char relpersistence, int options); + char relpersistence, ReindexParams *params); /* Flag bits for reindex_relation(): */ #define REINDEX_REL_PROCESS_TOAST 0x01 @@ -155,7 +158,7 @@ extern void reindex_index(Oid indexId, bool skip_constraint_checks, #define REINDEX_REL_FORCE_INDEXES_UNLOGGED 0x08 #define REINDEX_REL_FORCE_INDEXES_PERMANENT 0x10 -extern bool reindex_relation(Oid relid, int flags, int options); +extern bool reindex_relation(Oid relid, int flags, ReindexParams *params); extern bool ReindexIsProcessingHeap(Oid heapOid); extern bool ReindexIsProcessingIndex(Oid indexOid); diff --git a/src/include/commands/cluster.h b/src/include/commands/cluster.h index 401a0827ae..1245d944dc 100644 --- a/src/include/commands/cluster.h +++ b/src/include/commands/cluster.h @@ -18,16 +18,17 @@ #include "storage/lock.h" #include "utils/relcache.h" - /* options for CLUSTER */ -typedef enum ClusterOption +#define CLUOPT_RECHECK 0x01 /* recheck relation state */ +#define CLUOPT_VERBOSE 0x02 /* print progress info */ + +typedef struct ClusterParams { - CLUOPT_RECHECK = 1 << 0, /* recheck relation state */ - CLUOPT_VERBOSE = 1 << 1 /* print progress info */ -} ClusterOption; + bits32 options; /* bitmask of CLUOPT_* */ +} ClusterParams; extern void cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel); -extern void cluster_rel(Oid tableOid, Oid indexOid, int options); +extern void cluster_rel(Oid tableOid, Oid indexOid, ClusterParams *params); extern void check_index_is_clusterable(Relation OldHeap, Oid indexOid, bool recheck, LOCKMODE lockmode); extern void mark_index_clustered(Relation rel, Oid indexOid, bool is_internal); diff --git a/src/include/commands/defrem.h b/src/include/commands/defrem.h index e2d2a77ca4..91281d6f8e 100644 --- a/src/include/commands/defrem.h +++ b/src/include/commands/defrem.h @@ -14,6 +14,7 @@ #ifndef DEFREM_H #define DEFREM_H +#include "catalog/index.h" #include "catalog/objectaddress.h" #include "nodes/params.h" #include "parser/parse_node.h" @@ -34,11 +35,7 @@ extern ObjectAddress DefineIndex(Oid relationId, bool check_not_in_use, bool skip_build, bool quiet); -extern int ReindexParseOptions(ParseState *pstate, ReindexStmt *stmt); -extern void ReindexIndex(RangeVar *indexRelation, int options, bool isTopLevel); -extern Oid ReindexTable(RangeVar *relation, int options, bool isTopLevel); -extern void ReindexMultipleTables(const char *objectName, ReindexObjectType objectKind, - int optio
Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit
On 2020-12-30 17:59, Bharath Rupireddy wrote: On Wed, Dec 30, 2020 at 5:20 PM Alexey Kondratov wrote: On 2020-12-30 09:10, Bharath Rupireddy wrote: I still have some doubts that it is worth of allowing to call postgres_fdw_disconnect() in the explicit transaction block, since it adds a lot of things to care and check for. But otherwise current logic looks solid. +errdetail("Such connections get closed either in the next use or at the end of the current transaction.") +: errdetail("Such connection gets closed either in the next use or at the end of the current transaction."))); Does it really have a chance to get closed on the next use? If foreign server is dropped then user mapping should be dropped as well (either with CASCADE or manually), but we do need user mapping for a local cache lookup. That way, if I understand all the discussion up-thread correctly, we can only close such connections at the end of xact, do we? The next use of such a connection in the following query whose foreign server would have been dropped fails because of the cascading that can happen to drop the user mapping and the foreign table as well. During the start of the next query such connection will be marked as invalidated because xact_depth of that connection is > 1 and when the fail happens, txn gets aborted due to which pgfdw_xact_callback gets called and in that the connection gets closed. To make it more clear, please have a look at the scenarios [1]. In my understanding 'connection gets closed either in the next use' means that connection will be closed next time someone will try to use it, i.e. GetConnection() will be called and it closes this connection because of a bad state. However, if foreign server is dropped GetConnection() cannot lookup the connection because it needs a user mapping oid as a key. I had a look on your scenarios. IIUC, under **next use** you mean a select attempt from a table belonging to the same foreign server, which leads to a transaction abort and connection gets closed in the xact callback. Sorry, maybe I am missing something, but this just confirms that such connections only get closed in the xact callback (taking into account your recently committed patch [1]), so 'next use' looks misleading. [1] https://www.postgresql.org/message-id/8b2aa1aa-c638-12a8-cb56-ea0f0a5019cf%40oss.nttdata.com Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit
On 2020-12-30 09:10, Bharath Rupireddy wrote: Hi, I'm posting a v4-0001 patch for the new functions postgres_fdw_get_connections() and postgres_fdw_disconnect(). In this patch, I tried to address the review comments provided upthread. Thoughts? I still have some doubts that it is worth of allowing to call postgres_fdw_disconnect() in the explicit transaction block, since it adds a lot of things to care and check for. But otherwise current logic looks solid. + errdetail("Such connections get closed either in the next use or at the end of the current transaction.") + : errdetail("Such connection gets closed either in the next use or at the end of the current transaction."))); Does it really have a chance to get closed on the next use? If foreign server is dropped then user mapping should be dropped as well (either with CASCADE or manually), but we do need user mapping for a local cache lookup. That way, if I understand all the discussion up-thread correctly, we can only close such connections at the end of xact, do we? + * This function returns false if the cache doesn't exist. + * When the cache exists: I think that this will be corrected later by pg_indent, but still. In this comment section following points 1) and 2) have a different combination of tabs/spaces. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres CompanyFrom 82b82b901e8f0ca84e1b21454a3b68f4fd8f0016 Mon Sep 17 00:00:00 2001 From: Bharath Rupireddy Date: Wed, 30 Dec 2020 11:07:35 +0530 Subject: [PATCH v4] postgres_fdw function to discard cached connections This patch introduces new function postgres_fdw_disconnect() when called with a foreign server name discards the associated connections with the server name. When called without any argument, discards all the existing cached connections. This patch also adds another function postgres_fdw_get_connections() to get the list of all cached connections by corresponding foreign server names. --- contrib/postgres_fdw/connection.c | 319 +- .../postgres_fdw/expected/postgres_fdw.out| 415 +- contrib/postgres_fdw/postgres_fdw--1.0.sql| 15 + contrib/postgres_fdw/sql/postgres_fdw.sql | 166 +++ doc/src/sgml/postgres-fdw.sgml| 57 ++- 5 files changed, 953 insertions(+), 19 deletions(-) diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c index d841cec39b..496ef2bae2 100644 --- a/contrib/postgres_fdw/connection.c +++ b/contrib/postgres_fdw/connection.c @@ -22,6 +22,7 @@ #include "postgres_fdw.h" #include "storage/fd.h" #include "storage/latch.h" +#include "utils/builtins.h" #include "utils/datetime.h" #include "utils/hsearch.h" #include "utils/inval.h" @@ -57,6 +58,8 @@ typedef struct ConnCacheEntry bool have_error; /* have any subxacts aborted in this xact? */ bool changing_xact_state; /* xact state change in process */ bool invalidated; /* true if reconnect is pending */ + /* Server oid to get the associated foreign server name. */ + Oid serverid; uint32 server_hashvalue; /* hash value of foreign server OID */ uint32 mapping_hashvalue; /* hash value of user mapping OID */ } ConnCacheEntry; @@ -73,6 +76,12 @@ static unsigned int prep_stmt_number = 0; /* tracks whether any work is needed in callback functions */ static bool xact_got_connection = false; +/* + * SQL functions + */ +PG_FUNCTION_INFO_V1(postgres_fdw_get_connections); +PG_FUNCTION_INFO_V1(postgres_fdw_disconnect); + /* prototypes of private functions */ static void make_new_connection(ConnCacheEntry *entry, UserMapping *user); static PGconn *connect_pg_server(ForeignServer *server, UserMapping *user); @@ -94,6 +103,8 @@ static bool pgfdw_exec_cleanup_query(PGconn *conn, const char *query, static bool pgfdw_get_cleanup_result(PGconn *conn, TimestampTz endtime, PGresult **result); static bool UserMappingPasswordRequired(UserMapping *user); +static bool disconnect_cached_connections(uint32 hashvalue, bool all, + bool *is_in_use); /* * Get a PGconn which can be used to execute queries on the remote PostgreSQL @@ -273,6 +284,7 @@ make_new_connection(ConnCacheEntry *entry, UserMapping *user) entry->have_error = false; entry->changing_xact_state = false; entry->invalidated = false; + entry->serverid = server->serverid; entry->server_hashvalue = GetSysCacheHashValue1(FOREIGNSERVEROID, ObjectIdGetDatum(server->serverid)); @@ -789,6 +801,13 @@ pgfdw_xact_callback(XactEvent event, void *arg) if (!xact_got_connection) return; + /* + * Quick exit if the cache has been destroyed in + * disconnect_cached_connections. + */ + if (!ConnectionHash) + return; + /* * Scan all connection cache entries to find open remote transactions, and * close them. @@ -98
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
On 2020-12-23 10:38, Michael Paquier wrote: On Tue, Dec 22, 2020 at 03:15:37PM -0600, Justin Pryzby wrote: Now, I really think utility.c ought to pass in a pointer to a local ReindexOptions variable to avoid all the memory context, which is unnecessary and prone to error. Yeah, it sounds right to me to just bite the bullet and do this refactoring, limiting the manipulations of the options as much as possible across contexts. So +1 from me to merge 0001 and 0002 together. I have adjusted a couple of comments and simplified a bit more the code in utility.c. I think that this is commitable, but let's wait more than a couple of days for Alvaro and Peter first. This is a period of vacations for a lot of people, and there is no point to apply something that would need more work at the end. Using hexas for the flags with bitmasks is the right conclusion IMO, but we are not alone. After eyeballing the patch I can add that we should alter this comment: int options;/* bitmask of VacuumOption */ as you are going to replace VacuumOption with VACOPT_* defs. So this should say: /* bitmask of VACOPT_* */ Also I have found naming to be a bit inconsistent: * we have ReindexOptions, but VacuumParams * and ReindexOptions->flags, but VacuumParams->options And the last one, you have used bits32 for Cluster/ReindexOptions, but left VacuumParams->options as int. Maybe we should also change it to bits32 for consistency? Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
On 2020-12-23 08:22, Justin Pryzby wrote: On Tue, Dec 22, 2020 at 03:22:19PM -0800, Zhihong Yu wrote: Justin: For reindex_index() : + if (options->tablespaceOid == MyDatabaseTableSpace) + options->tablespaceOid = InvalidOid; ... + oldTablespaceOid = iRel->rd_rel->reltablespace; + if (set_tablespace && + (options->tablespaceOid != oldTablespaceOid || + (options->tablespaceOid == MyDatabaseTableSpace && OidIsValid(oldTablespaceOid I wonder why the options->tablespaceOid == MyDatabaseTableSpace clause appears again in the second if statement. Since the first if statement would assign InvalidOid to options->tablespaceOid when the first if condition is satisfied. Good question. Alexey mentioned on Sept 23 that he added the first stanza. to avoid storing the DB's tablespace OID (rather than InvalidOid). I think the 2nd half of the "or" is unnecessary since that was added setting to options->tablespaceOid = InvalidOid. If requesting to move to the DB's default tablespace, it'll now hit the first part of the OR: + (options->tablespaceOid != oldTablespaceOid || Without the first stanza setting, it would've hit the 2nd condition: + (options->tablespaceOid == MyDatabaseTableSpace && OidIsValid(oldTablespaceOid which means: "user requested to move a table to the DB's default tblspace, and it was previously on a nondefault space". So I think we can drop the 2nd half of the OR. Thanks for noticing. Yes, I have not noticed that we would have already assigned tablespaceOid to InvalidOid in this case. Back to the v7 we were doing this assignment a bit later, so this could make sense, but now it seems to be redundant. For some reason I have mixed these refactorings separated by a dozen of versions... Thanks -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit
On 2020-12-17 18:02, Bharath Rupireddy wrote: On Thu, Dec 17, 2020 at 5:38 PM Hou, Zhijie wrote: I took a look into the patch, and have a little issue: +bool disconnect_cached_connections(uint32 hashvalue, bool all) + if (all) + { + hash_destroy(ConnectionHash); + ConnectionHash = NULL; + result = true; + } If disconnect_cached_connections is called to disconnect all the connections, should we reset the 'xact_got_connection' flag ? I think we must allow postgres_fdw_disconnect() to disconnect the particular/all connections only when the corresponding entries have no open txns or connections are not being used in that txn, otherwise not. We may end up closing/disconnecting the connection that's still being in use because entry->xact_dept can even go more than 1 for sub txns. See use case [1]. +if ((all || entry->server_hashvalue == hashvalue) && entry->xact_depth == 0 && +entry->conn) +{ +disconnect_pg_server(entry); +result = true; +} Thoughts? I think that you are right. Actually, I was thinking about much more simple solution to this problem --- just restrict postgres_fdw_disconnect() to run only *outside* of explicit transaction block. This should protect everyone from closing its underlying connections, but seems to be a bit more restrictive than you propose. Just thought, that if we start closing fdw connections in the open xact block: 1) Close a couple of them. 2) Found one with xact_depth > 0 and error out. 3) End up in the mixed state: some of connections were closed, but some them not, and it cannot be rolled back with the xact. In other words, I have some doubts about allowing to call a non-transactional by its nature function in the transaction block. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
On 2020-12-15 03:14, Justin Pryzby wrote: On Sat, Dec 12, 2020 at 01:45:26PM -0600, Justin Pryzby wrote: On Sat, Dec 12, 2020 at 09:20:35AM +0100, Peter Eisentraut wrote: > On 2020-12-11 21:27, Alvaro Herrera wrote: > > By the way-- What did you think of the idea of explictly marking the > > types used for bitmasks using types bits32 and friends, instead of plain > > int, which is harder to spot? > > If we want to make it clearer, why not turn the thing into a struct, as in > the attached patch, and avoid the bit fiddling altogether. I like this. It's a lot like what I wrote as [PATCH v31 1/5] ExecReindex and ReindexParams In my v31 patch, I moved ReindexOptions to a private structure in indexcmds.c, with an "int options" bitmask which is passed to reindex_index() et al. Your patch keeps/puts ReindexOptions index.h, so it also applies to reindex_index, which I think is good. So I've rebased this branch on your patch. Some thoughts: - what about removing the REINDEXOPT_* prefix ? - You created local vars with initialization like "={}". But I thought it's needed to include at least one struct member like "={false}", or else they're not guaranteed to be zerod ? - You passed the structure across function calls. The usual convention is to pass a pointer. I think maybe Michael missed this message (?) I had applied some changes on top of Peter's patch. I squished those commits now, and also handled ClusterOption and VacuumOption in the same style. Some more thoughts: - should the structures be named in plural ? "ReindexOptions" etc. Since they define *all* the options, not just a single bit. - For vacuum, do we even need a separate structure, or should the members be directly within VacuumParams ? It's a bit odd to write params.options.verbose. Especially since there's also ternary options which are directly within params. This is exactly what I have thought after looking on Peter's patch. Writing 'params.options.some_option' looks just too verbose. I even started moving all vacuum options into VacuumParams on my own and was in the middle of the process when realized that there are some places that do not fit well, like: if (!vacuum_is_relation_owner(RelationGetRelid(onerel), onerel->rd_rel, params->options & VACOPT_ANALYZE)) Here we do not want to set option permanently, but rather to trigger some additional code path in the vacuum_is_relation_owner(), IIUC. With unified VacuumParams we should do: bool opt_analyze = params->analyze; ... params->analyze = true; if (!vacuum_is_relation_owner(RelationGetRelid(onerel), onerel->rd_rel, params)) ... params->analyze = opt_analyze; to achieve the same, but it does not look good to me, or change the whole interface. I have found at least one other place like that so far --- vacuum_open_relation() in the analyze_rel(). Not sure how we could better cope with such logic. - Then, for cluster, I think it should be called ClusterParams, and eventually include the tablespaceOid, like what we're doing for Reindex. I am awaiting feedback on these before going further since I've done too much rebasing with these ideas going back and forth and back. Yes, we have moved a bit from my original patch set in the thread with all this refactoring. However, all the consequent patches are heavily depend on this interface, so let us decide first on the proposed refactoring. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
On 2020-12-04 04:25, Justin Pryzby wrote: On Thu, Dec 03, 2020 at 04:12:53PM +0900, Michael Paquier wrote: > +typedef struct ReindexParams { > + bool concurrently; > + bool verbose; > + bool missingok; > + > + int options;/* bitmask of lowlevel REINDEXOPT_* */ > +} ReindexParams; > + By moving everything into indexcmds.c, keeping ReindexParams within it makes sense to me. Now, there is no need for the three booleans because options stores the same information, no? I liked the bools, but dropped them so the patch is smaller. I had a look on 0001 and it looks mostly fine to me except some strange mixture of tabs/spaces in the ExecReindex(). There is also a couple of meaningful comments: - options = - (verbose ? REINDEXOPT_VERBOSE : 0) | - (concurrently ? REINDEXOPT_CONCURRENTLY : 0); + if (verbose) + params.options |= REINDEXOPT_VERBOSE; Why do we need this intermediate 'verbose' variable here? We only use it once to set a bitmask. Maybe we can do it like this: params.options |= defGetBoolean(opt) ? REINDEXOPT_VERBOSE : 0; See also attached txt file with diff (I wonder can I trick cfbot this way, so it does not apply the diff). + int options;/* bitmask of lowlevel REINDEXOPT_* */ I would prefer if the comment says '/* bitmask of ReindexOption */' as in the VacuumOptions, since citing the exact enum type make it easier to navigate source code. Regarding the REINDEX patch, I think this comment is misleading: |+* Even if table was moved to new tablespace, normally toast cannot move. | */ |+ Oid toasttablespaceOid = allowSystemTableMods ? tablespaceOid : InvalidOid; |result |= reindex_relation(toast_relid, flags, I think it ought to say "Even if a table's indexes were moved to a new tablespace, its toast table's index is not normally moved" Right ? Yes, I think so, we are dealing only with index tablespace changing here. Thanks for noticing. Also, I don't know whether we should check for GLOBALTABLESPACE_OID after calling get_tablespace_oid(), or in the lowlevel routines. Note that reindex_relation is called during cluster/vacuum, and in the later patches, I moved the test from from cluster() and ExecVacuum() to rebuild_relation(). IIRC, I wanted to do GLOBALTABLESPACE_OID check as early as possible (just after getting Oid), since it does not make sense to proceed further if tablespace is set to that value. So initially there were a lot of duplicative GLOBALTABLESPACE_OID checks, since there were a lot of reindex entry-points (index, relation, concurrently, etc.). Now we are going to have ExecReindex(), so there are much less entry-points and in my opinion it is fine to keep this validation just after get_tablespace_oid(). However, this is mostly a sanity check. I can hardly imagine a lot of users trying to constantly move indexes to the global tablespace, so it is also OK to put this check deeper into guts. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Companydiff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c index a27f8f9d83..0b1884815c 100644 --- a/src/backend/commands/indexcmds.c +++ b/src/backend/commands/indexcmds.c @@ -2472,8 +2472,6 @@ void ExecReindex(ParseState *pstate, ReindexStmt *stmt, bool isTopLevel) { ReindexParams params = {0}; - bool verbose = false, -concurrently = false; ListCell *lc; char *tablespace = NULL; @@ -2483,9 +2481,11 @@ ExecReindex(ParseState *pstate, ReindexStmt *stmt, bool isTopLevel) DefElem*opt = (DefElem *) lfirst(lc); if (strcmp(opt->defname, "verbose") == 0) - verbose = defGetBoolean(opt); + params.options |= defGetBoolean(opt) ? +REINDEXOPT_VERBOSE : 0; else if (strcmp(opt->defname, "concurrently") == 0) - concurrently = defGetBoolean(opt); + params.options |= defGetBoolean(opt) ? +REINDEXOPT_CONCURRENTLY : 0; else if (strcmp(opt->defname, "tablespace") == 0) tablespace = defGetString(opt); else @@ -2496,18 +2496,12 @@ ExecReindex(ParseState *pstate, ReindexStmt *stmt, bool isTopLevel) parser_errposition(pstate, opt->location))); } - if (verbose) - params.options |= REINDEXOPT_VERBOSE; + params.tablespaceOid = tablespace ? + get_tablespace_oid(tablespace, false) : InvalidOid; - if (concurrently) - { - params.options |= REINDEXOPT_CONCURRENTLY; + if (params.options & REINDEXOPT_CONCURRENTLY) PreventInTransactionBlock(isTopLevel, "REINDEX CONCURRENTLY"); - } - - params.tablespaceOid = tablespace ? - get_tablespace_oid(tablespace, false) : InvalidOid; switch (stmt->kind) {
Re: Notes on physical replica failover with logical publisher or subscriber
Hi Craig, On 2020-11-30 06:59, Craig Ringer wrote: https://wiki.postgresql.org/wiki/Logical_replication_and_physical_standby_failover Thank you for sharing these notes. I have not dealt a lot with physical/logical replication interoperability, so those were mostly new problems for me to know. One point from the wiki page, which seems clear enough to me: ``` Logical slots can fill pg_wal and can't benefit from archiving. Teach the logical decoding page read callback how to use the restore_command to retrieve WAL segs temporarily if they're not found in pg_wal... ``` It does not look like a big deal to teach logical decoding process to use restore_command, but I have some doubts about how everything will perform in the case when we started getting WAL from archive for decoding purposes. If we started using restore_command, then subscriber lagged long enough to exceed max_slot_wal_keep_size. Taking into account that getting WAL files from the archive has an additional overhead and that primary continues generating (and archiving) new segments, there is a possibility for primary to start doing this double duty forever --- archive WAL file at first and get it back for decoding when requested. Another problem is that there are maybe several active decoders, IIRC, so they would have better to communicate in order to avoid fetching the same segment twice. I tried to address many of these issues with failover slots, but I am not trying to beat that dead horse now. I know that at least some people here are of the opinion that effort shouldn't go into logical/physical replication interoperation anyway - that we should instead address the remaining limitations in logical replication so that it can provide complete HA capabilities without use of physical replication. So for now I'm just trying to save others who go looking into these issues some time and warn them about some of the less obvious booby-traps. Another point to add regarding logical replication capabilities to build logical-only HA system --- logical equivalent of pg_rewind. At least I have not noticed anything after brief reading of the wiki page. IIUC, currently there is no way to quickly return ex-primary (ex-logical publisher) into HA-cluster without doing a pg_basebackup, isn't it? It seems that we should have the same problem here as with physical replication --- ex-primary may accept some xacts after promotion of new primary, so their history diverges and old primary should be rewound before being returned as standby (subscriber). Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
On 2020-11-30 14:33, Michael Paquier wrote: On Tue, Nov 24, 2020 at 09:31:23AM -0600, Justin Pryzby wrote: @cfbot: rebased Catching up with the activity here, I can see four different things in the patch set attached: 1) Refactoring of the grammar of CLUSTER, VACUUM, ANALYZE and REINDEX to support values in parameters. 2) Tablespace change for REINDEX. 3) Tablespace change for VACUUM FULL/CLUSTER. 4) Tablespace change for indexes with VACUUM FULL/CLUSTER. I am not sure yet about the last three points, so let's begin with 1) that is dealt with in 0001 and 0002. I have spent some time on 0001, renaming the rule names to be less generic than "common", and applied it. 0002 looks to be in rather good shape, still there are a few things that have caught my eyes. I'll look at that more closely tomorrow. Thanks. I have rebased the remaining patches on top of 873ea9ee to use 'utility_option_list' instead of 'common_option_list'. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres CompanyFrom ac3b77aec26a40016784ada9dab8b9059f424fa4 Mon Sep 17 00:00:00 2001 From: Justin Pryzby Date: Tue, 31 Mar 2020 20:35:41 -0500 Subject: [PATCH v31 5/5] Implement vacuum full/cluster (INDEX_TABLESPACE ) --- doc/src/sgml/ref/cluster.sgml | 12 - doc/src/sgml/ref/vacuum.sgml | 12 - src/backend/commands/cluster.c| 64 ++- src/backend/commands/matview.c| 3 +- src/backend/commands/tablecmds.c | 2 +- src/backend/commands/vacuum.c | 46 +++- src/backend/postmaster/autovacuum.c | 1 + src/include/commands/cluster.h| 6 ++- src/include/commands/vacuum.h | 5 +- src/test/regress/input/tablespace.source | 13 + src/test/regress/output/tablespace.source | 20 +++ 11 files changed, 123 insertions(+), 61 deletions(-) diff --git a/doc/src/sgml/ref/cluster.sgml b/doc/src/sgml/ref/cluster.sgml index cbfc0582be..6781e3a025 100644 --- a/doc/src/sgml/ref/cluster.sgml +++ b/doc/src/sgml/ref/cluster.sgml @@ -28,6 +28,7 @@ CLUSTER [VERBOSE] [ ( option [, ... VERBOSE [ boolean ] TABLESPACE new_tablespace +INDEX_TABLESPACE new_tablespace @@ -105,6 +106,15 @@ CLUSTER [VERBOSE] [ ( option [, ... + +INDEX_TABLESPACE + + + Specifies that the table's indexes will be rebuilt on a new tablespace. + + + + table_name @@ -141,7 +151,7 @@ CLUSTER [VERBOSE] [ ( option [, ... new_tablespace - The tablespace where the table will be rebuilt. + The tablespace where the table or its indexes will be rebuilt. diff --git a/doc/src/sgml/ref/vacuum.sgml b/doc/src/sgml/ref/vacuum.sgml index 5261a7c727..28cab119b6 100644 --- a/doc/src/sgml/ref/vacuum.sgml +++ b/doc/src/sgml/ref/vacuum.sgml @@ -36,6 +36,7 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [ boolean ] PARALLEL integer TABLESPACE new_tablespace +INDEX_TABLESPACE new_tablespace and table_and_columns is: @@ -265,6 +266,15 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [ boolean @@ -314,7 +324,7 @@ VACUUM [ FULL ] [ FREEZE ] [ VERBOSE ] [ ANALYZE ] [ new_tablespace - The tablespace where the relation will be rebuilt. + The tablespace where the relation or its indexes will be rebuilt. diff --git a/src/backend/commands/cluster.c b/src/backend/commands/cluster.c index b289a76d58..0f9f09a15a 100644 --- a/src/backend/commands/cluster.c +++ b/src/backend/commands/cluster.c @@ -71,7 +71,7 @@ typedef struct static void rebuild_relation(Relation OldHeap, Oid indexOid, bool verbose, - Oid NewTableSpaceOid); + Oid NewTableSpaceOid, Oid NewIdxTableSpaceOid); static void copy_table_data(Oid OIDNewHeap, Oid OIDOldHeap, Oid OIDOldIndex, bool verbose, bool *pSwapToastByContent, TransactionId *pFreezeXid, MultiXactId *pCutoffMulti); @@ -107,9 +107,11 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel) { ListCell *lc; int options = 0; - /* Name and Oid of tablespace to use for clustered relation. */ - char *tablespaceName = NULL; - Oid tablespaceOid = InvalidOid; + /* Name and Oid of tablespaces to use for clustered relations. */ + char *tablespaceName = NULL, +*idxtablespaceName = NULL; + Oid tablespaceOid, +idxtablespaceOid; /* Parse list of generic parameters not handled by the parser */ foreach(lc, stmt->params) @@ -123,6 +125,8 @@ cluster(ParseState *pstate, ClusterStmt *stmt, bool isTopLevel) options &= ~CLUOPT_VERBOSE; else if (strcmp(opt->defname, "tablespace") == 0) tablespaceName = defGetString(opt); + else if (strcmp(opt->defname, "index_tablespace") == 0) + idxtablespaceName = defGetString(opt); else ereport(ERROR, (errcode(ERRCODE_SYNTAX_ERROR), @@ -1
Re: Printing LSN made easy
Hi, On 2020-11-27 13:40, Ashutosh Bapat wrote: Off list Peter Eisentraut pointed out that we can not use these macros in elog/ereport since it creates problems for translations. He suggested adding functions which return strings and use %s when doing so. The patch has two functions pg_lsn_out_internal() which takes an LSN as input and returns a palloc'ed string containing the string representation of LSN. This may not be suitable in performance critical paths and also may leak memory if not freed. So there's another function pg_lsn_out_buffer() which takes LSN and a char array as input, fills the char array with the string representation and returns the pointer to the char array. This allows the function to be used as an argument in printf/elog etc. Macro MAXPG_LSNLEN has been extern'elized for this purpose. If usage of macros in elog/ereport can cause problems for translation, then even with this patch life is not get simpler significantly. For example, instead of just doing like: elog(WARNING, - "xlog min recovery request %X/%X is past current point %X/%X", - (uint32) (lsn >> 32), (uint32) lsn, - (uint32) (newMinRecoveryPoint >> 32), - (uint32) newMinRecoveryPoint); + "xlog min recovery request " LSN_FORMAT " is past current point " LSN_FORMAT, + LSN_FORMAT_ARG(lsn), + LSN_FORMAT_ARG(newMinRecoveryPoint)); we have to either declare two additional local buffers, which is verbose; or use pg_lsn_out_internal() and rely on memory contexts (or do pfree() manually, which is verbose again) to prevent memory leaks. Off list Craig Ringer suggested introducing a new format specifier similar to %m for LSN but I did not get time to take a look at the relevant code. AFAIU it's available only to elog/ereport, so may not be useful generally. But teaching printf variants about the new format would be the best solution. However, I didn't find any way to do that. It seems that this topic has been extensively discussed off-list, but still strong +1 for the patch. I always wanted LSN printing to be more concise. I have just tried new printing utilities in a couple of new places and it looks good to me. +char * +pg_lsn_out_internal(XLogRecPtr lsn) +{ + charbuf[MAXPG_LSNLEN + 1]; + + snprintf(buf, sizeof(buf), LSN_FORMAT, LSN_FORMAT_ARG(lsn)); + + return pstrdup(buf); +} Would it be a bit more straightforward if we palloc buf initially and just return a pointer instead of doing pstrdup()? Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres CompanyFrom 698e481f5f55b967b5c60dba4bc577f8baa20ff4 Mon Sep 17 00:00:00 2001 From: Ashutosh Bapat Date: Fri, 16 Oct 2020 17:09:29 +0530 Subject: [PATCH] Make it easy to print LSN The commit introduces following macros and functions to make it easy to use LSNs in printf variants, elog, ereport and appendStringInfo variants. LSN_FORMAT - macro representing the format in which LSN is printed LSN_FORMAT_ARG - macro to pass LSN as an argument to the above format pg_lsn_out_internal - a function which returns palloc'ed char array containing string representation of given LSN. pg_lsn_out_buffer - similar to above but accepts and returns a char array of size (MAXPG_LSNLEN + 1) The commit also has some example usages of these. Ashutosh Bapat --- contrib/pageinspect/rawpage.c| 3 +- src/backend/access/rmgrdesc/replorigindesc.c | 5 +- src/backend/access/rmgrdesc/xlogdesc.c | 3 +- src/backend/access/transam/xlog.c| 8 ++-- src/backend/utils/adt/pg_lsn.c | 49 ++-- src/include/access/xlogdefs.h| 7 +++ src/include/utils/pg_lsn.h | 3 ++ 7 files changed, 55 insertions(+), 23 deletions(-) diff --git a/contrib/pageinspect/rawpage.c b/contrib/pageinspect/rawpage.c index c0181506a5..2cd055a5f0 100644 --- a/contrib/pageinspect/rawpage.c +++ b/contrib/pageinspect/rawpage.c @@ -261,8 +261,7 @@ page_header(PG_FUNCTION_ARGS) { char lsnchar[64]; - snprintf(lsnchar, sizeof(lsnchar), "%X/%X", - (uint32) (lsn >> 32), (uint32) lsn); + snprintf(lsnchar, sizeof(lsnchar), LSN_FORMAT, LSN_FORMAT_ARG(lsn)); values[0] = CStringGetTextDatum(lsnchar); } else diff --git a/src/backend/access/rmgrdesc/replorigindesc.c b/src/backend/access/rmgrdesc/replorigindesc.c index 19e14f910b..a3f49b5750 100644 --- a/src/backend/access/rmgrdesc/replorigindesc.c +++ b/src/backend/access/rmgrdesc/replorigindesc.c @@ -29,10 +29,9 @@ replorigin_desc(StringInfo buf, XLogReaderState *record) xlrec = (xl_replorigin_set *) rec; -appendStringInfo(buf, "set %u; lsn %X/%X; force: %d", +appendStringInfo(buf, "set %u; lsn " LSN_FORMAT "; force: %d", xlrec->
Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit
On 2020-11-25 06:17, Bharath Rupireddy wrote: On Wed, Nov 25, 2020 at 7:24 AM Craig Ringer wrote: A quick thought here. Would it make sense to add a hook in the DISCARD ALL implementation that postgres_fdw can register for? There's precedent here, since DISCARD ALL already has the same effect as SELECT pg_advisory_unlock_all(); amongst other things. IIUC, then it is like a core(server) function doing some work for the postgres_fdw module. Earlier in the discussion, one point raised was that it's better not to have core handling something related to postgres_fdw. This is the reason we have come up with postgres_fdw specific function and a GUC, which get defined when extension is created. Similarly, dblink also has it's own bunch of functions one among them is dblink_disconnect(). If I have got Craig correctly, he proposed that we already have a DISCARD ALL statement, which is processed by DiscardAll(), and it releases internal resources known from the core perspective. That way, we can introduce a general purpose hook DiscardAll_hook(), so postgres_fdw can get use of it to clean up its own resources (connections in our context) if needed. In other words, it is not like a core function doing some work for the postgres_fdw module, but rather like a callback/hook, that postgres_fdw is able to register to do some additional work. It can be a good replacement for 0001, but won't it be already an overkill to drop all local caches along with remote connections? I mean, that it would be a nice to have hook from the extensibility perspective, but postgres_fdw_disconnect() still makes sense, since it does a very narrow and specific job. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit
On 2020-11-24 06:52, Bharath Rupireddy wrote: Thanks for the review comments. On Mon, Nov 23, 2020 at 9:57 PM Alexey Kondratov wrote: > v1-0001-postgres_fdw-function-to-discard-cached-connections.patch This patch looks pretty straightforward for me, but there are some things to be addressed IMO: + server = GetForeignServerByName(servername, true); + + if (server != NULL) + { Yes, you return a false if no server was found, but for me it worth throwing an error in this case as, for example, dblink does in the dblink_disconnect(). dblink_disconnect() "Returns status, which is always OK (since any error causes the function to throw an error instead of returning)." This behaviour doesn't seem okay to me. Since we throw true/false, I would prefer to throw a warning(with a reason) while returning false over an error. I thought about something a bit more sophisticated: 1) Return 'true' if there were open connections and we successfully closed them. 2) Return 'false' in the no-op case, i.e. there were no open connections. 3) Rise an error if something went wrong. And non-existing server case belongs to this last category, IMO. That looks like a semantically correct behavior, but let us wait for any other opinion. + result = disconnect_cached_connections(FOREIGNSERVEROID, +hashvalue, +false); + if (all || (!all && cacheid == FOREIGNSERVEROID && + entry->server_hashvalue == hashvalue)) + { + if (entry->conn != NULL && + !all && cacheid == FOREIGNSERVEROID && + entry->server_hashvalue == hashvalue) These conditions look bulky for me. First, you pass FOREIGNSERVEROID to disconnect_cached_connections(), but actually it just duplicates 'all' flag, since when it is 'FOREIGNSERVEROID', then 'all == false'; when it is '-1', then 'all == true'. That is all, there are only two calls of disconnect_cached_connections(). That way, it seems that we should keep only 'all' flag at least for now, doesn't it? I added cachid as an argument to disconnect_cached_connections() for reusability. Say, someone wants to use it with a user mapping then they can pass cacheid USERMAPPINGOID, hash value of user mapping. The cacheid == USERMAPPINGOID && entry->mapping_hashvalue == hashvalue can be added to disconnect_cached_connections(). Yeah, I have got your point and motivation to add this argument, but how we can use it? To disconnect all connections belonging to some specific user mapping? But any user mapping is hard bound to some foreign server, AFAIK, so we can pass serverid-based hash in this case. In the case of pgfdw_inval_callback() this argument makes sense, since syscache callbacks work that way, but here I can hardly imagine a case where we can use it. Thus, it still looks as a preliminary complication for me, since we do not have plans to use it, do we? Anyway, everything seems to be working fine, so it is up to you to keep this additional argument. v1-0003-postgres_fdw-server-level-option-keep_connection.patch This patch adds a new server level option, keep_connection, default being on, when set to off, the local session doesn't cache the connections associated with the foreign server. This patch looks good to me, except one note: (entry->used_in_current_xact && - !keep_connections)) + (!keep_connections || !entry->keep_connection))) { Following this logic: 1) If keep_connections == true, then per-server keep_connection has a *higher* priority, so one can disable caching of a single foreign server. 2) But if keep_connections == false, then it works like a global switch off indifferently of per-server keep_connection's, i.e. they have a *lower* priority. It looks fine for me, at least I cannot propose anything better, but maybe it should be documented in 0004? v1-0004-postgres_fdw-connection-cache-discard-tests-and-documentation.patch This patch adds the tests and documentation related to this feature. I have not read all texts thoroughly, but what caught my eye: + A GUC, postgres_fdw.keep_connections, default being + on, when set to off, the local session I think that GUC acronym is used widely only in the source code and Postgres docs tend to do not use it at all, except from acronyms list and a couple of 'GUC parameters' collocation usage. And it never used in a singular form there, so I think that it should be rather: A configuration parameter, postgres_fdw.keep_connections, default being... + + Note that when postgres_fdw.keep_connections is set to + off, postgres_fdw discards either the connections + that are made previously and will be used by th
Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit
Hi, On 2020-11-23 09:48, Bharath Rupireddy wrote: Here is how I'm making 4 separate patches: 1. new function and it's documentation. 2. GUC and it's documentation. 3. server level option and it's documentation. 4. test cases for all of the above patches. Hi, I'm attaching the patches here. Note that, though the code changes for this feature are small, I divided them up as separate patches to make review easy. v1-0001-postgres_fdw-function-to-discard-cached-connections.patch This patch looks pretty straightforward for me, but there are some things to be addressed IMO: + server = GetForeignServerByName(servername, true); + + if (server != NULL) + { Yes, you return a false if no server was found, but for me it worth throwing an error in this case as, for example, dblink does in the dblink_disconnect(). + result = disconnect_cached_connections(FOREIGNSERVEROID, +hashvalue, +false); + if (all || (!all && cacheid == FOREIGNSERVEROID && + entry->server_hashvalue == hashvalue)) + { + if (entry->conn != NULL && + !all && cacheid == FOREIGNSERVEROID && + entry->server_hashvalue == hashvalue) These conditions look bulky for me. First, you pass FOREIGNSERVEROID to disconnect_cached_connections(), but actually it just duplicates 'all' flag, since when it is 'FOREIGNSERVEROID', then 'all == false'; when it is '-1', then 'all == true'. That is all, there are only two calls of disconnect_cached_connections(). That way, it seems that we should keep only 'all' flag at least for now, doesn't it? Second, I think that we should just rewrite this if statement in order to simplify it and make more readable, e.g.: if ((all || entry->server_hashvalue == hashvalue) && entry->conn != NULL) { disconnect_pg_server(entry); result = true; } + if (all) + { + hash_destroy(ConnectionHash); + ConnectionHash = NULL; + result = true; + } Also, I am still not sure that it is a good idea to destroy the whole cache even in 'all' case, but maybe others will have a different opinion. v1-0002-postgres_fdw-add-keep_connections-GUC-to-not-cache-connections.patch + entry->changing_xact_state) || + (entry->used_in_current_xact && + !keep_connections)) I am not sure, but I think, that instead of adding this additional flag into ConnCacheEntry structure we can look on entry->xact_depth and use local: bool used_in_current_xact = entry->xact_depth > 0; for exactly the same purpose. Since we set entry->xact_depth to zero at the end of xact, then it was used if it is not zero. It is set to 1 by begin_remote_xact() called by GetConnection(), so everything seems to be fine. Otherwise, both patches seem to be working as expected. I am going to have a look on the last two patches a bit later. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit
On 2020-11-19 07:11, Bharath Rupireddy wrote: On Wed, Nov 18, 2020 at 10:32 PM Alexey Kondratov wrote: Thanks! I will make separate patches and post them soon. >> Attached is a small POC patch, which implements this contrib-level >> postgres_fdw.keep_connections GUC. What do you think? > > I see two problems with your patch: 1) It just disconnects the remote > connection at the end of txn if the GUC is set to false, but it > doesn't remove the connection cache entry from ConnectionHash. Yes, and this looks like a valid state for postgres_fdw and it can get into the same state even without my patch. Next time GetConnection() will find this cache entry, figure out that entry->conn is NULL and establish a fresh connection. It is not clear for me right now, what benefits we will get from clearing also this cache entry, except just doing this for sanity. By clearing the cache entry we will have 2 advantages: 1) we could save a(small) bit of memory 2) we could allow new connections to be cached, currently ConnectionHash can have only 8 entries. IMHO, along with disconnecting, we can also clear off the cache entry. Thoughts? IIUC, 8 is not a hard limit, it is just a starting size. ConnectionHash is not a shared-memory hash table, so dynahash can expand it on-the-fly as follow, for example, from the comment before hash_create(): * Note: for a shared-memory hashtable, nelem needs to be a pretty good * estimate, since we can't expand the table on the fly. But an unshared * hashtable can be expanded on-the-fly, so it's better for nelem to be * on the small side and let the table grow if it's exceeded. An overly * large nelem will penalize hash_seq_search speed without buying much. Also I am not sure that by doing just a HASH_REMOVE you will free any memory, since hash table is already allocated (or expanded) to some size. So HASH_REMOVE will only add removed entry to the freeList, I guess. Anyway, I can hardly imagine bloating of ConnectionHash to be a problem even in the case, when one has thousands of foreign servers all being accessed during a single backend life span. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit
On 2020-11-18 16:39, Bharath Rupireddy wrote: Thanks for the interest shown! On Wed, Nov 18, 2020 at 1:07 AM Alexey Kondratov wrote: Regarding the initial issue I prefer point #3, i.e. foreign server option. It has a couple of benefits IMO: 1) it may be set separately on per foreign server basis, 2) it will live only in the postgres_fdw contrib without any need to touch core. I would only supplement this postgres_fdw foreign server option with a GUC, e.g. postgres_fdw.keep_connections, so one could easily define such behavior for all foreign servers at once or override server-level option by setting this GUC on per session basis. Below is what I have in my mind, mostly inline with yours: a) Have a server level option (keep_connetion true/false, with the default being true), when set to false the connection that's made with this foreign server is closed and cached entry from the connection cache is deleted at the end of txn in pgfdw_xact_callback. b) Have postgres_fdw level GUC postgres_fdw.keep_connections default being true. When set to false by the user, the connections, that are used after this, are closed and removed from the cache at the end of respective txns. If we don't use a connection that was cached prior to the user setting the GUC as false, then we may not be able to clear it. We can avoid this problem by recommending users either to set the GUC to false right after the CREATE EXTENSION postgres_fdw; or else use the function specified in (c). c) Have a new function that gets defined as part of CREATE EXTENSION postgres_fdw;, say postgres_fdw_discard_connections(), similar to dblink's dblink_disconnect(), which discards all the remote connections and clears connection cache. And we can also have server name as input to postgres_fdw_discard_connections() to discard selectively. Thoughts? If okay with the approach, I will start working on the patch. This approach looks solid enough from my perspective to give it a try. I would only make it as three separate patches for an ease of further review. Attached is a small POC patch, which implements this contrib-level postgres_fdw.keep_connections GUC. What do you think? I see two problems with your patch: 1) It just disconnects the remote connection at the end of txn if the GUC is set to false, but it doesn't remove the connection cache entry from ConnectionHash. Yes, and this looks like a valid state for postgres_fdw and it can get into the same state even without my patch. Next time GetConnection() will find this cache entry, figure out that entry->conn is NULL and establish a fresh connection. It is not clear for me right now, what benefits we will get from clearing also this cache entry, except just doing this for sanity. 2) What happens if there are some cached connections, user set the GUC to false and not run any foreign queries or not use those connections thereafter, so only the new connections will not be cached? Will the existing unused connections still remain in the connection cache? See (b) above for a solution. Yes, they will. This could be solved with that additional disconnect function as you proposed in c). Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: [PATCH] postgres_fdw connection caching - cause remote sessions linger till the local session exit
Hi, On 2020-11-06 18:56, Anastasia Lubennikova wrote: Status update for a commitfest entry. This thread was inactive for a while and from the latest messages, I see that the patch needs some further work. So I move it to "Waiting on Author". The new status of this patch is: Waiting on Author I had a look on the initial patch and discussed options [1] to proceed with this issue. I agree with Bruce about idle_session_timeout, it would be a nice to have in-core feature on its own. However, this should be a cluster-wide option and it will start dropping all idle connection not only foreign ones. So it may be not an option for some cases, when the same foreign server is used for another load as well. Regarding the initial issue I prefer point #3, i.e. foreign server option. It has a couple of benefits IMO: 1) it may be set separately on per foreign server basis, 2) it will live only in the postgres_fdw contrib without any need to touch core. I would only supplement this postgres_fdw foreign server option with a GUC, e.g. postgres_fdw.keep_connections, so one could easily define such behavior for all foreign servers at once or override server-level option by setting this GUC on per session basis. Attached is a small POC patch, which implements this contrib-level postgres_fdw.keep_connections GUC. What do you think? [1] https://www.postgresql.org/message-id/CALj2ACUFNydy0uo0JL9A1isHQ9pFe1Fgqa_HVanfG6F8g21nSQ%40mail.gmail.com Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Companydiff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c index ab3226287d..64f0e96635 100644 --- a/contrib/postgres_fdw/connection.c +++ b/contrib/postgres_fdw/connection.c @@ -28,6 +28,8 @@ #include "utils/memutils.h" #include "utils/syscache.h" +#include "postgres_fdw.h" + /* * Connection cache hash table entry * @@ -948,6 +950,7 @@ pgfdw_xact_callback(XactEvent event, void *arg) */ if (PQstatus(entry->conn) != CONNECTION_OK || PQtransactionStatus(entry->conn) != PQTRANS_IDLE || + !keep_connections || entry->changing_xact_state) { elog(DEBUG3, "discarding connection %p", entry->conn); diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c index 9c5aaacc51..4cd5f71223 100644 --- a/contrib/postgres_fdw/postgres_fdw.c +++ b/contrib/postgres_fdw/postgres_fdw.c @@ -45,6 +45,8 @@ #include "utils/sampling.h" #include "utils/selfuncs.h" +#include "postgres_fdw.h" + PG_MODULE_MAGIC; /* Default CPU cost to start up a foreign query. */ @@ -301,6 +303,8 @@ typedef struct List *already_used; /* expressions already dealt with */ } ec_member_foreign_arg; +bool keep_connections = true; + /* * SQL functions */ @@ -505,6 +509,15 @@ static void merge_fdw_options(PgFdwRelationInfo *fpinfo, const PgFdwRelationInfo *fpinfo_o, const PgFdwRelationInfo *fpinfo_i); +void +_PG_init(void) +{ + DefineCustomBoolVariable("postgres_fdw.keep_connections", + "Enables postgres_fdw connection caching.", + "When off postgres_fdw will close connections at the end of transaction.", + _connections, true, PGC_USERSET, 0, NULL, + NULL, NULL); +} /* * Foreign-data wrapper handler function: return a struct with pointers diff --git a/contrib/postgres_fdw/postgres_fdw.h b/contrib/postgres_fdw/postgres_fdw.h index eef410db39..7f1bdb96d6 100644 --- a/contrib/postgres_fdw/postgres_fdw.h +++ b/contrib/postgres_fdw/postgres_fdw.h @@ -124,9 +124,12 @@ typedef struct PgFdwRelationInfo int relation_index; } PgFdwRelationInfo; +extern bool keep_connections; + /* in postgres_fdw.c */ extern int set_transmission_modes(void); extern void reset_transmission_modes(int nestlevel); +extern void _PG_init(void); /* in connection.c */ extern PGconn *GetConnection(UserMapping *user, bool will_prep_stmt);
Re: Misuse of TimestampDifference() in the autoprewarm feature of pg_prewarm
On 2020-11-11 06:59, Tom Lane wrote: Alexey Kondratov writes: After looking on the autoprewarm code more closely I have realised that this 'double dump' issues was not an issues at all. I have just misplaced a debug elog(), so its second output in the log was only indicating that we calculated delay_in_ms one more time. Ah --- that explains why I couldn't see a problem. I've pushed 0001+0002 plus some followup work to fix other places that could usefully use TimestampDifferenceMilliseconds(). I have not done anything with 0003 (the TAP test for pg_prewarm), and will leave that to the judgment of somebody who's worked with pg_prewarm before. To me it looks like it's not really testing things very carefully at all; on the other hand, we have exactly zero test coverage of that module today, so maybe something is better than nothing. Great, thank you for generalisation of the issue and working on it. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: Misuse of TimestampDifference() in the autoprewarm feature of pg_prewarm
On 2020-11-09 23:25, Tom Lane wrote: Alexey Kondratov writes: On 2020-11-09 21:53, Tom Lane wrote: 0002 seems like a pretty clear bug fix, though I wonder if this is exactly what we want to do going forward. It seems like a very large fraction of the callers of TimestampDifference would like to have the value in msec, which means we're doing a whole lot of expensive and error-prone arithmetic to break down the difference to sec/usec and then put it back together again. Let's get rid of that by inventing, say TimestampDifferenceMilliseconds(...). Yeah, I get into this problem after a bug in another extension — pg_wait_sampling. I have attached 0002, which implements TimestampDifferenceMilliseconds(), so 0003 just uses this new function to solve the initial issues. If it looks good to you, then we can switch all similar callers to it. Yeah, let's move forward with that --- in fact, I'm inclined to back-patch it. (Not till the current release cycle is done, though. I don't find this important enough to justify a last-moment patch.) BTW, I wonder if we shouldn't make TimestampDifferenceMilliseconds round any fractional millisecond up rather than down. Rounding down seems to create a hazard of uselessly waking just before the delay is completed. Better to wake just after. Yes, it make sense. I have changed TimestampDifferenceMilliseconds() to round result up if there is a reminder. After looking on the autoprewarm code more closely I have realised that this 'double dump' issues was not an issues at all. I have just misplaced a debug elog(), so its second output in the log was only indicating that we calculated delay_in_ms one more time. Actually, even with wrong calculation of delay_in_ms the only problem was that we were busy looping with ~1 second interval instead of waiting on latch. It is still a buggy behaviour, but much less harmful than I have originally thought. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres CompanyFrom ce09103d9d58b611728b66366cd24e8a4069f7ac Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Mon, 9 Nov 2020 19:04:10 +0300 Subject: [PATCH v3 3/3] pg_prewarm: add tap test for autoprewarm feature --- contrib/pg_prewarm/Makefile | 2 + contrib/pg_prewarm/t/001_autoprewarm.pl | 59 + 2 files changed, 61 insertions(+) create mode 100644 contrib/pg_prewarm/t/001_autoprewarm.pl diff --git a/contrib/pg_prewarm/Makefile b/contrib/pg_prewarm/Makefile index b13ac3c813..9cfde8c4e4 100644 --- a/contrib/pg_prewarm/Makefile +++ b/contrib/pg_prewarm/Makefile @@ -10,6 +10,8 @@ EXTENSION = pg_prewarm DATA = pg_prewarm--1.1--1.2.sql pg_prewarm--1.1.sql pg_prewarm--1.0--1.1.sql PGFILEDESC = "pg_prewarm - preload relation data into system buffer cache" +TAP_TESTS = 1 + ifdef USE_PGXS PG_CONFIG = pg_config PGXS := $(shell $(PG_CONFIG) --pgxs) diff --git a/contrib/pg_prewarm/t/001_autoprewarm.pl b/contrib/pg_prewarm/t/001_autoprewarm.pl new file mode 100644 index 00..f55b2a5352 --- /dev/null +++ b/contrib/pg_prewarm/t/001_autoprewarm.pl @@ -0,0 +1,59 @@ +# +# Check that pg_prewarm can dump blocks from shared buffers +# to PGDATA/autoprewarm.blocks. +# + +use strict; +use Test::More; +use TestLib; +use Time::HiRes qw(usleep); +use warnings; + +use PostgresNode; + +plan tests => 3; + +# Wait up to 180s for pg_prewarm to dump blocks. +sub wait_for_dump +{ + my $path = shift; + + foreach my $i (0 .. 1800) + { + last if -e $path; + usleep(100_000); + } +} + +my $node = get_new_node("node"); +$node->init; +$node->append_conf( + 'postgresql.conf', qq( +shared_preload_libraries = 'pg_prewarm' +pg_prewarm.autoprewarm = 'on' +pg_prewarm.autoprewarm_interval = 1 +)); +$node->start; + +my $blocks_path = $node->data_dir . '/autoprewarm.blocks'; + +# Check that we can dump blocks on timeout. +wait_for_dump($blocks_path); +ok(-e $blocks_path, 'file autoprewarm.blocks should be present in the PGDATA'); + +# Check that we can dump blocks on shutdown. +$node->stop; +$node->append_conf( + 'postgresql.conf', qq( +pg_prewarm.autoprewarm_interval = 0 +)); + +# Remove autoprewarm.blocks +unlink($blocks_path) || die "$blocks_path: $!"; +ok(!-e $blocks_path, 'sanity check, dump on timeout is turned off'); + +$node->start; +$node->stop; + +wait_for_dump($blocks_path); +ok(-e $blocks_path, 'file autoprewarm.blocks should be present in the PGDATA after clean shutdown'); -- 2.19.1 From fba212ed765c8c411db1ca19c2ac991662109d99 Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Mon, 9 Nov 2020 19:12:00 +0300 Subject: [PATCH v3 2/3] pg_prewarm: fix autoprewarm_interval behaviour Previously it misused seconds from TimestampDifference() as milliseconds, so it was busy looping with ~1 second interval instead of waiting on latch with default autoprewarm_interval = 300s. --- contrib/pg_prewarm/autoprewarm.c | 8 +++- 1 file chang
Re: Misuse of TimestampDifference() in the autoprewarm feature of pg_prewarm
On 2020-11-09 21:53, Tom Lane wrote: Alexey Kondratov writes: After fixing this issue I have noticed that it still dumps blocks twice at each timeout (here I set autoprewarm_interval to 15s): ... This happens because at timeout time we were using continue, but actually we still have to wait the entire autoprewarm_interval after successful dumping. I don't think your 0001 is correct. It would be okay if apw_dump_now() could be counted on to take negligible time, but we shouldn't assume that should we? Yes, it seems so, if I understand you correctly. I had a doubt about possibility of pg_ctl to exit earlier than a dumping process. Now I added an explicit wait for dump file into test. I agree that the "continue" seems a bit bogus, because it's skipping the ResetLatch call at the bottom of the loop; it's not quite clear to me whether that's a good thing or not. But the general idea of the existing code seems to be to loop around and make a fresh calculation of how-long-to-wait, and that doesn't seem wrong. I have left the last patch intact, since it resolves the 'double dump' issue, but I agree with нщгк point about existing logic of the code, although it is a bit broken. So I have to think more about how to fix it in a better way. 0002 seems like a pretty clear bug fix, though I wonder if this is exactly what we want to do going forward. It seems like a very large fraction of the callers of TimestampDifference would like to have the value in msec, which means we're doing a whole lot of expensive and error-prone arithmetic to break down the difference to sec/usec and then put it back together again. Let's get rid of that by inventing, say TimestampDifferenceMilliseconds(...). Yeah, I get into this problem after a bug in another extension — pg_wait_sampling. I have attached 0002, which implements TimestampDifferenceMilliseconds(), so 0003 just uses this new function to solve the initial issues. If it looks good to you, then we can switch all similar callers to it. BTW, I see another bug of a related ilk. Look what postgres_fdw/connection.c is doing: TimestampDifference(now, endtime, , ); /* To protect against clock skew, limit sleep to one minute. */ cur_timeout = Min(6, secs * USECS_PER_SEC + microsecs); /* Sleep until there's something to do */ wc = WaitLatchOrSocket(MyLatch, WL_LATCH_SET | WL_SOCKET_READABLE | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH, PQsocket(conn), cur_timeout, PG_WAIT_EXTENSION); WaitLatchOrSocket's timeout is measured in msec not usec. I think the comment about "clock skew" is complete BS, and the Min() calculation was put in as a workaround by somebody observing that the sleep waited too long, but not understanding why. I wonder how many troubles one can get with all these unit conversions. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres CompanyFrom c79de17014753b311858b4570ca475f713328c62 Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Mon, 9 Nov 2020 19:24:55 +0300 Subject: [PATCH v2 4/4] pg_prewarm: refactor autoprewarm waits Previously it was dumping twice at every timeout time. --- contrib/pg_prewarm/autoprewarm.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/contrib/pg_prewarm/autoprewarm.c b/contrib/pg_prewarm/autoprewarm.c index e5bd130bc8..872c7d51b1 100644 --- a/contrib/pg_prewarm/autoprewarm.c +++ b/contrib/pg_prewarm/autoprewarm.c @@ -236,7 +236,9 @@ autoprewarm_main(Datum main_arg) { last_dump_time = GetCurrentTimestamp(); apw_dump_now(true, false); -continue; + +/* We have to sleep even after a successful dump */ +delay_in_ms = autoprewarm_interval * 1000; } /* Sleep until the next dump time. */ -- 2.19.1 From c38c07708d57d6dec5a8a1697ca9c9810ad4d7ce Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Mon, 9 Nov 2020 19:12:00 +0300 Subject: [PATCH v2 3/4] pg_prewarm: fix autoprewarm_interval behaviour. Previously it misused seconds from TimestampDifference() as milliseconds, so it was dumping autoprewarm.blocks ~every second event with default autoprewarm_interval = 300s. --- contrib/pg_prewarm/autoprewarm.c | 8 +++- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/contrib/pg_prewarm/autoprewarm.c b/contrib/pg_prewarm/autoprewarm.c index d3dec6e3ec..e5bd130bc8 100644 --- a/contrib/pg_prewarm/autoprewarm.c +++ b/contrib/pg_prewarm/autoprewarm.c @@ -222,16 +222,14 @@ autoprewarm_main(Datum main_arg) { long delay_in_ms = 0; TimestampTz next_dump_time = 0; - long secs = 0; - int usecs = 0; /* Compute the next dump time. */ next_dump_time = TimestampTzPlusMilliseconds(last_dump_time,
Misuse of TimestampDifference() in the autoprewarm feature of pg_prewarm
Hi Hackers, Today I have accidentally noticed that autoprewarm feature of pg_prewarm used TimestampDifference()'s results in a wrong way. First, it used *seconds* result from it as a *milliseconds*. It was causing it to make dump file autoprewarm.blocks ~every second with default setting of autoprewarm_interval = 300s. Here is a log part with debug output in this case: ``` 2020-11-09 19:09:00.162 MSK [85328] LOG: dumping autoprewarm.blocks 2020-11-09 19:09:01.161 MSK [85328] LOG: dumping autoprewarm.blocks 2020-11-09 19:09:02.160 MSK [85328] LOG: dumping autoprewarm.blocks 2020-11-09 19:09:03.159 MSK [85328] LOG: dumping autoprewarm.blocks ``` After fixing this issue I have noticed that it still dumps blocks twice at each timeout (here I set autoprewarm_interval to 15s): ``` 2020-11-09 19:18:59.692 MSK [85662] LOG: dumping autoprewarm.blocks 2020-11-09 19:18:59.700 MSK [85662] LOG: dumping autoprewarm.blocks 2020-11-09 19:19:14.694 MSK [85662] LOG: dumping autoprewarm.blocks 2020-11-09 19:19:14.704 MSK [85662] LOG: dumping autoprewarm.blocks ``` This happens because at timeout time we were using continue, but actually we still have to wait the entire autoprewarm_interval after successful dumping. I have fixed both issues in the attached patches and also added a minimalistic tap test as a first one to verify that this automatic damping still works after refactoring. I put Robert into CC, since he is an author of this feature. What do you think? Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres CompanyFrom 6d4bab7f21c3661dd4dd5a0de7e097b1de3f642c Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Mon, 9 Nov 2020 19:24:55 +0300 Subject: [PATCH v1 3/3] pg_prewarm: refactor autoprewarm waits Previously it was dumping twice at every timeout time. --- contrib/pg_prewarm/autoprewarm.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/contrib/pg_prewarm/autoprewarm.c b/contrib/pg_prewarm/autoprewarm.c index b18a065ed5..f52c83de1e 100644 --- a/contrib/pg_prewarm/autoprewarm.c +++ b/contrib/pg_prewarm/autoprewarm.c @@ -238,7 +238,9 @@ autoprewarm_main(Datum main_arg) { last_dump_time = GetCurrentTimestamp(); apw_dump_now(true, false); -continue; + +/* We have to sleep even after a successfull dump */ +delay_in_ms = autoprewarm_interval * 1000; } /* Sleep until the next dump time. */ -- 2.19.1 From 8793b8beb6a5c1ae730f1fffb09dff64c83bc631 Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Mon, 9 Nov 2020 19:12:00 +0300 Subject: [PATCH v1 2/3] pg_prewarm: fix autoprewarm_interval behaviour. Previously it misused seconds from TimestampDifference() as milliseconds, so it was dumping autoprewarm.blocks ~every second event with default autoprewarm_interval = 300s. --- contrib/pg_prewarm/autoprewarm.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/contrib/pg_prewarm/autoprewarm.c b/contrib/pg_prewarm/autoprewarm.c index d3dec6e3ec..b18a065ed5 100644 --- a/contrib/pg_prewarm/autoprewarm.c +++ b/contrib/pg_prewarm/autoprewarm.c @@ -231,7 +231,7 @@ autoprewarm_main(Datum main_arg) autoprewarm_interval * 1000); TimestampDifference(GetCurrentTimestamp(), next_dump_time, , ); - delay_in_ms = secs + (usecs / 1000); + delay_in_ms = secs * 1000 + (usecs / 1000); /* Perform a dump if it's time. */ if (delay_in_ms <= 0) -- 2.19.1 From 31dc30c97861afae9c34852afc5a5b1c91bbeadc Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Mon, 9 Nov 2020 19:04:10 +0300 Subject: [PATCH v1 1/3] pg_prewarm: add tap test for autoprewarm feature --- contrib/pg_prewarm/Makefile | 2 + contrib/pg_prewarm/t/001_autoprewarm.pl | 51 + 2 files changed, 53 insertions(+) create mode 100644 contrib/pg_prewarm/t/001_autoprewarm.pl diff --git a/contrib/pg_prewarm/Makefile b/contrib/pg_prewarm/Makefile index b13ac3c813..9cfde8c4e4 100644 --- a/contrib/pg_prewarm/Makefile +++ b/contrib/pg_prewarm/Makefile @@ -10,6 +10,8 @@ EXTENSION = pg_prewarm DATA = pg_prewarm--1.1--1.2.sql pg_prewarm--1.1.sql pg_prewarm--1.0--1.1.sql PGFILEDESC = "pg_prewarm - preload relation data into system buffer cache" +TAP_TESTS = 1 + ifdef USE_PGXS PG_CONFIG = pg_config PGXS := $(shell $(PG_CONFIG) --pgxs) diff --git a/contrib/pg_prewarm/t/001_autoprewarm.pl b/contrib/pg_prewarm/t/001_autoprewarm.pl new file mode 100644 index 00..b564c29931 --- /dev/null +++ b/contrib/pg_prewarm/t/001_autoprewarm.pl @@ -0,0 +1,51 @@ +# +# Check that pg_prewarm can dump blocks from shared buffers +# to PGDATA/autoprewarm.blocks. +# + +use strict; +use Test::More; +use TestLib; +use Time::HiRes qw(usleep); +use warnings; + +use PostgresNode; + +plan tests => 3; + +my $node = get_new_node("node"); +$node->init; +$node->append_conf( +'postgresql.conf', qq( +shared_preload_libraries = 'pg_
Re: Global snapshots
On 2020-09-18 00:54, Bruce Momjian wrote: On Tue, Sep 8, 2020 at 01:36:16PM +0300, Alexey Kondratov wrote: Thank you for the link! After a quick look on the Sawada-san's patch set I think that there are two major differences: 1. There is a built-in foreign xacts resolver in the [1], which should be much more convenient from the end-user perspective. It involves huge in-core changes and additional complexity that is of course worth of. However, it's still not clear for me that it is possible to resolve all foreign prepared xacts on the Postgres' own side with a 100% guarantee. Imagine a situation when the coordinator node is actually a HA cluster group (primary + sync + async replica) and it failed just after PREPARE stage of after local COMMIT. In that case all foreign xacts will be left in the prepared state. After failover process complete synchronous replica will become a new primary. Would it have all required info to properly resolve orphan prepared xacts? Probably, this situation is handled properly in the [1], but I've not yet finished a thorough reading of the patch set, though it has a great doc! On the other hand, previous 0003 and my proposed patch rely on either manual resolution of hung prepared xacts or usage of external monitor/resolver. This approach is much simpler from the in-core perspective, but doesn't look as complete as [1] though. Have we considered how someone would clean up foreign transactions if the coordinating server dies? Could it be done manually? Would an external resolver, rather than an internal one, make this easier? Both Sawada-san's patch [1] and in this thread (e.g. mine [2]) use 2PC with a special gid format including a xid + server identification info. Thus, one can select from pg_prepared_xacts, get xid and coordinator info, then use txid_status() on the coordinator (or ex-coordinator) to get transaction status and finally either commit or abort these stale prepared xacts. Of course this could be wrapped into some user-level support routines as it is done in the [1]. As for the benefits of using an external resolver, I think that there are some of them from the whole system perspective: 1) If one follows the logic above, then this resolver could be stateless, it takes all the required info from the Postgres nodes themselves. 2) Then you can easily put it into container, which make it easier do deploy to all these 'cloud' stuff like kubernetes. 3) Also you can scale resolvers independently from Postgres nodes. I do not think that either of these points is a game changer, but we use a very simple external resolver altogether with [2] in our sharding prototype and it works just fine so far. [1] https://www.postgresql.org/message-id/CA%2Bfd4k4HOVqqC5QR4H984qvD0Ca9g%3D1oLYdrJT_18zP9t%2BUsJg%40mail.gmail.com [2] https://www.postgresql.org/message-id/3ef7877bfed0582019eab3d462a43275%40postgrespro.ru -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: Concurrency issue in pg_rewind
On 2020-09-17 15:27, Alexander Kukushkin wrote: On Thu, 17 Sep 2020 at 14:04, Alexey Kondratov wrote: With --restore-target-wal pg_rewind is trying to call restore_command on its own and it can happen at two stages: 1) When pg_rewind is trying to find the last checkpoint preceding a divergence point. In that case file map is not even yet initialized. Thus, all fetched WAL segments at this stage will be present in the file map created later. Nope, it will fetch files you requested, and in addition to that it will leave a child process running in the background which is doing the prefetch (manipulating with pg_wal/.wal-g/...) 2) When it creates a data pages map. It should traverse WAL from the last common checkpoint till the final shutdown point in order to find all modified pages on the target. At this stage pg_rewind only updates info about data segments in the file map. That way, I see a minor problem that WAL segments fetched at this stage would not be deleted, since they are absent in the file map. Anyway, pg_rewind does not delete neither WAL segments, not any other files in the middle of the file map creation, so I cannot imagine, how it can get into the same trouble on its own. When pg_rewind was creating the map, some temporary files where there, because the forked child process of wal-g was still running. When the wal-g child process exits, it removes some of these files. Specifically, it was trying to prefetch 00840A760024 into the pg_wal/.wal-g/prefetch/running/00840A760024, but apparently the file wasn't available on S3 and prefetch failed, therefore the empty file was removed. I do understand how you got into this problem with wal-g. This part of my answer was about bare postgres and pg_rewind. And my point was that from my perspective pg_rewind with --restore-target-wal cannot get into the same trouble on its own, without 'help' of some side tools like wal-g. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: Concurrency issue in pg_rewind
On 2020-09-16 15:55, Alexander Kukushkin wrote: Hello, Today I bumped into an issue with pg_rewind which is not 100% clear where should be better fixed. The first call of pg_rewind failed with the following message: servers diverged at WAL location A76/39E55338 on timeline 132 could not open file "/home/postgres/pgdata/pgroot/data/pg_wal/00840A76001E": No such file or directory could not find previous WAL record at A76/1EFFE620 Failure, exiting In order to avoid rebuilding the replica from scratch, we restored the missing file by calling restore_command (wal-g) and repeated the call of pg_rewind. The second time pg_rewind also failed, but the error looked differently: servers diverged at WAL location A76/39E55338 on timeline 132 rewinding from last common checkpoint at A76/1EF254B8 on timeline 132 could not remove file "/home/postgres/pgdata/pgroot/data/pg_wal/.wal-g/prefetch/running/00840A760024": No such file or directory Failure, exiting The second call left PGDATA in an inconsistent state (empty pg_control). A few words about where the pg_wal/.wal-g/prefetch/running/ is coming from: wal-g by default when fetching the WAL file is also trying to do a prefetch of a few next WAL files. For that it forks and the child process doing prefetch while the parent process exits. In order to avoid multiple parallel prefetches of the same file, wal-g keeps its state in the pg_wal/.wal-g directory. It also keeps prefetched files there. Hm, I cannot understand why wal-g (or any other tool) is trying to run pg_rewind, while WAL copying (and prefetch) is still in progress? Why do not just wait until it is finished? It is also not clear for me why it does not put prefetched WAL files directly into the pg_wal? The issue occurred on 10.14, but I believe very similar might happen with postgres 13 when pg_rewind is called with --restore-target-wal option. With --restore-target-wal pg_rewind is trying to call restore_command on its own and it can happen at two stages: 1) When pg_rewind is trying to find the last checkpoint preceding a divergence point. In that case file map is not even yet initialized. Thus, all fetched WAL segments at this stage will be present in the file map created later. 2) When it creates a data pages map. It should traverse WAL from the last common checkpoint till the final shutdown point in order to find all modified pages on the target. At this stage pg_rewind only updates info about data segments in the file map. That way, I see a minor problem that WAL segments fetched at this stage would not be deleted, since they are absent in the file map. Anyway, pg_rewind does not delete neither WAL segments, not any other files in the middle of the file map creation, so I cannot imagine, how it can get into the same trouble on its own. That made me think about how it could be improved in the pg_rewind. The thing is, that we want to have a specific file to be removed, and it is already not there. Should it be a fatal error? traverse_datadir()/recurse_dir() already ignoring all failed lstat() calls with errno == ENOENT. recurse_dir() is also called on the source, which should be running when --source-server is used, so it can remove files. There is actually a TODO there: /* * File doesn't exist anymore. This is ok, if the new master * is running and the file was just removed. If it was a data * file, there should be a WAL record of the removal. If it * was something else, it couldn't have been anyway. * * TODO: But complain if we're processing the target dir! */ Basically I have to options: 1. Remove modify remove_target_file(const char *path, bool missing_ok) function and remove the missing_ok option, that would be consistent with recurse_dir() 2. Change the logic of remove_target_file(), so it doesn't exit with the fatal error if the file is missing, but shows only a warning. In addition to the aforementioned options the remove_target_dir() also should be improved, i.e. it should check errno and behave similarly to the remove_target_file() if the errno == ENOENT What do you think? Although keeping arbitrary files inside PGDATA does not look like a good idea for me, I do not see anything criminal in skipping non-existing file, when executing a file map by pg_rewind. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: Global snapshots
On 2020-09-09 20:29, Fujii Masao wrote: On 2020/09/09 2:00, Alexey Kondratov wrote: According to the Sawada-san's v25 0002 the logic is pretty much the same there: +2. Pre-Commit phase (1st phase of two-phase commit) +3. Commit locally +Once we've prepared all of them, commit the transaction locally. +4. Post-Commit Phase (2nd phase of two-phase commit) Brief look at the code confirms this scheme. IIUC, AtEOXact_FdwXact / FdwXactParticipantEndTransaction happens after ProcArrayEndTransaction() in the CommitTransaction(). Thus, I don't see many difference between these approach and CallXactCallbacks() usage regarding this point. IIUC the commit logic in Sawada-san's patch looks like 1. PreCommit_FdwXact() PREPARE TRANSACTION command is issued 2. RecordTransactionCommit() 2-1. WAL-log the commit record 2-2. Update CLOG 2-3. Wait for sync rep 2-4. FdwXactWaitForResolution() Wait until COMMIT PREPARED commands are issued to the remote servers and completed. 3. ProcArrayEndTransaction() 4. AtEOXact_FdwXact(true) So ISTM that the timing of when COMMIT PREPARED is issued to the remote server is different between the patches. Am I missing something? No, you are right, sorry. At a first glance I thought that AtEOXact_FdwXact is responsible for COMMIT PREPARED as well, but it is only calling FdwXactParticipantEndTransaction in the abort case. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
On 2020-09-09 15:22, Michael Paquier wrote: On Tue, Sep 08, 2020 at 07:17:58PM -0500, Justin Pryzby wrote: Initially I added List *params, and Michael suggested to retire ReindexStmt->concurrent. I provided a patch to do so, initially by leaving int options and then, after this, removing it to "complete the thought", and get rid of the remnants of the "old way" of doing it. This is also how vacuum and explain are done. https://www.postgresql.org/message-id/20200902022410.GA20149%40telsasoft.com Defining a set of DefElem when parsing and then using the int "options" with bitmasks where necessary at the beginning of the execution looks like a good balance to me. This way, you can extend the grammar to use things like (verbose = true), etc. By the way, skimming through the patch set, I was wondering if we could do the refactoring of patch 0005 as a first step Yes, I did it with intention to put as a first patch, but wanted to get some feedback. It's easier to refactor the last patch without rebasing others. until I noticed this part: +common_option_name: NonReservedWord { $$ = $1; } | analyze_keyword { $$ = "analyze"; } This is not a good idea as you make ANALYZE an option available for all the commands involved in the refactoring. A portion of that could be considered though, like the use of common_option_arg. From the grammar perspective ANY option is available for any command that uses parenthesized option list. All the checks and validations are performed at the corresponding command code. This analyze_keyword is actually doing only an ANALYZE word normalization if it's used as an option. Why it could be harmful? Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: Global snapshots
On 2020-09-09 08:35, Masahiko Sawada wrote: On Wed, 9 Sep 2020 at 02:00, Alexey Kondratov wrote: On 2020-09-08 14:48, Fujii Masao wrote: > > IIUC, yes, the information required for automatic resolution is > WAL-logged and the standby tries to resolve those orphan transactions > from WAL after the failover. But Sawada-san's patch provides > the special function for manual resolution, so there may be some cases > where manual resolution is necessary. > I've found a note about manual resolution in the v25 0002: +After that we prepare all foreign transactions by calling +PrepareForeignTransaction() API. If we failed on any of them we change to +rollback, therefore at this time some participants might be prepared whereas +some are not prepared. The former foreign transactions need to be resolved +using pg_resolve_foreign_xact() manually and the latter ends transaction +in one-phase by calling RollbackForeignTransaction() API. but it's not yet clear for me. Sorry, the above description in README is out of date. In the v25 patch, it's true that if a backend fails to prepare a transaction on a foreign server, it’s possible that some foreign transactions are prepared whereas others are not. But at the end of the transaction after changing to rollback, the process does rollback (or rollback prepared) all of them. So the use case of pg_resolve_foreign_xact() is to resolve orphaned foreign prepared transactions or to resolve a foreign transaction that is not resolved for some reasons, bugs etc. OK, thank you for the explanation! Once the transaction is committed locally any ERROR (or higher level message) will be escalated to PANIC. I think this is true only inside the critical section and it's not necessarily true for all errors happening after the local commit, right? It's not actually related to critical section errors escalation. Any error in the backend after the local commit and ProcArrayEndTransaction() will try to abort the current transaction and do RecordTransactionAbort(), but it's too late to do so and PANIC will be risen: /* * Check that we haven't aborted halfway through RecordTransactionCommit. */ if (TransactionIdDidCommit(xid)) elog(PANIC, "cannot abort transaction %u, it was already committed", xid); At least that's how I understand it. And I do see possible ERROR level messages in the postgresCommitForeignTransaction() for example: + if (PQresultStatus(res) != PGRES_COMMAND_OK) + ereport(ERROR, (errmsg("could not commit transaction on server %s", + frstate->server->servername))); I don't think that it's very convenient to get a PANIC every time we failed to commit one of the prepared foreign xacts, since it could be not so rare in the distributed system. That's why I tried to get rid of possible ERRORs as far as possible in my proposed patch. In my patch, the second phase of 2PC is executed only by the resolver process. Therefore, even if an error would happen during committing a foreign prepared transaction, we just need to relaunch the resolver process and trying again. During that, the backend process will be just waiting. If a backend process raises an error after the local commit, the client will see transaction failure despite the local transaction having been committed. An error could happen even by palloc. So the patch uses a background worker to commit prepared foreign transactions, not by backend itself. Yes, if it's a background process, then it seems to be safe. BTW, it seems that I've chosen a wrong thread for posting my patch and staring a discussion :) Activity from this thread moved to [1] and you solution with built-in resolver is discussed [2]. I'll try to take a look on v25 closely and write to [2] instead. [1] https://www.postgresql.org/message-id/2020081009525213277261%40highgo.ca [2] https://www.postgresql.org/message-id/CAExHW5uBy9QwjdSO4j82WC4aeW-Q4n2ouoZ1z70o%3D8Vb0skqYQ%40mail.gmail.com Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: [POC] Fast COPY FROM command for the table with foreign partitions
On 2020-09-09 11:45, Andrey V. Lepikhov wrote: On 9/8/20 8:34 PM, Alexey Kondratov wrote: On 2020-09-08 17:00, Amit Langote wrote: wrote: On 2020-09-08 10:34, Amit Langote wrote: Another ambiguous part of the refactoring was in changing InitResultRelInfo() arguments: @@ -1278,6 +1280,7 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo, Relation resultRelationDesc, Index resultRelationIndex, Relation partition_root, + bool use_multi_insert, int instrument_options) Why do we need to pass this use_multi_insert flag here? Would it be better to set resultRelInfo->ri_usesMultiInsert in the InitResultRelInfo() unconditionally like it is done for ri_usesFdwDirectModify? And after that it will be up to the caller whether to use multi-insert or not based on their own circumstances. Otherwise now we have a flag to indicate that we want to check for another flag, while this check doesn't look costly. Hmm, I think having two flags seems confusing and bug prone, especially if you consider partitions. For example, if a partition's ri_usesMultiInsert is true, but CopyFrom()'s local flag is false, then execPartition.c: ExecInitPartitionInfo() would wrongly perform BeginForeignCopy() based on only ri_usesMultiInsert, because it wouldn't know CopyFrom()'s local flag. Am I missing something? No, you're right. If someone want to share a state and use ResultRelInfo (RRI) for that purpose, then it's fine, but CopyFrom() may simply override RRI->ri_usesMultiInsert if needed and pass this RRI further. This is how it's done for RRI->ri_usesFdwDirectModify. InitResultRelInfo() initializes it to false and then ExecInitModifyTable() changes the flag if needed. Probably this is just a matter of personal choice, but for me the current implementation with additional argument in InitResultRelInfo() doesn't look completely right. Maybe because a caller now should pass an additional argument (as false) even if it doesn't care about ri_usesMultiInsert at all. It also adds additional complexity and feels like abstractions leaking. I didn't feel what the problem was and prepared a patch version according to Alexey's suggestion (see Alternate.patch). Yes, that's very close to what I've meant. + leaf_part_rri->ri_usesMultiInsert = (leaf_part_rri->ri_usesMultiInsert && + rootResultRelInfo->ri_usesMultiInsert) ? true : false; This could be just: + leaf_part_rri->ri_usesMultiInsert = (leaf_part_rri->ri_usesMultiInsert && + rootResultRelInfo->ri_usesMultiInsert); This does not seem very convenient and will lead to errors in the future. So, I agree with Amit. And InitResultRelInfo() may set ri_usesMultiInsert to false by default, since it's used only by COPY now. Then you won't need this in several places: + resultRelInfo->ri_usesMultiInsert = false; While the logic of turning multi-insert on with all the validations required could be factored out of InitResultRelInfo() to a separate routine. Anyway, I don't insist at all and think it's fine to stick to the original v7's logic. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: Global snapshots
On 2020-09-08 14:48, Fujii Masao wrote: On 2020/09/08 19:36, Alexey Kondratov wrote: On 2020-09-08 05:49, Fujii Masao wrote: On 2020/09/05 3:31, Alexey Kondratov wrote: Attached is a patch, which implements a plain 2PC in the postgres_fdw and adds a GUC 'postgres_fdw.use_twophase'. Also it solves these errors handling issues above and tries to add proper comments everywhere. I think, that 0003 should be rebased on the top of it, or it could be a first patch in the set, since it may be used independently. What do you think? Thanks for the patch! Sawada-san was proposing another 2PC patch at [1]. Do you have any thoughts about pros and cons between your patch and Sawada-san's? [1] https://www.postgresql.org/message-id/ca+fd4k4z6_b1etevqamwqhu4rx7xsrn5orl7ohj4b5b6sw-...@mail.gmail.com Thank you for the link! After a quick look on the Sawada-san's patch set I think that there are two major differences: Thanks for sharing your thought! As far as I read your patch quickly, I basically agree with your this view. 1. There is a built-in foreign xacts resolver in the [1], which should be much more convenient from the end-user perspective. It involves huge in-core changes and additional complexity that is of course worth of. However, it's still not clear for me that it is possible to resolve all foreign prepared xacts on the Postgres' own side with a 100% guarantee. Imagine a situation when the coordinator node is actually a HA cluster group (primary + sync + async replica) and it failed just after PREPARE stage of after local COMMIT. In that case all foreign xacts will be left in the prepared state. After failover process complete synchronous replica will become a new primary. Would it have all required info to properly resolve orphan prepared xacts? IIUC, yes, the information required for automatic resolution is WAL-logged and the standby tries to resolve those orphan transactions from WAL after the failover. But Sawada-san's patch provides the special function for manual resolution, so there may be some cases where manual resolution is necessary. I've found a note about manual resolution in the v25 0002: +After that we prepare all foreign transactions by calling +PrepareForeignTransaction() API. If we failed on any of them we change to +rollback, therefore at this time some participants might be prepared whereas +some are not prepared. The former foreign transactions need to be resolved +using pg_resolve_foreign_xact() manually and the latter ends transaction +in one-phase by calling RollbackForeignTransaction() API. but it's not yet clear for me. Implementing 2PC feature only inside postgres_fdw seems to cause another issue; COMMIT PREPARED is issued to the remote servers after marking the local transaction as committed (i.e., ProcArrayEndTransaction()). According to the Sawada-san's v25 0002 the logic is pretty much the same there: +2. Pre-Commit phase (1st phase of two-phase commit) +3. Commit locally +Once we've prepared all of them, commit the transaction locally. +4. Post-Commit Phase (2nd phase of two-phase commit) Brief look at the code confirms this scheme. IIUC, AtEOXact_FdwXact / FdwXactParticipantEndTransaction happens after ProcArrayEndTransaction() in the CommitTransaction(). Thus, I don't see many difference between these approach and CallXactCallbacks() usage regarding this point. Is this safe? This issue happens because COMMIT PREPARED is issued via CallXactCallbacks(XACT_EVENT_COMMIT) and that CallXactCallbacks() is called after ProcArrayEndTransaction(). Once the transaction is committed locally any ERROR (or higher level message) will be escalated to PANIC. And I do see possible ERROR level messages in the postgresCommitForeignTransaction() for example: + if (PQresultStatus(res) != PGRES_COMMAND_OK) + ereport(ERROR, (errmsg("could not commit transaction on server %s", + frstate->server->servername))); I don't think that it's very convenient to get a PANIC every time we failed to commit one of the prepared foreign xacts, since it could be not so rare in the distributed system. That's why I tried to get rid of possible ERRORs as far as possible in my proposed patch. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: [POC] Fast COPY FROM command for the table with foreign partitions
On 2020-09-08 17:00, Amit Langote wrote: Hi Alexey, On Tue, Sep 8, 2020 at 6:29 PM Alexey Kondratov wrote: On 2020-09-08 10:34, Amit Langote wrote: > Any thoughts on the taking out the refactoring changes out of the main > patch as I suggested? > +1 for splitting the patch. It was rather difficult for me to distinguish changes required by COPY via postgres_fdw from this refactoring. Another ambiguous part of the refactoring was in changing InitResultRelInfo() arguments: @@ -1278,6 +1280,7 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo, Relation resultRelationDesc, Index resultRelationIndex, Relation partition_root, + bool use_multi_insert, int instrument_options) Why do we need to pass this use_multi_insert flag here? Would it be better to set resultRelInfo->ri_usesMultiInsert in the InitResultRelInfo() unconditionally like it is done for ri_usesFdwDirectModify? And after that it will be up to the caller whether to use multi-insert or not based on their own circumstances. Otherwise now we have a flag to indicate that we want to check for another flag, while this check doesn't look costly. Hmm, I think having two flags seems confusing and bug prone, especially if you consider partitions. For example, if a partition's ri_usesMultiInsert is true, but CopyFrom()'s local flag is false, then execPartition.c: ExecInitPartitionInfo() would wrongly perform BeginForeignCopy() based on only ri_usesMultiInsert, because it wouldn't know CopyFrom()'s local flag. Am I missing something? No, you're right. If someone want to share a state and use ResultRelInfo (RRI) for that purpose, then it's fine, but CopyFrom() may simply override RRI->ri_usesMultiInsert if needed and pass this RRI further. This is how it's done for RRI->ri_usesFdwDirectModify. InitResultRelInfo() initializes it to false and then ExecInitModifyTable() changes the flag if needed. Probably this is just a matter of personal choice, but for me the current implementation with additional argument in InitResultRelInfo() doesn't look completely right. Maybe because a caller now should pass an additional argument (as false) even if it doesn't care about ri_usesMultiInsert at all. It also adds additional complexity and feels like abstractions leaking. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: Global snapshots
On 2020-09-08 05:49, Fujii Masao wrote: On 2020/09/05 3:31, Alexey Kondratov wrote: Attached is a patch, which implements a plain 2PC in the postgres_fdw and adds a GUC 'postgres_fdw.use_twophase'. Also it solves these errors handling issues above and tries to add proper comments everywhere. I think, that 0003 should be rebased on the top of it, or it could be a first patch in the set, since it may be used independently. What do you think? Thanks for the patch! Sawada-san was proposing another 2PC patch at [1]. Do you have any thoughts about pros and cons between your patch and Sawada-san's? [1] https://www.postgresql.org/message-id/ca+fd4k4z6_b1etevqamwqhu4rx7xsrn5orl7ohj4b5b6sw-...@mail.gmail.com Thank you for the link! After a quick look on the Sawada-san's patch set I think that there are two major differences: 1. There is a built-in foreign xacts resolver in the [1], which should be much more convenient from the end-user perspective. It involves huge in-core changes and additional complexity that is of course worth of. However, it's still not clear for me that it is possible to resolve all foreign prepared xacts on the Postgres' own side with a 100% guarantee. Imagine a situation when the coordinator node is actually a HA cluster group (primary + sync + async replica) and it failed just after PREPARE stage of after local COMMIT. In that case all foreign xacts will be left in the prepared state. After failover process complete synchronous replica will become a new primary. Would it have all required info to properly resolve orphan prepared xacts? Probably, this situation is handled properly in the [1], but I've not yet finished a thorough reading of the patch set, though it has a great doc! On the other hand, previous 0003 and my proposed patch rely on either manual resolution of hung prepared xacts or usage of external monitor/resolver. This approach is much simpler from the in-core perspective, but doesn't look as complete as [1] though. 2. In the patch from this thread all 2PC logic sit in the postgres_fdw, while [1] tries to put it into the generic fdw core, which also feels like a more general and architecturally correct way. However, how many from the currently available dozens of various FDWs are capable to perform 2PC? And how many of them are maintained well enough to adopt this new API? This is not an argument against [1] actually, since postgres_fdw is known to be the most advanced FDW and an early adopter of new feature, just a little doubt about a usefulness of this preliminary generalisation. Anyway, I think that [1] is a great work and really hope to find more time to investigate it deeper later this year. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: [POC] Fast COPY FROM command for the table with foreign partitions
Hi, I've started doing a review of v7 yesterday. On 2020-09-08 10:34, Amit Langote wrote: On Mon, Sep 7, 2020 at 7:31 PM Andrey V. Lepikhov wrote: > v.7 (in attachment) fixes this problem. I also accepted Amit's suggestion to rename all fdwapi routines such as ForeignCopyIn to *ForeignCopy. It seems that naming is quite inconsistent now: + /* COPY a bulk of tuples into a foreign relation */ + BeginForeignCopyIn_function BeginForeignCopy; + EndForeignCopyIn_function EndForeignCopy; + ExecForeignCopyIn_function ExecForeignCopy; You get rid of this 'In' in the function names, but the types are still with it: +typedef void (*BeginForeignCopyIn_function) (ModifyTableState *mtstate, + ResultRelInfo *rinfo); + +typedef void (*EndForeignCopyIn_function) (EState *estate, + ResultRelInfo *rinfo); + +typedef void (*ExecForeignCopyIn_function) (ResultRelInfo *rinfo, + TupleTableSlot **slots, + int nslots); Also docs refer to old function names: +void +BeginForeignCopyIn(ModifyTableState *mtstate, + ResultRelInfo *rinfo); I think that it'd be better to choose either of these two naming schemes and use it everywhere for consistency. Any thoughts on the taking out the refactoring changes out of the main patch as I suggested? +1 for splitting the patch. It was rather difficult for me to distinguish changes required by COPY via postgres_fdw from this refactoring. Another ambiguous part of the refactoring was in changing InitResultRelInfo() arguments: @@ -1278,6 +1280,7 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo, Relation resultRelationDesc, Index resultRelationIndex, Relation partition_root, + bool use_multi_insert, int instrument_options) Why do we need to pass this use_multi_insert flag here? Would it be better to set resultRelInfo->ri_usesMultiInsert in the InitResultRelInfo() unconditionally like it is done for ri_usesFdwDirectModify? And after that it will be up to the caller whether to use multi-insert or not based on their own circumstances. Otherwise now we have a flag to indicate that we want to check for another flag, while this check doesn't look costly. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: Global snapshots
Hi, On 2020-07-27 09:44, Andrey V. Lepikhov wrote: On 7/27/20 11:22 AM, tsunakawa.ta...@fujitsu.com wrote: US8356007B2 - Distributed transaction management for database systems with multiversioning - Google Patents https://patents.google.com/patent/US8356007 If it is, can we circumvent this patent? Thank you for the research (and previous links too). I haven't seen this patent before. This should be carefully studied. I had a look on the patch set, although it is quite outdated, especially on 0003. Two thoughts about 0003: First, IIUC atomicity of the distributed transaction in the postgres_fdw is achieved by the usage of 2PC. I think that this postgres_fdw 2PC support should be separated from global snapshots. It could be useful to have such atomic distributed transactions even without a proper visibility, which is guaranteed by the global snapshot. Especially taking into account the doubts about Clock-SI and general questions about algorithm choosing criteria above in the thread. Thus, I propose to split 0003 into two parts and add a separate GUC 'postgres_fdw.use_twophase', which could be turned on independently from 'postgres_fdw.use_global_snapshots'. Of course if the latter is enabled, then 2PC should be forcedly turned on as well. Second, there are some problems with errors handling in the 0003 (thanks to Arseny Sher for review). +error: + if (!res) + { + sql = psprintf("ABORT PREPARED '%s'", fdwTransState->gid); + BroadcastCmd(sql); + elog(ERROR, "Failed to PREPARE transaction on remote node"); + } It seems that we should never reach this point, just because BroadcastStmt will throw an ERROR if it fails to prepare transaction on the foreign server: + if (PQresultStatus(result) != expectedStatus || + (handler && !handler(result, arg))) + { +elog(WARNING, "Failed command %s: status=%d, expected status=%d", sql, PQresultStatus(result), expectedStatus); + pgfdw_report_error(ERROR, result, entry->conn, true, sql); + allOk = false; + } Moreover, It doesn't make much sense to try to abort prepared xacts, since if we failed to prepare it somewhere, then some foreign servers may become unavailable already and this doesn't provide us a 100% guarantee of clean up. + /* COMMIT open transaction of we were doing 2PC */ + if (fdwTransState->two_phase_commit && + (event == XACT_EVENT_PARALLEL_COMMIT || event == XACT_EVENT_COMMIT)) + { + BroadcastCmd(psprintf("COMMIT PREPARED '%s'", fdwTransState->gid)); + } At this point, the host (local) transaction is already committed and there is no way to abort it gracefully. However, BroadcastCmd may rise an ERROR that will cause a PANIC, since it is non-recoverable state: PANIC: cannot abort transaction 487, it was already committed Attached is a patch, which implements a plain 2PC in the postgres_fdw and adds a GUC 'postgres_fdw.use_twophase'. Also it solves these errors handling issues above and tries to add proper comments everywhere. I think, that 0003 should be rebased on the top of it, or it could be a first patch in the set, since it may be used independently. What do you think? Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres CompanyFrom debdffade7abcdbf29031bda6c8359a89776ad36 Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Fri, 7 Aug 2020 16:50:57 +0300 Subject: [PATCH] Add postgres_fdw.use_twophase GUC to use 2PC for transactions involving several servers. --- contrib/postgres_fdw/connection.c | 234 +--- contrib/postgres_fdw/postgres_fdw.c | 17 ++ contrib/postgres_fdw/postgres_fdw.h | 2 + 3 files changed, 228 insertions(+), 25 deletions(-) diff --git a/contrib/postgres_fdw/connection.c b/contrib/postgres_fdw/connection.c index 08daf26fdf0..d18fdd1f94e 100644 --- a/contrib/postgres_fdw/connection.c +++ b/contrib/postgres_fdw/connection.c @@ -66,6 +66,20 @@ typedef struct ConnCacheEntry */ static HTAB *ConnectionHash = NULL; +/* + * FdwTransactionState + * + * Holds number of open remote transactions and shared state + * needed for all connection entries. + */ +typedef struct FdwTransactionState +{ + char *gid; + int nparticipants; + bool two_phase_commit; +} FdwTransactionState; +static FdwTransactionState *fdwTransState; + /* for assigning cursor numbers and prepared statement numbers */ static unsigned int cursor_number = 0; static unsigned int prep_stmt_number = 0; @@ -73,6 +87,9 @@ static unsigned int prep_stmt_number = 0; /* tracks whether any work is need
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
On 2020-09-01 13:12, Justin Pryzby wrote: This patch seems to be missing a call to RelationAssumeNewRelfilenode() in reindex_index(). That's maybe the related to the cause of the crashes I pointed out earlier this year. Alexey's v4 patch changed RelationSetNewRelfilenode() to accept a tablespace parameter, but Michael seemed to object to that. However that seems cleaner and ~30 line shorter. Michael, would you comment on that ? The v4 patch and your comments are here. https://www.postgresql.org/message-id/attachment/105574/v4-0001-Allow-REINDEX-and-REINDEX-CONCURRENTLY-to-change-tablespace.patch https://www.postgresql.org/message-id/20191127035416.GG5435%40paquier.xyz Actually, the last time we discussed this point I only got the gut feeling that this is a subtle place and it is very easy to break things with these changes. However, it isn't clear for me how exactly. That way, I'd be glad if Michael could reword his explanation, so it'd more clear for me as well. BTW, I've started doing a review of the last patch set yesterday and will try to post some comments later. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: [POC] Fast COPY FROM command for the table with foreign partitions
On 2020-07-28 03:33, Andrey Lepikhov wrote: 27.07.2020 21:34, Alexey Kondratov пишет: Hi Andrey, On 2020-07-23 09:23, Andrey V. Lepikhov wrote: On 7/16/20 2:14 PM, Amit Langote wrote: Amit Langote EnterpriseDB: http://www.enterprisedb.com Version 5 of the patch. With changes caused by Amit's comments. Just got a segfault with your v5 patch by deleting from a foreign table. It seems that ShutdownForeignScan inside node->fdwroutine doesn't have a correct pointer to the required function. I haven't had a chance to look closer on the code, but you can easily reproduce this error with the attached script (patched Postgres binaries should be available in the PATH). It works well with master and fails with your patch applied. I used master a3ab7a707d and v5 version of the patch with your script. No errors found. Can you check your test case? Yes, my bad. I forgot to re-install postgres_fdw extension, only did it for postgres core, sorry for disturb. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: [POC] Fast COPY FROM command for the table with foreign partitions
Hi Andrey, On 2020-07-23 09:23, Andrey V. Lepikhov wrote: On 7/16/20 2:14 PM, Amit Langote wrote: Amit Langote EnterpriseDB: http://www.enterprisedb.com Version 5 of the patch. With changes caused by Amit's comments. Just got a segfault with your v5 patch by deleting from a foreign table. Here is a part of backtrace: * frame #0: 0x0001029069ec postgres`ExecShutdownForeignScan(node=0x7ff28c8909b0) at nodeForeignscan.c:385:3 frame #1: 0x0001028e7b06 postgres`ExecShutdownNode(node=0x7ff28c8909b0) at execProcnode.c:779:4 frame #2: 0x00010299b3fa postgres`planstate_walk_members(planstates=0x7ff28c8906d8, nplans=1, walker=(postgres`ExecShutdownNode at execProcnode.c:752), context=0x) at nodeFuncs.c:3998:7 frame #3: 0x00010299b010 postgres`planstate_tree_walker(planstate=0x7ff28c8904c0, walker=(postgres`ExecShutdownNode at execProcnode.c:752), context=0x) at nodeFuncs.c:3914:8 frame #4: 0x0001028e7ab7 postgres`ExecShutdownNode(node=0x7ff28c8904c0) at execProcnode.c:771:2 (lldb) f 0 frame #0: 0x0001029069ec postgres`ExecShutdownForeignScan(node=0x7ff28c8909b0) at nodeForeignscan.c:385:3 382 FdwRoutine *fdwroutine = node->fdwroutine; 383 384 if (fdwroutine->ShutdownForeignScan) -> 385 fdwroutine->ShutdownForeignScan(node); 386 } (lldb) p node->fdwroutine->ShutdownForeignScan (ShutdownForeignScan_function) $1 = 0x7f7f7f7f7f7f7f7f It seems that ShutdownForeignScan inside node->fdwroutine doesn't have a correct pointer to the required function. I haven't had a chance to look closer on the code, but you can easily reproduce this error with the attached script (patched Postgres binaries should be available in the PATH). It works well with master and fails with your patch applied. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company#!/usr/bin/env sh pg_ctl -D node1 stop > /dev/null pg_ctl -D node2 stop > /dev/null rm -rf node1 node2 rm node1.log node2.log initdb -D node1 initdb -D node2 echo "port = 5433" >> node2/postgresql.conf pg_ctl -D node1 -l node1.log start pg_ctl -D node2 -l node2.log start createdb createdb -p5433 psql -p5433 -c "CREATE TABLE test (id INT) PARTITION BY HASH (id)" psql -c "CREATE EXTENSION IF NOT EXISTS postgres_fdw" psql -c "CREATE SERVER node2 FOREIGN DATA WRAPPER postgres_fdw OPTIONS (port '5433'); CREATE USER MAPPING FOR current_user SERVER node2" psql -c "CREATE FOREIGN TABLE test(id INT) SERVER node2" psql -c "DELETE FROM test"
Re: Partitioning and postgres_fdw optimisations for multi-tenancy
On 2020-07-16 19:35, Etsuro Fujita wrote: On Thu, Jul 16, 2020 at 8:56 PM Andrey Lepikhov wrote: On 7/16/20 9:55 AM, Etsuro Fujita wrote: >>>> On Tue, Jul 14, 2020 at 12:48 AM Alexey Kondratov >>>> wrote: >>>>> Some real-life test queries show, that all single-node queries aren't >>>>> pushed-down to the required node. For example: >>>>> >>>>> SELECT >>>>> * >>>>> FROM >>>>> documents >>>>> INNER JOIN users ON documents.user_id = users.id >>>>> WHERE >>>>> documents.company_id = 5 >>>>> AND users.company_id = 5; > PWJ cannot be applied > to the join due to the limitation of the PWJ matching logic. See the > discussion started in [1]. I think the patch in [2] would address > this issue as well, though the patch is under review. I think, discussion [1] is little relevant to the current task. Here we join not on partition attribute and PWJ can't be used at all. The main point of the discussion is to determine whether PWJ can be used for a join between partitioned tables, based on EquivalenceClasses, not just join clauses created by build_joinrel_restrictlist(). For the above join, for example, the patch in [2] would derive a join clause "documents.company_id = users.company_id" from an EquivalenceClass that recorded the knowledge "documents.company_id = 5" and "users.company_id = 5", and then the planner would consider from it that PWJ can be used for the join. Yes, it really worked well. Thank you for the explanation, it wasn't so obvious for me as well. That way, I think that the patch from [1] covers many cases of joins targeting a single partition / foreign server. However, there is an issue with aggregates as well. For a query like: SELECT count(*) FROM documents WHERE company_id = 5; It would be great to teach planner to understand, that it's a partition-wise aggregate as well, even without GROUP BY company_id, which doesn't always help as well. I'll try to look closer on this problem, but if you have any thoughts about it, then I'd be glad to know. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: Partitioning and postgres_fdw optimisations for multi-tenancy
On 2020-07-16 14:56, Andrey Lepikhov wrote: On 7/16/20 9:55 AM, Etsuro Fujita wrote: On Wed, Jul 15, 2020 at 9:02 PM Etsuro Fujita wrote: On Wed, Jul 15, 2020 at 12:12 AM Alexey Kondratov wrote: On 2020-07-14 15:27, Ashutosh Bapat wrote: On Tue, Jul 14, 2020 at 12:48 AM Alexey Kondratov wrote: Some real-life test queries show, that all single-node queries aren't pushed-down to the required node. For example: SELECT * FROM documents INNER JOIN users ON documents.user_id = users.id WHERE documents.company_id = 5 AND users.company_id = 5; There are a couple of things happening here 1. the clauses on company_id in WHERE clause are causing partition pruning. Partition-wise join is disabled with partition pruning before PG13. More precisely, PWJ cannot be applied when there are no matched partitions on the nullable side due to partition pruning before PG13. On reflection, I think I was wrong: the limitation applies to PG13, even with advanced PWJ. But the join is an inner join, so I think PWJ can still be applied for the join. I think I was wrong in this point as well :-(. PWJ cannot be applied to the join due to the limitation of the PWJ matching logic. See the discussion started in [1]. I think the patch in [2] would address this issue as well, though the patch is under review. Thanks for sharing the links, Fujita-san. I think, discussion [1] is little relevant to the current task. Here we join not on partition attribute and PWJ can't be used at all. Here we can use push-down join of two foreign relations. We can analyze baserestrictinfo's of outer and inner RelOptInfo's and may detect that only one partition from outer and inner need to be joined. Next, we will create joinrel from RelOptInfo's of these partitions and replace joinrel of partitioned tables. But it is only rough outline of a possible solution... I was a bit skeptical after eyeballing the thread [1], but still tried v3 patch with the current master and my test setup. Surprisingly, it just worked, though it isn't clear for me how. With this patch aforementioned simple join is completely pushed down to the foreign server. And speedup is approximately the same (~3 times) as when required partitions are explicitly used in the query. As a side-effected it also affected join + aggregate queries like: SELECT user_id, count(*) AS documents_count FROM documents INNER JOIN users ON documents.user_id = users.id WHERE documents.company_id = 5 AND users.company_id = 5 GROUP BY user_id; With patch it is executed as: GroupAggregate Group Key: documents.user_id -> Sort Sort Key: documents.user_id -> Foreign Scan Relations: (documents_node2 documents) INNER JOIN (users_node2 users) Without patch its plan was: GroupAggregate Group Key: documents.user_id -> Sort Sort Key: documents.user_id -> Hash Join Hash Cond: (documents.user_id = users.id) -> Foreign Scan on documents_node2 documents -> Hash -> Foreign Scan on users_node2 users I cannot say that it is most efficient plan in that case, since the entire query could be pushed down to the foreign server, but still it gives a 5-10% speedup on my setup. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: Partitioning and postgres_fdw optimisations for multi-tenancy
On 2020-07-14 15:27, Ashutosh Bapat wrote: On Tue, Jul 14, 2020 at 12:48 AM Alexey Kondratov wrote: I built a simple two node multi-tenant schema for tests, which can be easily set up with attached scripts. It creates three tables (companies, users, documents) distributed over two nodes. Everything can be found in this Gist [2] as well. Some real-life test queries show, that all single-node queries aren't pushed-down to the required node. For example: SELECT * FROM documents INNER JOIN users ON documents.user_id = users.id WHERE documents.company_id = 5 AND users.company_id = 5; There are a couple of things happening here 1. the clauses on company_id in WHERE clause are causing partition pruning. Partition-wise join is disabled with partition pruning before PG13. In PG13 we have added advanced partition matching algorithm which will allow partition-wise join with partition pruning. I forgot to mention that I use a recent master (991c444e7a) for tests with enable_partitionwise_join = 'on' enable_partitionwise_aggregate = 'on' of course. I've also tried postgres_fdw.use_remote_estimate = true followed by ANALYSE on both nodes (it is still used in setup.sh script). BTW, can you, please, share a link to commit / thread about allowing partition-wise join and partition pruning to work together in PG13? 2. the query has no equality condition on the partition key of the tables being joined. Partitionwise join is possible only when there's an equality condition on the partition keys (company_id) of the joining tables. PostgreSQL's optimizer is not smart enough to convert the equality conditions in WHERE clause into equality conditions on partition keys. So having those conditions just in WHERE clause does not help. Instead please add equality conditions on partition keys in JOIN .. ON clause or WHERE clause (only for INNER join). With adding documents.company_id = users.company_id SELECT * FROM documents INNER JOIN users ON (documents.company_id = users.company_id AND documents.user_id = users.id) WHERE documents.company_id = 5 AND users.company_id = 5; query plan remains the same. executed as following QUERY PLAN --- Nested Loop Join Filter: (documents.user_id = users.id) -> Foreign Scan on users_node2 users -> Materialize -> Foreign Scan on documents_node2 documents i.e. it uses two foreign scans and does the final join locally. However, once I specify target partitions explicitly, then the entire query is pushed down to the foreign node: QUERY PLAN - Foreign Scan Relations: (documents_node2) INNER JOIN (users_node2) Execution time is dropped significantly as well — by more than 3 times even for this small test database. Situation for simple queries with aggregates or joins and aggregates followed by the sharding key filter is the same. Something similar was briefly discussed in this thread [3]. IIUC, it means that push-down of queries through the postgres_fdw works perfectly well, the problem is with partition-wise operation detection at the planning time. Currently, partition-wise aggregate routines, e.g., looks for a GROUP BY and checks whether sharding key exists there or not. After that PARTITIONWISE_AGGREGATE_* flag is set. However, it doesn't look for a content of WHERE clause, so frankly speaking it isn't a problem, this functionality is not yet implemented. Actually, sometimes I was able to push down queries with aggregate simply by adding an additional GROUP BY with sharding key, like this: SELECT count(*) FROM documents WHERE company_id = 5 GROUP BY company_id; This gets pushed down since GROUP BY clause is on the partition key. Sure, but it only works *sometimes*, I've never seen most of such simple queries with aggregates to be pushed down, e.g.: SELECT sum(id) FROM documents_node2 WHERE company_id = 5 GROUP BY company_id; whether 'GROUP BY company_id' is used or not. Although it seems that it will be easier to start with aggregates, probably we should initially plan a more general solution? For example, check that all involved tables are filtered by partitioning key and push down the entire query if all of them target the same foreign server. Any thoughts? I think adding just equality conditions on the partition key will be enough. No need for any code change. So, it hasn't helped. Maybe I could modify some costs to verify that push-down of such joins is ever possible? Anyway, what about aggregates? Partition-wise aggregates work fine for queries like SELECT count(*) FROM documents GROUP BY company_id; but once I narrow it to a single partition with 'WHERE company_id = 5', then it is being executed in a very inefficient way — takes
Partitioning and postgres_fdw optimisations for multi-tenancy
Hi Hackers, The idea of achieving Postgres scaling via sharding using postgres_fdw + partitioning got a lot of attention last years. Many optimisations have been done in this direction: partition pruning, partition-wise aggregates / joins, postgres_fdw push-down of LIMIT, GROUP BY, etc. In many cases they work really nice. However, still there is a vast case, where postgres_fdw + native partitioning doesn't perform so good — Multi-tenant architecture. From the database perspective it is presented well in this Citus tutorial [1]. The main idea is that there is a number of tables and all of them are sharded / partitioned by the same key, e.g. company_id. That way, if every company mostly works within its own data, then every query may be effectively executed on a single node without a need for an internode communication. I built a simple two node multi-tenant schema for tests, which can be easily set up with attached scripts. It creates three tables (companies, users, documents) distributed over two nodes. Everything can be found in this Gist [2] as well. Some real-life test queries show, that all single-node queries aren't pushed-down to the required node. For example: SELECT * FROM documents INNER JOIN users ON documents.user_id = users.id WHERE documents.company_id = 5 AND users.company_id = 5; executed as following QUERY PLAN --- Nested Loop Join Filter: (documents.user_id = users.id) -> Foreign Scan on users_node2 users -> Materialize -> Foreign Scan on documents_node2 documents i.e. it uses two foreign scans and does the final join locally. However, once I specify target partitions explicitly, then the entire query is pushed down to the foreign node: QUERY PLAN - Foreign Scan Relations: (documents_node2) INNER JOIN (users_node2) Execution time is dropped significantly as well — by more than 3 times even for this small test database. Situation for simple queries with aggregates or joins and aggregates followed by the sharding key filter is the same. Something similar was briefly discussed in this thread [3]. IIUC, it means that push-down of queries through the postgres_fdw works perfectly well, the problem is with partition-wise operation detection at the planning time. Currently, partition-wise aggregate routines, e.g., looks for a GROUP BY and checks whether sharding key exists there or not. After that PARTITIONWISE_AGGREGATE_* flag is set. However, it doesn't look for a content of WHERE clause, so frankly speaking it isn't a problem, this functionality is not yet implemented. Actually, sometimes I was able to push down queries with aggregate simply by adding an additional GROUP BY with sharding key, like this: SELECT count(*) FROM documents WHERE company_id = 5 GROUP BY company_id; where this GROUP BY obviously doesn't change a results, it just allows planner to choose from more possible paths. Also, I have tried to hack it a bit and forcedly set PARTITIONWISE_AGGREGATE_FULL for this particular query. Everything executed fine and returned result was correct, which means that all underlying machinery is ready. That way, I propose a change to the planner, which will check whether partitioning key exist in the WHERE clause and will set PARTITIONWISE_AGGREGATE_* flags if appropriate. The whole logic may look like: 1. If the only one condition by partitioning key is used (like above), then it is PARTITIONWISE_AGGREGATE_FULL. 2. If several conditions are used, then it should be PARTITIONWISE_AGGREGATE_PARTIAL. I'm aware that WHERE clause may be extremely complex in general, but we could narrow this possible optimisation to the same restrictions as postgres_fdw push-down "only WHERE clauses using built-in operators and functions will be considered for execution on the remote server". Although it seems that it will be easier to start with aggregates, probably we should initially plan a more general solution? For example, check that all involved tables are filtered by partitioning key and push down the entire query if all of them target the same foreign server. Any thoughts? [1] https://docs.citusdata.com/en/v9.3/get_started/tutorial_multi_tenant.html [2] https://gist.github.com/ololobus/8fba33241f68be2e3765d27bf04882a3 [3] https://www.postgresql.org/message-id/flat/CAFT%2BaqL1Tt0qfYqjHH%2BshwPoW8qdFjpJ8vBR5ABoXJDUcHyN1w%40mail.gmail.com Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres CompanyDROP TABLE IF EXISTS companies CASCADE; DROP TABLE IF EXISTS users CASCADE; DROP TABLE IF EXISTS documents CASCADE; DROP SERVER IF EXISTS node2 CASCADE; CREATE EXTENSION IF NOT EXISTS postgres_fdw; CREATE SERVER node2 FOREIGN DATA WRAPPER postgres_fdw OPTIONS (p
Re: [PATCH] Allow to specify restart_lsn in pg_create_physical_replication_slot()
On 2020-06-23 04:18, Michael Paquier wrote: On Mon, Jun 22, 2020 at 08:18:58PM +0300, Alexey Kondratov wrote: Things get worse when we allow specifying an older LSN, since it has a higher chances to be at the horizon of deletion by checkpointer. Anyway, if I get it correctly, with a current patch slot will be created successfully, but will be obsolete and should be invalidated by the next checkpoint. Is that a behavior acceptable for the end user? For example, a physical slot that is created to immediately reserve WAL may get invalidated, causing it to actually not keep WAL around contrary to what the user has wanted the command to do. I can imagine that it could be acceptable in the initially proposed scenario for someone, since creation of a slot with historical restart_lsn is already unpredictable — required segment may exist or may do not exist. However, adding here an undefined behaviour even after a slot creation does not look good to me anyway. I have looked closely on the checkpointer code and another problem is that it decides once which WAL segments to delete based on the replicationSlotMinLSN, and does not check anything before the actual file deletion. That way the gap for a possible race is even wider. I do not know how to completely get rid of this race without introducing of some locking mechanism, which may be costly. Thanks for feedback -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: [PATCH] Allow to specify restart_lsn in pg_create_physical_replication_slot()
On 2020-06-19 21:57, Fujii Masao wrote: On 2020/06/19 23:20, Alexey Kondratov wrote: On 2020-06-19 03:59, Michael Paquier wrote: On Thu, Jun 18, 2020 at 03:39:09PM +0300, Vyacheslav Makarov wrote: If the WAL segment for the specified restart_lsn (STOP_LSN of the backup) exists, then the function will create a physical replication slot and will keep all the WAL segments required by the replica to catch up with the primary. Otherwise, it returns error, which means that the required WAL segments have been already utilised, so we do need to take a new backup. Without passing this newly added parameter pg_create_physical_replication_slot() works as before. What do you think about this? Currently pg_create_physical_replication_slot() and CREATE_REPLICATION_SLOT replication command seem to be "idential". So if we add new option into one, we should add it also into another? I wonder how it could be used via the replication protocol, but probably this option should be added there as well for consistency. What happen if future LSN is specified in restart_lsn? With the patch, in this case, if the segment at that LSN exists (e.g., because it's recycled one), the slot seems to be successfully created. However if the LSN is far future and the segment doesn't exist, the creation of slot seems to fail. This behavior looks fragile and confusing. We should accept future LSN whether its segment currently exists or not? But what about a possible timeline switch? If we allow specifying it as further in the future as one wanted, then appropriate segment with specified LSN may be created in the different timeline if it would be switched, so it may be misleading. I am not even sure about allowing future LSN for existing segments, since PITR / timeline switch may occur just after the slot creation, so the pointer may never be valid. Would it be better to completely disallow future LSN? And here I noticed another moment in the patch. TimeLineID of the last restart/checkpoint is used to detect whether WAL segment file exists or not. It means that if we try to create a slot just after a timeline switch, then we could not specify the oldest LSN actually available on the disk, since it may be from the previous timeline. One can use LSN only within the current timeline. It seems to be fine, but should be covered in the docs. + if (!RecoveryInProgress() && !SlotIsLogical(MyReplicationSlot)) With the patch, the given restart_lsn seems to be ignored during recovery. Why? I have the same question, not sure that this is needed here. It looks more like a forgotten copy-paste from ReplicationSlotReserveWal(). I think that this was discussed in the past (perhaps one of the threads related to WAL advancing actually?), I have searched through the archives a bit and found one thread related to slots advancing [1]. It was dedicated to a problem of advancing slots which do not reserve WAL yet, if I get it correctly. Although it is somehow related to the topic, it was a slightly different issue, IMO. and this stuff is full of holes when it comes to think about error handling with checkpoints running in parallel, potentially doing recycling of segments you would expect to be around based on your input value for restart_lsn *while* pg_create_physical_replication_slot() is still running and manipulating the on-disk slot information. ... These are the right concerns, but all of them should be applicable to the pg_create_physical_replication_slot() + immediately_reserve == true in the same way, doesn't it? I think so, since in that case we are doing a pretty similar thing — trying to reserve some WAL segment that may be concurrently deleted. And this is exactly the reason why ReplicationSlotReserveWal() does it in several steps in a loop: 1. Creates a slot with some restart_lsn. 2. Does ReplicationSlotsComputeRequiredLSN() to prevent removal of the WAL segment with this restart_lsn. 3. Checks that required WAL segment is still there. 4. Repeat if this attempt to prevent WAL removal has failed. What happens if concurrent checkpoint decides to remove the segment at restart_lsn before #2 and then actually removes it after #3? The replication slot is successfully created with the given restart_lsn, but the reserved segment has already been removed? I though about it a bit more and it seems that yes, there is a race even for a current pg_create_physical_replication_slot() + immediately_reserve == true, i.e. ReplicationSlotReserveWal(). However, the chance is very subtle since we take a current GetRedoRecPtr() there. Probably one could reproduce it with wal_keep_segments = 1 by holding / releasing backend doing the slot creation and checkpointer with gdb, but not sure that it is an issue anywhere in the real world. Maybe I am wrong, but it is not clear for me why current ReplicationSlotReserveWal() routine does not have tha
Re: [PATCH] Allow to specify restart_lsn in pg_create_physical_replication_slot()
On 2020-06-19 03:59, Michael Paquier wrote: On Thu, Jun 18, 2020 at 03:39:09PM +0300, Vyacheslav Makarov wrote: If the WAL segment for the specified restart_lsn (STOP_LSN of the backup) exists, then the function will create a physical replication slot and will keep all the WAL segments required by the replica to catch up with the primary. Otherwise, it returns error, which means that the required WAL segments have been already utilised, so we do need to take a new backup. Without passing this newly added parameter pg_create_physical_replication_slot() works as before. What do you think about this? I think that this was discussed in the past (perhaps one of the threads related to WAL advancing actually?), I have searched through the archives a bit and found one thread related to slots advancing [1]. It was dedicated to a problem of advancing slots which do not reserve WAL yet, if I get it correctly. Although it is somehow related to the topic, it was a slightly different issue, IMO. and this stuff is full of holes when it comes to think about error handling with checkpoints running in parallel, potentially doing recycling of segments you would expect to be around based on your input value for restart_lsn *while* pg_create_physical_replication_slot() is still running and manipulating the on-disk slot information. I suspect that this also breaks a couple of assumptions behind concurrent calls of the minimum LSN calculated across slots when a caller sees fit to recompute the thresholds (WAL senders mainly here, depending on the replication activity). These are the right concerns, but all of them should be applicable to the pg_create_physical_replication_slot() + immediately_reserve == true in the same way, doesn't it? I think so, since in that case we are doing a pretty similar thing — trying to reserve some WAL segment that may be concurrently deleted. And this is exactly the reason why ReplicationSlotReserveWal() does it in several steps in a loop: 1. Creates a slot with some restart_lsn. 2. Does ReplicationSlotsComputeRequiredLSN() to prevent removal of the WAL segment with this restart_lsn. 3. Checks that required WAL segment is still there. 4. Repeat if this attempt to prevent WAL removal has failed. I guess that the only difference in the case of proposed scenario is that we do not have a chance for step 4, since we do need some specific restart_lsn, not any recent restart_lsn, i.e. in this case we have to: 1. Create a slot with restart_lsn specified by user. 2. Do ReplicationSlotsComputeRequiredLSN() to prevent WAL removal. 3. Check that required WAL segment is still there and report ERROR to the user if it is not. I have eyeballed the attached patch and it looks like doing exactly the same, so issues with concurrent deletion are not obvious for me. Or, there are should be the same issues for pg_create_physical_replication_slot() + immediately_reserve == true with current master implementation. [1] https://www.postgresql.org/message-id/flat/20180626071305.GH31353%40paquier.xyz Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: Physical replication slot advance is not persistent
On 2020-06-16 10:27, Michael Paquier wrote: On Wed, Jun 10, 2020 at 08:57:17PM +0300, Alexey Kondratov wrote: New test reproduces this issue well. Left it running for a couple of hours in repeat and it seems to be stable. Thanks for testing. I have been thinking about the minimum xmin and LSN computations on advancing, and actually I have switched the recomputing to be called at the end of pg_replication_slot_advance(). This may be a waste if no advancing is done, but it could also be an advantage to enforce a recalculation of the thresholds for each function call. And that's more consistent with the slot copy, drop and creation. Sorry for a bit late response, but I see a couple of issues with this modified version of the patch in addition to the waste call if no advancing is done, mentioned by you: 1. Both ReplicationSlotsComputeRequiredXmin() and ReplicationSlotsComputeRequiredLSN() may have already been done in the LogicalConfirmReceivedLocation() if it was a logical slot. It may be fine and almost costless to do it twice, but it looks untidy for me. 2. It seems that we do not need ReplicationSlotsComputeRequiredXmin() at all if it was a physical slot, since we do not modify xmin in the pg_physical_replication_slot_advance(), doesn't it? That's why I wanted (somewhere around v5 of the patch in this thread) to move all dirtying and recomputing calls to the places, where xmin / lsn slot modifications are actually done — pg_physical_replication_slot_advance() and LogicalConfirmReceivedLocation(). LogicalConfirmReceivedLocation() already does this, so we only needed to teach pg_physical_replication_slot_advance() to do the same. However, just noted that LogicalConfirmReceivedLocation() only does ReplicationSlotsComputeRequiredLSN() if updated_xmin flag was set, which looks wrong from my perspective, since updated_xmin and updated_restart flags are set separately. That way, I would solve this all as per attached, which works well for me, but definitely worth of a better testing. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Companydiff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c index 61902be3b0..2ceb078418 100644 --- a/src/backend/replication/logical/logical.c +++ b/src/backend/replication/logical/logical.c @@ -1081,8 +1081,14 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn) SpinLockRelease(>mutex); ReplicationSlotsComputeRequiredXmin(false); - ReplicationSlotsComputeRequiredLSN(); } + + /* + * Now recompute the minimum required LSN across all slots if + * restart_lsn of the slot has been updated. + */ + if (updated_restart) + ReplicationSlotsComputeRequiredLSN(); } else { diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c index 06e4955de7..d9c19c07e5 100644 --- a/src/backend/replication/slotfuncs.c +++ b/src/backend/replication/slotfuncs.c @@ -419,6 +419,12 @@ pg_physical_replication_slot_advance(XLogRecPtr moveto) * crash, but this makes the data consistent after a clean shutdown. */ ReplicationSlotMarkDirty(); + + /* + * Recompute the minimum required LSN across all slots to adjust with + * the advancing done. + */ + ReplicationSlotsComputeRequiredLSN(); } return retlsn; @@ -621,13 +627,6 @@ pg_replication_slot_advance(PG_FUNCTION_ARGS) values[0] = NameGetDatum(>data.name); nulls[0] = false; - /* - * Recompute the minimum LSN and xmin across all slots to adjust with the - * advancing potentially done. - */ - ReplicationSlotsComputeRequiredXmin(false); - ReplicationSlotsComputeRequiredLSN(); - ReplicationSlotRelease(); /* Return the reached position. */
Re: Physical replication slot advance is not persistent
On 2020-06-10 11:38, Kyotaro Horiguchi wrote: At Wed, 10 Jun 2020 15:53:53 +0900, Michael Paquier wrote in > > I find it really depressing how much obviously untested stuff gets > > added in this area. > > Prior to this patch pg_replication_slot_advance was not being tested > at all. > Unfortunately, added tests appeared to be not enough to cover all > cases. It > seems that the whole machinery of WAL holding and trimming is worth > to be > tested more thoroughly. I think that it would be interesting if we had a SQL representation of the contents of XLogCtlData (wanted that a couple of times). Now we are actually limited to use a checkpoint and check that past segments are getting recycled by looking at the contents of pg_wal. Doing that here does not cause the existing tests to be much more expensive as we only need one extra call to pg_switch_wal(), mostly. Please see the attached. The test in the patch looks fine to me and worked well for me. Using smaller wal_segment_size (1(MB) worked for me) reduces the cost of the check, but I'm not sure it's worth doing. New test reproduces this issue well. Left it running for a couple of hours in repeat and it seems to be stable. Just noted that we do not need to keep $phys_restart_lsn_pre: my $phys_restart_lsn_pre = $node_master->safe_psql('postgres', "SELECT restart_lsn from pg_replication_slots WHERE slot_name = '$phys_slot';" ); chomp($phys_restart_lsn_pre); we can safely use $current_lsn used for pg_replication_slot_advance(), since reatart_lsn is set as is there. It may make the test a bit simpler as well. -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: Physical replication slot advance is not persistent
On 2020-06-09 20:19, Andres Freund wrote: Hi, On 2020-01-28 17:01:14 +0900, Michael Paquier wrote: So attached is an updated patch which addresses the problem just by marking a physical slot as dirty if any advancing is done. Some documentation is added about the fact that an advance is persistent only at the follow-up checkpoint. And the tests are fixed to not use a fake LSN but instead advance to the latest LSN position produced. - /* Update the on disk state when lsn was updated. */ - if (XLogRecPtrIsInvalid(endlsn)) - { - ReplicationSlotMarkDirty(); - ReplicationSlotsComputeRequiredXmin(false); - ReplicationSlotsComputeRequiredLSN(); - ReplicationSlotSave(); - } - I am quite confused by the wholesale removal of these lines. That wasn't in previous versions of the patch. As far as I can tell not calling ReplicationSlotsComputeRequiredLSN() for the physical slot leads to the global minimum LSN never beeing advanced, and thus WAL reserved by the slot not being removable. Only if there's some independent call to ReplicationSlotsComputeRequiredLSN() XLogSetReplicationSlotMinimumLSN() will be called, allowing for slots to advance. I realize this stuff has been broken since the introduction in 9c7d06d6068 (due to the above being if (XLogRecPtrIsInvalid()) rather than if (!XLogRecPtrIsInvalid()) , but this seems to make it even worse? Yes, there was a ReplicationSlotsComputeRequiredLSN() call inside pg_physical_replication_slot_advance() in the v5 of the patch: @@ -370,6 +370,11 @@ pg_physical_replication_slot_advance(XLogRecPtr moveto) MyReplicationSlot->data.restart_lsn = moveto; SpinLockRelease(>mutex); retlsn = moveto; + + ReplicationSlotMarkDirty(); + + /* We moved retart_lsn, update the global value. */ + ReplicationSlotsComputeRequiredLSN(); But later it has been missed and we have not noticed that. I think that adding it back as per attached will be enough. I find it really depressing how much obviously untested stuff gets added in this area. Prior to this patch pg_replication_slot_advance was not being tested at all. Unfortunately, added tests appeared to be not enough to cover all cases. It seems that the whole machinery of WAL holding and trimming is worth to be tested more thoroughly. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Companydiff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c index 1b929a603e..8e543e276f 100644 --- a/src/backend/replication/slotfuncs.c +++ b/src/backend/replication/slotfuncs.c @@ -419,6 +419,11 @@ pg_physical_replication_slot_advance(XLogRecPtr moveto) * crash, but this makes the data consistent after a clean shutdown. */ ReplicationSlotMarkDirty(); + + /* + * We advanced retart_lsn, update the global minimum required value. + */ + ReplicationSlotsComputeRequiredLSN(); } return retlsn;
Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line
On 2020-06-08 09:14, Michael Paquier wrote: On Sun, Jun 07, 2020 at 10:05:03PM +0300, Alexander Korotkov wrote: On Sat, Jun 6, 2020 at 8:49 PM Peter Eisentraut wrote: Why is fe_archive.c in src/common/ and not in src/fe_utils/? It is not "common" to frontend and backend. Yep, this seems wrong to me. I can propose following file rename. src/common/fe_archive.c => src/fe_utils/archive.c include/common/fe_archive.h => include/fe_utils/archive.h Is it technically a problem though? fe_archive.c is listed as a frontend-only file with OBJS_FRONTEND in src/common/Makefile. Anyway, for this one I would not mind to do the move so please see the attached, but I am not completely sure either why src/fe_utils/ had better be chosen than src/common/. Perhaps we have more things that are frontend-only in src/common/ that could be moved to src/fe_utils/? I am looking at restricted_token.c, fe_memutils.c, logging.c and file_utils.c here, but that would mean breaking a couple of declarations, something never free for plugin developers. I noticed this before in the thread [1], but it has been left unnoticed I guess: "I just went through the both patches and realized that I cannot get into semantics of splitting frontend code between common and fe_utils. This applies only to 0002, where we introduce fe_archive.c. Should it be placed into fe_utils alongside of the recent recovery_gen.c also used by pg_rewind? This is a frontend-only code intended to be used by frontend applications, so fe_utils feels like the right place, doesn't it? Just tried to do so and everything went fine, so it seems that there is no obstacles from the build system. BTW, most of 'common' is a really common code with only four exceptions like logging.c, which is frontend-only. Is it there for historical reasons only or something else?" Personally, I would prefer that everything in the 'common' was actually common. I also do not sure about moving an older code, because of possible backward compatibility breakage, but doing so for a newer one seems to be a good idea. It actually defines functions that clash with functions in the backend, so this seems doubly wrong. Let's have frontend version of RestoreArchivedFile() renamed as well. What about RestoreArchivedFileFrontend()? -1 from me for the renaming, which was intentional to ease grepping with the backend counterpart. We have other cases like that, say palloc(), fsync_fname(), etc. Do not like it either. Having the same naming and a guarantee that this code is always compiled separately looks more convenient for me. [1] https://www.postgresql.org/message-id/784fa7dc-414b-9dc9-daae-138033db298c%40postgrespro.ru Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: [HACKERS] make async slave to wait for lsn to be replayed
On 2020-04-10 05:25, Fujii Masao wrote: On 2020/04/10 3:16, Alexey Kondratov wrote: Just another idea in case if one will still decide to go with a separate statement + BEGIN integration instead of a function. We could use parenthesized options list here. This is already implemented for VACUUM, REINDEX, etc. There was an idea to allow CONCURRENTLY in REINDEX there [1] and recently this was proposed again for new options [2], since it is much more extensible from the grammar perspective. That way, the whole feature may look like: WAIT (LSN '16/B374D848', TIMEOUT 100); and/or BEGIN WAIT (LSN '16/B374D848', WHATEVER_OPTION_YOU_WANT); ... COMMIT; It requires only one reserved keyword 'WAIT'. The advantage of this approach is that it can be extended to support xid, timestamp, csn or anything else, that may be invented in the future, without affecting the grammar. What do you think? Personally, I find this syntax to be more convenient and human-readable compared with function call: SELECT pg_wait_for_lsn('16/B374D848'); BEGIN; I can imagine that some users want to specify the LSN to wait for, from the result of another query, for example, SELECT pg_wait_for_lsn(lsn) FROM xxx. If this is valid use case, isn't the function better? I think that the main purpose of the feature is to achieve read-your-writes-consistency, while using async replica for reads. In that case lsn of last modification is stored inside application, so there is no need to do any query for that. Moreover, you cannot store this lsn inside database, since reads are distributed across all replicas (+ primary). Thus, I could imagine that 'xxx' in your example states for some kind of stored procedure, that fetches lsn from the off-postgres storage, but it looks like very narrow case to count on it, doesn't it? Anyway, I am not against implementing this as a function. That was just another option to consider. Just realized that the last patch I have seen does not allow usage of wait on primary. It may be a problem if reads are pooled not only across replicas, but on primary as well, which should be quite usual I guess. In that case application does not know either request will be processed on replica, or on primary. I think it should be allowed without any warning, or just saying some LOG/DEBUG at most, that there was no waiting performed. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: [HACKERS] make async slave to wait for lsn to be replayed
On 2020-04-09 16:33, Tom Lane wrote: Fujii Masao writes: On 2020/04/09 16:11, Kyotaro Horiguchi wrote: At Wed, 08 Apr 2020 16:35:46 -0400, Tom Lane wrote in Why is this getting grafted onto BEGIN/START TRANSACTION in the first place? The rationale for not being a fmgr function is stated in the following comments. [...] This issue happens because the function is executed after BEGIN? If yes, what about executing the function (i.e., as separate transaction) before BEGIN? If so, the snapshot taken in the function doesn't affect the subsequent transaction whatever its isolation level is. I wonder whether making it a procedure, rather than a plain function, would help any. Just another idea in case if one will still decide to go with a separate statement + BEGIN integration instead of a function. We could use parenthesized options list here. This is already implemented for VACUUM, REINDEX, etc. There was an idea to allow CONCURRENTLY in REINDEX there [1] and recently this was proposed again for new options [2], since it is much more extensible from the grammar perspective. That way, the whole feature may look like: WAIT (LSN '16/B374D848', TIMEOUT 100); and/or BEGIN WAIT (LSN '16/B374D848', WHATEVER_OPTION_YOU_WANT); ... COMMIT; It requires only one reserved keyword 'WAIT'. The advantage of this approach is that it can be extended to support xid, timestamp, csn or anything else, that may be invented in the future, without affecting the grammar. What do you think? Personally, I find this syntax to be more convenient and human-readable compared with function call: SELECT pg_wait_for_lsn('16/B374D848'); BEGIN; [1] https://www.postgresql.org/message-id/aad2ec49-5142-7356-ffb2-a9b2649cdd1f%402ndquadrant.com [2] https://www.postgresql.org/message-id/20200401060334.GB142683%40paquier.xyz Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
On 2020-04-06 21:44, Justin Pryzby wrote: On Mon, Apr 06, 2020 at 08:43:46PM +0300, Alexey Kondratov wrote: +/* XXX: reusing reindex_option_list */ + | CLUSTER opt_verbose '(' reindex_option_list ')' qualified_name cluster_index_specification Could we actually simply reuse vac_analyze_option_list? From the first sight it does just the right thing, excepting the special handling of spelling ANALYZE/ANALYSE, but it does not seem to be a problem. Hm, do you mean to let cluster.c reject the other options like "analyze" ? I'm not sure why that would be better than reusing reindex? I think the suggestion will probably be to just copy+paste the reindex option list and rename it to cluster (possibly with the explanation that they're separate and independant and so their behavior shouldn't be tied together). I mean to literally use vac_analyze_option_list for reindex and cluster as well. Please, check attached 0007. Now, vacuum, reindex and cluster filter options list and reject everything that is not supported, so it seems completely fine to just reuse vac_analyze_option_list, doesn't it? ReindexRelationConcurrently is used for all cases, but it hits different code paths in the case of database, table and index. I have not checked yet, but are you sure it is safe removing these validations in the case of REINDEX CONCURRENTLY? You're right about the pg_global case, fixed. System catalogs can't be reindexed CONCURRENTLY, so they're already caught by that check. > XXX: for cluster/vacuum, it might be more friendly to check before > clustering > the table, rather than after clustering and re-indexing. Yes, I think it would be much more user-friendly. I realized it's not needed or useful to check indexes in advance of clustering, since 1) a mapped index will be on a mapped relation, which is already checked; 2) a system index will be on a system relation. Right ? Yes, it seems that you are right. I have tried to create user index on system relation with allow_system_table_mods=1, but this new index appeared to become system as well. That way, we do not have to check indexes in advance. -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres CompanyFrom afa132674d2b49328a5e95ecd2ff30d838f7ec6c Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Tue, 7 Apr 2020 13:42:42 +0300 Subject: [PATCH v18 7/7] Reuse vac_analyze_option_list for cluster and reindex --- src/backend/parser/gram.y | 38 +++--- 1 file changed, 3 insertions(+), 35 deletions(-) diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y index 3665ee8700..7821f07c40 100644 --- a/src/backend/parser/gram.y +++ b/src/backend/parser/gram.y @@ -512,10 +512,6 @@ static Node *makeRecursiveViewSelect(char *relname, List *aliases, Node *query); %type explain_option_list %type reindex_target_type reindex_target_multitable -%type reindex_option_name -%type reindex_option_arg -%type reindex_option_list -%type reindex_option_elem %type copy_generic_opt_arg copy_generic_opt_arg_list_item %type copy_generic_opt_elem @@ -8428,7 +8424,7 @@ ReindexStmt: n->params = NIL; $$ = (Node *)n; } - | REINDEX '(' reindex_option_list ')' reindex_target_type opt_concurrently qualified_name + | REINDEX '(' vac_analyze_option_list ')' reindex_target_type opt_concurrently qualified_name { ReindexStmt *n = makeNode(ReindexStmt); n->kind = $5; @@ -8438,7 +8434,7 @@ ReindexStmt: n->params = $3; $$ = (Node *)n; } - | REINDEX '(' reindex_option_list ')' reindex_target_multitable opt_concurrently name + | REINDEX '(' vac_analyze_option_list ')' reindex_target_multitable opt_concurrently name { ReindexStmt *n = makeNode(ReindexStmt); n->kind = $5; @@ -8458,33 +8454,6 @@ reindex_target_multitable: | SYSTEM_P{ $$ = REINDEX_OBJECT_SYSTEM; } | DATABASE{ $$ = REINDEX_OBJECT_DATABASE; } ; -reindex_option_list: - reindex_option_elem -{ - $$ = list_make1($1); -} - | reindex_option_list ',' reindex_option_elem -{ - $$ = lappend($1, $3); -} - ; - -reindex_option_elem: - reindex_option_name reindex_option_arg -{ - $$ = makeDefElem($1, $2, @1); -} - ; - -reindex_option_name: - NonReservedWord { $$ = $1; } - ; - -reindex_option_arg: - opt_boolean_or_string { $$ = (Node *) makeString($1); } - | NumericOnly { $$ = (Node *) $1; } - | /* EMPTY */ { $$ = NULL; } - ; /* * @@ -10643,8 +10612,7 @@ ClusterStmt: n->params = NIL; $$ = (Node*)n; } -/* XXX: reusing reindex_option_list */ - | CLUSTER opt_verbose '(' reindex_option_list ')' qualified_name cluster_index_specification + | CLUSTER opt_verbose '(' vac_analyze_option_list ')' qualifie
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
On 2020-04-03 21:27, Justin Pryzby wrote: On Wed, Apr 01, 2020 at 08:08:36AM -0500, Justin Pryzby wrote: Or maybe you'd want me to squish my changes into yours and resend after any review comments ? I didn't hear any feedback, so I've now squished all "parenthesized" and "fix" commits. Thanks for the input, but I am afraid that the patch set became a bit messy now. I have eyeballed it and found some inconsistencies. const char *name; /* name of database to reindex */ - int options;/* Reindex options flags */ + List*rawoptions;/* Raw options */ + int options;/* Parsed options */ boolconcurrent; /* reindex concurrently? */ You introduced rawoptions in the 0002, but then removed it in 0003. So is it required or not? Probably this is a rebase artefact. +/* XXX: reusing reindex_option_list */ + | CLUSTER opt_verbose '(' reindex_option_list ')' qualified_name cluster_index_specification Could we actually simply reuse vac_analyze_option_list? From the first sight it does just the right thing, excepting the special handling of spelling ANALYZE/ANALYSE, but it does not seem to be a problem. 0004 reduces duplicative error handling, as a separate commit so Alexey can review it and/or integrate it. @@ -2974,27 +2947,6 @@ ReindexRelationConcurrently(Oid relationOid, Oid tablespaceOid, int options) /* Open relation to get its indexes */ heapRelation = table_open(relationOid, ShareUpdateExclusiveLock); - /* - * We don't support moving system relations into different tablespaces, -* unless allow_system_table_mods=1. -*/ - if (OidIsValid(tablespaceOid) && - !allowSystemTableMods && IsSystemRelation(heapRelation)) - ereport(ERROR, - (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), - errmsg("permission denied: \"%s\" is a system catalog", - RelationGetRelationName(heapRelation; ReindexRelationConcurrently is used for all cases, but it hits different code paths in the case of database, table and index. I have not checked yet, but are you sure it is safe removing these validations in the case of REINDEX CONCURRENTLY? The last two commits save a few dozen lines of code, but not sure they're desirable. Sincerely, I do not think that passing raw strings down to the guts is a good idea. Yes, it saves us a few checks here and there now, but it may reduce a further reusability of these internal routines in the future. XXX: for cluster/vacuum, it might be more friendly to check before clustering the table, rather than after clustering and re-indexing. Yes, I think it would be much more user-friendly. -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: [HACKERS] make async slave to wait for lsn to be replayed
On 2020-04-01 02:26, Anna Akenteva wrote: On 2020-03-27 04:15, Kartyshov Ivan wrote: Anna, feel free to work on this patch. Ivan and I worked on this patch a bit more. We fixed the bugs that we could find and cleaned up the code. For now, we've kept both options: WAIT as a standalone statement and WAIT as a part of BEGIN. The new patch is attached. The syntax looks a bit different now: - WAIT FOR [ANY | ALL] event [, ...] - BEGIN [ WORK | TRANSACTION ] [ transaction_mode [, ...] ] [ WAIT FOR [ANY | ALL] event [, ...]] where event is one of: LSN value TIMEOUT number_of_milliseconds timestamp Now, one event cannot contain both an LSN and a TIMEOUT. In my understanding the whole idea of having TIMEOUT was to do something like 'Do wait for this LSN to be replicated, but no longer than TIMEOUT milliseconds'. What is the point of having plain TIMEOUT? It seems to be equivalent to pg_sleep, doesn't it? Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line
On 2020-04-01 05:19, Michael Paquier wrote: On Tue, Mar 31, 2020 at 03:48:21PM +0900, Michael Paquier wrote: Thanks, committed 0001 after fixing the order of the headers. One patch left. And committed now 0002, meaning that we are officially done. Thanks Alexey for your patience. Thanks for doing that! And to all others who participated. -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
On 2020-03-30 21:34, Justin Pryzby wrote: On Mon, Mar 30, 2020 at 09:02:22PM +0300, Alexey Kondratov wrote: Hmm, I went through the well known to me SQL commands in Postgres and a bit more. Parenthesized options list is mostly used in two common cases: There's also ANALYZE(VERBOSE), REINDEX(VERBOSE). There was debate a year ago [0] as to whether to make "reindex CONCURRENTLY" a separate command, or to use parenthesized syntax "REINDEX (CONCURRENTLY)". I would propose to support that now (and implemented that locally). I am fine with allowing REINDEX (CONCURRENTLY), but then we will have to support both syntaxes as we already do for VACUUM. Anyway, if we agree to add parenthesized options to REINDEX/CLUSTER, then it should be done as a separated patch before the current patch set. ..and explain(...) - In the beginning for boolean options only, e.g. VACUUM You're right that those are currently boolean, but note that explain(FORMAT ..) is not boolean. Yep, I forgot EXPLAIN, this is a good example. .. and create table (LIKE ..) LIKE is used in the table definition, so it is a slightly different case. Putting it into the WITH (...) options list looks like an option to me. However, doing it only for VACUUM will ruin the consistency, while doing it for CLUSTER and REINDEX is not necessary, so I do not like it either. It's not necessary but I think it's a more flexible way to add new functionality (requiring no changes to the grammar for vacuum, and for REINDEX/CLUSTER it would allow future options to avoid changing the grammar). If we use parenthesized syntax for vacuum, my proposal is to do it for REINDEX, and consider adding parenthesized syntax for cluster, too. To summarize, currently I see only 2 + 1 extra options: 1) Keep everything with syntax as it is in 0001-0002 2) Implement tail syntax for VACUUM, but with limitation for VACUUM FULL of the entire database + TABLESPACE change 3) Change TABLESPACE to a fully reserved word + 4) Use parenthesized syntax for all three. Note, I mentioned that maybe VACUUM/CLUSTER should support not only "TABLESPACE foo" but also "INDEX TABLESPACE bar" (I would use that, too). I think that would be easy to implement, and for sure it would suggest using () for both. (For sure we don't want to implement "VACUUM t TABLESPACE foo" now, and then later implement "INDEX TABLESPACE bar" and realize that for consistency we cannot parenthesize it. Michael ? Alvaro ? Robert ? Yes, I would be glad to hear other opinions too, before doing this preliminary refactoring. -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
On 2020-03-28 03:11, Justin Pryzby wrote: On Thu, Mar 26, 2020 at 11:01:06PM -0500, Justin Pryzby wrote: > Another issue is this: > > +VACUUM ( FULL [, ...] ) [ TABLESPACE new_tablespace ] [ table_and_columns [, ...] ] > As you mentioned in your v1 patch, in the other cases, "tablespace > [tablespace]" is added at the end of the command rather than in the middle. I > wasn't able to make that work, maybe because "tablespace" isn't a fully > reserved word (?). I didn't try with "SET TABLESPACE", although I understand > it'd be better without "SET". SET does not change anything in my experience. The problem is that opt_vacuum_relation_list is... optional and TABLESPACE is not a fully reserved word (why?) as you correctly noted. I have managed to put TABLESPACE to the end, but with vacuum_relation_list, like: | VACUUM opt_full opt_freeze opt_verbose opt_analyze vacuum_relation_list TABLESPACE name | VACUUM '(' vac_analyze_option_list ')' vacuum_relation_list TABLESPACE name It means that one would not be able to do VACUUM FULL of the entire database + TABLESPACE change. I do not think that it is a common scenario, but this limitation would be very annoying, wouldn't it? I think we should use the parenthesized syntax for vacuum - it seems clear in hindsight. Possibly REINDEX should use that, too, instead of adding OptTablespace at the end. I'm not sure. The attached mostly implements generic parenthesized options to REINDEX(...), so I'm soliciting opinions: should TABLESPACE be implemented in parenthesized syntax or non? CLUSTER doesn't support parenthesized syntax, but .. maybe it should? And this ? Hmm, I went through the well known to me SQL commands in Postgres and a bit more. Parenthesized options list is mostly used in two common cases: - In the beginning for boolean options only, e.g. VACUUM - In the end for options of a various type, but accompanied by WITH, e.g. COPY, CREATE SUBSCRIPTION Moreover, TABLESPACE is already used in CREATE TABLE/INDEX in the same way I did in 0001-0002. That way, putting TABLESPACE option into the parenthesized options list does not look to be convenient and semantically correct, so I do not like it. Maybe others will have a different opinion. Putting it into the WITH (...) options list looks like an option to me. However, doing it only for VACUUM will ruin the consistency, while doing it for CLUSTER and REINDEX is not necessary, so I do not like it either. To summarize, currently I see only 2 + 1 extra options: 1) Keep everything with syntax as it is in 0001-0002 2) Implement tail syntax for VACUUM, but with limitation for VACUUM FULL of the entire database + TABLESPACE change 3) Change TABLESPACE to a fully reserved word Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line
On 2020-03-30 04:56, Michael Paquier wrote: On Fri, Mar 27, 2020 at 07:47:29AM +0900, Michael Paquier wrote: On Thu, Mar 26, 2020 at 06:56:36PM -0300, Alvaro Herrera wrote: I don't think any such cleanup should hamper the patch at hand anyway. I don't think either, so I would do the split for the archive routines at least. It still feels strange to me to have a different routine name for the frontend-side of RestoreArchivedFile(). That's not really consistent with the current practice we have palloc() & co, as well as the sync routines. Okay. Hearing nothing, I have rebased the patch set this morning, improving some comments and the code format while on it. I would like to commit both 0001 (creation of xlogarchiver.h) and 0002 attached in the next couple of days. If you have an objection, please feel free. 0001: What do think about adding following sanity check into xlogarchive.c? +#ifdef FRONTEND +#error "This file is not expected to be compiled for frontend code" +#endif It would prevent someone from adding typedefs and any other common definitions into xlogarchive.h in the future, leading to the accidental inclusion of both xlogarchive.h and fe_archive.h in the same time. 0002: + */ +int +RestoreArchivedFile(const char *path, const char *xlogfname, + off_t expectedSize, const char *restoreCommand) +{ There is a couple of extra tabs IMO. +extern int RestoreArchivedFile(const char *path, + const char *xlogfname, + off_t expectedSize, + const char *restoreCommand); And the same here. + * This uses a logic based on "postgres -C" to get the value from + * from the cluster. Repetitive 'from'. extractPageMap(const char *datadir, XLogRecPtr startpoint, int tliIndex, - XLogRecPtr endpoint) + XLogRecPtr endpoint, const char *restore_command) Let us use camel case 'restoreCommand' here as in the header for tidiness. I have left 0001 intact, but fixed all these small remarks in the 0002. Please, find it attached. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company From f94990da84844841a35e80ca30e2b3d8c3bdaded Mon Sep 17 00:00:00 2001 From: Michael Paquier Date: Mon, 30 Mar 2020 10:51:51 +0900 Subject: [PATCH v6 2/2] Add -c/--restore-target-wal to pg_rewind pg_rewind needs to copy from the source server to the target a set of relation blocks changed from the previous checkpoint where WAL forked between both up to the end of WAL on the target. This may require WAL segments that are not present anymore on the target's pg_wal, causing pg_rewind to fail. It is possible to fix that by copying manually the WAL segments needed but this may lead to extra and useless work. Instead, this commit introduces a new option allowing pg_rewind to use a restore_command when running by grabbing it from the target configuration. This allows the rewind operation to be more reliable, and only the necessary segments are fetched from the archives in this case. In order to do that, a new routine is added to src/common/ to allow frontend tools to retrieve a WAL segment using an already-built restore command. This version is much more simple than the backend equivalent. Author: Alexey Kondratov Reviewed-by: Andrey Borodin, Andres Freund, Alvaro Herrera, Alexander Korotkov, Michael Paquier Discussion: https://postgr.es/m/a3acff50-5a0d-9a2c-b3b2-ee3616895...@postgrespro.ru --- doc/src/sgml/ref/pg_rewind.sgml | 28 +-- src/bin/pg_rewind/parsexlog.c | 33 +++- src/bin/pg_rewind/pg_rewind.c | 87 ++-- src/bin/pg_rewind/pg_rewind.h | 6 +- src/bin/pg_rewind/t/001_basic.pl | 3 +- src/bin/pg_rewind/t/RewindTest.pm | 70 +++- src/common/Makefile | 1 + src/common/exec.c | 3 +- src/common/fe_archive.c | 128 ++ src/include/common/fe_archive.h | 21 + src/include/port.h| 3 +- src/tools/msvc/Mkvcbuild.pm | 4 +- 12 files changed, 360 insertions(+), 27 deletions(-) create mode 100644 src/common/fe_archive.c create mode 100644 src/include/common/fe_archive.h diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml index f64d659522..07c49e4719 100644 --- a/doc/src/sgml/ref/pg_rewind.sgml +++ b/doc/src/sgml/ref/pg_rewind.sgml @@ -66,11 +66,11 @@ PostgreSQL documentation can be found either on the target timeline, the source timeline, or their common ancestor. In the typical failover scenario where the target cluster was shut down soon after the divergence, this is not a problem, but if the - target cluster ran for a long time after the divergence, the old WAL - files might no longer
Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line
On 2020-03-26 10:34, Michael Paquier wrote: On Tue, Mar 24, 2020 at 12:22:16PM +0900, Michael Paquier wrote: Thanks Alvaro and Alexander. 0001 has been applied as of e09ad07. Now for 0002, let's see about it later. Attached is a rebased version, with no actual changes. I was looking at this patch again today and I am rather fine with the existing semantics. Still I don't like much to name the frontend-side routine FrontendRestoreArchivedFile() and use a different name than the backend counterpart because we have to include xlog_internal.h in fe_archive.c to be able to grab XLOGDIR. So here is an idea: let's move the declaration of the routines part of xlogarchive.c to a new header, called xlogarchive.h, and then let's use the same routine name for the frontend and the backend in this second patch. We include xlog_internal.h already in many frontend tools, so that would clean up things a bit. Two extra things are the routines for the checkpointer as well as the variables like ArchiveRecoveryRequested. It may be nice to move those while on it, but I am not sure where and that's not actually required for this patch set so that could be addressed later if need be. Any thoughts? The block of function declarations for xlogarchive.c inside xlog_internal.h looks a bit dangling already, since all other functions and variables defined/initialized in xlog.c. That way, it looks good to me to move it outside. The only one concern about using the same name I have is that later someone may introduce a new variable or typedef inside xlogarchive.h. So someone else would be required to include both fe_archive.h and xlogarchive.h at once. But probably there should be a warning in the header comment section against doing so. Anyway, I have tried to do what you proposed and attached is a small patch, that introduces xlogarchive.h. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Companydiff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c index 860a996414..d683af377f 100644 --- a/src/backend/access/transam/timeline.c +++ b/src/backend/access/transam/timeline.c @@ -36,6 +36,7 @@ #include "access/timeline.h" #include "access/xlog.h" +#include "access/xlogarchive.h" #include "access/xlog_internal.h" #include "access/xlogdefs.h" #include "pgstat.h" diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 7621fc05e2..242427f815 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -32,6 +32,7 @@ #include "access/transam.h" #include "access/twophase.h" #include "access/xact.h" +#include "access/xlogarchive.h" #include "access/xlog_internal.h" #include "access/xloginsert.h" #include "access/xlogreader.h" diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c index 914ad340ea..04104b55ea 100644 --- a/src/backend/access/transam/xlogarchive.c +++ b/src/backend/access/transam/xlogarchive.c @@ -20,6 +20,7 @@ #include #include "access/xlog.h" +#include "access/xlogarchive.h" #include "access/xlog_internal.h" #include "common/archive.h" #include "miscadmin.h" diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c index 25e0333c9e..cc91954ac1 100644 --- a/src/backend/replication/walreceiver.c +++ b/src/backend/replication/walreceiver.c @@ -48,6 +48,7 @@ #include "access/htup_details.h" #include "access/timeline.h" #include "access/transam.h" +#include "access/xlogarchive.h" #include "access/xlog_internal.h" #include "catalog/pg_authid.h" #include "catalog/pg_type.h" diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h index 27ded593ab..8e3cfcf83e 100644 --- a/src/include/access/xlog_internal.h +++ b/src/include/access/xlog_internal.h @@ -320,22 +320,4 @@ extern bool InArchiveRecovery; extern bool StandbyMode; extern char *recoveryRestoreCommand; -/* - * Prototypes for functions in xlogarchive.c - */ -extern bool RestoreArchivedFile(char *path, const char *xlogfname, -const char *recovername, off_t expectedSize, -bool cleanupEnabled); -extern void ExecuteRecoveryCommand(const char *command, const char *commandName, - bool failOnSignal); -extern void KeepFileRestoredFromArchive(const char *path, const char *xlogfname); -extern void XLogArchiveNotify(const char *xlog); -extern void XLogArchiveNotifySeg(XLogSegNo segno); -extern void XLogArchiveForceDone(const char *xlog); -extern bool XLogArchiveCheckDone(const char *xlog); -extern bool XLogArchiveIsBusy(const char *xlog); -extern bool XLogArchiveIsReady(const char *xlog); -extern bool XLogArchiveIs
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
On 2020-03-26 21:01, Justin Pryzby wrote: @@ -262,7 +280,7 @@ cluster(ClusterStmt *stmt, bool isTopLevel) * and error messages should refer to the operation as VACUUM not CLUSTER. */ void -cluster_rel(Oid tableOid, Oid indexOid, int options) +cluster_rel(Oid tableOid, Oid indexOid, Oid tablespaceOid, int options) Add a comment here about the tablespaceOid parameter, like the other functions where it's added. The permission checking is kind of duplicitive, so I'd suggest to factor it out. Ideally we'd only have one place that checks for pg_global/system/mapped. It needs to check that it's not a system relation, or that system_table_mods are allowed, and in any case that if it's a mapped rel, that it's not being moved. I would pass a boolean indicating if the tablespace is being changed. Yes, but I wanted to make sure first that all necessary validations are there to do not miss something as I did last time. I do not like repetitive code either, so I would like to introduce more common check after reviewing the code as a whole. Another issue is this: +VACUUM ( FULL [, ...] ) [ TABLESPACE class="parameter">new_tablespace ] [ class="parameter">table_and_columns [, ...] ] As you mentioned in your v1 patch, in the other cases, "tablespace [tablespace]" is added at the end of the command rather than in the middle. I wasn't able to make that work, maybe because "tablespace" isn't a fully reserved word (?). I didn't try with "SET TABLESPACE", although I understand it'd be better without "SET". Initially I tried "SET TABLESPACE", but also failed to completely get rid of shift/reduce conflicts. I will try to rewrite VACUUM's part again with OptTableSpace. Maybe I will manage it this time. I will take into account all your text edits as well. Thanks -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
On 2020-03-26 02:40, Justin Pryzby wrote: On Thu, Mar 12, 2020 at 08:08:46PM +0300, Alexey Kondratov wrote: On 09.03.2020 23:04, Justin Pryzby wrote: On Sat, Feb 29, 2020 at 08:53:04AM -0600, Justin Pryzby wrote: On Sat, Feb 29, 2020 at 03:35:27PM +0300, Alexey Kondratov wrote: tests for that. (I'm including your v8 untouched in hopes of not messing up the cfbot). My fixes avoid an issue if you try to REINDEX onto pg_default, I think due to moving system toast indexes. I was able to avoid this issue by adding a call to GetNewRelFileNode, even though that's already called by RelationSetNewRelfilenode(). Not sure if there's a better way, or if it's worth Alexey's v3 patch which added a tablespace param to RelationSetNewRelfilenode. Do you have any understanding of what exactly causes this error? I have tried to debug it a little bit, but still cannot figure out why we need this extra GetNewRelFileNode() call and a mechanism how it helps. The PANIC is from smgr hashtable, which couldn't find an entry it expected. My very tentative understanding is that smgr is prepared to handle a *relation* which is dropped/recreated multiple times in a transaction, but it's *not* prepared to deal with a given RelFileNode(Backend) being dropped/recreated, since that's used as a hash key. I revisited it and solved it in a somewhat nicer way. I included your new solution regarding this part from 0004 into 0001. It seems that at least a tip of the problem was in that we tried to change tablespace to pg_default being already there. It's still not clear to me if there's an issue with your original way of adding a tablespace parameter to RelationSetNewRelfilenode(). Yes, it is not clear for me too. Many thanks for you review and fixups! There are some inconsistencies like mentions of SET TABLESPACE in error messages and so on. I am going to refactor and include your fixes 0003-0004 into 0001 and 0002, but keep 0005 separated for now, since this part requires more understanding IMO (and comparison with v4 implementation). I'd suggest to keep the CLUSTER/VACUUM FULL separate from REINDEX, in case Michael or someone else wants to progress one but cannot commit to both. Yes, sure, I did not have plans to melt everything into a single patch. So, it has taken much longer to understand and rework all these fixes and permission validations. Attached is the updated patch set. 0001: — It is mostly the same, but refactored — I also included your most recent fix for REINDEX DATABASE with allow_system_table_mods=1 — With this patch REINDEX + TABLESPACE simply errors out, when index on TOAST table is met and allow_system_table_mods=0 0002: — I reworked it a bit, since REINDEX CONCURRENTLY is not allowed on system catalog anyway, that is checked at the hegher levels of statement processing. So we have to care about TOAST relations — Also added the same check into the plain REINDEX — It works fine, but I am not entirely happy that with this patch errors/warnings are a bit inconsistent: template1=# REINDEX INDEX CONCURRENTLY pg_toast.pg_toast_12773_index TABLESPACE pg_default; WARNING: skipping tablespace change of "pg_toast_12773_index" DETAIL: Cannot move system relation, only REINDEX CONCURRENTLY is performed. template1=# REINDEX TABLE CONCURRENTLY pg_toast.pg_toast_12773 TABLESPACE pg_default; ERROR: permission denied: "pg_toast_12773" is a system catalog And REINDEX DATABASE CONCURRENTLY will generate a warning again. Maybe we should always throw a warning and do only reindex if it is not possible to change tablespace? 0003: — I have get rid of some of previous refactoring pieces like check_relation_is_movable for now. Let all these validations to settle and then think whether we could do it better — Added CLUSTER to copy/equalfuncs — Cleaned up messages and comments I hope that I did not forget anything from your proposals. -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company From 692bcadfacfa0c47fec7b6969525d33d0cac1f83 Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Tue, 24 Mar 2020 18:16:06 +0300 Subject: [PATCH v12 3/3] Allow CLUSTER and VACUUM FULL to change tablespace --- doc/src/sgml/ref/cluster.sgml | 11 - doc/src/sgml/ref/vacuum.sgml | 11 + src/backend/commands/cluster.c| 58 --- src/backend/commands/vacuum.c | 51 ++-- src/backend/nodes/copyfuncs.c | 2 + src/backend/nodes/equalfuncs.c| 2 + src/backend/parser/gram.y | 45 -- src/backend/postmaster/autovacuum.c | 1 + src/include/commands/cluster.h| 2 +- src/include/commands/vacuum.h | 2 + src/include/nodes/parsenodes.h| 2 + src/test/regress/input/tablespace.source | 23 - src/test/regress/output/ta
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
Hi Justin, On 09.03.2020 23:04, Justin Pryzby wrote: On Sat, Feb 29, 2020 at 08:53:04AM -0600, Justin Pryzby wrote: On Sat, Feb 29, 2020 at 03:35:27PM +0300, Alexey Kondratov wrote: Anyway, new version is attached. It is rebased in order to resolve conflicts with a recent fix of REINDEX CONCURRENTLY + temp relations, and includes this small comment fix. Thanks for rebasing - I actually started to do that yesterday. I extracted the bits from your original 0001 patch which handled CLUSTER and VACUUM FULL. I don't think if there's any interest in combining that with ALTER anymore. On another thread (1), I tried to implement that, and Tom pointed out problem with the implementation, but also didn't like the idea. I'm including some proposed fixes, but didn't yet update the docs, errors or tests for that. (I'm including your v8 untouched in hopes of not messing up the cfbot). My fixes avoid an issue if you try to REINDEX onto pg_default, I think due to moving system toast indexes. I was able to avoid this issue by adding a call to GetNewRelFileNode, even though that's already called by RelationSetNewRelfilenode(). Not sure if there's a better way, or if it's worth Alexey's v3 patch which added a tablespace param to RelationSetNewRelfilenode. Do you have any understanding of what exactly causes this error? I have tried to debug it a little bit, but still cannot figure out why we need this extra GetNewRelFileNode() call and a mechanism how it helps. Probably you mean v4 patch. Yes, interestingly, if we do everything at once inside RelationSetNewRelfilenode(), then there is no issue at all with: REINDEX DATABASE template1 TABLESPACE pg_default; It feels like I am doing a monkey coding here, so I want to understand it better :) The current logic allows moving all the indexes and toast indexes, but I think we should use IsSystemRelation() unless allow_system_table_mods, like existing behavior of ALTER. template1=# ALTER TABLE pg_extension_oid_index SET tablespace pg_default; ERROR: permission denied: "pg_extension_oid_index" is a system catalog template1=# REINDEX INDEX pg_extension_oid_index TABLESPACE pg_default; REINDEX Yeah, we definitely should obey the same rules as ALTER TABLE / INDEX in my opinion. Finally, I think the CLUSTER is missing permission checks. It looks like relation_is_movable was factored out, but I don't see how that helps ? I did this relation_is_movable refactoring in order to share the same check between REINDEX + TABLESPACE and ALTER INDEX + SET TABLESPACE. Then I realized that REINDEX already has its own temp tables check and does mapped relations validation in multiple places, so I just added global tablespace checks instead. Thus, relation_is_movable seems to be outdated right now. Probably, we have to do another refactoring here once all proper validations will be accumulated in this patch set. Alexey, I'm hoping to hear back if you think these changes are ok or if you'll publish a new version of the patch addressing the crash I reported. Or if you're too busy, maybe someone else can adopt the patch (I can help). Sorry for the late response, I was not going to abandon this patch, but was a bit busy last month. Many thanks for you review and fixups! There are some inconsistencies like mentions of SET TABLESPACE in error messages and so on. I am going to refactor and include your fixes 0003-0004 into 0001 and 0002, but keep 0005 separated for now, since this part requires more understanding IMO (and comparison with v4 implementation). That way, I am going to prepare a more clear patch set till the middle of the next week. I will be glad to receive more feedback from you then. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line
On 12.03.2020 07:39, Michael Paquier wrote: I'd like to commit the refactoring piece in 0001 tomorrow, then let's move on with the rest as of 0002. If more comments and docs are needed for archive.c, let's continue discussing that. I just went through the both patches and realized that I cannot get into semantics of splitting frontend code between common and fe_utils. This applies only to 0002, where we introduce fe_archive.c. Should it be placed into fe_utils alongside of the recent recovery_gen.c also used by pg_rewind? This is a frontend-only code intended to be used by frontend applications, so fe_utils feels like the right place, doesn't it? Just tried to do so and everything went fine, so it seems that there is no obstacles from the build system. BTW, most of 'common' is a really common code with only four exceptions like logging.c, which is frontend-only. Is it there for historical reasons only or something else? Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: Conflict handling for COPY FROM
On 09.03.2020 15:34, Surafel Temesgen wrote: okay attached is a rebased patch with it + Portal portal = NULL; ... + portal = GetPortalByName(""); + SetRemoteDestReceiverParams(dest, portal); I think that you do not need this, since you are using a ready DestReceiver. The whole idea of passing DestReceiver down to the CopyFrom was to avoid that code. This unnamed portal is created in the exec_simple_query [1] and has been already set to the DestReceiver there [2]. Maybe I am missing something, but I have just removed this code and everything works just fine. [1] https://github.com/postgres/postgres/blob/0a42a2e9/src/backend/tcop/postgres.c#L1178 [2] https://github.com/postgres/postgres/blob/0a42a2e9/src/backend/tcop/postgres.c#L1226 Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line
On 05.03.2020 09:24, Michael Paquier wrote: On Wed, Mar 04, 2020 at 08:14:20PM +0300, Alexey Kondratov wrote: - I did not actually get why you don't check for a missing command when using wait_result_is_any_signal. In this case I'd think that it is better to exit immediately as follow-up calls would just fail. Believe me or not, but I put 'false' there intentionally. The idea was that if the reason is a signal, then maybe user tired of waiting and killed that restore_command process theirself or something like that, so it is better to exit immediately. If it was a missing command, then there is no hurry, so we can go further and complain that attempt of recovering WAL segment has failed. Actually, I guess that there is no big difference if we include missing command here or not. There is no complicated logic further compared to real recovery process in Postgres, where we cannot simply return false in that case. On the contrary, it seems to me that the difference is very important. Imagine for example a frontend tool which calls RestoreArchivedWALFile in a loop, and that this one fails because the command called is missing. This tool would keep looping for nothing. So checking for a missing command and leaving immediately would be more helpful for the user. Can you think about scenarios where it would make sense to be able to loop in this case instead of failing? OK, I was still having in mind pg_rewind as the only one user of this routine. Now it is a part of the common and I could imagine a hypothetical tool that is polling the archive and waiting for a specific WAL segment to become available. In this case 'command not found' is definitely the end of game, while the absence of segment is expected error, so we can continue looping. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line
On 04.03.2020 10:45, Michael Paquier wrote: On Mon, Mar 02, 2020 at 08:59:49PM +0300, Alexey Kondratov wrote: All other remarks look clear for me, so I fix them in the next patch version, thanks. Already done as per the attached, with a new routine named getRestoreCommand() and more done. Many thanks for doing that. I went through the diff between v21 and v20. Most of the changes look good to me. - * Functions for finding and validating executable files + * Functions for finding and validating from executables files There is probably something missing here. Finding and validating what? And 'executables files' does not seem to be correct as well. + # First, remove all the content in the archive directory, + # as RecursiveCopy::copypath does not support copying to + # existing directories. I think that 'remove all the content' is not completely correct in this case. We are simply removing archive directory. There is no content there yet, so 'First, remove archive directory...' should be fine. - I did not actually get why you don't check for a missing command when using wait_result_is_any_signal. In this case I'd think that it is better to exit immediately as follow-up calls would just fail. Believe me or not, but I put 'false' there intentionally. The idea was that if the reason is a signal, then maybe user tired of waiting and killed that restore_command process theirself or something like that, so it is better to exit immediately. If it was a missing command, then there is no hurry, so we can go further and complain that attempt of recovering WAL segment has failed. Actually, I guess that there is no big difference if we include missing command here or not. There is no complicated logic further compared to real recovery process in Postgres, where we cannot simply return false in that case. - The code was rather careless about error handling and RestoreArchivedWALFile(), and it seemed to me that it is rather pointless to report an extra message "could not restore file \"%s\" from archive" on top of the other error. Probably you mean several pg_log_error calls not followed by 'return -1;'. Yes, I did it to fall down to the function end and show this extra message, but I agree that there is no much sense in doing so. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line
On 2020-03-02 07:53, Michael Paquier wrote: + * For fixed-size files, the caller may pass the expected size as an + * additional crosscheck on successful recovery. If the file size is not + * known, set expectedSize = 0. + */ +int +RestoreArchivedWALFile(const char *path, const char *xlogfname, + off_t expectedSize, const char *restoreCommand) Actually, expectedSize is IMO a bad idea, because any caller of this routine passing down zero could be trapped with an incorrect file size. So let's remove the behavior where it is possible to bypass this sanity check. We don't need it in pg_rewind either. OK, sounds reasonable, but just to be clear. I will remove only a possibility to bypass this sanity check (with 0), but leave expectedSize argument intact. We still need it, since pg_rewind takes WalSegSz from ControlFile and should pass it further, am I right? + /* Remove trailing newline */ + if (strchr(cmd_output, '\n') != NULL) + *strchr(cmd_output, '\n') = '\0'; It seems to me that what you are looking here is to use pg_strip_crlf(). Thinking harder, we have pipe_read_line() in src/common/exec.c which does the exact same job.. pg_strip_crlf fits well, but would you mind if I also make pipe_read_line external in this patch? - /* -* construct the command to be executed -*/ Perhaps you meant "build" here. Actually, the verb 'construct' is used historically applied to archive/restore commands (see also xlogarchive.c and pgarch.c), but it should be 'build' in (fe_)archive.c, since we have BuildRestoreCommand there now. All other remarks look clear for me, so I fix them in the next patch version, thanks. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com The Russian Postgres Company
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
On 2020-02-11 19:48, Justin Pryzby wrote: For your v7 patch, which handles REINDEX to a new tablespace, I have a few minor comments: + * the relation will be rebuilt. If InvalidOid is used, the default => should say "currrent", not default ? Yes, it keeps current index tablespace in that case, thanks. +++ b/doc/src/sgml/ref/reindex.sgml +TABLESPACE ... +class="parameter">new_tablespace => I saw you split the description of TABLESPACE from new_tablespace based on comment earlier in the thread, but I suggest that the descriptions for these should be merged, like: + +TABLESPACEnew_tablespace + + + Allow specification of a tablespace where all rebuilt indexes will be created. + Cannot be used with "mapped" relations. If SCHEMA, + DATABASE or SYSTEM are specified, then + all unsuitable relations will be skipped and a single WARNING + will be generated. + + + It sounds good to me, but here I just obey the structure, which is used all around. Documentation of ALTER TABLE/DATABASE, REINDEX and many others describes each literal/parameter in a separate entry, e.g. new_tablespace. So I would prefer to keep it as it is for now. The existing patch is very natural, especially the parts in the original patch handling vacuum full and cluster. Those were removed to concentrate on REINDEX, and based on comments that it might be nice if ALTER handled CLUSTER and VACUUM FULL. On a separate thread, I brought up the idea of ALTER using clustered order. Tom pointed out some issues with my implementation, but didn't like the idea, either. So I suggest to re-include the CLUSTER/VAC FULL parts as a separate 0002 patch, the same way they were originally implemented. BTW, I think if "ALTER" were updated to support REINDEX (to allow multiple operations at once), it might be either: |ALTER INDEX i SET TABLESPACE , REINDEX -- to reindex a single index on a given tlbspc or |ALTER TABLE tbl REINDEX USING INDEX TABLESPACE spc; -- to reindex all inds on table inds moved to a given tblspc "USING INDEX TABLESPACE" is already used for ALTER..ADD column/table CONSTRAINT. Yes, I also think that allowing REINDEX/CLUSTER/VACUUM FULL to put resulting relation in a different tablespace is a very natural operation. However, I did a couple of attempts to integrate latter two with ALTER TABLE and failed with it, since it is already complex enough. I am still willing to proceed with it, but not sure how soon it will be. Anyway, new version is attached. It is rebased in order to resolve conflicts with a recent fix of REINDEX CONCURRENTLY + temp relations, and includes this small comment fix. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com The Russian Postgres Company From d2b7a5fa2e11601759b47af0c142a7824ef907a2 Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Mon, 30 Dec 2019 20:00:37 +0300 Subject: [PATCH v8] Allow REINDEX to change tablespace REINDEX already does full relation rewrite, this patch adds a possibility to specify a new tablespace where new relfilenode will be created. --- doc/src/sgml/ref/reindex.sgml | 24 +- src/backend/catalog/index.c | 75 -- src/backend/commands/cluster.c| 2 +- src/backend/commands/indexcmds.c | 96 --- src/backend/commands/tablecmds.c | 2 +- src/backend/nodes/copyfuncs.c | 1 + src/backend/nodes/equalfuncs.c| 1 + src/backend/parser/gram.y | 14 ++-- src/backend/tcop/utility.c| 6 +- src/bin/psql/tab-complete.c | 6 ++ src/include/catalog/index.h | 7 +- src/include/commands/defrem.h | 6 +- src/include/nodes/parsenodes.h| 1 + src/test/regress/input/tablespace.source | 49 src/test/regress/output/tablespace.source | 66 15 files changed, 323 insertions(+), 33 deletions(-) diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml index c54a7c420d4..0628c94bb1e 100644 --- a/doc/src/sgml/ref/reindex.sgml +++ b/doc/src/sgml/ref/reindex.sgml @@ -21,7 +21,7 @@ PostgreSQL documentation -REINDEX [ ( option [, ...] ) ] { INDEX | TABLE | SCHEMA | DATABASE | SYSTEM } [ CONCURRENTLY ] name +REINDEX [ ( option [, ...] ) ] { INDEX | TABLE | SCHEMA | DATABASE | SYSTEM } [ CONCURRENTLY ] name [ TABLESPACE new_tablespace ] where option can be one of: @@ -174,6 +174,28 @@ REINDEX [ ( option [, ...] ) ] { IN + +TABLESPACE + + + This specifies a tablespace, where all rebuilt indexes will be created. + Cannot be used with "mapped" relations. If SCHEMA, + DATABASE or SYSTEM is specified, then + all unsuitable relations will be skipped and a single WARNING + will be generated. +
Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line
On 2020-02-28 09:43, Michael Paquier wrote: On Thu, Feb 27, 2020 at 06:29:34PM +0300, Alexey Kondratov wrote: On 2020-02-27 16:41, Alexey Kondratov wrote: > > New version of the patch is attached. Thanks again for your review. > Last patch (v18) got a conflict with one of today commits (05d8449e73). Rebased version is attached. The shape of the patch is getting better. I have found some issues when reading through the patch, but nothing huge. + printf(_(" -c, --restore-target-wal use restore_command in target config\n")); + printf(_(" to retrieve WAL files from archive\n")); [...] {"progress", no_argument, NULL, 'P'}, + {"restore-target-wal", no_argument, NULL, 'c'}, It may be better to reorder that alphabetically. Sure, I put it in order. However, the recent -R option is out of order too. + if (rc != 0) + /* Sanity check, should never happen. */ + elog(ERROR, "failed to build restore_command due to missing parameters"); No point in having this comment IMO. I would prefer to keep it, since there are plenty of similar comments near Asserts and elogs all over the Postgres. Otherwise it may look like a valid error state. It may be obvious now, but for someone who is not aware of BuildRestoreCommand refactoring it may be not. So from my perspective there is nothing bad in this extra one line comment. +/* logging support */ +#define pg_fatal(...) do { pg_log_fatal(__VA_ARGS__); exit(1); } while(0) Actually, I don't think that this is a good idea to name this pg_fatal() as we have the same think in pg_rewind so it could be confusing. I have added explicit exit(1) calls, since pg_fatal was used only twice in the archive.c. Probably, pg_log_fatal from common/logging should obey the same logic as FATAL log-level in the backend and do exit the process, but for now including pg_rewind.h inside archive.c or vice versa does not look like a solution. - while ((c = getopt_long(argc, argv, "D:nNPR", long_options, _index)) != -1) + while ((c = getopt_long(argc, argv, "D:nNPRc", long_options, _index)) != -1) Alphabetical order here. Done. + rmdir($node_master->archive_dir); rmtree() is used in all our other tests. Done. There was an unobvious logic that rmdir only deletes empty directories, which is true in the case of archive_dir in that test, but I have unified it for consistency. + pg_log_error("archive file \"%s\" has wrong size: %lu instead of %lu, %s", +xlogfname, (unsigned long) stat_buf.st_size, +(unsigned long) expectedSize, strerror(errno)); I think that the error message should be reworded: "unexpected WAL file size for \"%s\": %lu instead of %lu". Please note that there is no need for strerror() here at all, as errno should be 0. +if (xlogfd < 0) +pg_log_error("could not open file \"%s\" restored from archive: %s\n", + xlogpath, strerror(errno)); [...] +pg_log_error("could not stat file \"%s\" restored from archive: %s", +xlogpath, strerror(errno)); No need for strerror() as you can just use %m. And no need for the extra newline at the end as pg_log_* routines do that by themselves. + pg_log_error("could not restore file \"%s\" from archive\n", +xlogfname); No need for a newline here. Thanks, I have cleaned up these log statements. -- Alexey Kondratov Postgres Professional https://www.postgrespro.com The Russian Postgres Company From ba20808ffddf3fe2eefe96d3385697fb6583ce9a Mon Sep 17 00:00:00 2001 From: Alexander Korotkov Date: Tue, 25 Feb 2020 02:22:45 +0300 Subject: [PATCH v20] pg_rewind: Add options to restore WAL files from archive Currently, pg_rewind fails when it could not find required WAL files in the target data directory. One have to manually figure out which WAL files are required and copy them back from archive. This commit implements new pg_rewind options, which allow pg_rewind to automatically retrieve missing WAL files from archival storage. The restore_command option is read from postgresql.conf. Discussion: https://postgr.es/m/a3acff50-5a0d-9a2c-b3b2-ee36168955c1%40postgrespro.ru Author: Alexey Kondratov Reviewed-by: Michael Paquier, Andrey Borodin, Alvaro Herrera Reviewed-by: Andres Freund, Alexander Korotkov --- doc/src/sgml/ref/pg_rewind.sgml | 28 -- src/backend/access/transam/xlogarchive.c | 60 ++-- src/bin/pg_rewind/parsexlog.c| 33 ++- src/bin/pg_rewind/pg_rewind.c| 77 ++- src/bin/pg_rewind/pg_rewind.h| 6 +- src/bin/pg_rewind/t/001_basic.pl | 3 +- src/bi
Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line
On 2020-02-27 16:41, Alexey Kondratov wrote: New version of the patch is attached. Thanks again for your review. Last patch (v18) got a conflict with one of today commits (05d8449e73). Rebased version is attached. -- Alexey Kondratov Postgres Professional https://www.postgrespro.com The Russian Postgres Company From ea93b52b298d80aac547735c5917386b37667595 Mon Sep 17 00:00:00 2001 From: Alexander Korotkov Date: Tue, 25 Feb 2020 02:22:45 +0300 Subject: [PATCH v19] pg_rewind: Add options to restore WAL files from archive Currently, pg_rewind fails when it could not find required WAL files in the target data directory. One have to manually figure out which WAL files are required and copy them back from archive. This commit implements new pg_rewind options, which allow pg_rewind to automatically retrieve missing WAL files from archival storage. The restore_command option is read from postgresql.conf. Discussion: https://postgr.es/m/a3acff50-5a0d-9a2c-b3b2-ee36168955c1%40postgrespro.ru Author: Alexey Kondratov Reviewed-by: Michael Paquier, Andrey Borodin, Alvaro Herrera Reviewed-by: Andres Freund, Alexander Korotkov --- doc/src/sgml/ref/pg_rewind.sgml | 28 -- src/backend/access/transam/xlogarchive.c | 60 ++-- src/bin/pg_rewind/parsexlog.c| 33 ++- src/bin/pg_rewind/pg_rewind.c| 77 +++- src/bin/pg_rewind/pg_rewind.h| 6 +- src/bin/pg_rewind/t/001_basic.pl | 3 +- src/bin/pg_rewind/t/RewindTest.pm| 66 +- src/common/Makefile | 2 + src/common/archive.c | 102 + src/common/fe_archive.c | 111 +++ src/include/common/archive.h | 22 + src/include/common/fe_archive.h | 19 src/tools/msvc/Mkvcbuild.pm | 8 +- 13 files changed, 457 insertions(+), 80 deletions(-) create mode 100644 src/common/archive.c create mode 100644 src/common/fe_archive.c create mode 100644 src/include/common/archive.h create mode 100644 src/include/common/fe_archive.h diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml index 42d29edd4e..64a6942031 100644 --- a/doc/src/sgml/ref/pg_rewind.sgml +++ b/doc/src/sgml/ref/pg_rewind.sgml @@ -66,11 +66,11 @@ PostgreSQL documentation can be found either on the target timeline, the source timeline, or their common ancestor. In the typical failover scenario where the target cluster was shut down soon after the divergence, this is not a problem, but if the - target cluster ran for a long time after the divergence, the old WAL - files might no longer be present. In that case, they can be manually - copied from the WAL archive to the pg_wal directory, or - fetched on startup by configuring or - . The use of + target cluster ran for a long time after the divergence, its old WAL + files might no longer be present. In this case, you can manually copy them + from the WAL archive to the pg_wal directory, or run + pg_rewind with the -c option to + automatically retrieve them from the WAL archive. The use of pg_rewind is not limited to failover, e.g. a standby server can be promoted, run some write transactions, and then rewinded to become a standby again. @@ -232,6 +232,19 @@ PostgreSQL documentation + + -c + --restore-target-wal + + +Use the restore_command defined in +the target cluster configuration to retrieve WAL files from +the WAL archive if these files are no longer available in the +pg_wal directory. + + + + --debug @@ -318,7 +331,10 @@ GRANT EXECUTE ON function pg_catalog.pg_read_binary_file(text, bigint, bigint, b history forked off from the target cluster. For each WAL record, record each data block that was touched. This yields a list of all the data blocks that were changed in the target cluster, after the - source cluster forked off. + source cluster forked off. If some of the WAL files are no longer + available, try re-running pg_rewind with + the -c option to search for the missing files in + the WAL archive. diff --git a/src/backend/access/transam/xlogarchive.c b/src/backend/access/transam/xlogarchive.c index 188b73e752..f78a7e8f02 100644 --- a/src/backend/access/transam/xlogarchive.c +++ b/src/backend/access/transam/xlogarchive.c @@ -21,6 +21,7 @@ #include "access/xlog.h" #include "access/xlog_internal.h" +#include "common/archive.h" #include "miscadmin.h" #include "postmaster/startup.h" #include "replication/walsender.h" @@ -55,9 +56,6 @@ RestoreArchivedFile(char *path, const char *xlogfname, char xlogpath[MAXPGPATH]; char xlogRestoreCmd[MAXPGPATH]; c
Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line
On 2020-02-27 04:52, Michael Paquier wrote: On Thu, Feb 27, 2020 at 12:43:55AM +0300, Alexander Korotkov wrote: Regarding text split change, it was made by pgindent. I didn't notice it belongs to unchanged part of code. Sure, we shouldn't include this into the patch. I have read through v17 (not tested, sorry), and spotted a couple of issues that need to be addressed. + "--source-pgdata=$standby_pgdata", + "--target-pgdata=$master_pgdata", + "--no-sync", "--no-ensure-shutdown", FWIW, I think that perl indenting would reshape this part. I would recommend to run src/tools/pgindent/pgperltidy and ./src/tools/perlcheck/pgperlcritic before commit. Thanks, formatted this part with perltidy. It also has modified RecursiveCopy's indents. Pgperlcritic has no complains about this file. BTW, being executed on the whole project pgperltidy modifies dozens of perl files an even pgindent itself. + * Copyright (c) 2020, PostgreSQL Global Development Group Wouldn't it be better to just use the full copyright here? I mean the following: Portions Copyright (c) 1996-2020, PostgreSQL Global Development Group Portions Copyright (c) 1994, The Regents of the University of California I think so, it contains some older code parts, so it is better to use unified copyrights. +++ b/src/common/archive.c [...] +#include "postgres.h" + +#include "common/archive.h" This is incorrect. All files shared between the backend and the frontend in src/common/ have to include the following set of headers: #ifndef FRONTEND #include "postgres.h" #else #include "postgres_fe.h" #endif +++ b/src/common/fe_archive.c [...] +#include "postgres_fe.h" This is incomplete. The following piece should be added: #ifndef FRONTEND #error "This file is not expected to be compiled for backend code" #endif Fixed both. + snprintf(postgres_cmd, sizeof(postgres_cmd), "%s -D %s -C restore_command", +postgres_exec_path, datadir_target); + I think that this is missing proper quoting. Yep, added the same quoting as in pg_upgrade/options. I would rename ConstructRestoreCommand() to BuildRestoreCommand() while on it.. OK, shorter is better. I think that it would be saner to check the return status of ConstructRestoreCommand() in xlogarchive.c as a sanity check, with an elog(ERROR) if not 0, as that should never happen. Added. New version of the patch is attached. Thanks again for your review. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com The Russian Postgres Company From c775a2e40e405474f6ecef35843d276d43fb462f Mon Sep 17 00:00:00 2001 From: Alexander Korotkov Date: Tue, 25 Feb 2020 02:22:45 +0300 Subject: [PATCH v18] pg_rewind: Add options to restore WAL files from archive Currently, pg_rewind fails when it could not find required WAL files in the target data directory. One have to manually figure out which WAL files are required and copy them back from archive. This commit implements new pg_rewind options, which allow pg_rewind to automatically retrieve missing WAL files from archival storage. The restore_command option is read from postgresql.conf. Discussion: https://postgr.es/m/a3acff50-5a0d-9a2c-b3b2-ee36168955c1%40postgrespro.ru Author: Alexey Kondratov Reviewed-by: Michael Paquier, Andrey Borodin, Alvaro Herrera Reviewed-by: Andres Freund, Alexander Korotkov --- doc/src/sgml/ref/pg_rewind.sgml | 28 -- src/backend/access/transam/xlogarchive.c | 60 ++-- src/bin/pg_rewind/parsexlog.c| 33 ++- src/bin/pg_rewind/pg_rewind.c| 77 +++- src/bin/pg_rewind/pg_rewind.h| 6 +- src/bin/pg_rewind/t/001_basic.pl | 3 +- src/bin/pg_rewind/t/RewindTest.pm| 66 +- src/common/Makefile | 2 + src/common/archive.c | 102 + src/common/fe_archive.c | 111 +++ src/include/common/archive.h | 22 + src/include/common/fe_archive.h | 19 src/tools/msvc/Mkvcbuild.pm | 8 +- 13 files changed, 457 insertions(+), 80 deletions(-) create mode 100644 src/common/archive.c create mode 100644 src/common/fe_archive.c create mode 100644 src/include/common/archive.h create mode 100644 src/include/common/fe_archive.h diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml index 42d29edd4e..64a6942031 100644 --- a/doc/src/sgml/ref/pg_rewind.sgml +++ b/doc/src/sgml/ref/pg_rewind.sgml @@ -66,11 +66,11 @@ PostgreSQL documentation can be found either on the target timeline, the source timeline, or their common ancestor. In the typical
Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line
On 2020-02-26 22:03, Alexander Korotkov wrote: On Tue, Feb 25, 2020 at 1:48 PM Alexander Korotkov wrote: I think usage of chmod() deserves comment. As I get default permissions are sufficient for work, but we need to set them to satisfy 'check PGDATA permissions' test. I've added this comment myself. Thanks for doing it yourself, I was going to answer tonight, but it would be obviously too late. I've also fixes some indentation. Patch now looks good to me. I'm going to push it if no objections. I think that docs should be corrected. Previously Michael was against the phrase 'restore_command defined in the postgresql.conf', since it also could be defined in any config file included there. We corrected it in the pg_rewind --help output, but now docs say: +Use the restore_command defined in +postgresql.conf to retrieve WAL files from +the WAL archive if these files are no longer available in the +pg_wal directory of the target cluster. Probably it should be something like: +Use the restore_command defined in +the target cluster configuration to retrieve WAL files from +the WAL archive if these files are no longer available in the +pg_wal directory. Here the only text split changed: -* Ignore restore_command when not in archive recovery (meaning -* we are in crash recovery). + * Ignore restore_command when not in archive recovery (meaning we are in +* crash recovery). Should we do so in this patch? I think that this extra dot at the end is not necessary here: + pg_log_debug("using config variable restore_command=\'%s\'.", restore_command); If you agree then attached is a patch with all the corrections above. It is made with default git format-patch format, but yours were in a slightly different format, so I only was able to apply them with git am --patch-format=stgit. -- Alexey Kondratov Postgres Professional https://www.postgrespro.com The Russian Postgres Company From fa2fc359dd9852afc608663fa32733e800652ffa Mon Sep 17 00:00:00 2001 From: Alexander Korotkov Date: Tue, 25 Feb 2020 02:22:45 +0300 Subject: [PATCH v17] pg_rewind: Add options to restore WAL files from archive Currently, pg_rewind fails when it could not find required WAL files in the target data directory. One have to manually figure out which WAL files are required and copy them back from archive. This commit implements new pg_rewind options, which allow pg_rewind to automatically retrieve missing WAL files from archival storage. The restore_command option is read from postgresql.conf. Discussion: https://postgr.es/m/a3acff50-5a0d-9a2c-b3b2-ee36168955c1%40postgrespro.ru Author: Alexey Kondratov Reviewed-by: Michael Paquier, Andrey Borodin, Alvaro Herrera Reviewed-by: Andres Freund, Alexander Korotkov --- doc/src/sgml/ref/pg_rewind.sgml | 28 -- src/backend/access/transam/xlogarchive.c | 58 + src/bin/pg_rewind/parsexlog.c| 33 ++- src/bin/pg_rewind/pg_rewind.c| 77 ++-- src/bin/pg_rewind/pg_rewind.h| 6 +- src/bin/pg_rewind/t/001_basic.pl | 3 +- src/bin/pg_rewind/t/RewindTest.pm| 67 +- src/common/Makefile | 2 + src/common/archive.c | 97 + src/common/fe_archive.c | 106 +++ src/include/common/archive.h | 21 + src/include/common/fe_archive.h | 18 src/tools/msvc/Mkvcbuild.pm | 8 +- 13 files changed, 443 insertions(+), 81 deletions(-) create mode 100644 src/common/archive.c create mode 100644 src/common/fe_archive.c create mode 100644 src/include/common/archive.h create mode 100644 src/include/common/fe_archive.h diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml index 42d29edd4e..64a6942031 100644 --- a/doc/src/sgml/ref/pg_rewind.sgml +++ b/doc/src/sgml/ref/pg_rewind.sgml @@ -66,11 +66,11 @@ PostgreSQL documentation can be found either on the target timeline, the source timeline, or their common ancestor. In the typical failover scenario where the target cluster was shut down soon after the divergence, this is not a problem, but if the - target cluster ran for a long time after the divergence, the old WAL - files might no longer be present. In that case, they can be manually - copied from the WAL archive to the pg_wal directory, or - fetched on startup by configuring or - . The use of + target cluster ran for a long time after the divergence, its old WAL + files might no longer be present. In this case, you can manually copy them + from the WAL archive to the pg_wal directory, or run + pg_rewind with the -c option to + automatically retrieve them from the WAL archive. The use of pg_rewind is not limited to failover, e.g.
Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line
On 2020-01-24 08:50, Michael Paquier wrote: On Wed, Jan 22, 2020 at 12:55:30AM +0300, Alexander Korotkov wrote: On Sun, Jan 19, 2020 at 1:24 PM Michael Paquier wrote: +static int +RestoreArchivedWALFile(const char *path, const char *xlogfname, + off_t expectedSize, const char *restoreCommand) Not a fan of putting that to pg_rewind/parsexlog.c. It has nothing to do with WAL parsing, and it seems to me that we could have an argument for making this available as a frontend-only API in src/common/. I've put it into src/common/fe_archive.c. This split makes sense. You have forgotten to update @pgcommonfrontendfiles in Mkvcbuild.pm for the MSVC build. Still, I can see that we have a lot of duplication between the code of the frontend and the backend, which is annoying.. The execution part is tricky to split because the backend has pre- and post- callbacks, the interruption handling is not the same and there is the problem of elog() calls, with elevel that can be changed depending on the backend configuration. However, I can see one portion of the code which is fully duplicated and could be refactored out: the construction of the command to execute, by having in input the restore_command string and the arguments that we expect to replace in the '%' markers which are part of the command. OK, I agree that duplication of this part directly is annoying. I did a refactoring and moved this restore_command building code into common/archive. Separate file was needed, since fe_archive is frontend-only, so I cannot put universal code there. I do not know is there any better place for it. If NULL is specified for one of those values, then the construction routine returns NULL, synonym of failure. And the result of the routine is the built command. I think that you would need an extra argument to give space for the error message generated in the event of an error when building the command. I did it in a bit different way, but the overall logic is exactly as you proposed, I hope. Also I have checked win/linux build in the similar to cfbot CI setup and now everything runs well. +++ b/src/include/common/fe_archive.h +#ifndef ARCHIVE_H +#define ARCHIVE_H This should be FE_ARCHIVE_H. Changed. - while ((c = getopt_long(argc, argv, "D:nNPR", long_options, _index)) != -1) + while ((c = getopt_long(argc, argv, "D:nNPRC:c", long_options, _index)) != -1) This still includes 'C', but that should not be the case. + pg_rewind with the -c or + -C option to automatically retrieve them from the WAL [...] + -C restore_command + --target-restore-command=restore_command [...] + available, try re-running pg_rewind with + the -c or -C option to search + for the missing files in the WAL archive. Three places in the docs need to be cleaned up. I have fixed all the above, thanks. Do we really need to test the new "archive" mode as well for 003_extrafiles and 002_databases? I would imagine that only 001_basic is enough. We do not check any unique code paths with it, so I do it only for 001_basic now. I have left it as a test mode, since it will be easy to turn this on for any other test in the future if needed and it fits well RewindTest.pm module. +# Moves WAL files to the temporary location and returns restore_command +# to get them back. +sub move_wal +{ + my ($tmp_dir, $master_pgdata) = @_; + my $wal_archive_path = "$tmp_dir/master_wal_archive"; + my $wal_path = "$master_pgdata/pg_wal"; + my $wal_dir; + my $restore_command; I think that this routine is wrongly designed. First, in order to copy the contents from one path to another, we have RecursiveCopy::copy_path, and there is a long history about making it safe for the TAP tests. Second, there is in PostgresNode::enable_restoring already a code path doing the decision-making on how building restore_command. We should keep this code path unique. Yeah, thank you for pointing me to the RecursiveCopy::copypath and especially PostgresNode::enable_restoring, I have modified test to use them instead. The copypath does not work with existing destination directories and does not preserve initial permissions, so I had to do some dirty work to achieve what we need in the test and keep it simple in the same time. However, I think that using these unified routines is much better for a future support. New version is attached. Do you have any other comments or objections? Regards -- Alexey From 7af0b3642f6218c764eee361e013f24bfb43ffbe Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Fri, 31 Jan 2020 20:08:13 +0300 Subject: [PATCH v14] pg_rewind: Add options to restore WAL files from archive Currently, pg_rewind fails when it could not find required WAL files in the target data directory. One have to manually figure out which WAL files are required and copy them back from archive.
Re: Physical replication slot advance is not persistent
On 30.01.2020 05:19, Michael Paquier wrote: On Wed, Jan 29, 2020 at 05:10:20PM +0900, Kyotaro Horiguchi wrote: Looks perfect. Thanks Horiguchi-san and others. Applied and back-patched down to 11. Great! Thanks for getting this done. -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: Physical replication slot advance is not persistent
On 28.01.2020 15:14, Kyotaro Horiguchi wrote: At Tue, 28 Jan 2020 17:45:47 +0800, Craig Ringer wrote in On Tue, 28 Jan 2020 at 16:01, Michael Paquier wrote: So attached is an updated patch which addresses the problem just by marking a physical slot as dirty if any advancing is done. Some documentation is added about the fact that an advance is persistent only at the follow-up checkpoint. And the tests are fixed to not use a fake LSN but instead advance to the latest LSN position produced. Any objections? LGTM. Thankyou. I agree not to save slots immediately. The code is wrtten as described above. The TAP test is correct. +1, removing this broken saving code path from pg_replication_slot_advance and marking slot as dirty looks good to me. It solves the issue and does not add any unnecessary complexity. But the doc part looks a bit too detailed to me. Couldn't we explain that without the word 'dirty'? -and it will not be moved beyond the current insert location. Returns -name of the slot and real position to which it was advanced to. +and it will not be moved beyond the current insert location. Returns +name of the slot and real position to which it was advanced to. The +updated slot is marked as dirty if any advancing is done, with its +information being written out at the follow-up checkpoint. In the +event of a crash, the slot may return to an earlier position. and it will not be moved beyond the current insert location. Returns name of the slot and real position to which it was advanced to. The information of the updated slot is scheduled to be written out at the follow-up checkpoint if any advancing is done. In the event of a crash, the slot may return to an earlier position. Just searched through the *.sgml files, we already use terms 'dirty' and 'flush' applied to writing out pages during checkpoints. Here we are trying to describe the very similar process, but in relation to replication slots, so it looks fine for me. In the same time, the term 'schedule' is used for VACUUM, constraint check or checkpoint itself. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: Physical replication slot advance is not persistent
On 09.01.2020 09:36, Kyotaro Horiguchi wrote: Hello. At Sun, 29 Dec 2019 15:12:16 +0300, Alexey Kondratov wrote in On 2019-12-26 16:35, Alexey Kondratov wrote: Another concern is that ReplicationSlotIsDirty is added with the only one user. It also cannot be used by SaveSlotToPath due to the simultaneous usage of both flags dirty and just_dirtied there. In that way, I hope that we should call ReplicationSlotSave unconditionally in the pg_replication_slot_advance, so slot will be saved or not automatically based on the slot->dirty flag. In the same time, ReplicationSlotsComputeRequiredXmin and ReplicationSlotsComputeRequiredLSN should be called by anyone, who modifies xmin and LSN fields in the slot. Otherwise, currently we are getting some leaky abstractions. Sounds reasonable. Great, so it seems that we have reached some agreement about who should mark slot as dirty, at least for now. If someone will utilise old WAL and after that crash will happen between steps 2) and 3), then we start with old value of restart_lsn, but without required WAL. I do not know how to properly reproduce it without gdb and power off, so the chance is pretty low, but still it could be a case. In the first place we advance required LSN for every reply message but save slot data only at checkpoint on physical repliation. Such a strict guarantee seems too much. ... I think we shouldn't touch the paths used by replication protocol. And don't we focus on how we make a change of a replication slot from SQL interface persistent? It seems to me that generaly we don't need to save dirty slots other than checkpoint, but the SQL function seems wanting the change to be saved immediately. As the result, please find the attached, which is following only the first paragraph cited above. OK, I have definitely overthought that, thanks. This looks like a minimal subset of changes that actually solves the bug. I would only prefer to keep some additional comments (something like the attached), otherwise after half a year it will be unclear again, why we save slot unconditionally here. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c index bb69683e2a..084e0c2960 100644 --- a/src/backend/replication/slotfuncs.c +++ b/src/backend/replication/slotfuncs.c @@ -370,6 +370,11 @@ pg_physical_replication_slot_advance(XLogRecPtr moveto) MyReplicationSlot->data.restart_lsn = moveto; SpinLockRelease(>mutex); retlsn = moveto; + + ReplicationSlotMarkDirty(); + + /* We moved retart_lsn, update the global value. */ + ReplicationSlotsComputeRequiredLSN(); } return retlsn; @@ -564,7 +569,10 @@ pg_replication_slot_advance(PG_FUNCTION_ARGS) (uint32) (moveto >> 32), (uint32) moveto, (uint32) (minlsn >> 32), (uint32) minlsn))); - /* Do the actual slot update, depending on the slot type */ + /* + * Do the actual slot update, depending on the slot type. Slot will be + * marked as dirty by pg_*_replication_slot_advance if changed. + */ if (OidIsValid(MyReplicationSlot->data.database)) endlsn = pg_logical_replication_slot_advance(moveto); else @@ -573,14 +581,11 @@ pg_replication_slot_advance(PG_FUNCTION_ARGS) values[0] = NameGetDatum(>data.name); nulls[0] = false; - /* Update the on disk state when lsn was updated. */ - if (XLogRecPtrIsInvalid(endlsn)) - { - ReplicationSlotMarkDirty(); - ReplicationSlotsComputeRequiredXmin(false); - ReplicationSlotsComputeRequiredLSN(); - ReplicationSlotSave(); - } + /* + * Update the on disk state. No work here if + * pg_*_replication_slot_advance call was a no-op. + */ + ReplicationSlotSave(); ReplicationSlotRelease();
Re: Allow CLUSTER, VACUUM FULL and REINDEX to change tablespace on the fly
On 2019-12-02 11:21, Michael Paquier wrote: On Wed, Nov 27, 2019 at 08:47:06PM +0300, Alexey Kondratov wrote: Thus, I cannot get your point correctly here. Can you, please, elaborate a little bit more your concerns? The case of REINDEX CONCURRENTLY is pretty simple, because a new relation which is a copy of the old relation is created before doing the reindex, so you simply need to set the tablespace OID correctly in index_concurrently_create_copy(). And actually, I think that the computation is incorrect because we need to check after MyDatabaseTableSpace as well, no? The case of REINDEX is more tricky, because you are working on a relation that already exists, hence I think that what you need to do a different thing before the actual REINDEX: 1) Update the existing relation's pg_class tuple to point to the new tablespace. 2) Do a CommandCounterIncrement. So I think that the order of the operations you are doing is incorrect, and that you have a risk of breaking the existing tablespace assignment logic done when first flushing a new relfilenode. This actually brings an extra thing: when doing a plain REINDEX you need to make sure that the past relfilenode of the relation gets away properly. The attached POC patch does that before doing the CCI which is a bit ugly, but that's enough to show my point, and there is no need to touch RelationSetNewRelfilenode() this way. OK, I hope that now I understand your concerns better. Another thing I just realised is that RelationSetNewRelfilenode is also used for mapped relations, which are not movable at all, so adding a tablespace options there seems to be not semantically correct as well. However, I still have not find a way to reproduce how to actually brake anything with my previous version of the patch. As for doing RelationDropStorage before CCI, I do not think that there is something wrong with it, this is exactly what RelationSetNewRelfilenode does. I have only moved RelationDropStorage before CatalogTupleUpdate compared to your proposal to match order inside RelationSetNewRelfilenode. Your patch has forgotten to update copyfuncs.c and equalfuncs.c with the new tablespace string field. It would be nice to add tab completion for this new clause in psql. This is not ready for committer yet in my opinion, and more work is done, so I am marking it as returned with feedback for now. Finally, I have also merged and unified all your and Masahiko's proposals with my recent changes: ereport corrections, tab-completion, docs update, copy/equalfuncs update, etc. New version is attached. Have it come any closer to a committable state now? Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company From 81a9df46db91b6ba47e11f53e6e2379f835b576f Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Mon, 30 Dec 2019 20:00:37 +0300 Subject: [PATCH v7] Allow REINDEX to change tablespace REINDEX already does full relation rewrite, this patch adds a possibility to specify a new tablespace where new relfilenode will be created. --- doc/src/sgml/ref/reindex.sgml | 24 +- src/backend/catalog/index.c | 75 -- src/backend/commands/cluster.c| 2 +- src/backend/commands/indexcmds.c | 96 --- src/backend/commands/tablecmds.c | 2 +- src/backend/nodes/copyfuncs.c | 1 + src/backend/nodes/equalfuncs.c| 1 + src/backend/parser/gram.y | 14 ++-- src/backend/tcop/utility.c| 6 +- src/bin/psql/tab-complete.c | 6 ++ src/include/catalog/index.h | 7 +- src/include/commands/defrem.h | 6 +- src/include/nodes/parsenodes.h| 1 + src/test/regress/input/tablespace.source | 49 src/test/regress/output/tablespace.source | 66 15 files changed, 323 insertions(+), 33 deletions(-) diff --git a/doc/src/sgml/ref/reindex.sgml b/doc/src/sgml/ref/reindex.sgml index 5aa59d3b751..cf6bd88c128 100644 --- a/doc/src/sgml/ref/reindex.sgml +++ b/doc/src/sgml/ref/reindex.sgml @@ -21,7 +21,7 @@ PostgreSQL documentation -REINDEX [ ( option [, ...] ) ] { INDEX | TABLE | SCHEMA | DATABASE | SYSTEM } [ CONCURRENTLY ] name +REINDEX [ ( option [, ...] ) ] { INDEX | TABLE | SCHEMA | DATABASE | SYSTEM } [ CONCURRENTLY ] name [ TABLESPACE new_tablespace ] where option can be one of: @@ -169,6 +169,28 @@ REINDEX [ ( option [, ...] ) ] { IN + +TABLESPACE + + + This specifies a tablespace, where all rebuilt indexes will be created. + Cannot be used with "mapped" relations. If SCHEMA, + DATABASE or SYSTEM is specified, then + all unsuitable relations will be skipped and a single WARNING + will be generated. + + + + + +new_tablespace + + + The name of the specific tablespace to store rebui
Re: Physical replication slot advance is not persistent
On 2019-12-26 16:35, Alexey Kondratov wrote: Another concern is that ReplicationSlotIsDirty is added with the only one user. It also cannot be used by SaveSlotToPath due to the simultaneous usage of both flags dirty and just_dirtied there. In that way, I hope that we should call ReplicationSlotSave unconditionally in the pg_replication_slot_advance, so slot will be saved or not automatically based on the slot->dirty flag. In the same time, ReplicationSlotsComputeRequiredXmin and ReplicationSlotsComputeRequiredLSN should be called by anyone, who modifies xmin and LSN fields in the slot. Otherwise, currently we are getting some leaky abstractions. It seems that there was even a race in the order of actions inside pg_replication_slot_advance, it did following: - ReplicationSlotMarkDirty(); - ReplicationSlotsComputeRequiredXmin(false); - ReplicationSlotsComputeRequiredLSN(); - ReplicationSlotSave(); 1) Mark slot as dirty, which actually does nothing immediately, but setting dirty flag; 2) Do compute new global required LSN; 3) Flush slot state to disk. If someone will utilise old WAL and after that crash will happen between steps 2) and 3), then we start with old value of restart_lsn, but without required WAL. I do not know how to properly reproduce it without gdb and power off, so the chance is pretty low, but still it could be a case. Logical slots were not affected again, since there was a proper operations order (with comments) and slot flushing routines inside LogicalConfirmReceivedLocation. Thus, in the attached patch I have decided to do not perform slot flushing in the pg_replication_slot_advance at all and do it in the pg_physical_replication_slot_advance instead, as it is done in the LogicalConfirmReceivedLocation. Since this bugfix have not moved forward during the week, I will put it on the 01.2020 commitfest. Kyotaro, if you do not object I will add you as a reviewer, as you have already gave a lot of feedback, thank you for that! Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company From c9446bfc60ae125adea476bcc5b7d50f4bafa38b Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Sun, 29 Dec 2019 14:38:13 +0300 Subject: [PATCH v3] Make physical slot advance to be persistent Starting from v11 pg_replication_slot_advance result was not persistent after restart. This patch fixes that bug by proper slot flushing if it was modified. --- src/backend/replication/logical/logical.c | 10 ++-- src/backend/replication/slotfuncs.c | 31 +++ 2 files changed, 29 insertions(+), 12 deletions(-) diff --git a/src/backend/replication/logical/logical.c b/src/backend/replication/logical/logical.c index 7e06615864..bdb82da39b 100644 --- a/src/backend/replication/logical/logical.c +++ b/src/backend/replication/logical/logical.c @@ -1049,7 +1049,10 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn) SpinLockRelease(>mutex); - /* first write new xmin to disk, so we know what's up after a crash */ + /* + * First write new xmin and/or LSN to disk, so we know what's up + * after a crash. + */ if (updated_xmin || updated_restart) { ReplicationSlotMarkDirty(); @@ -1057,6 +1060,10 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn) elog(DEBUG1, "updated xmin: %u restart: %u", updated_xmin, updated_restart); } + /* Compute global required LSN if restart_lsn was changed */ + if (updated_restart) + ReplicationSlotsComputeRequiredLSN(); + /* * Now the new xmin is safely on disk, we can let the global value * advance. We do not take ProcArrayLock or similar since we only @@ -1070,7 +1077,6 @@ LogicalConfirmReceivedLocation(XLogRecPtr lsn) SpinLockRelease(>mutex); ReplicationSlotsComputeRequiredXmin(false); - ReplicationSlotsComputeRequiredLSN(); } } else diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c index ba08ad405f..9d0e99afb8 100644 --- a/src/backend/replication/slotfuncs.c +++ b/src/backend/replication/slotfuncs.c @@ -370,6 +370,19 @@ pg_physical_replication_slot_advance(XLogRecPtr moveto) MyReplicationSlot->data.restart_lsn = moveto; SpinLockRelease(>mutex); retlsn = moveto; + + /* + * Dirty the slot as we updated data that is meant to be persistent + * between restarts, flush it and re-compute global required LSN. + * + * If we change the order of operations and do not flush slot before + * required LSN computing, then there will be a race. Least recent WAL + * segments may be already utilized, so if we crash and start again with + * old restart_lsn there will be no WAL to proceed properly. + */ + ReplicationSlotMarkDirty(); + ReplicationSlotSave(); + ReplicationSlotsComputeRequiredLSN(); } return retlsn; @@ -463,6 +476,10 @@ pg_logical_replication_slot_advance(XLogRecPtr moveto) if (ctx->reader->EndRecPtr != InvalidXLogRecPtr) { +
Re: Physical replication slot advance is not persistent
On 26.12.2019 11:33, Kyotaro Horiguchi wrote: At Wed, 25 Dec 2019 20:28:04 +0300, Alexey Kondratov wrote in Yep, it helps with physical replication slot persistence after advance, but the whole validation (moveto <= endlsn) does not make sense for me. The value of moveto should be >= than minlsn == confirmed_flush / restart_lsn, while endlsn == retlsn is also always initialized with confirmed_flush / restart_lsn. Thus, your condition seems to be true in any case, even if it was no-op one, which we were intended to catch. ... If I get it correctly, then we already keep previous slot position in the minlsn, so we just have to compare endlsn with minlsn and treat endlsn <= minlsn as a no-op without slot state flushing. I think you're right about the condition. (endlsn cannot be less than minlsn, though) But I came to think that we shouldn't use locations in that decision. Attached is a patch that does this, so it fixes the bug without affecting any user-facing behavior. Detailed comment section and DEBUG output are also added. What do you think now? I have also forgotten to mention that all versions down to 11.0 should be affected with this bug. pg_replication_slot_advance is the only caller of pg_logical/physical_replication_slot_advacne so there's no apparent determinant on who-does-what about dirtying and other housekeeping calculation like *ComputeRequired*() functions, but the current shape seems a kind of inconsistent between logical and physical. I think pg_logaical/physical_replication_slot_advance should dirty the slot if they actually changed anything. And pg_replication_slot_advance should do the housekeeping if the slots are dirtied. (Otherwise both the caller function should dirty the slot in lieu of the two.) The attached does that. Both approaches looks fine for me: my last patch with as minimal intervention as possible and yours refactoring. I think that it is a right direction to let everyone who modifies slot->data also mark slot as dirty. I found some comment section in your code as rather misleading: + /* + * We don't need to dirty the slot only for the above change, but dirty + * this slot for the same reason with + * pg_logical_replication_slot_advance. + */ We just modified MyReplicationSlot->data, which is "On-Disk data of a replication slot, preserved across restarts.", so it definitely should be marked as dirty, not because pg_logical_replication_slot_advance does the same. Also I think that using this transient variable in ReplicationSlotIsDirty is not necessary. MyReplicationSlot is already a pointer to the slot in shared memory. + ReplicationSlot *slot = MyReplicationSlot; + + Assert(MyReplicationSlot != NULL); + + SpinLockAcquire(>mutex); Otherwise it looks fine for me, so attached is the same diff, but with these proposed corrections. Another concern is that ReplicationSlotIsDirty is added with the only one user. It also cannot be used by SaveSlotToPath due to the simultaneous usage of both flags dirty and just_dirtied there. In that way, I hope that we should call ReplicationSlotSave unconditionally in the pg_replication_slot_advance, so slot will be saved or not automatically based on the slot->dirty flag. In the same time, ReplicationSlotsComputeRequiredXmin and ReplicationSlotsComputeRequiredLSN should be called by anyone, who modifies xmin and LSN fields in the slot. Otherwise, currently we are getting some leaky abstractions. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company diff --git a/src/backend/replication/slot.c b/src/backend/replication/slot.c index 21ae8531b3..edf661521a 100644 --- a/src/backend/replication/slot.c +++ b/src/backend/replication/slot.c @@ -672,6 +672,23 @@ ReplicationSlotMarkDirty(void) SpinLockRelease(>mutex); } +/* + * Verify whether currently acquired slot is dirty. + */ +bool +ReplicationSlotIsDirty(void) +{ + bool dirty; + + Assert(MyReplicationSlot != NULL); + + SpinLockAcquire(>mutex); + dirty = MyReplicationSlot->dirty; + SpinLockRelease(>mutex); + + return dirty; +} + /* * Convert a slot that's marked as RS_EPHEMERAL to a RS_PERSISTENT slot, * guaranteeing it will be there after an eventual crash. diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c index 6683fc3f9b..d7a16a9071 100644 --- a/src/backend/replication/slotfuncs.c +++ b/src/backend/replication/slotfuncs.c @@ -370,6 +370,12 @@ pg_physical_replication_slot_advance(XLogRecPtr moveto) MyReplicationSlot->data.restart_lsn = moveto; SpinLockRelease(>mutex); retlsn = moveto; + + /* + * Dirty the slot as we updated data that is meant to be + * persistent on disk. + */ + ReplicationSlotMarkDirty(); } return retlsn; @@ -574,9 +580,8 @@ pg_replication_slot_advance(PG_FUNCTION_ARGS) nulls[0] = false;
Re: Physical replication slot advance is not persistent
On 25.12.2019 16:51, Alexey Kondratov wrote: On 25.12.2019 07:03, Kyotaro Horiguchi wrote: As the result I think what is needed there is just checking if the returned lsn is equal or larger than moveto. Doen't the following change work? - if (XLogRecPtrIsInvalid(endlsn)) + if (moveto <= endlsn) Yep, it helps with physical replication slot persistence after advance, but the whole validation (moveto <= endlsn) does not make sense for me. The value of moveto should be >= than minlsn == confirmed_flush / restart_lsn, while endlsn == retlsn is also always initialized with confirmed_flush / restart_lsn. Thus, your condition seems to be true in any case, even if it was no-op one, which we were intended to catch. I will recheck everything again and try to come up with something during this week. If I get it correctly, then we already keep previous slot position in the minlsn, so we just have to compare endlsn with minlsn and treat endlsn <= minlsn as a no-op without slot state flushing. Attached is a patch that does this, so it fixes the bug without affecting any user-facing behavior. Detailed comment section and DEBUG output are also added. What do you think now? I have also forgotten to mention that all versions down to 11.0 should be affected with this bug. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company >From e08299ddf92abc3fb4e802e8b475097fa746c458 Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Wed, 25 Dec 2019 20:12:42 +0300 Subject: [PATCH v2] Make physical replslot advance persistent --- src/backend/replication/slotfuncs.c | 12 ++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c index 6683fc3f9b..bc5c93b089 100644 --- a/src/backend/replication/slotfuncs.c +++ b/src/backend/replication/slotfuncs.c @@ -573,9 +573,17 @@ pg_replication_slot_advance(PG_FUNCTION_ARGS) values[0] = NameGetDatum(>data.name); nulls[0] = false; - /* Update the on disk state when lsn was updated. */ - if (XLogRecPtrIsInvalid(endlsn)) + /* + * Update the on disk state when LSN was updated. Here we rely on the facts + * that: 1) minlsn is initialized with restart_lsn and confirmed_flush LSN for + * physical and logical replication slot respectively, and 2) endlsn is set in + * the same way by pg_*_replication_slot_advance, but after advance. Thus, + * endlsn <= minlsn is treated as a no-op. + */ + if (endlsn > minlsn) { + elog(DEBUG1, "flushing replication slot '%s' state", + NameStr(MyReplicationSlot->data.name)); ReplicationSlotMarkDirty(); ReplicationSlotsComputeRequiredXmin(false); ReplicationSlotsComputeRequiredLSN(); base-commit: 8ce3aa9b5914d1ac45ed3f9bc484f66b3c4850c7 -- 2.17.1
Re: Physical replication slot advance is not persistent
On 25.12.2019 07:03, Kyotaro Horiguchi wrote: At Tue, 24 Dec 2019 20:12:32 +0300, Alexey Kondratov wrote in I dig into the code and it happens because of this if statement: /* Update the on disk state when lsn was updated. */ if (XLogRecPtrIsInvalid(endlsn)) { ReplicationSlotMarkDirty(); ReplicationSlotsComputeRequiredXmin(false); ReplicationSlotsComputeRequiredLSN(); ReplicationSlotSave(); } Yes, it seems just broken. Attached is a small patch, which fixes this bug. I have tried to stick to the same logic in this 'if (XLogRecPtrIsInvalid(endlsn))' and now pg_logical_replication_slot_advance and pg_physical_replication_slot_advance return InvalidXLogRecPtr if no-op. What do you think? I think we shoudn't change the definition of pg_*_replication_slot_advance since the result is user-facing. Yes, that was my main concern too. OK. The functions return a invalid value only when the slot had the invalid value and failed to move the position. I think that happens only for uninitalized slots. Anyway what we should do there is dirtying the slot when the operation can be assumed to have been succeeded. As the result I think what is needed there is just checking if the returned lsn is equal or larger than moveto. Doen't the following change work? - if (XLogRecPtrIsInvalid(endlsn)) + if (moveto <= endlsn) Yep, it helps with physical replication slot persistence after advance, but the whole validation (moveto <= endlsn) does not make sense for me. The value of moveto should be >= than minlsn == confirmed_flush / restart_lsn, while endlsn == retlsn is also always initialized with confirmed_flush / restart_lsn. Thus, your condition seems to be true in any case, even if it was no-op one, which we were intended to catch. Actually, if we do not want to change pg_*_replication_slot_advance, we can just add straightforward validation that either confirmed_flush, or restart_lsn changed after slot advance guts execution. It will be a little bit bulky, but much more clear and will never be affected by pg_*_replication_slot_advance logic change. Another weird part I have found is this assignment inside pg_logical_replication_slot_advance: /* Initialize our return value in case we don't do anything */ retlsn = MyReplicationSlot->data.confirmed_flush; It looks redundant, since later we do the same assignment, which should be reachable in any case. I will recheck everything again and try to come up with something during this week. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Physical replication slot advance is not persistent
Hi Hackers, I have accidentally noticed that pg_replication_slot_advance only changes in-memory state of the slot when its type is physical. Its new value does not survive restart. Reproduction steps: 1) Create new slot and remember its restart_lsn SELECT pg_create_physical_replication_slot('slot1', true); SELECT * from pg_replication_slots; 2) Generate some dummy WAL CHECKPOINT; SELECT pg_switch_wal(); CHECKPOINT; SELECT pg_switch_wal(); 3) Advance slot to the value of pg_current_wal_insert_lsn() SELECT pg_replication_slot_advance('slot1', '0/160001A0'); 4) Check that restart_lsn has been updated SELECT * from pg_replication_slots; 5) Restart server and check restart_lsn again. It should be the same as in the step 1. I dig into the code and it happens because of this if statement: /* Update the on disk state when lsn was updated. */ if (XLogRecPtrIsInvalid(endlsn)) { ReplicationSlotMarkDirty(); ReplicationSlotsComputeRequiredXmin(false); ReplicationSlotsComputeRequiredLSN(); ReplicationSlotSave(); } Actually, endlsn is always a valid LSN after the execution of replication slot advance guts. It works for logical slots only by chance, since there is an implicit ReplicationSlotMarkDirty() call inside LogicalConfirmReceivedLocation. Attached is a small patch, which fixes this bug. I have tried to stick to the same logic in this 'if (XLogRecPtrIsInvalid(endlsn))' and now pg_logical_replication_slot_advance and pg_physical_replication_slot_advance return InvalidXLogRecPtr if no-op. What do you think? Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company P.S. CCed Simon and Michael as they are the last who seriously touched pg_replication_slot_advance code. >From 36d1fa2a89b3fb354a813354496df475ee11b62e Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Tue, 24 Dec 2019 18:21:50 +0300 Subject: [PATCH v1] Make phsycal replslot advance persistent --- src/backend/replication/slotfuncs.c | 17 + 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/src/backend/replication/slotfuncs.c b/src/backend/replication/slotfuncs.c index 46e6dd4d12..826708d3f6 100644 --- a/src/backend/replication/slotfuncs.c +++ b/src/backend/replication/slotfuncs.c @@ -358,12 +358,14 @@ pg_get_replication_slots(PG_FUNCTION_ARGS) * The LSN position to move to is compared simply to the slot's restart_lsn, * knowing that any position older than that would be removed by successive * checkpoints. + * + * Returns InvalidXLogRecPtr if no-op. */ static XLogRecPtr pg_physical_replication_slot_advance(XLogRecPtr moveto) { XLogRecPtr startlsn = MyReplicationSlot->data.restart_lsn; - XLogRecPtr retlsn = startlsn; + XLogRecPtr retlsn = InvalidXLogRecPtr; if (startlsn < moveto) { @@ -386,6 +388,8 @@ pg_physical_replication_slot_advance(XLogRecPtr moveto) * because we need to digest WAL to advance restart_lsn allowing to recycle * WAL and removal of old catalog tuples. As decoding is done in fast_forward * mode, no changes are generated anyway. + * + * Returns InvalidXLogRecPtr if no-op. */ static XLogRecPtr pg_logical_replication_slot_advance(XLogRecPtr moveto) @@ -393,7 +397,7 @@ pg_logical_replication_slot_advance(XLogRecPtr moveto) LogicalDecodingContext *ctx; ResourceOwner old_resowner = CurrentResourceOwner; XLogRecPtr startlsn; - XLogRecPtr retlsn; + XLogRecPtr retlsn = InvalidXLogRecPtr; PG_TRY(); { @@ -414,9 +418,6 @@ pg_logical_replication_slot_advance(XLogRecPtr moveto) */ startlsn = MyReplicationSlot->data.restart_lsn; - /* Initialize our return value in case we don't do anything */ - retlsn = MyReplicationSlot->data.confirmed_flush; - /* invalidate non-timetravel entries */ InvalidateSystemCaches(); @@ -480,9 +481,9 @@ pg_logical_replication_slot_advance(XLogRecPtr moveto) * better than always losing the position even on clean restart. */ ReplicationSlotMarkDirty(); - } - retlsn = MyReplicationSlot->data.confirmed_flush; + retlsn = MyReplicationSlot->data.confirmed_flush; + } /* free context, call shutdown callback */ FreeDecodingContext(ctx); @@ -575,7 +576,7 @@ pg_replication_slot_advance(PG_FUNCTION_ARGS) nulls[0] = false; /* Update the on disk state when lsn was updated. */ - if (XLogRecPtrIsInvalid(endlsn)) + if (!XLogRecPtrIsInvalid(endlsn)) { ReplicationSlotMarkDirty(); ReplicationSlotsComputeRequiredXmin(false); -- 2.17.1
Re: [PATCH] Increase the maximum value track_activity_query_size
On 19.12.2019 20:52, Robert Haas wrote: On Thu, Dec 19, 2019 at 10:59 AM Tom Lane wrote: Bruce Momjian writes: Good question. I am in favor of allowing a larger value if no one objects. I don't think adding the min/max is helpful. The original poster. And probably anyone else, who debugs stuck queries of yet another crazy ORM. Yes, one could use log_min_duration_statement, but having a possibility to directly get it from pg_stat_activity without eyeballing the logs is nice. Also, IIRC log_min_duration_statement applies only to completed statements. I think there are pretty obvious performance and memory-consumption penalties to very large track_activity_query_size values. Who exactly are we really helping if we let them set it to huge values? (wanders away wondering if we have suitable integer-overflow checks in relevant code paths...) The value of pgstat_track_activity_query_size is in bytes, so setting it to any value below INT_MAX seems to be safe from that perspective. However, being multiplied by NumBackendStatSlots its reasonable value should be far below INT_MAX (~2 GB). Sincerely, It does not look for me like something badly needed, but still. We already have hundreds of GUCs and it is easy for a user to build a sub-optimal configuration, so does this overprotection make sense? Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company
Re: [Patch] pg_rewind: options to use restore_command from recovery.conf or command line
On 01.12.2019 5:57, Michael Paquier wrote: On Thu, Sep 26, 2019 at 03:08:22PM +0300, Alexey Kondratov wrote: As Alvaro correctly pointed in the nearby thread [1], we've got an interference regarding -R command line argument. I agree that it's a good idea to reserve -R for recovery configuration write to be consistent with pg_basebackup, so I've updated my patch to use another letters: The patch has rotten and does not apply anymore. Could you please send a rebased version? I have moved the patch to next CF, waiting on author for now. Rebased and updated patch is attached. There was a problem with testing new restore_command options altogether with recent ensureCleanShutdown. My test simply moves all WAL from pg_wal and generates restore_command for a new options testing, but this prevents startup recovery required by ensureCleanShutdown. To test both options in the same we have to leave some recent WAL segments in the pg_wal and be sure that they are enough for startup recovery, but not enough for successful pg_rewind run. I have manually figured out that required amount of inserted records (and generated WAL) to achieve this. However, I think that this approach is not good for test, since tests may be modified in the future (amount of writes to DB changed) or even volume of WAL written by Postgres will change. It will lead to falsely always failed or passed tests. Moreover, testing both ensureCleanShutdown and new options in the same time doesn't hit new code paths, so I decided to test new options with --no-ensure-shutdown for simplicity and stability of tests. Regards -- Alexey Kondratov Postgres Professional https://www.postgrespro.com Russian Postgres Company >From a05c3343e0bd6fe339c944f6b0cde64ceb46a0b3 Mon Sep 17 00:00:00 2001 From: Alexey Kondratov Date: Tue, 19 Feb 2019 19:14:53 +0300 Subject: [PATCH v11] pg_rewind: options to use restore_command from command line or cluster config Previously, when pg_rewind could not find required WAL files in the target data directory the rewind process would fail. One had to manually figure out which of required WAL files have already moved to the archival storage and copy them back. This patch adds possibility to specify restore_command via command line option or use one specified inside postgresql.conf. Specified restore_command will be used for automatic retrieval of missing WAL files from archival storage. --- doc/src/sgml/ref/pg_rewind.sgml | 49 +++- src/bin/pg_rewind/parsexlog.c | 164 +- src/bin/pg_rewind/pg_rewind.c | 118 +++--- src/bin/pg_rewind/pg_rewind.h | 6 +- src/bin/pg_rewind/t/001_basic.pl | 4 +- src/bin/pg_rewind/t/002_databases.pl | 4 +- src/bin/pg_rewind/t/003_extrafiles.pl | 4 +- src/bin/pg_rewind/t/RewindTest.pm | 105 - 8 files changed, 416 insertions(+), 38 deletions(-) diff --git a/doc/src/sgml/ref/pg_rewind.sgml b/doc/src/sgml/ref/pg_rewind.sgml index 42d29edd4e..b601a5c7e4 100644 --- a/doc/src/sgml/ref/pg_rewind.sgml +++ b/doc/src/sgml/ref/pg_rewind.sgml @@ -66,11 +66,12 @@ PostgreSQL documentation can be found either on the target timeline, the source timeline, or their common ancestor. In the typical failover scenario where the target cluster was shut down soon after the divergence, this is not a problem, but if the - target cluster ran for a long time after the divergence, the old WAL - files might no longer be present. In that case, they can be manually - copied from the WAL archive to the pg_wal directory, or - fetched on startup by configuring or - . The use of + target cluster ran for a long time after the divergence, its old WAL + files might no longer be present. In this case, you can manually copy them + from the WAL archive to the pg_wal directory, or run + pg_rewind with the -c or + -C option to automatically retrieve them from the WAL + archive. The use of pg_rewind is not limited to failover, e.g. a standby server can be promoted, run some write transactions, and then rewinded to become a standby again. @@ -232,6 +233,39 @@ PostgreSQL documentation + + -c + --restore-target-wal + + +Use the restore_command defined in +postgresql.conf to retrieve WAL files from +the WAL archive if these files are no longer available in the +pg_wal directory of the target cluster. + + +This option cannot be used together with --target-restore-command. + + + + + + -C restore_command + --target-restore-command=restore_command + + +Specifies the restore_command to use for retrieving +WAL files from the WAL archive if these files are no longer available +in the pg_wal directory of the target cluster. + + +If restore_command is already