RE: State of pg_createsubscriber
Dear Amit, Robert, > So, we have the following options: (a) by default drop the > pre-existing subscriptions, (b) by default disable the pre-existing > subscriptions, and add a Note in the docs that users can take > necessary actions to enable or drop them. Now, we can even think of > providing a switch to retain the pre-existing subscriptions or > publications as the user may have some use case where it can be > helpful for her. For example, retaining publications can help in > creating a bi-directional setup. Another point we should consider is the replication slot. If standby server has had slots and they were forgotten, WAL files won't be discarded so disk full failure will happen. v2-0004 proposed in [1] drops replication slots when their failover option is true. This can partially solve the issue, but what should be for other slots? [1]: https://www.postgresql.org/message-id/CANhcyEV6q1Vhd37i1axUeScLi0UAGVxta1LDa0BV0Eh--TcPMg%40mail.gmail.com Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
RE: Pgoutput not capturing the generated columns
Dear Shubham, Thanks for updating the patch! I checked your patches briefly. Here are my comments. 01. API Since the option for test_decoding is enabled by default, I think it should be renamed. E.g., "skip-generated-columns" or something. 02. ddl.sql ``` +-- check include-generated-columns option with generated column +CREATE TABLE gencoltable (a int PRIMARY KEY, b int GENERATED ALWAYS AS (a * 2) STORED); +INSERT INTO gencoltable (a) VALUES (1), (2), (3); +SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 'include-xids', '0', 'skip-empty-xacts', '1', 'include-generated-columns', '1'); +data +- + BEGIN + table public.gencoltable: INSERT: a[integer]:1 b[integer]:2 + table public.gencoltable: INSERT: a[integer]:2 b[integer]:4 + table public.gencoltable: INSERT: a[integer]:3 b[integer]:6 + COMMIT +(5 rows) ``` We should test non-default case, which the generated columns are not generated. 03. ddl.sql Not sure new tests are in the correct place. Do we have to add new file and move tests to it? Thought? 04. protocol.sgml Please keep the format of the sgml file. 05. protocol.sgml The option is implemented as the streaming option of pgoutput plugin, so they should be located under "Logical Streaming Replication Parameters" section. 05. AlterSubscription ``` + if (IsSet(opts.specified_opts, SUBOPT_GENERATED_COLUMN)) + { + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), +errmsg("toggling generated_column option is not allowed."))); + } ``` If you don't want to support the option, you can remove SUBOPT_GENERATED_COLUMN macro from the function. But can you clarify the reason why you do not want? 07. logicalrep_write_tuple ``` - if (!column_in_column_list(att->attnum, columns)) + if (!column_in_column_list(att->attnum, columns) && !att->attgenerated) + continue; + + if (att->attgenerated && !publish_generated_column) continue; ``` I think changes in v2 was reverted or wrongly merged. 08. test code Can you add tests that generated columns are replicated by the logical replication? Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
RE: Slow catchup of 2PC (twophase) transactions on replica in LR
Dear Peter, > You wrote "Fixed" for that patch v9-0004 suggestion but I don't think > anything was changed at all. Accidentally missed? Sorry, I missed to do `git add` after the revision. The change was also included in new patch [1]. [1]: https://www.postgresql.org/message-id/OSBPR01MB25522052F9F3E3AAD3BA2A8CF5ED2%40OSBPR01MB2552.jpnprd01.prod.outlook.com Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
RE: Fix src/test/subscription/t/029_on_error.pl test when wal_debug is enabled
Dear Amit, Ian, > I guess it could be more work if we want to enhance the test for > ERRORs other than the primary key violation. One simple fix is to > update the log_offset to the location of the LOG after successful > replication of un-conflicted data. For example, the Log location after > we execute the below line in the test: > > # Check replicated data >my $res = > $node_subscriber->safe_psql('postgres', "SELECT > count(*) FROM tbl"); >is($res, $expected, $msg); I made a patch for confirmation purpose. This worked well on my environment. Ian, how about you? Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/ fix_029.diff Description: fix_029.diff
RE: Slow catchup of 2PC (twophase) transactions on replica in LR
Dear Peter, Thanks for reviewing! New patch is available in [1]. > I'm having second thoughts about how these patches mention the option > values "on|off". These are used in the ALTER SUBSCRIPTION document > page for 'two_phase' and 'failover' parameters, and then those > "on|off" get propagated to the code comments, error messages, and > tests... > > Now I see that on the CREATE SUBSCRIPTION page [1], every boolean > parameter (even including 'two_phase' and 'failover') is described in > terms of "true|false" (not "on|off"). Hmm. But I could sentences like "The default value is off,...". Also, in alter_subscription.sgml, "on|off" notation has already been used. Not sure, but I felt there are no rules around here. > In hindsight, it is probably better to refer only to true|false > everywhere for these boolean parameters, instead of sometimes using > different values like on|off. > > What do you think? It's OK for me to make message/code comments consistent. Not sure the documentation, but followed only my part. [1]: https://www.postgresql.org/message-id/OSBPR01MB2552F66463EFCFD654E87C09F5E32%40OSBPR01MB2552.jpnprd01.prod.outlook.com Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
RE: Slow catchup of 2PC (twophase) transactions on replica in LR
ied as > true./ Fixed. > > 7. > +# Verify the prepared transaction still exists > +$result = $node_subscriber->safe_psql('postgres', > +"SELECT count(*) FROM pg_prepared_xacts;"); > +is($result, q(1), "prepared transaction still exits"); > + > > TYPO: /exits/exists/ Fixed. > > ~~~ > > 8. > +# Alter the two_phase with the force_alter option. Apart from the above, the > +# command will abort the prepared transaction and succeed. > +$node_subscriber->safe_psql('postgres', > +"ALTER SUBSCRIPTION regress_sub SET (two_phase = off, force_alter > = true);"); > +$node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION > regress_sub ENABLE;"); > + > > What does "Apart from the above" mean? Be more explicit. Clarified like "Apart from the last ALTER SUBSCRIPTION command...". > 9. > +# Verify the prepared transaction are aborted > $result = $node_subscriber->safe_psql('postgres', > "SELECT count(*) FROM pg_prepared_xacts;"); > is($result, q(0), "prepared transaction done by worker is aborted"); > > /transaction are aborted/transaction was aborted/ Fixed. [1]: https://www.postgresql.org/message-id/OSBPR01MB2552FEA48D265EA278AA9F7AF5E22%40OSBPR01MB2552.jpnprd01.prod.outlook.com Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
RE: Slow catchup of 2PC (twophase) transactions on replica in LR
Dear Peter, Thanks for reviewing! Patch can be available in [1]. > == > src/sgml/ref/alter_subscription.sgml > > 1. > + > + The two_phase parameter can only be altered when > the > + subscription is disabled. When altering the parameter from > on > + to off, the backend process checks prepared > + transactions done by the logical replication worker and aborts them. > + > > The text may be OK as-is, but I was wondering if it might be better to > give a more verbose explanation. > > BEFORE > ... the backend process checks prepared transactions done by the > logical replication worker and aborts them. > > SUGGESTION > ... the backend process checks for any incomplete prepared > transactions done by the logical replication worker (from when > two_phase parameter was still "on") and, if any are found, those are > aborted. > Fixed. > == > src/backend/commands/subscriptioncmds.c > > 2. AlterSubscription > > - /* > - * Since the altering two_phase option of subscriptions > - * also leads to the change of slot option, this command > - * cannot be rolled back. So prevent we are in the > - * transaction block. > + * If two_phase was enabled, there is a possibility the > + * transactions has already been PREPARE'd. They must be > + * checked and rolled back. > */ > > BEFORE > ... there is a possibility the transactions has already been PREPARE'd. > > SUGGESTION > ... there is a possibility that transactions have already been PREPARE'd. Fixed. > 3. AlterSubscription > + /* > + * Since the altering two_phase option of subscriptions > + * (especially on->off case) also leads to the > + * change of slot option, this command cannot be rolled > + * back. So prevent we are in the transaction block. > + */ > PreventInTransactionBlock(isTopLevel, > "ALTER SUBSCRIPTION ... SET (two_phase = off)"); > > > This comment is a bit vague and includes some typos, but IIUC these > problems will already be addressed by the 0002 patch changes.AFAIK > patch 0003 is only moving the 0002 comment. Yeah, the comment was updated accordingly. > > == > src/test/subscription/t/099_twophase_added.pl > > 4. > +# Check the case that prepared transactions exist on the subscriber node > +# > +# If the two_phase is altering from "on" to "off" and there are prepared > +# transactions on the subscriber, they must be aborted. This test checks it. > + > > Similar to the comment that I gave for v8-0002. I think there should > be comment for the major test comment to > distinguish it from comments for the sub-steps. Added. > 5. > +# Verify the prepared transaction are aborted because two_phase is changed to > +# "off". > +$result = $node_subscriber->safe_psql('postgres', > +"SELECT count(*) FROM pg_prepared_xacts;"); > +is($result, q(0), "prepared transaction done by worker is aborted"); > + > > /the prepared transaction are aborted/any prepared transactions are aborted/ Fixed. [1]: https://www.postgresql.org/message-id/OSBPR01MB2552FEA48D265EA278AA9F7AF5E22%40OSBPR01MB2552.jpnprd01.prod.outlook.com Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
RE: Improving the latch handling between logical replication launcher and worker processes.
Dear Vignesh, Thanks for raising idea! > a) Introduce a new latch to handle worker attach and exit. Just to confirm - there are three wait events for launchers, so I feel we may be able to create latches per wait event. Is there a reason to introduce "a" latch? > b) Add a new GUC launcher_retry_time which gives more flexibility to > users as suggested by Amit at [1]. Before 5a3a953, the > wal_retrieve_retry_interval plays a similar role as the suggested new > GUC launcher_retry_time, e.g. even if a worker is launched, the > launcher only wait wal_retrieve_retry_interval time before next round. Hmm. My concern is how users estimate the value. Maybe the default will be 3min, but should users change it? If so, how long? I think even if it becomes tunable, they cannot control well. > c) Don't reset the latch at worker attach and allow launcher main to > identify and handle it. For this there is a patch v6-0002 available at > [2]. Does it mean that you want to remove ResetLatch() from WaitForReplicationWorkerAttach(), right? If so, what about the scenario? 1) The launcher waiting the worker is attached in WaitForReplicationWorkerAttach(), and 2) subscription is created before attaching. In this case, the launcher will become un-sleepable because the latch is set but won't be reset. It may waste the CPU time. Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
RE: Slow catchup of 2PC (twophase) transactions on replica in LR
Dear Peter, > > == > Commit message > > 1. > A detailed commit message is needed to describe the purpose and > details of this patch. Added. > == > doc/src/sgml/ref/alter_subscription.sgml > > 2. CREATE SUBSCRIPTION > > Shouldn't there be an entry for "force_alter" parameter in the CREATE > SUBSCRIPTION "parameters" section, instead of just vaguely mentioning > it in passing when describing the "two_phase" in ALTER SUBSCRIPTION? > > 3. ALTER SUBSCRIPTION - alterable parameters > > And shouldn't this new option also be named in the ALTER SUBSCRIPTION > list: "The parameters that can be altered are..." Hmm, but the parameter cannot be used for CREATE SUBSCRIPTION. Should we modify to accept and add the description in the doc? This was not accepted. > == > src/backend/commands/subscriptioncmds.c > > 4. > XLogRecPtr lsn; > + bool twophase_force; > } SubOpts; > > IMO this field ought to be called 'force_alter' to be the same as the > option name. Sure, now it is only relevant for 'two_phase', but that > might not always be the case in the future. Modified. > 5. AlterSubscription > > + /* > + * Abort prepared transactions if force option is also > + * specified. Otherwise raise an ERROR. > + */ > + if (!opts.twophase_force) > + ereport(ERROR, > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > + errmsg("cannot alter %s when there are prepared transactions", > + "two_phase = false"))); > + > > 5a. > /if force option is also specified/only if the 'force_alter' option is true/ Modified. > > 5b. > "two_phase = false" -- IMO that should say "two_phase = off" Modified. > 5c. > IMO this ereport should include a errhint to tell the user they can > use 'force_alter = true' to avoid getting this error. Hint was added. > 6. > > + /* force_alter cannot be used standalone */ > + if (IsSet(opts.specified_opts, SUBOPT_FORCE_ALTER) && > + !IsSet(opts.specified_opts, SUBOPT_TWOPHASE_COMMIT)) > + ereport(ERROR, > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > + errmsg("%s must be specified with %s", > + "force_alter", "two_phase"))); > + > > IMO this rule is not necessary so the code should be removed. I think > using 'force_alter' standalone doesn't do anything at all (certainly, > it does no harm) so why add more complications (more rules, more code, > more tests) just for the sake of it? Removed. So standalone 'force_alter' is now no-op. > src/test/subscription/t/099_twophase_added.pl > > 7. > +$node_subscriber->safe_psql('postgres', > +"ALTER SUBSCRIPTION sub SET (two_phase = off, force_alter = on);"); > > "force" is a verb, so it is better to say 'force_alter = true' instead > of 'force_alter = on'. Fixed. Actually not sure it is better because I'm not a native. > 8. > $result = $node_subscriber->safe_psql('postgres', > "SELECT count(*) FROM pg_prepared_xacts;"); > is($result, q(0), "prepared transaction done by worker is aborted"); > > +$node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION sub > ENABLE;"); > + > > I felt the ENABLE statement should be above the SELECT statement so > that the code is more like it was before applying the patch. Fixed. Please see attached patch set. Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/ v8-0001-Allow-altering-of-two_phase-option-of-a-SUBSCRIPT.patch Description: v8-0001-Allow-altering-of-two_phase-option-of-a-SUBSCRIPT.patch v8-0002-Alter-slot-option-two_phase-only-when-altering-on.patch Description: v8-0002-Alter-slot-option-two_phase-only-when-altering-on.patch v8-0003-Abort-prepared-transactions-while-altering-two_ph.patch Description: v8-0003-Abort-prepared-transactions-while-altering-two_ph.patch v8-0004-Add-force_alter-option-for-ALTER-SUBSCIRPTION-.-S.patch Description: v8-0004-Add-force_alter-option-for-ALTER-SUBSCIRPTION-.-S.patch
RE: Slow catchup of 2PC (twophase) transactions on replica in LR
Dear Peter, > Commit Message > > 1. > The patch needs a commit message to describe the purpose and highlight > any limitations and other details. Added. > == > doc/src/sgml/ref/alter_subscription.sgml > > 2. > + > + > + The two_phase parameter can only be altered when > the > + subscription is disabled. When altering the parameter from > true > + to false, the backend process checks prepared > + transactions done by the logical replication worker and aborts them. > + > > Here, the para is referring to "true" and "false" but earlier on this > page it talks about "twophase = off". IMO it is better to use a > consistent terminology like "on|off" everywhere instead of randomly > changing the way it is described each time. I checked contents and changed to "on|off". > == > src/backend/commands/subscriptioncmds.c > > 3. AlterSubscription > > if (IsSet(opts.specified_opts, SUBOPT_TWOPHASE_COMMIT)) > { > + List *prepared_xacts = NIL; > > This 'prepared_xacts' can be declared at a lower scrope because it is > only used if (!opts.twophase). > > Furthermore, IIUC you don't need to assign NIL in the declaration > because there is no chance for it to be unassigned anyway. Made the namespace narrower and initialization was removed. > ~~~ > > 4. AlterSubscription > > + /* > + * The changed two_phase option (true->false) of the > + * slot can't be rolled back. > + */ > PreventInTransactionBlock(isTopLevel, > "ALTER SUBSCRIPTION ... SET (two_phase = off)"); > > Here is another example of inconsistent mixing of the terminology > where the comment says "true"/"false" but the message says "off". > Let's keep everything consistent. (I prefer on|off). Modified. > ~~~ > > 5. > + if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED && > + (prepared_xacts = GetGidListBySubid(subid)) != NIL) > + { > + ListCell *cell; > + > + /* Abort all listed transactions */ > + foreach(cell, prepared_xacts) > + FinishPreparedTransaction((char *) lfirst(cell), > + false); > + > + list_free(prepared_xacts); > + } > > 5A. > IIRC there is a cleaner way to write this loop without needing > ListCell variable -- e.g. foreach_ptr() macro? Changed. > 5B. > Shouldn't this be using list_free_deep() so the pstrdup gid gets freed too? Yeah, fixed. > == > src/test/subscription/t/099_twophase_added.pl > > 6. > +## > +# Check the case that prepared transactions exist on subscriber node > +## > + > > Give some more detailed comments here similar to the review comment of > patch v7-0002 for the other part of this TAP test. > > ~~~ > > 7. TAP test - comments > > Same as for my v7-0002 review comments, I think this test case also > needs a few more one-line comments to describe the sub-steps. e.g.: > > # prepare a transaction to insert some rows to the table > > # verify the prepared tx is replicated to the subscriber (because > 'two_phase = on') > > # toggle the two_phase to 'off' *before* the COMMIT PREPARED > > # verify the prepared tx got aborted > > # do the COMMIT PREPARED (note that now two_phase is 'off') > > # verify the inserted rows got replicated ok They were fixed based on your previous comments. > > 8. TAP test - subscription name > > It's better to rename the SUBSCRIPTION in this TAP test so you can > avoid getting log warnings like: > > psql::4: WARNING: subscriptions created by regression test > cases should have names starting with "regress_" > psql::4: NOTICE: created replication slot "sub" on publisher Modified, but it was included in 0001. Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
RE: Slow catchup of 2PC (twophase) transactions on replica in LR
+ appendStringInfo(, "FAILOVER %s ", > + (*failover) ? "true" : "false"); > + > + if (two_phase) > + appendStringInfo(, "TWO_PHASE %s%s ", > + (*two_phase) ? "true" : "false", > + failover ? ", " : ""); > + > + appendStringInfoString(, ");"); > > 4a. > IIUC the comma logic here was broken in v7 when you swapped the order. > Anyway, IMO it will be better NOT to try combining that comma logic > with the existing appendStringInfo. Doing it separately is both easier > and less error-prone. > > Furthermore, the parentheses like "(*two_phase)" instead of just > "*two_phase" seemed a bit overkill. > > SUGGESTION: > + if (failover) > + appendStringInfo(, "FAILOVER %s", > + *failover ? "true" : "false"); > + > + if (failover && two_phase) > + appendStringInfo(, ", "); > + > + if (two_phase) > + appendStringInfo(, "TWO_PHASE %s", > + *two_phase ? "true" : "false"); > + > + appendStringInfoString(, " );"); Fixed. > 4b. > Like I said above, IMO the current separator logic in v7 is broken. So > it is a bit concerning the tests all passed anyway. How did that > happen? I think this indicates that there needs to be an additional > test scenario where both 'failover' and 'two_phase' get altered at the > same time so this code gets exercised properly. Right, it was added. > == > src/test/subscription/t/099_twophase_added.pl > > 5. > +# Define pre-existing tables on both nodes > > Why say they are "pre-existing"? They are not pre-existing because you > are creating them right here! Removed the word. > 6. > +## > +# Check the case that prepared transactions exist on publisher node > +## > > I think this needs a slightly more detailed comment. > > SUGGESTION (this is just an example, but you can surely improve it) > > # Check the case that prepared transactions exist on the publisher node. > # > # Since two_phase is "off", then normally this PREPARE will do nothing until > # the COMMIT PREPARED, but in this test, we toggle the two_phase to "on" again > # before the COMMIT PREPARED happens. Changed with adjustments. > 7. > Maybe this test case needs a few more one-line comments for each of > the sub-steps. e.g.: > > # prepare a transaction to insert some rows to the table > > # verify the prepared tx is not yet replicated to the subscriber > (because 'two_phase = off') > > # toggle the two_phase to 'on' *before* the COMMIT PREPARED > > # verify the inserted rows got replicated ok Modified like yours, but changed based on the suggestion by Grammarly. > 8. > IIUC this test will behave the same even if you DON'T do the toggle > 'two_phase = on'. So I wonder is there something more you can do to > test this scenario more convincingly? I found an indicator. When the apply starts, it outputs the current status of two_phase option. I added wait_for_log() to ensure below appeared. Thought? ``` ereport(DEBUG1, (errmsg_internal("logical replication apply worker for subscription \"%s\" two_phase is %s", MySubscription->name, MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_DISABLED ? "DISABLED" : MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING ? "PENDING" : MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED ? "ENABLED" : "?"))); ``` Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
RE: Slow catchup of 2PC (twophase) transactions on replica in LR
Dear Peter, Thanks for reviewing! The patch will be posted in the upcoming post. > == > src/backend/access/transam/twophase.c > > 1. IsTwoPhaseTransactionGidForSubid > > +/* > + * IsTwoPhaseTransactionGidForSubid > + * Check whether the given GID is formed by TwoPhaseTransactionGid. > + */ > +static bool > +IsTwoPhaseTransactionGidForSubid(Oid subid, char *gid) > > I think the function comment should mention something about 'subid'. > > SUGGESTION > Check whether the given GID (as formed by TwoPhaseTransactionGid) is > for the specified 'subid'. Fixed. > src/backend/commands/subscriptioncmds.c > > 2. AlterSubscription > > + if (!opts.twophase && > + form->subtwophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED > && > + LookupGXactBySubid(subid)) > + /* Add error message */ > + ereport(ERROR, > + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > + errmsg("cannot disable two_phase when uncommitted prepared > transactions present"), > + errhint("Resolve these transactions and try again"))); > > The comment "/* Add error message */" seems unnecessary. Yeah, this was an internal flag. Removed. Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
RE: Pgoutput not capturing the generated columns
Dear Shubham, Thanks for creating a patch! Here are high-level comments. 1. Please document the feature. If it is hard to describe, we should change the API. 2. Currently, the option is implemented as streaming option. Are there any reasons to choose the way? Another approach is to implement as slot option, like failover and temporary. 3. You said that subscription option is not supported for now. Not sure, is it mean that logical replication feature cannot be used for generated columns? If so, the restriction won't be acceptable. If the combination between this and initial sync is problematic, can't we exclude them in CreateSubscrition and AlterSubscription? E.g., create_slot option cannot be set if slot_name is NONE. 4. Regarding the test_decoding plugin, it has already been able to decode the generated columns. So... as the first place, is the proposed option really needed for the plugin? Why do you include it? If you anyway want to add the option, the default value should be on - which keeps current behavior. 5. Assuming that the feature become usable used for logical replicaiton. Not sure, should we change the protocol version at that time? Nodes prior than PG17 may not want receive values for generated columns. Can we control only by the option? 6. logicalrep_write_tuple() ``` -if (!column_in_column_list(att->attnum, columns)) +if (!column_in_column_list(att->attnum, columns) && !att->attgenerated) +continue; ``` Hmm, does above mean that generated columns are decoded even if they are not in the column list? If so, why? I think such columns should not be sent. 7. Some functions refer data->publish_generated_column many times. Can we store the value to a variable? Below comments are for test_decoding part, but they may be not needed. = a. pg_decode_startup() ``` +else if (strcmp(elem->defname, "include_generated_columns") == 0) ``` Other options for test_decoding do not have underscore. It should be "include-generated-columns". b. pg_decode_change() data->include_generated_columns is referred four times in the function. Can you store the value to a varibable? c. pg_decode_change() ``` -true); +true, data->include_generated_columns ); ``` Please remove the blank. Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
RE: Slow catchup of 2PC (twophase) transactions on replica in LR
ession > instead? Actually it cannot be done at main regression test. Because altering two_phase requires the connection between pub/sub, but it is not established in subscription.sql file. Succeeded case for altering failover has not been tested neither, and I think they have same reason. > src/test/subscription/t/021_twophase.pl > > 17. > +# Disable the subscription and alter it to two_phase = false, > +# verify that the altered subscription reflects the two_phase option. > > /verify/then verify/ Fixed. > 18. > +# Now do a prepare on publisher and make sure that it is not replicated. > +$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub"); > +$node_publisher->safe_psql( > + 'postgres', qq{ > +BEGIN; > +INSERT INTO tab_copy VALUES (100); > +PREPARE TRANSACTION 'newgid'; > + }); > + > > 18a. > /on publisher/on the publisher/ Fixed. > 18b. > What is that "DROP SUBSCRIPTION tap_sub" doing here? It seems > misplaced under this comment. The subscription must be dropped because it also prepares a transaction. Moved before the test case and added comments. > 19. > +# Make sure that there is 0 prepared transaction on the subscriber > +$result = $node_subscriber->safe_psql('postgres', > + "SELECT count(*) FROM pg_prepared_xacts;"); > +is($result, qq(0), 'transaction is prepared on subscriber'); > > 19a. > SUGGESTION > Make sure there are no prepared transactions on the subscriber Fixed. > 19b. > /'transaction is prepared on subscriber'/'should be no prepared > transactions on subscriber'/ Replaced/ > 20. > +# Made sure that the commited transaction is replicated. > > /Made sure/Make sure/ > > /commited/committed/ Fixed. > 21. > +# Make sure that the two-phase is enabled on the subscriber > +$result = $node_subscriber->safe_psql('postgres', > + "SELECT subtwophasestate FROM pg_subscription WHERE subname = > 'tap_sub_copy';" > +); > +is($result, qq(e), 'two-phase is disabled'); > > The 'two-phase is disabled' is the identical message used in the > opposite case earlier, so something is amiss. Maybe this one should > say 'two-phase should be enabled' and the earlier counterpart should > say 'two-phase should be disabled'. Both of them were fixed. Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/ v7-0001-Allow-altering-of-two_phase-option-of-a-SUBSCRIPT.patch Description: v7-0001-Allow-altering-of-two_phase-option-of-a-SUBSCRIPT.patch v7-0002-Alter-slot-option-two_phase-only-when-altering-tr.patch Description: v7-0002-Alter-slot-option-two_phase-only-when-altering-tr.patch v7-0003-Abort-prepared-transactions-while-altering-two_ph.patch Description: v7-0003-Abort-prepared-transactions-while-altering-two_ph.patch v7-0004-Add-force_alter-option.patch Description: v7-0004-Add-force_alter-option.patch
RE: Slow catchup of 2PC (twophase) transactions on replica in LR
Dear hackers, Per recent commit (b29cbd3da), our patch needed to be rebased. Here is an updated version. Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/ v6-0001-Allow-altering-of-two_phase-option-of-a-SUBSCRIPT.patch Description: v6-0001-Allow-altering-of-two_phase-option-of-a-SUBSCRIPT.patch v6-0002-Alter-slot-option-two_phase-only-when-altering-tr.patch Description: v6-0002-Alter-slot-option-two_phase-only-when-altering-tr.patch v6-0003-Abort-prepared-transactions-while-altering-two_ph.patch Description: v6-0003-Abort-prepared-transactions-while-altering-two_ph.patch v6-0004-Add-force_alter-option.patch Description: v6-0004-Add-force_alter-option.patch
RE: Slow catchup of 2PC (twophase) transactions on replica in LR
> Dear Vitaly, > > > I looked at the patch and realized that I can't try it easily in the near > > future > > because the solution I'm working on is based on PG16 or earlier. This patch > > is > > not easily applicable to the older releases. I have to port my solution to > > the > > master, which is not done yet. > > We also tried to port our patch for PG16, but the largest barrier was that a > replication command ALTER_SLOT is not supported. Since the slot option > two_phase > can't be modified, it is difficult to skip decoding PREPARE command even when > altering the option from true to false. > IIUC, Adding a new feature (e.g., replication command) for minor updates is > generally > prohibited > > We must consider another approach for backpatching, but we do not have > solutions > for now. Attached patch set is a ported version for PG16, which breaks ABI. This can be used for testing purpose, but it won't be pushed to REL_16_STABLE. At least, this patchset can pass my github CI. Can you apply and check whether your issue is solved? Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/ REL_16_0001-Allow-altering-of-two_phase-option-of-a-SUBSCRIPTION.patch Description: REL_16_0001-Allow-altering-of-two_phase-option-of-a-SUBSCRIPTION.patch REL_16_0002-Alter-slot-option-two_phase-only-when-altering-true-.patch Description: REL_16_0002-Alter-slot-option-two_phase-only-when-altering-true-.patch REL_16_0003-Abort-prepared-transactions-while-altering-two_phase.patch Description: REL_16_0003-Abort-prepared-transactions-while-altering-two_phase.patch REL_16_0004-Add-force_alter-option.patch Description: REL_16_0004-Add-force_alter-option.patch
RE: Slow catchup of 2PC (twophase) transactions on replica in LR
Dear hackers, > Looking at this a bit more, maybe rolling back all prepared transactions on > the > subscriber when toggling two_phase from true to false might not be desirable > for the customer. Maybe we should have an option for customers to control > whether transactions should be rolled back or not. Maybe transactions should > only be rolled back if a "force" option is also set. What do people think? And here is a patch for adds new option "force_alter" (better name is very welcome). It could be used only when altering two_phase option. Let me share examples. Assuming that there are logical replication system with two_phase = on, and there are prepared transactions: ``` subscriber=# SELECT * FROM pg_prepared_xacts ; transaction | gid| prepared| owner | database -+--+---+--+-- 741 | pg_gid_16390_741 | 2024-04-22 08:02:34.727913+00 | postgres | postgres 742 | pg_gid_16390_742 | 2024-04-22 08:02:34.729486+00 | postgres | postgres (2 rows) ``` At that time, altering two_phase to false alone will be failed: ``` subscriber=# ALTER SUBSCRIPTION sub DISABLE ; ALTER SUBSCRIPTION subscriber=# ALTER SUBSCRIPTION sub SET (two_phase = off); ERROR: cannot alter two_phase = false when there are prepared transactions ``` It succeeds if force_alter is also expressly set. Prepared transactions will be aborted at that time. ``` subscriber=# ALTER SUBSCRIPTION sub SET (two_phase = off, force_alter = on); ALTER SUBSCRIPTION subscriber=# SELECT * FROM pg_prepared_xacts ; transaction | gid | prepared | owner | database -+-+--+---+------ (0 rows) ``` Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/global/ v5-0001-Allow-altering-of-two_phase-option-of-a-SUBSCRIPT.patch Description: v5-0001-Allow-altering-of-two_phase-option-of-a-SUBSCRIPT.patch v5-0002-Alter-slot-option-two_phase-only-when-altering-tr.patch Description: v5-0002-Alter-slot-option-two_phase-only-when-altering-tr.patch v5-0003-Abort-prepared-transactions-while-altering-two_ph.patch Description: v5-0003-Abort-prepared-transactions-while-altering-two_ph.patch v5-0004-Add-force_alter-option.patch Description: v5-0004-Add-force_alter-option.patch
RE: Slow catchup of 2PC (twophase) transactions on replica in LR
Dear Vitaly, > I looked at the patch and realized that I can't try it easily in the near > future > because the solution I'm working on is based on PG16 or earlier. This patch is > not easily applicable to the older releases. I have to port my solution to the > master, which is not done yet. We also tried to port our patch for PG16, but the largest barrier was that a replication command ALTER_SLOT is not supported. Since the slot option two_phase can't be modified, it is difficult to skip decoding PREPARE command even when altering the option from true to false. IIUC, Adding a new feature (e.g., replication command) for minor updates is generally prohibited We must consider another approach for backpatching, but we do not have solutions for now. Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/global/
RE: Disallow changing slot's failover option in transaction block
Dear Hou, Thanks for updating the patch! Let me confirm the content. In your patch, the pg_dump.c was updated. IIUC the main reason was that pg_restore executes some queries as single transactions so that ALTER SUBSCRIPTION cannot be done, right? Also, failover was synchronized only when we were in the upgrade mode, but your patch seems to remove the condition. Can you clarify the reason? Other than that, the patch LGTM. Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
RE: Disallow changing slot's failover option in transaction block
Dear Shveta, > When I said that option 2 aligns with ALTER-SUB documented behaviour, > I meant the doc described in [1] stating "When altering the slot_name, > the failover and two_phase property values of the named slot may > differ from the counterpart failover and two_phase parameters > specified in the subscription" > > [1]: > https://www.postgresql.org/docs/devel/sql-altersubscription.html#SQL-ALTER > SUBSCRIPTION-PARAMS-SET I see, thanks for the clarification. Agreed that the description is not conflict with option 2. Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
RE: Disallow changing slot's failover option in transaction block
Dear Shveta, Sorry for delay response. I missed your post. > +1. > > Similar to ALTER SUB, CREATE SUB also needs to be fixed. This is > because we call alter_replication_slot in CREATE SUB as well, for the > case when slot_name is provided and create_slot=false. But the tricky > part is we call alter_replication_slot() when creating subscription > for both failover=true and false. That means if we want to fix it on > the similar line of ALTER SUB, we have to disallow user from executing > the CREATE SUBSCRIPTION (slot_name = xx) in a txn block, which seems > to break some existing use cases. (previously, user can execute such a > command inside a txn block). > > So, we need to think if there are better ways to fix it. After > discussion with Hou-San offlist, here are some ideas: > 1. do not alter replication slot's failover option when CREATE > SUBSCRIPTION WITH failover=false. This means we alter replication > slot only when failover is set to true. And thus we can disallow > CREATE SUB WITH (slot_name =xx, failover=true, create_slot=false) > inside a txn block. > > This option allows user to run CREATE-SUB(create_slot=false) with > failover=false in txn block like earlier. But on the downside, it > makes the behavior inconsistent for otherwise simpler option like > failover, i.e. with failover=true, CREATE SUB is not allowed in txn > block while with failover=false, it is allowed. It makes it slightly > complex to be understood by user. > > 2. let's not disallow CREATE SUB in txn block as earlier, just don't > perform internal alter-failover during CREATE SUB for existing > slots(create_slot=false, slot_name=xx) i.e. when create_slot is > false, we will ignore failover parameter of CREATE SUB and it is > user's responsibility to set it appropriately using ALTER SUB cmd. For > create_slot=true, the behaviour of CREATE-SUB remains same as earlier. > > This option does not add new restriction for CREATE SUB wrt txn block. > In context of failover with create_slot=false, we already have a > similar restriction (documented one) for ALTER SUB, i.e. with 'ALTER > SUBSCRIPTION SET(slot_name = new)', user needs to alter the new slot's > failover by himself. CREAT SUB can also be documented in similar way. > This seems simpler to be understood considering existing ALTER SUB's > behavior as well. Plus, this will make CREATE-SUB code slightly > simpler and thus easily manageable. > > 3. add a alter_slot option for CREATE SUBSCRIPTION, we can only alter > the slot's failover if alter_slot=true. And so we can disallow CREATE > SUB WITH (slot_name =xx, alter_slot=true) inside a txn block. > > This seems a clean way, as everything will be as per user's consent > based on alter_slot parameter. But on the downside, this will need > introducing additional parameter and also adding new restriction of > running CREATE-sub in txn block for a specific case. > > 4. Don't alter replication in subscription DDLs. Instead, try to alter > replication slot's failover in the apply worker. This means we need to > execute alter_replication_slot each time before starting streaming in > the apply worker. > > This does not seem appealing to execute alter_replication_slot > everytime the apply worker starts. But if others think it as a better > option, it can be further analyzed. Thanks for describing, I also prefer 2, because it seems bit strange that CREATE statement leads ALTER. > Currently, our preference is option 2, as that looks a clean solution > and also aligns with ALTER-SUB behavior which is already documented. > Thoughts? > > > Note that we could not refer to the design of two_phase here, because > two_phase can be considered as a streaming option, so it's fine to > change the two_phase along with START_REPLICATION command. (the > two_phase is not changed in subscription DDLs, but get changed in > START_REPLICATION command). But the failover is closely related to a > replication slot itself. > Sorry, I cannot find statements. Where did you refer? Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
RE: Disallow changing slot's failover option in transaction block
Dear Hou, > Kuroda-San reported an issue off-list that: > > If user execute ALTER SUBSCRIPTION SET (failover) command inside a txn block > and rollback, only the subscription option change can be rolled back, while > the > replication slot's failover change is preserved. > > This is because ALTER SUBSCRIPTION SET (failover) command internally > executes > the replication command ALTER_REPLICATION_SLOT to change the replication > slot's > failover property, but this replication command execution cannot be > rollback. > > To fix it, I think we can prevent user from executing ALTER SUBSCRIPTION set > (failover) inside a txn block, which is also consistent to the ALTER > SUBSCRIPTION REFRESH/DROP SUBSCRIPTION command. Attach a small > patch to address this. Thanks for posting the patch, the fix is same as my expectation. Also, should we add the restriction to the doc? I feel [1] can be updated. [1]:https://www.postgresql.org/docs/devel/sql-altersubscription.html#SQL-ALTERSUBSCRIPTION-PARAMS-SET Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
RE: Slow catchup of 2PC (twophase) transactions on replica in LR
Dear Amit, > > FYI - We also considered the idea which walsender waits until all prepared > transactions > > are resolved before decoding and sending changes, but it did not work well > > - the restarted walsender sent only COMMIT PREPARED record for > transactions which > > have been prepared before disabling the subscription. This happened because > > 1) if the two_phase option of slots is false, the confirmed_flush can be > > ahead of > >PREPARE record, and > > 2) after the altering and restarting, start_decoding_at becomes same as > >confirmed_flush and records behind this won't be decoded. > > > > I don't understand the exact problem you are facing. IIUC, if the > commit is after start_decoding_at point and prepare was before it, we > expect to send the entire transaction followed by a commit record. The > restart_lsn should be before the start of such a transaction and we > should have recorded the changes in the reorder buffer. This behavior is right for two_phase = false case. But if the parameter is altered between PREPARE and COMMIT PREPARED, there is a possibility that only COMMIT PREPARED is sent. As the first place, the executed workload is below. 1. created a subscription with (two_phase = false) 2. prepared a transaction on publisher 3. disabled the subscription once 4. altered the subscription to two_phase = true 5. enabled the subscription again 6. did COMMIT PREPARED on the publisher -> Apply worker would raise an ERROR while applying COMMIT PREPARED record: ERROR: prepared transaction with identifier "pg_gid_XXX_YYY" does not exist Below part describes why the ERROR occurred. == ### Regarding 1) the confirmed_flush can be ahead of PREPARE record, If two_phase is off, as you might know, confirmed_flush can be ahead of PREPARE record by keepalive mechanism. Walsender sometimes sends a keepalive message in WalSndKeepalive(). Here the LSN is written, which is lastly decoded record. Since the PREPARE record is skipped (just handled by ReorderBufferProcessXid()), sometimes the written LSN in the message can be ahead of PREPARE record. If the WAL records are aligned like below, the LSN can point CHECKPOINT_ONLINE. ... INSERT PREPARE txn1 CHECKPOINT_ONLINE ... On worker side, when it receives the keepalive, it compares the LSN in the message and lastly received LSN, and advance last_received. Then, the worker replies to the walsender, and at that time it replies that last_recevied record has been flushed on the subscriber. See send_feedback(). On publisher, when the walsender receives the message from subscriber, it reads the message and advance the confirmed_flush to the written value. If the walsender sends LSN which locates ahead PREPARE, the confirmed flush is updated as well. ### Regarding 2) after the altering, records behind the confirmed_flush are not decoded Then, at decoding phase. The snapshot builder determines the point where decoding is resumed, as start_decoding_at. After the restart, the value is same as confirmed_flush of the slot. Since the confiremed_fluish is ahead of PREPARE, the start_decoding_at becomes ahead as well, so whole of prepared transactions are not decoded. == Attached zip file contains the PoC and used script. You can refer what I really did. Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/ <>
RE: Slow catchup of 2PC (twophase) transactions on replica in LR
Dear Amit, > Vitaly, does the minimal solution provided by the proposed patch > (Allow to alter two_phase option of a subscriber provided no > uncommitted > prepared transactions are pending on that subscription.) address your use > case? I think we do not have to handle cases which there are prepared transactions on publisher/subscriber, as the first step. It leads additional complexity and we do not have smarter solutions, especially for problem 2. IIUC it meets the Vitaly's condition, right? > > 1. While toggling two_phase from true to false, we could probably get a > > list of > prepared transactions for this subscriber id and rollback/abort the prepared > transactions. This will allow the transactions to be re-applied like a normal > transaction when the commit comes. Alternatively, if this isn't appropriate > doing it > in the ALTER SUBSCRIPTION context, we could store the xids of all prepared > transactions of this subscription in a list and when the corresponding xid is > being > committed by the apply worker, prior to commit, we make sure the previously > prepared transaction is rolled back. But this would add the overhead of > checking > this list every time a transaction is committed by the apply worker. > > > > In the second solution, if you check at the time of commit whether > there exists a prior prepared transaction then won't we end up > applying the changes twice? I think we can first try to achieve it at > the time of Alter Subscription because the other solution can add > overhead at each commit? Yeah, at least the second solution might be problematic. I prototyped the first one and worked well. However, to make the feature more consistent, it is prohibit to exist prepared transactions on subscriber for now. We can ease based on the requirement. > > 2. No solution yet. > > > > One naive idea is that on the publisher we can remember whether the > prepare has been sent and if so then only send commit_prepared, > otherwise send the entire transaction. On the subscriber-side, we > somehow, need to ensure before applying the first change whether the > corresponding transaction is already prepared and if so then skip the > changes and just perform the commit prepared. One drawback of this > approach is that after restart, the prepare flag wouldn't be saved in > the memory and we end up sending the entire transaction again. One way > to avoid this overhead is that the publisher before sending the entire > transaction checks with subscriber whether it has a prepared > transaction corresponding to the current commit. I understand that > this is not a good idea even if it works but I don't have any better > ideas. What do you think? I considered but not sure it is good to add such mechanism. Your idea requires additional wait-loop, which might lead bugs and unexpected behavior. And it may degrade the performance based on the network environment. As for the another solution (worker sends a list of prepared transactions), it is also not so good because list of prepared transactions may be huge. Based on above, I think we can reject the case for now. FYI - We also considered the idea which walsender waits until all prepared transactions are resolved before decoding and sending changes, but it did not work well - the restarted walsender sent only COMMIT PREPARED record for transactions which have been prepared before disabling the subscription. This happened because 1) if the two_phase option of slots is false, the confirmed_flush can be ahead of PREPARE record, and 2) after the altering and restarting, start_decoding_at becomes same as confirmed_flush and records behind this won't be decoded. > > 3. We could mandate that the altering of two_phase state only be done after > disabling the subscription, just like how it is handled for failover option. > > > > makes sense. OK, this spec was added. According to above, I updated the patch with Ajin. 0001 - extends ALTER SUBSCRIPTION statement. A tab-completion was added. 0002 - mandates the subscription has been disabled. Since no need to change AtEOXact_ApplyLauncher(), the change is reverted. If no objections, this can be included to 0001. 0003 - checks whether there are transactions prepared by the worker. If found, rejects the ALTER SUBSCRIPTION command. 0004 - checks whether there are transactions prepared on publisher. The backend connects to the publisher and confirms it. If found, rejects the ALTER SUBSCRIPTION command. 0005 - adds TAP test for it. Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/ v3-0001-Allow-altering-of-two_phase-option-of-a-SUBSCRIPT.patch Description: v3-0001-Allow-altering-of-two_phase-option-of-a-SUBSCRIPT.patch v3-0002-Mandate-the-subscription-has-been-disabled.patch Descrip
RE: Slow catchup of 2PC (twophase) transactions on replica in LR
Dear Amit, > One naive idea is that on the publisher we can remember whether the > prepare has been sent and if so then only send commit_prepared, > otherwise send the entire transaction. On the subscriber-side, we > somehow, need to ensure before applying the first change whether the > corresponding transaction is already prepared and if so then skip the > changes and just perform the commit prepared. One drawback of this > approach is that after restart, the prepare flag wouldn't be saved in > the memory and we end up sending the entire transaction again. One way > to avoid this overhead is that the publisher before sending the entire > transaction checks with subscriber whether it has a prepared > transaction corresponding to the current commit. I understand that > this is not a good idea even if it works but I don't have any better > ideas. What do you think? Alternative idea is that worker pass a list of prepared transaction as new option in START_REPLICATION. This can reduce the number of inter-node communications, but sometimes the list may be huge. Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
RE: Improve eviction algorithm in ReorderBuffer
Dear Heikki, I also prototyped the idea, which has almost the same shape. I attached just in case, but we may not have to see. Few comments based on the experiment. ``` + /* txn_heap is ordered by transaction size */ + buffer->txn_heap = pairingheap_allocate(ReorderBufferTXNSizeCompare, NULL); ``` I think the pairing heap should be in the same MemoryContext with the buffer. Can we add MemoryContextSwithTo()? ``` + /* Update the max-heap */ + if (oldsize != 0) + pairingheap_remove(rb->txn_heap, >txn_node); + pairingheap_add(rb->txn_heap, >txn_node); ... + /* Update the max-heap */ + pairingheap_remove(rb->txn_heap, >txn_node); + if (txn->size != 0) + pairingheap_add(rb->txn_heap, >txn_node); ``` Since the number of stored transactions does not affect to the insert operation, we may able to add the node while creating ReorederBufferTXN and remove while cleaning up it. This can reduce branches in ReorderBufferChangeMemoryUpdate(). Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/ <>
RE: Is this a problem in GenericXLogFinish()?
Dear Michael, > On Fri, Apr 05, 2024 at 06:22:58AM +0000, Hayato Kuroda (Fujitsu) wrote: > > Thanks for pointing out. Yes, we fixed a behavior by the commit aa5edbe37, > > but we missed the redo case. I made a fix patch based on the suggestion [1]. > > + boolmod_buf = false; > > Perhaps you could name that mod_wbuf, to be consistent with the WAL > insert path. Right, fixed. > I'm slightly annoyed by the fact that there is no check on > is_prim_bucket_same_wrt for buffer 1 in the BLK_NEEDS_REDO case to > show the symmetry between the insert and replay paths. Shouldn't > there be at least an assert for that in the branch when there are no > tuples (imagine an else to cover xldata->ntups == 0)? I mean between > just before updating the hash page's opaque area when > is_prev_bucket_same_wrt. Indeed, added an Assert() in else part. Was it same as your expectation? > I've been thinking about ways to make such cases detectable in an > automated fashion. The best choice would be > verifyBackupPageConsistency(), just after RestoreBlockImage() on the > copy of the block image stored in WAL before applying the page masking > which would mask the LSN. A problem, unfortunately, is that we would > need to transfer the knowledge of REGBUF_NO_CHANGE as a new BKPIMAGE_* > flag so we would be able to track if the block rebuilt from the record > has the *same* LSN as the copy used for the consistency check. So > this edge consistency case would come at a cost, I am afraid, and the > 8 bits of bimg_info are precious :/ I could not decide that the change has more benefit than its cost, so I did nothing for it. Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/ v2_add_modbuf_flag.diff Description: v2_add_modbuf_flag.diff
RE: Is this a problem in GenericXLogFinish()?
Dear Michael, > There is still some divergence between the code path of > _hash_freeovflpage() and the replay in hash_xlog_squeeze_page() when > squeezing a page: we do not set the LSN of the write buffer if > (xlrec.ntups <= 0 && xlrec.is_prim_bucket_same_wrt && > !xlrec.is_prev_bucket_same_wrt) when generating the squeeze record, > but at replay we call PageSetLSN() on the write buffer and mark it > dirty in this case. Isn't that incorrect? It seems to me that we > should be able to make the conditional depending on the state of the > xl_hash_squeeze_page record, no? Thanks for pointing out. Yes, we fixed a behavior by the commit aa5edbe37, but we missed the redo case. I made a fix patch based on the suggestion [1]. Below part contained my analysis and how I reproduced. I posted for clarifying the issue to others. = ## Pointed issue Assuming the case - xlrec.ntups is 0, - xlrec.is_prim_bucket_same_wrt is true, and - xlrec.is_prev_bucket_same_wrt is false. This meant that there are several overflow pages and the tail deadpage is removing. In this case, the primary page is not have to be modified. While doing the normal operation, the removal is done in _hash_freeovflpage(). If above condition is met, mod_wbuf is not set to true so PageSetLSN() is skipped. While doing the recovery, the squeeze and removal is done in hash_xlog_squeeze_page(). In this function PageSetLSN() is set unconditionally. He said in this case the PageSetLSN should be avoided as well. ## Analysis IIUC there is the same issue with pointed by [2]. He said the PageSetLSN() should be set when the page is really modified. In the discussing case, wbug is not modified (just removing the tail entry) so that no need to assign LSN to it. However, we might miss to update the redo case as well. ## How to reproduce I could reproduce the case below steps. 1. Added the debug log like [3] 2. Constructed a physical replication. 3. Ran hash_index.sql 4. Found the added debug log. [1]: https://www.postgresql.org/message-id/CAA4eK1%2BayneM-8mSRC0iWpDMnm39EwDoqgiOCSqrrMLcdnUnAA%40mail.gmail.com [2]: https://www.postgresql.org/message-id/ZbyVVG_7eW3YD5-A%40paquier.xyz [3]: ``` --- a/src/backend/access/hash/hash_xlog.c +++ b/src/backend/access/hash/hash_xlog.c @@ -713,6 +713,11 @@ hash_xlog_squeeze_page(XLogReaderState *record) writeopaque->hasho_nextblkno = xldata->nextblkno; } +if (xldata->ntups == 0 && +xldata->is_prim_bucket_same_wrt && +!xldata->is_prev_bucket_same_wrt) + elog(LOG, "XXX no need to set PageSetLSN"); + ``` Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/ add_modbuf_flag.diff Description: add_modbuf_flag.diff
RE: Improve eviction algorithm in ReorderBuffer
Dear Sawada-san, > Agreed. > > I think the patch is in good shape. I'll push the patch with the > suggestion next week, barring any objections. Thanks for working on this. Agreed it is committable. Few minor comments: ``` + * Either txn or change must be non-NULL at least. We update the memory + * counter of txn if it's non-NULL, otherwise change->txn. ``` IIUC no one checks the restriction. Should we add Assert() for it, e.g,: Assert(txn || change)? ``` +/* make sure enough space for a new node */ ... +/* make sure enough space for a new node */ ``` Should be started with upper case? Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
RE: Synchronizing slots from primary to standby
Dear Hou, Thanks for updating the patch! Here is a comment for it. ``` +/* + * By advancing the restart_lsn, confirmed_lsn, and xmin using + * fast-forward logical decoding, we can verify whether a consistent + * snapshot can be built. This process also involves saving necessary + * snapshots to disk during decoding, ensuring that logical decoding + * efficiently reaches a consistent point at the restart_lsn without + * the potential loss of data during snapshot creation. + */ +pg_logical_replication_slot_advance(remote_slot->confirmed_lsn, +found_consistent_point); +ReplicationSlotsComputeRequiredLSN(); +updated_lsn = true; ``` You added them like pg_replication_slot_advance(), but the function also calls ReplicationSlotsComputeRequiredXmin(false) at that time. According to the related commit b48df81 and discussions [1], I know it is needed only for physical slots, but it makes more consistent to call requiredXmin() as well, per [2]: ``` This may be a waste if no advancing is done, but it could also be an advantage to enforce a recalculation of the thresholds for each function call. And that's more consistent with the slot copy, drop and creation. ``` How do you think? [1]: https://www.postgresql.org/message-id/20200609171904.kpltxxvjzislidks%40alap3.anarazel.de [2]: https://www.postgresql.org/message-id/20200616072727.GA2361%40paquier.xyz Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
RE: pg_upgrade and logical replication
Dear Vignesh, > > Recently there was a failure in 004_subscription tap test at [1]. > In this failure, the tab_upgraded1 table was expected to have 51 > records but has only 50 records. Before the upgrade both publisher and > subscriber have 50 records. Good catch! > After the upgrade we have inserted one record in the publisher, now > tab_upgraded1 will have 51 records in the publisher. Then we start the > subscriber after changing max_logical_replication_workers so that > apply workers get started and apply the changes received. After > starting we enable regress_sub5, wait for sync of regress_sub5 > subscription and check for tab_upgraded1 and tab_upgraded2 table data. > In a few random cases the one record that was inserted into > tab_upgraded1 table will not get replicated as we have not waited for > regress_sub4 subscription to apply the changes from the publisher. > The attached patch has changes to wait for regress_sub4 subscription > to apply the changes from the publisher before verifying the data. > > [1] - > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba=2024-03- > 26%2004%3A23%3A13 Yeah, I think it is an oversight in f17529. Previously subscriptions which receiving changes were confirmed to be caught up, I missed to add the line while restructuring the script. +1 for your fix. Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
RE: speed up a logical replica setup
Dear Amit, Euler, > > This only drops the publications created by this tool, not the > pre-existing ones that we discussed in the link provided. Another concern around here is the case which primary subscribes changes from others. After the conversion, new subscriber also tries to connect to another publisher as well - this may lead conflicts. This causes because both launcher/workers start after recovery finishes. So, based on the Ashutosh's point, should we remove such replication objects? Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
RE: speed up a logical replica setup
Dear Bharath, Peter, > Looks like BF animals aren't happy, please check - > > https://buildfarm.postgresql.org/cgi-bin/show_failures.pl. > > Looks like sanitizer failures. There were a few messages about that > recently, but those were all just about freeing memory after use, which > we don't necessarily require for client programs. So maybe something else. It seems that there are several time of failures, [1] and [2]. ## Analysis for failure 1 The failure caused by a time lag between walreceiver finishes and pg_is_in_recovery() returns true. According to the output [1], it seems that the tool failed at wait_for_end_recovery() with the message "standby server disconnected from the primary". Also, lines "redo done at..." and "terminating walreceiver process due to administrator command" meant that walreceiver was requested to shut down by XLogShutdownWalRcv(). According to the source, we confirm that walreceiver is shut down in StartupXLOG()->FinishWalRecovery()->XLogShutdownWalRcv(). Also, SharedRecoveryState is changed to RECOVERY_STATE_DONE (this meant the pg_is_in_recovery() return true) at the latter part of StartupXLOG(). So, if there is a delay between FinishWalRecovery() and change the state, the check in wait_for_end_recovery() would be failed during the time. Since we allow to miss the walreceiver 10 times and it is checked once per second, the failure occurs if the time lag is longer than 10 seconds. I do not have a good way to fix it. One approach is make NUM_CONN_ATTEMPTS larger, but it's not a fundamental solution. ## Analysis for failure 2 According to [2], the physical replication slot which is specified as primary_slot_name was not used by the walsender process. At that time walsender has not existed. ``` ... pg_createsubscriber: publisher: current wal senders: 0 pg_createsubscriber: command is: SELECT 1 FROM pg_catalog.pg_replication_slots WHERE active AND slot_name = 'physical_slot' pg_createsubscriber: error: could not obtain replication slot information: got 0 rows, expected 1 row ... ``` Currently standby must be stopped before the command and current code does not block the flow to ensure the replication is started. So there is a possibility that the checking is run before walsender is launched. One possible approach is to wait until the replication starts. Alternative one is to ease the condition. How do you think? [1]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus=2024-03-25%2013%3A03%3A07 [2]: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae=2024-03-25%2013%3A53%3A58 Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
RE: speed up a logical replica setup
> IIUC, added options were inspired by Tomas. And he said the limitation > (pub/sub/slot > name cannot be specified) was acceptable for the first version. I agree with > him. > (To be honest, I feel that options should be fewer for the first release) Just to confirm - I don't think it is not completely needed. If we can agree a specification in sometime, it's OK for me to add them. If you ask me, I still prefer Tomas's approach. Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
RE: speed up a logical replica setup
Dear Amit, Euler, I also want to share my opinion, just in case. > On Thu, Mar 21, 2024 at 8:00 PM Euler Taveira wrote: > > > > On Thu, Mar 21, 2024, at 10:33 AM, vignesh C wrote: > > > > If we commit this we might not be able to change the way the option > > behaves once the customers starts using it. How about removing these > > options in the first version and adding it in the next version after > > more discussion. > > > > > > We don't need to redesign this one if we want to add a format string in a > > next > > version. A long time ago, pg_dump started to accept pattern for tables > > without > > breaking or deprecating the -t option. If you have 100 databases and you > > don't > > want to specify the options or use a script to generate it for you, you also > > have the option to let pg_createsubscriber generate the object names for > > you. > > Per my experience, it will be a rare case. > > > > But, why go with some UI in the first place which we don't think is a > good one, or at least don't have a broader agreement that it is a good > one? The problem with self-generated names for users could be that > they won't be able to make much out of it. If one has to always use > those internally then probably that would be acceptable. I would > prefer what Tomas proposed a few emails ago as compared to this one as > that has fewer options to be provided by users but still, they can > later identify objects. But surely, we should discuss if you or others > have better alternatives. IIUC, added options were inspired by Tomas. And he said the limitation (pub/sub/slot name cannot be specified) was acceptable for the first version. I agree with him. (To be honest, I feel that options should be fewer for the first release) > The user choosing to convert a physical standby to a subscriber would > in some cases be interested in converting it for all the databases > (say for the case of upgrade [1]) and giving options for each database > would be cumbersome for her. +1 for the primary use case. > > Currently dry-run will do the check and might fail on identifying a > > few failures like after checking subscriber configurations. Then the > > user will have to correct the configuration and re-run then fix the > > next set of failures. Whereas the suggest-config will display all the > > optimal configuration for both the primary and the standby in a single > > shot. This is not a must in the first version, it can be done as a > > subsequent enhancement. > > > > > > Do you meant publisher, right? Per order, check_subscriber is done before > > check_publisher and it checks all settings on the subscriber before > > exiting. In > > v30, I changed the way it provides the required settings. In a previous > > version, > > it fails when it found a wrong setting; the current version, check all > > settings > > from that server before providing a suitable error. > > > > pg_createsubscriber: checking settings on publisher > > pg_createsubscriber: primary has replication slot "physical_slot" > > pg_createsubscriber: error: publisher requires wal_level >= logical > > pg_createsubscriber: error: publisher requires 2 replication slots, but > > only 0 > remain > > pg_createsubscriber: hint: Consider increasing max_replication_slots to at > > least > 3. > > pg_createsubscriber: error: publisher requires 2 wal sender processes, but > > only > 0 remain > > pg_createsubscriber: hint: Consider increasing max_wal_senders to at least > > 3. > > > > If you have such an error, you will fix them all and rerun using dry run > > mode > > again to verify everything is ok. I don't have a strong preference about > > it. It > > can be changed easily (unifying the check functions or providing a return > > for > > each of the check functions). > > > > We can unify the checks but not sure if it is worth it. I am fine > either way. It would have been better if we provided a way for a user > to know the tool's requirement rather than letting him know via errors > but I think it should be okay to extend it later as well. Both ways are OK, but I prefer to unify checks a bit. The number of working modes in the same executables should be reduced as much as possible. Also, I feel the current specification is acceptable. pg_upgrade checks one by one and exits soon in bad cases. If both old and new clusters have issues, the first run only reports the old one's issues. After DBA fixes and runs again, issues on the new cluster are output. [1]: https://www.postgresql.org/message-id/8d52c226-7e34-44f7-a919-759bf8d81541%40enterprisedb.com Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
RE: Have pg_basebackup write "dbname" in "primary_conninfo"?
Dear Sawada-san, Thanks for giving comments! > This behavior makes sense to me. But do we want to handle the case of > using environment variables too? Yeah, v5 does not consider which libpq parameters are specified by environment variables. Such a variable should be used when the dbname is not expressly written in the connection string. Such a path was added in the v6 patch. If the dbname is not determined after parsing the connection string, we call PQconndefaults() to get settings from environment variables and service files [1], then start to search dbname again. Below shows an example. ``` PGPORT=5431 PGUSER=kuroda PGDATABASE=postgres pg_basebackup -D data_N2 -R -v -> primary_conninfo = 'user=kuroda ... port=5431 ... dbname=postgres ... ' ``` > IIUC, > > pg_basebackup -D tmp -d "user=masahiko dbname=test_db" > > is equivalent to: > > PGDATABASE="user=masahiko dbname=test_db" pg_basebackup -D tmp The case won't work. I think You assumed that expanded_dbname like PQconnectdbParams() [2] can be used for enviroment variables, but it is not correct - it won't parse as connection string again. In the libpq layer, connection parameters are parsed in PQconnectStartParams()->conninfo_array_parse(). When expand_dbname is specified, the entry "dbname" is firstly checked and parsed its value. They are done at fe-connect.c:5846. The environment variables are checked and parsed in conninfo_add_defaults(), which is called from conninfo_array_parse(). However, it is done at fe-connect.c:5956 - the expand_dbname has already been done at that time. This means there is no chance that PGDATABASE is parsed as an expanded style. For example, if the pg_basebackup runs like below: PGDATABASE="user=kuroda dbname=postgres" pg_basebackup -D data_N2 -R -v The primary_conninfo written in the file will be: primary_conninfo = 'user=hayato ... dbname=''user=kuroda dbname=postgres''' [1]: https://www.postgresql.org/docs/devel/libpq-pgservice.html [2]: https://www.postgresql.org/docs/devel/libpq-connect.html#LIBPQ-PQCONNECTDBPARAMS Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/ pg_basebackup-write-dbname.v0006.patch Description: pg_basebackup-write-dbname.v0006.patch
RE: speed up a logical replica setup
Dear Peter, Thanks for giving comments. I want to reply some of them. > > 17. > > > > "specifies promote" > > > > We can do double-quote for the word promote. > > The v30 patch has promote, which I think is adequate. Opps. Actually I did look v29 patch while firstly reviewing. Sorry for noise. > > > 20. > > > > New options must be also documented as well. This helps not only users but > also > > reviewers. > > (Sometimes we cannot identify that the implementation is intentinal or not.) > > Which ones are missing? In v29, newly added options (publication/subscription/replication-slot) was not added. Since they have been added, please ignore. > > 21. > > > > Also, not sure the specification is good. I preferred to specify them by > > format > > string. Because it can reduce the number of arguments and I cannot find use > cases > > which user want to control the name of objects. > > > > However, your approach has a benefit which users can easily identify the > generated > > objects by pg_createsubscriber. How do other think? > > I think listing them explicitly is better for the first version. It's > simpler to implement and more flexible. OK. > > 22. > > > > ``` > > #define BASE_OUTPUT_DIR > "pg_createsubscriber_output.d" > > ``` > > > > No one refers the define. > > This is gone in v30. I wrote due to the above reason. Please ignore... > > > 31. > > > > ``` > > /* Create replication slot on publisher */ > > if (lsn) > > pg_free(lsn); > > ``` > > > > I think allocating/freeing memory is not so efficient. > > Can we add a flag to create_logical_replication_slot() for controlling the > > returning value (NULL or duplicated string)? We can use the condition (i == > num_dbs-1) > > as flag. > > Nothing is even using the return value of > create_logical_replication_slot(). I think this can be removed altogether. > > 37. > > > > ``` > > /* Register a function to clean up objects in case of failure */ > > atexit(cleanup_objects_atexit); > > ``` > > > > Sorry if we have already discussed. I think the registration can be moved > > just > > before the boot of the standby. Before that, the callback will be no-op. > > But it can also stay where it is. What is the advantage of moving it later? I thought we could reduce the risk of bugs. Previously some bugs were reported because the registration is too early. However, this is not a strong opinion. Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
RE: PostgreSQL 17 Release Management Team & Feature Freeze
Dear Michael, > If you think that this is OK, and as far as I can see this looks OK on > the thread, then this open item should be moved under "resolved before > 17beta1", mentioning the commit involved in the fix. Please see [1] > for examples. OK, I understood that I could wait checking from you. Thanks. Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
RE: speed up a logical replica setup
nother rule? 23. ``` static int num_dbs = 0; static int num_pubs = 0; static int num_subs = 0; static int num_replslots = 0; ``` I think the name is bit confusing. The number of generating publications/subscriptions/replication slots are always same as the number of databases. They just indicate the number of specified. My idea is num_custom_pubs or something. Thought? 24. ``` /* standby / subscriber data directory */ static char *subscriber_dir = NULL; ``` It is bit strange that only subscriber_dir is a global variable. Caller requires the CreateSubscriberOptions as an argument, except cleanup_objects_atexit() and main. So, how about makeing `CreateSubscriberOptions opt` to global one? 25. ``` * Replication slots, publications and subscriptions are created. Depending on * the step it failed, it should remove the already created objects if it is * possible (sometimes it won't work due to a connection issue). ``` I think it should be specified here that subscriptions won't be removed with the reason. 26. ``` /* * If the server is promoted, there is no way to use the current setup * again. Warn the user that a new replication setup should be done before * trying again. */ ``` Per comment 25, we can add a reference like "See comments atop the function" 27. usage() was not updated based on recent changes. 28. ``` if (strcmp(conn_opt->keyword, "dbname") == 0 && conn_opt->val != NULL) { if (dbname) *dbname = pg_strdup(conn_opt->val); continue; } ``` There is a memory-leak if multiple dbname are specified in the conninfo. 29. ``` pg_prng_seed(_state, (uint64) (getpid() ^ time(NULL))); ``` No need to initialize the seed every time. Can you reuse pg_prng_state? 30. ``` if (num_replslots == 0) dbinfo[i].replslotname = pg_strdup(genname); ``` I think the straightforward way is to use the name of subscription if no name is specified. This follows the rule for CREATE SUBSCRIPTION. 31. ``` /* Create replication slot on publisher */ if (lsn) pg_free(lsn); ``` I think allocating/freeing memory is not so efficient. Can we add a flag to create_logical_replication_slot() for controlling the returning value (NULL or duplicated string)? We can use the condition (i == num_dbs-1) as flag. 32. ``` /* * Close the connection. If exit_on_error is true, it has an undesired * condition and it should exit immediately. */ static void disconnect_database(PGconn *conn, bool exit_on_error) ``` In case of disconnect_database(), the second argument should have different name. If it is true, the process exits unconditionally. Also, comments atop the function must be fixed. 33. ``` wal_level = strdup(PQgetvalue(res, 0, 0)); ``` pg_strdup should be used here. 34. ``` {"config-file", required_argument, NULL, 1}, {"publication", required_argument, NULL, 2}, {"replication-slot", required_argument, NULL, 3}, {"subscription", required_argument, NULL, 4}, ``` The ordering looks strange for me. According to pg_upgarade and pg_basebackup, options which do not have short notation are listed behind. 35. ``` opt.sub_port = palloc(16); ``` Per other lines, pg_alloc() should be used. 36. ``` pg_free(opt.sub_port); ``` You said that the leak won't be concerned here. If so, why only 'p' has pg_free()? 37. ``` /* Register a function to clean up objects in case of failure */ atexit(cleanup_objects_atexit); ``` Sorry if we have already discussed. I think the registration can be moved just before the boot of the standby. Before that, the callback will be no-op. 38. ``` /* Subscriber PID file */ snprintf(pidfile, MAXPGPATH, "%s/postmaster.pid", subscriber_dir); /* * If the standby server is running, stop it. Some parameters (that can * only be set at server start) are informed by command-line options. */ if (stat(pidfile, ) == 0) ``` Hmm. pidfile is used only here, but it is declared in main(). Can it be separated into another funtion like is_standby_started()? 39. Or, we may able to introcue "restart_standby_if_needed" or something. 40. ``` * XXX this code was extracted from BootStrapXLOG(). ``` So, can we extract the common part to somewhere? Since system identifier is related with the controldata file, I think it can be located in controldata_util.c. 41. You said like below in [1], but I could not find the related fix. Can you clarify? > That's a good point. We should state in the documentation that GUCs specified > in > the command-line options are ignored during the execution. [1]: https://www.postgresql.org/message-id/40595e73-c7e1-463a-b8be-49792e870007%40app.fastmail.com Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/global/
RE: PostgreSQL 17 Release Management Team & Feature Freeze
Dear Michael, > We are pleased to announce the Release Management Team (RMT) (cc'd) > for the PostgreSQL 17 release: > - Robert Haas > - Heikki Linnakangas > - Michael Paquier Thanks for managing the release of PostgreSQL to proceed the right direction! > You can track open items for the PostgreSQL 17 release here: > https://wiki.postgresql.org/wiki/PostgreSQL_17_Open_Items I think the entry can be closed: ``` pg_upgrade with --link mode failing upgrade of publishers Commit: 29d0a77fa660 Owner: Amit Kapila ``` The reported failure was only related with the test script, not the feature. The issue has already been analyzed and the fix patch was pushed as f17529b710. Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
RE: speed up a logical replica setup
Dear Vignesh, Thanks for updating the patch, but cfbot still got angry [1]. Note that two containers (autoconf and meson) failed at different place, so I think it is intentional one. It seems that there may be a bug related with 32-bit build. We should see and fix as soon as possible. [1]: http://cfbot.cputube.org/highlights/all.html#4637 Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
RE: speed up a logical replica setup
Dear Tomas, Euler, Thanks for starting to read the thread! Since I'm not an original author, I want to reply partially. > I decided to take a quick look on this patch today, to see how it works > and do some simple tests. I've only started to get familiar with it, so > I have only some comments / questions regarding usage, not on the code. > It's quite possible I didn't understand some finer points, or maybe it > was already discussed earlier in this very long thread, so please feel > free to push back or point me to the past discussion. > > Also, some of this is rather opinionated, but considering I didn't see > this patch before, my opinions may easily be wrong ... I felt your comments were quit valuable. > 1) SGML docs > > It seems the SGML docs are more about explaining how this works on the > inside, rather than how to use the tool. Maybe that's intentional, but > as someone who didn't work with pg_createsubscriber before I found it > confusing and not very helpful. > > For example, the first half of the page is prerequisities+warning, and > sure those are useful details, but prerequisities are checked by the > tool (so I can't really miss this) and warnings go into a lot of details > about different places where things may go wrong. Sure, worth knowing > and including in the docs, but maybe not right at the beginning, before > I learn how to even run the tool? Hmm, right. I considered below improvements. Tomas and Euler, how do you think? * Adds more descriptions in "Description" section. * Moves prerequisities+warning to "Notes" section. * Adds "Usage" section which describes from a single node. > I'm not sure FOR ALL TABLES is a good idea. Or said differently, I'm > sure it won't work for a number of use cases. I know large databases > it's common to create "work tables" (not necessarily temporary) as part > of a batch job, but there's no need to replicate those tables. Indeed, the documentation does not describe that all tables in the database would be included in the publication. > I do understand that FOR ALL TABLES is the simplest approach, and for v1 > it may be an acceptable limitation, but maybe it'd be good to also > support restricting which tables should be replicated (e.g. blacklist or > whitelist based on table/schema name?). May not directly related, but we considered that accepting options was a next-step [1]. > Note: I now realize this might fall under the warning about DDL, which > says this: > > Executing DDL commands on the source server while running > pg_createsubscriber is not recommended. If the target server has > already been converted to logical replica, the DDL commands must > not be replicated so an error would occur. Yeah, they would not be replicated, but not lead ERROR. So should we say like "Creating tables on the source server..."? > 5) slot / publication / subscription name > > I find it somewhat annoying it's not possible to specify names for > objects created by the tool - replication slots, publication and > subscriptions. If this is meant to be a replica running for a while, > after a while I'll have no idea what pg_createsubscriber_569853 or > pg_createsubscriber_459548_2348239 was meant for. > > This is particularly annoying because renaming these objects later is > either not supported at all (e.g. for replication slots), or may be > quite difficult (e.g. publications). > > I do realize there are challenges with custom names (say, if there are > multiple databases to replicate), but can't we support some simple > formatting with basic placeholders? So we could specify > > --slot-name "myslot_%d_%p" > > or something like that? Not sure we can do in the first version, but looks nice. One concern is that I cannot find applications which accepts escape strings like log_line_prefix. (It may just because we do not have use-case.) Do you know examples? > BTW what will happen if we convert multiple standbys? Can't they all get > the same slot name (they all have the same database OID, and I'm not > sure how much entropy the PID has)? I tested and the second try did not work. The primal reason was the name of publication - pg_createsubscriber_%u (oid). FYI - previously we can reuse same publications, but based on my comment [2] the feature was removed. It might be too optimistic. [1]: https://www.postgresql.org/message-id/TY3PR01MB9889CCBD4D9DAF8BD2F18541F56F2%40TY3PR01MB9889.jpnprd01.prod.outlook.com [2]: https://www.postgresql.org/message-id/TYCPR01MB12077756323B79042F29DDAEDF54C2%40TYCPR01MB12077.jpnprd01.prod.outlook.com Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
RE: speed up a logical replica setup
system name of the user * running it. */ ``` Unnecessary blank. 08. main ``` char *errstr = NULL; ``` This declaration can be at else-part. 09. main. Also, as the first place, do we have to get username if not specified? I felt libpq can handle the case if we skip passing the info. 10. main ``` appendPQExpBuffer(sub_conninfo_str, "host=%s port=%u user=%s fallback_application_name=%s", opt.socket_dir, opt.sub_port, opt.sub_username, progname); sub_base_conninfo = get_base_conninfo(sub_conninfo_str->data, NULL); ``` Is it really needed to call get_base_conninfo? I think no need to define sub_base_conninfo. 11. main ``` /* * In dry run mode, the server is restarted with the provided command-line * options so validation can be applied in the target server. In order to * preserve the initial state of the server (running), start it without * the command-line options. */ if (dry_run) start_standby_server(, pg_ctl_path, NULL, false); ``` I think initial state of the server may be stopped. Now both conditions are allowed. And I think it is not good not to specify the logfile. 12. others As Peter E pointed out [1], the main function is still huge. It has more than 400 lines. I think all functions should have less than 100 line to keep the readability. I considered separation idea like below. Note that this may require to change orderings. How do you think? * add parse_command_options() which accepts user options and verifies them * add verification_phase() or something which checks system identifier and calls check_XXX * add catchup_phase() or something which creates a temporary slot, writes recovery parameters, and wait until the end of recovery * add cleanup_phase() or something which removes primary_slot and modifies the system identifier * stop/start server can be combined into one wrapper. Attached txt file is proofs the concept. 13. others PQresultStatus(res) is called 17 times in this source code, it may be redundant. I think we can introduce a function like executeQueryOrDie() and gather in one place. 14. others I found that pg_createsubscriber does not refer functions declared in other files. Is there a possibility to use them, e.g., streamutils.h? 15. others While reading the old discussions [2], Amit suggested to keep the comment and avoid creating a temporary slot. You said "Got it" but temp slot still exists. Is there any reason? Can you clarify your opinion? 16. others While reading [2] and [3], I was confused the decision. You and Amit discussed the combination with pg_createsubscriber and slot sync and how should handle slots on the physical standby. You seemed to agree to remove such a slot, and Amit also suggested to raise an ERROR. However, you said in [8] that such handlings is not mandatory so should raise an WARNING in dry_run. I was quite confused. Am I missing something? 17. others Per discussion around [4], we might have to consider an if the some options like data_directory and config_file was initially specified for standby server. Another easy approach is to allow users to specify options like -o in pg_upgrade [5], which is similar to your idea. Thought? 18. others How do you handle the reported failure [6]? 19. main ``` char *pub_base_conninfo = NULL; char *sub_base_conninfo = NULL; char *dbname_conninfo = NULL; ``` No need to initialize pub_base_conninfo and sub_base_conninfo. These variables would not be free'd. 20. others IIUC, slot creations would not be finished if there are prepared transactions. Should we detect it on the verification phase and raise an ERROR? 21. others As I said in [7], the catch up would not be finished if long recovery_min_apply_delay is used. Should we overwrite during the catch up? 22. pg_createsubscriber.sgml ``` Check Write recovery parameters into the target data... ``` Not sure, but "Check" seems not needed. [1]: https://www.postgresql.org/message-id/b9aa614c-84ba-a869-582f-8d5e3ab57424%40enterprisedb.com [2]: https://www.postgresql.org/message-id/9fd3018d-0e5f-4507-aee6-efabfb5a4440%40app.fastmail.com [3]: https://www.postgresql.org/message-id/CAA4eK1L%2BE-bdKaOMSw-yWizcuprKMyeejyOwWjq_57%3DUqh-f%2Bg%40mail.gmail.com [4]: https://www.postgresql.org/message-id/TYCPR01MB12077B63D81B49E9DFD323661F55A2%40TYCPR01MB12077.jpnprd01.prod.outlook.com [5]: https://www.postgresql.org/docs/devel/pgupgrade.html#:~:text=options%20to%20be%20passed%20directly%20to%20the%20old%20postgres%20command%3B%20multiple%20option%20invocations%20are%20appended [6]: https://www.postgresql.org/message-id/CAHv8Rj%2B5mzK9Jt%2B7ECogJzfm5czvDCCd5jO1_rCx0bTEYpBE5g%40mail.gmail.com [7]: https://www.postgresql.org/message-id/OS3PR01MB98828B15DD9502C91E0C50D7F57D2%40OS3PR01MB9882.jpnpr
RE: Some shared memory chunks are allocated even if related processes won't start
Dear Alvaro, Thanks for giving comments! > > I agreed it sounds good, but I don't think it can be implemented by > > current interface. An interface for dynamically allocating memory is > > GetNamedDSMSegment(), and it returns the same shared memory region if > > input names are the same. Therefore, there is no way to re-alloc the > > shared memory. > > Yeah, I was imagining something like this: the workitem-array becomes a > struct, which has a name and a "next" pointer and a variable number of > workitem slots; the AutoVacuumShmem struct has a pointer to the first > workitem-struct and the last one; when a workitem is requested by > brininsert, we initially allocate via GetNamedDSMSegment("workitem-0") a > workitem-struct with a smallish number of elements; if we request > another workitem and the array is full, we allocate another array via > GetNamedDSMSegment("workitem-1") and store a pointer to it in workitem-0 > (so that the list can be followed by an autovacuum worker that's > processing the database), and it's also set as the tail of the list in > AutoVacuumShmem (so that we know where to store further work items). > When all items in a workitem-struct are processed, we can free it > (I guess via dsm_unpin_segment), and make AutoVacuumShmem->av_workitems > point to the next one in the list. > > This way, the "array" can grow arbitrarily. > Basically sounds good. My concerns are: * GetNamedDSMSegment() does not returns a raw pointer to dsm_segment. This means that it may be difficult to do dsm_unpin_segment on the caller side. * dynamic shared memory is recorded in dhash (dsm_registry_table) and the entry won't be deleted. The reference for the chunk might be remained. Overall, it may be needed that dsm_registry may be also extended. I do not start working yet, so will share results after trying them. Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
RE: Some shared memory chunks are allocated even if related processes won't start
Dear Alvaro, Thanks for discussing! > > I think it would be worth allocating AutoVacuumShmem->av_workItems using > dynamic shmem allocation, particularly to prevent workitems from being > discarded just because the array is full¹; but other than that, the > struct is just 64 bytes long so I doubt it's useful to allocate it > dynamically. > > ¹ I mean, if the array is full, just allocate another array, point to it > from the original one, and keep going. OK, I understood that my initial proposal is not so valuable, so I can withdraw it. About the suggetion, you imagined AutoVacuumRequestWork() and brininsert(), right? I agreed it sounds good, but I don't think it can be implemented by current interface. An interface for dynamically allocating memory is GetNamedDSMSegment(), and it returns the same shared memory region if input names are the same. Therefore, there is no way to re-alloc the shared memory. Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
RE: Some shared memory chunks are allocated even if related processes won't start
Dear Alvaro, Thanks for giving comments! > > While reading codes, I found that ApplyLauncherShmemInit() and > > AutoVacuumShmemInit() are always called even if they would not be > > launched. > > Note that there are situations where the autovacuum launcher is started > even though autovacuum is nominally turned off, and I suspect your > proposal would break that. IIRC this occurs when the Xid or multixact > counters cross the max_freeze_age threshold. Right. In GetNewTransactionId(), SetTransactionIdLimit() and some other places, PMSIGNAL_START_AUTOVAC_LAUNCHER is sent to postmaster when the xid exceeds autovacuum_freeze_max_age. This work has already been written in the doc [1]: ``` To ensure that this does not happen, autovacuum is invoked on any table that might contain unfrozen rows with XIDs older than the age specified by the configuration parameter autovacuum_freeze_max_age. (This will happen even if autovacuum is disabled.) ``` This means that my first idea won't work well. Even if the postmaster does not initially allocate shared memory, backends may request to start auto vacuum and use the region. However, the second idea is still valid, which allows the allocation of shared memory dynamically. This is a bit efficient for the system which tuples won't be frozen. Thought? [1]: https://www.postgresql.org/docs/devel/routine-vacuuming.html#VACUUM-FOR-WRAPAROUND Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
RE: Some shared memory chunks are allocated even if related processes won't start
Dear Tom, Thanks for replying! > "Hayato Kuroda (Fujitsu)" writes: > > While reading codes, I found that ApplyLauncherShmemInit() and > AutoVacuumShmemInit() > > are always called even if they would not be launched. > > It may be able to reduce the start time to avoid the unnecessary allocation. > > Why would this be a good idea? It would require preventing the > decision not to launch them from being changed later, except via > postmaster restart. Right. It is important to relax their GucContext. > We've generally been trying to move away > from unchangeable-without-restart decisions. In your example, > > > +if (max_logical_replication_workers == 0 || IsBinaryUpgrade) > > +return; > > max_logical_replication_workers is already PGC_POSTMASTER so there's > not any immediate loss of flexibility, but I don't think it's a great > idea to introduce another reason why it has to be PGC_POSTMASTER. You are right. The first example implied the max_logical_replication_workers won't be changed. So it is not appropriate. So ... what about second one? The approach allows to allocate a memory after startup, which means that users may able to change the parameter from 0 to natural number in future. (Of course, such an operation is prohibit for now). Can it be an initial step to ease the condition? Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
Some shared memory chunks are allocated even if related processes won't start
Dear hackers, While reading codes, I found that ApplyLauncherShmemInit() and AutoVacuumShmemInit() are always called even if they would not be launched. It may be able to reduce the start time to avoid the unnecessary allocation. However, I know this improvement would be quite small because the allocated chunks are quite small. Anyway, there are several ways to fix: 1) Skip calling ShmemInitStruct() if the related process would not be launched. I think this approach is the easiest way. E.g., ``` --- a/src/backend/replication/logical/launcher.c +++ b/src/backend/replication/logical/launcher.c @@ -962,6 +962,9 @@ ApplyLauncherShmemInit(void) { boolfound; +if (max_logical_replication_workers == 0 || IsBinaryUpgrade) +return; + ``` 2) Dynamically allocate the shared memory. This was allowed by recent commit [1]. I made a small PoC only for logical launcher to show what I meant. PSA diff file. Since some processes (backend, apply worker, parallel apply worker, and tablesync worker) refers the chunk, codes for attachment must be added on the several places. If you agree it should be fixed, I will create a patch. Thought? [1]: https://github.com/postgres/postgres/commit/8b2bcf3f287c79eaebf724cba57e5ff664b01e06 Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/ dynamic_allocation.diff Description: dynamic_allocation.diff
RE: speed up a logical replica setup
Dear Euler, > Sorry, which pg_rewind option did you mention? I cannot find. > IIUC, -l is an only option which can accept the path, but it is not related > with us. Sorry, I found [1]. I was confused with pg_resetwal. But my opinion was not so changed. The reason why --config-file exists is that pg_rewind requires that target must be stopped, which is different from the current pg_createsubscriber. So I still do not like to add options. [1]: https://www.postgresql.org/docs/current/app-pgrewind.html#:~:text=%2D%2Dconfig%2Dfile%3Dfilename [2]: ``` The target server must be shut down cleanly before running pg_rewind ``` Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
RE: Have pg_basebackup write "dbname" in "primary_conninfo"?
> PSA the patch for implementing it. It is basically same as Ian's one. > However, this patch still cannot satisfy the condition 3). > > `pg_basebackup -D data_N2 -d "user=postgres" -R` > -> dbname would not be appeared in primary_conninfo. > > This is because `if (connection_string)` case in GetConnection() explicy > override > a dbname to "replication". I've tried to add a dummy entry {"dbname", NULL} > pair > before the overriding, but it is no-op. Because The replacement of the dbname > in > pqConnectOptions2() would be done only for the valid (=lastly specified) > connection options. Oh, this patch missed the straightforward case: pg_basebackup -D data_N2 -d "user=postgres dbname=replication" -R -> dbname would not be appeared in primary_conninfo. So I think it cannot be applied as-is. Sorry for sharing the bad item. Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
RE: Have pg_basebackup write "dbname" in "primary_conninfo"?
Dear Amit, > We do append dbname=replication even in libpqrcv_connect for .pgpass > lookup as mentioned in comments. See below: > libpqrcv_connect() > { > > else > { > /* > * The database name is ignored by the server in replication mode, > * but specify "replication" for .pgpass lookup. > */ > keys[++i] = "dbname"; > vals[i] = "replication"; > } > ... > } OK. So we must add the value for the authorization, right? I think it should be described even in GetConnection(). > I think as part of this effort we should just add dbname to > primary_conninfo written in postgresql.auto.conf file. As above, the > question is valid whether we should do it just for 17 or for prior > versions. Let's discuss more on that. I am not sure of the use case > for versions before 17 but commit cca97ce6a665 mentioned that some > middleware or proxies might however need to know the dbname to make > the correct routing decision for the connection. Does that apply here > as well? If so, we can do it and update the docs, otherwise, let's do > it just for 17. Hmm, I might lose your requirements again. So, we must satisfy all of below ones, right? 1) add {"dbname", "replication"} key-value pair to look up .pgpass file correctly. 2) If the -R is given, output dbname=xxx value to be used by slotsync worker. 3) If the dbname is not given in the connection string, the same string as username must be used to keep the libpq connection rule. 4) No need to add dbname=replication pair PSA the patch for implementing it. It is basically same as Ian's one. However, this patch still cannot satisfy the condition 3). `pg_basebackup -D data_N2 -d "user=postgres" -R` -> dbname would not be appeared in primary_conninfo. This is because `if (connection_string)` case in GetConnection() explicy override a dbname to "replication". I've tried to add a dummy entry {"dbname", NULL} pair before the overriding, but it is no-op. Because The replacement of the dbname in pqConnectOptions2() would be done only for the valid (=lastly specified) connection options. Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/ pg_basebackup-write-dbname.v0002.patch Description: pg_basebackup-write-dbname.v0002.patch
RE: Have pg_basebackup write "dbname" in "primary_conninfo"?
Dear Amit, > When dbname is NULL or not given, it defaults to username. This > follows the specs of the connection string. See (dbname # > The database name. Defaults to be the same as the user name...) [1]. > Your patch breaks that specs, so I don't think it is correct. I have proposed the point because I thought pg_basebackup basically wanted to do a physical replication. But if the general libpq rule is stronger than it, we should not apply my add_NULL_check.txt. Let's forget it. Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
RE: speed up a logical replica setup
Dear Euler, >> The possible solution would be >> 1) allow to run pg_createsubscriber if standby is initially stopped . >> I observed that pg_logical_createsubscriber also uses this approach. >> 2) read GUCs via SHOW command and restore them when server restarts >> >3. add a config-file option. That's similar to what pg_rewind does. Sorry, which pg_rewind option did you mention? I cannot find. IIUC, -l is an only option which can accept the path, but it is not related with us. Also, I'm not sure the benefit to add as new options. Basically it should be less. Is there benefits than read via SHOW? Even if I assume the pg_resetwal has such an option, the reason is that the target postmaster for pg_resetwal must be stopped. >I expect >that Debian-based installations will have this issue. I'm not familiar with the Debian-env, so can you explain the reason? >It was not a good idea if you want to keep the postgresql.conf outside PGDATA. >I mean you need extra steps that can be error prone (different settings between >files). Yeah, if we use my approach, users who specify such GUCs may not be happy. So...based on above discussion, we should choose either of below items. Thought? a) enforce the standby must be *stopped*, and options like config_file can be specified via option. b) enforce the standby must be *running*, options like config_file would be read via SHOW command. Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/global/
RE: speed up a logical replica setup
Dear Vignesh, I forgot to reply one of the suggestion. > 2) There is a CheckDataVersion function which does exactly this, will > we be able to use this: > + /* Check standby server version */ > + if ((ver_fd = fopen(versionfile, "r")) == NULL) > + pg_fatal("could not open file \"%s\" for reading: %m", > versionfile); > + > + /* Version number has to be the first line read */ > + if (!fgets(rawline, sizeof(rawline), ver_fd)) > + { > + if (!ferror(ver_fd)) > + pg_fatal("unexpected empty file \"%s\"", versionfile); > + else > + pg_fatal("could not read file \"%s\": %m", > versionfile); > + } > + > + /* Strip trailing newline and carriage return */ > + (void) pg_strip_crlf(rawline); > + > + if (strcmp(rawline, PG_MAJORVERSION) != 0) > + { > + pg_log_error("standby server is of wrong version"); > + pg_log_error_detail("File \"%s\" contains \"%s\", > which is not compatible with this program's version \"%s\".", > + versionfile, > rawline, PG_MAJORVERSION); > + exit(1); > + } > + > + fclose(ver_fd); This suggestion has been rejected because I was not sure the location of the declaration and the implementation. Function which could be called from clients must be declared in src/include/{common|fe_utils|utils} directory. I saw files located at there but I could not appropriate location for CheckDataVersion. Also, I did not think new file should be created only for this function. Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
RE: speed up a logical replica setup
Dear Vignesh, > Few comments on the tests: > 1) If the dry run was successful because of some issue then the server > will be stopped so we can check for "pg_ctl status" if the server is > running otherwise the connection will fail in this case. Another way > would be to check if it does not have "postmaster was stopped" > messages in the stdout. > + > +# Check if node S is still a standby > +is($node_s->safe_psql('postgres', 'SELECT pg_catalog.pg_is_in_recovery()'), > + 't', 'standby is in recovery'); Just to confirm - your suggestion is to add `pg_ctl status`, right? Added. > 2) Can we add verification of "postmaster was stopped" messages in > the stdout for dry run without --databases testcase > +# pg_createsubscriber can run without --databases option > +command_ok( > + [ > + 'pg_createsubscriber', '--verbose', > + '--dry-run', '--pgdata', > + $node_s->data_dir, '--publisher-server', > + $node_p->connstr('pg1'), '--subscriber-server', > + $node_s->connstr('pg1') > + ], > + 'run pg_createsubscriber without --databases'); > + Hmm, in case of --dry-run, the server would be never shut down. See below part. ``` if (!dry_run) stop_standby_server(pg_ctl_path, opt.subscriber_dir); ``` > 3) This message "target server must be running" seems to be wrong, > should it be cannot specify cascading replicating standby as standby > node(this is for v22-0011 patch : > + 'pg_createsubscriber', '--verbose', '--pgdata', > $node_c->data_dir, > + '--publisher-server', $node_s->connstr('postgres'), > + '--port', $node_c->port, '--socketdir', $node_c->host, > + '--database', 'postgres' > ], > - 'primary server is in recovery'); > + 1, > + [qr//], > + [qr/primary server cannot be in recovery/], > + 'target server must be running'); Fixed. > 4) Should this be "Wait for subscriber to catch up" > +# Wait subscriber to catch up > +$node_s->wait_for_subscription_sync($node_p, $subnames[0]); > +$node_s->wait_for_subscription_sync($node_p, $subnames[1]); Fixed. > 5) Should this be 'Subscriptions has been created on all the specified > databases' > +); > +is($result, qq(2), > + 'Subscriptions has been created to all the specified databases' > +); Fixed, but "has" should be "have". > 6) Add test to verify current_user is not a member of > ROLE_PG_CREATE_SUBSCRIPTION, has no create permissions, has no > permissions to execution replication origin advance functions > > 7) Add tests to verify insufficient max_logical_replication_workers, > max_replication_slots and max_worker_processes for the subscription > node > > 8) Add tests to verify invalid configuration of wal_level, > max_replication_slots and max_wal_senders for the publisher node Hmm, but adding these checks may increase the test time. we should measure the time and then decide. > 9) We can use the same node name in comment and for the variable > +# Set up node P as primary > +$node_p = PostgreSQL::Test::Cluster->new('node_p'); > +$node_p->init(allows_streaming => 'logical'); > +$node_p->start; Fixed. > 10) Similarly we can use node_f instead of F in the comments. > +# Set up node F as about-to-fail node > +# Force it to initialize a new cluster instead of copying a > +# previously initdb'd cluster. > +{ > + local $ENV{'INITDB_TEMPLATE'} = undef; > + > + $node_f = PostgreSQL::Test::Cluster->new('node_f'); > + $node_f->init(allows_streaming => 'logical'); > + $node_f->start; > Fixed. Also, recent commit [1] allows to run the initdb forcibly. So followed. New patch can be available in [2]. [1]: https://github.com/postgres/postgres/commit/ff9e1e764fcce9a34467d614611a34d4d2a91b50 [2]: https://www.postgresql.org/message-id/TYCPR01MB12077CD76B53F9CAE7AC0F5562%40TYCPR01MB12077.jpnprd01.prod.outlook.com Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
RE: speed up a logical replica setup
Dear Vignesh, > Few comments: > 1) The below code can lead to assertion failure if the publisher is > stopped while dropping the replication slot: > + if (primary_slot_name != NULL) > + { > + conn = connect_database(dbinfo[0].pubconninfo); > + if (conn != NULL) > + { > + drop_replication_slot(conn, [0], > primary_slot_name); > + } > + else > + { > + pg_log_warning("could not drop replication > slot \"%s\" on primary", > + primary_slot_name); > + pg_log_warning_hint("Drop this replication > slot soon to avoid retention of WAL files."); > + } > + disconnect_database(conn); > + } > > pg_createsubscriber: error: connection to database failed: connection > to server on socket "/tmp/.s.PGSQL.5432" failed: No such file or > directory > Is the server running locally and accepting connections on that socket? > pg_createsubscriber: warning: could not drop replication slot > "standby_1" on primary > pg_createsubscriber: hint: Drop this replication slot soon to avoid > retention of WAL files. > pg_createsubscriber: pg_createsubscriber.c:432: disconnect_database: > Assertion `conn != ((void *)0)' failed. > Aborted (core dumped) > > This is happening because we are calling disconnect_database in case > of connection failure case too which has the following assert: > +static void > +disconnect_database(PGconn *conn) > +{ > + Assert(conn != NULL); > + > + PQfinish(conn); > +} Right. disconnect_database() was moved to if (conn != NULL) block. > 2) There is a CheckDataVersion function which does exactly this, will > we be able to use this: > + /* Check standby server version */ > + if ((ver_fd = fopen(versionfile, "r")) == NULL) > + pg_fatal("could not open file \"%s\" for reading: %m", > versionfile); > + > + /* Version number has to be the first line read */ > + if (!fgets(rawline, sizeof(rawline), ver_fd)) > + { > + if (!ferror(ver_fd)) > + pg_fatal("unexpected empty file \"%s\"", versionfile); > + else > + pg_fatal("could not read file \"%s\": %m", > versionfile); > + } > + > + /* Strip trailing newline and carriage return */ > + (void) pg_strip_crlf(rawline); > + > + if (strcmp(rawline, PG_MAJORVERSION) != 0) > + { > + pg_log_error("standby server is of wrong version"); > + pg_log_error_detail("File \"%s\" contains \"%s\", > which is not compatible with this program's version \"%s\".", > + versionfile, > rawline, PG_MAJORVERSION); > + exit(1); > + } > + > + fclose(ver_fd); > 3) Should this be added to typedefs.list: > +enum WaitPMResult > +{ > + POSTMASTER_READY, > + POSTMASTER_STILL_STARTING > +}; But the comment from Peter E. [1] was opposite. I did not handle this. > 4) pgCreateSubscriber should be mentioned after pg_controldata to keep > the ordering consistency: > diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml > index aa94f6adf6..c5edd244ef 100644 > --- a/doc/src/sgml/reference.sgml > +++ b/doc/src/sgml/reference.sgml > @@ -285,6 +285,7 @@ > > > > + > This has been already pointed out by Peter E. I did not handle this. > 5) Here pg_replication_slots should be pg_catalog.pg_replication_slots: > + if (primary_slot_name) > + { > + appendPQExpBuffer(str, > + "SELECT 1 FROM > pg_replication_slots " > + "WHERE active AND > slot_name = '%s'", > + primary_slot_name); Fixed. > 6) Here pg_settings should be pg_catalog.pg_settings: > +* - max_worker_processes >= 1 + number of dbs to be converted > + > * > +*/ > + res = PQexec(conn, > +"SELECT setting FROM pg_settings > WHERE name IN (" > +"'max_logical_replication_workers', " > +"'max_replication_slots', " > +"'max_worker_processes', " > +"'primary_slot_name') " > +"ORDER BY name"); Fixed. New version can be available in [2] [1]: https://www.postgresql.org/message-id/3ee79f2c-e8b3-4342-857c-a31b87e1afda%40eisentraut.org [2]: https://www.postgresql.org/message-id/TYCPR01MB12077CD76B53F9CAE7AC0F5562%40TYCPR01MB12077.jpnprd01.prod.outlook.com Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
RE: speed up a logical replica setup
Dear Vignesh, > Few comments regarding the documentation: > 1) max_replication_slots information seems to be present couple of times: > > + > + The target instance must have > + linkend="guc-max-replication-slots">max_replication_slots me> > + and linkend="guc-max-logical-replication-workers">max_logical_replica > tion_workers > + configured to a value greater than or equal to the number of target > + databases. > + > > + > + > + The target instance must have > + linkend="guc-max-replication-slots">max_replication_slots me> > + configured to a value greater than or equal to the number of target > + databases and replication slots. > + > + Fixed. > 2) Can we add an id to prerequisites and use it instead of referring > to r1-app-pg_createsubscriber-1: > - pg_createsubscriber checks if the > given target data > - directory has the same system identifier than the source data directory. > - Since it uses the recovery process as one of the steps, it starts the > - target server as a replica from the source server. If the system > - identifier is not the same, > pg_createsubscriber will > - terminate with an error. > + Checks the target can be converted. In particular, things listed in > + above section > would be > + checked. If these are not met > pg_createsubscriber > + will terminate with an error. > Changed. > 3) The code also checks the following: > Verify if a PostgreSQL binary (progname) is available in the same > directory as pg_createsubscriber. > > But this is not present in the pre-requisites of documentation. I think it is quite trivial so that I did not add. > 4) Here we mention that the target server should be stopped, but the > same is not mentioned in prerequisites: > + Here is an example of using > pg_createsubscriber. > + Before running the command, please make sure target server is stopped. > + > +$ pg_ctl -D /usr/local/pgsql/data > stop > + > + Oh, it is opposite, it should NOT be stopped. Fixed. > 5) If there is an error during any of the pg_createsubscriber > operation like if create subscription fails, it might not be possible > to rollback to the earlier state which had physical-standby > replication. I felt we should document this and also add it to the > console message like how we do in case of pg_upgrade. Added. New version can be available in [1] [1]: https://www.postgresql.org/message-id/TYCPR01MB12077CD76B53F9CAE7AC0F5562%40TYCPR01MB12077.jpnprd01.prod.outlook.com Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
RE: speed up a logical replica setup
Dear Alvaro, > > 15. > > > > You said in case of failure, cleanups is not needed if the process exits > > soon [1]. > > But some functions call PQfinish() then exit(1) or pg_fatal(). Should we > > follow? > > Hmm, but doesn't this mean that the server will log an ugly message that > "client closed connection unexpectedly"? I think it's nicer to close > the connection before terminating the process (especially since the > code for that is already written). OK. So we should disconnect properly even if the process exits. I added the function call again. Note that PQclear() was not added because it is only related with the application. Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
RE: speed up a logical replica setup
Dear Shlok, > Hi, > > I have reviewed the v21 patch. And found an issue. > > Initially I started the standby server with a new postgresql.conf file > (not the default postgresql.conf that is present in the instance). > pg_ctl -D ../standby start -o "-c config_file=/new_path/postgresql.conf" > > And I have made 'max_replication_slots = 1' in new postgresql.conf and > made 'max_replication_slots = 0' in the default postgresql.conf file. > Now when we run pg_createsubscriber on standby we get error: > pg_createsubscriber: error: could not set replication progress for the > subscription "pg_createsubscriber_5_242843": ERROR: cannot query or > manipulate replication origin when max_replication_slots = 0 > NOTICE: dropped replication slot "pg_createsubscriber_5_242843" on publisher > pg_createsubscriber: error: could not drop publication > "pg_createsubscriber_5" on database "postgres": ERROR: publication > "pg_createsubscriber_5" does not exist > pg_createsubscriber: error: could not drop replication slot > "pg_createsubscriber_5_242843" on database "postgres": ERROR: > replication slot "pg_createsubscriber_5_242843" does not exist > > I observed that when we run the pg_createsubscriber command, it will > stop the standby instance (the non-default postgres configuration) and > restart the standby instance which will now be started with default > postgresql.conf, where the 'max_replication_slot = 0' and > pg_createsubscriber will now fail with the error given above. > I have added the script file with which we can reproduce this issue. > Also similar issues can happen with other configurations such as port, etc. Possible. So the issue is that GUC settings might be changed after the restart so that the verification phase may not be enough. There are similar GUCs in [1] and they may have similar issues. E.g., if "hba_file" is set when the server started, the file cannot be seen after the restart, so pg_createsubscriber may not connect to it anymore. > The possible solution would be > 1) allow to run pg_createsubscriber if standby is initially stopped . > I observed that pg_logical_createsubscriber also uses this approach. > 2) read GUCs via SHOW command and restore them when server restarts > I also prefer the first solution. Another reason why the standby should be stopped is for backup purpose. Basically, the standby instance should be saved before running pg_createsubscriber. An easiest way is hard-copy, and the postmaster should be stopped at that time. I felt it is better that users can run the command immediately later the copying. Thought? [1]: https://www.postgresql.org/docs/current/storage-file-layout.html Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
RE: Have pg_basebackup write "dbname" in "primary_conninfo"?
Dear Horiguchi-san, > GetConnection()@streamutil.c wants to ensure conninfo has a fallback > database name ("replication"). However, the function seems to be > ignoring the case where neither dbname nor connection string is given, > which is the first case Kuroda-san raised. The second case is the > intended behavior of the function. > > > /* pg_recvlogical uses dbname only; others use connection_string only. > */ > > Assert(dbname == NULL || connection_string == NULL); > > And the function incorrectly assumes that the connection string > requires "dbname=replication". > > > * Merge the connection info inputs given in form of connection string, > > * options and default values (dbname=replication, replication=true, > etc.) > > But the name is a pseudo database name only used by pg_hba.conf > (maybe) , which cannot be used as an actual database name (without > quotation marks, or unless it is actually created). The function > should not add the fallback database name because the connection > string for physical replication connection doesn't require the dbname > parameter. (attached patch) I was also missing, but the requirement was that the dbname should be included only when the dbname option was explicitly specified [1]. Even mine and yours cannot handle like that. Libpq function PQconnectdbParams()->pqConnectOptions2() fills all the parameter to PGconn, at that time the information whether it is intentionally specified or not is discarded. Then, GenerateRecoveryConfig() would just write down all the connection parameters from PGconn, they cannot recognize. One approach is that based on Horiguchi-san's one and initial patch, we can avoid writing when the dbname is same as the username. [1]: https://www.postgresql.org/message-id/CAA4eK1KH1d1J5giPMZVOtMe0iqncf1CpNwkBKoYAmXdC-kEGZg%40mail.gmail.com Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
RE: Have pg_basebackup write "dbname" in "primary_conninfo"?
Dear Robert, > > Just FYI - here is an extreme case. And note that I have applied proposed > > patch. > > > > When `pg_basebackup -D data_N2 -R` is used: > > ``` > > primary_conninfo = 'user=hayato ... dbname=hayato ... > > ``` > > > > But when `pg_basebackup -d "" -D data_N2 -R` is used: > > ``` > > primary_conninfo = 'user=hayato ... dbname=replication > > ``` > > It seems like maybe somebody should look into why this is happening, > and perhaps fix it. I think this caused from below part [1] in GetConnection(). If both dbname and connection_string are the NULL, we will enter the else part and NULL would be substituted - {"dbnmae", NULL} key-value pair is generated only here. Then, in PQconnectdbParams()->PQconnectStartParams->pqConnectOptions2(), the strange part would be found and replaced to the username [2]. I think if both the connection string and the dbname are NULL, it should be considered as the physical replication connection. here is a patch to fix it. After the application, below two examples can output "dbname=replication". You can also confirm. ``` pg_basebackup -D data_N2 -U postgres pg_basebackup -D data_N2 -R -v -> primary_conninfo = 'user=postgres ... dbname=replication ... ``` [1] ``` else { keywords = pg_malloc0((argcount + 1) * sizeof(*keywords)); values = pg_malloc0((argcount + 1) * sizeof(*values)); keywords[i] = "dbname"; values[i] = dbname; i++; } ``` [2] ``` /* * If database name was not given, default it to equal user name */ if (conn->dbName == NULL || conn->dbName[0] == '\0') { free(conn->dbName); conn->dbName = strdup(conn->pguser); if (!conn->dbName) goto oom_error; } ``` Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/ diff --git a/src/bin/pg_basebackup/streamutil.c b/src/bin/pg_basebackup/streamutil.c index 56d1b15951..aea1ccba36 100644 --- a/src/bin/pg_basebackup/streamutil.c +++ b/src/bin/pg_basebackup/streamutil.c @@ -119,7 +119,7 @@ GetConnection(void) keywords = pg_malloc0((argcount + 1) * sizeof(*keywords)); values = pg_malloc0((argcount + 1) * sizeof(*values)); keywords[i] = "dbname"; - values[i] = dbname; + values[i] = dbname == NULL ? "replication" : dbname; i++; }
RE: Have pg_basebackup write "dbname" in "primary_conninfo"?
Dear Robert, > Seems weird to me. You don't use dbname=replication to ask for a > replication connection, so why would we ever end up with that > anywhere? And especially in only one of two such closely related > cases? Just FYI - here is an extreme case. And note that I have applied proposed patch. When `pg_basebackup -D data_N2 -R` is used: ``` primary_conninfo = 'user=hayato ... dbname=hayato ... ``` But when `pg_basebackup -d "" -D data_N2 -R` is used: ``` primary_conninfo = 'user=hayato ... dbname=replication ``` Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
RE: Have pg_basebackup write "dbname" in "primary_conninfo"?
Dear Ian, Thanks for making the patch. > With the addition of "pg_sync_replication_slots()", there is now a use-case > for > including "dbname" in "primary_conninfo" and the docs have changed from > stating [1]: > > Do not specify a database name in the primary_conninfo string. > > to [2]: > > For replication slot synchronization (see Section 48.2.3), it is also > necessary to specify a valid dbname in the primary_conninfo string. This > will > only be used for slot synchronization. It is ignored for streaming. > > [1] > https://www.postgresql.org/docs/16/runtime-config-replication.html#RUNTIME > -CONFIG-REPLICATION-STANDBY > [2] > https://www.postgresql.org/docs/devel/runtime-config-replication.html#RUNTI > ME-CONFIG-REPLICATION-STANDBY > > However, when setting up a standby (with the intent of creating a logical > standby) with pg_basebackup, providing the -R/-write-recovery-conf option > results in a "primary_conninfo" string being generated without a "dbname" > parameter, which requires a manual change should one intend to use > "pg_sync_replication_slots()". > > I can't see any reason for continuing to omit "dbname", so suggest it should > only continue to be omitted for 16 and earlier; see attached patch. Hmm, I also cannot find a reason, but we can document the change. > Note that this does mean that if the conninfo string provided to pg_basebackup > does not contain "dbname", the generated "primary_conninfo" GUC will default > to > "dbname=replication". It would be easy enough to suppress this, but AFAICS > there's no way to tell if it was explicitly supplied by the user, in which > case > it would be surprising if it were omitted from the final "primary_conninfo" > string. I found an inconsistency. When I ran ` pg_basebackup -D data_N2 -U postgres -R`, dbname would be set as username. ``` primary_conninfo = 'user=postgres ... dbname=postgres ``` However, when I ran `pg_basebackup -D data_N2 -d "user=postgres" -R`, dbname would be set as "replication". Is it an intentional item? ``` primary_conninfo = 'user=postgres ... dbname=replication... ``` > The only other place where GenerateRecoveryConfig() is called is pg_rewind; > I can't see any adverse affects from the proposed change. But it's perfectly > possible there's something blindingly obvious I'm overlooking. On-going feature pg_createsubscriber[1] also uses GenerateRecoveryConfig(), but I can't see any bad effects. The output is being used to make consistent standby from the primary. Even if dbname is set in the primary_conninfo, it would be ignored. [1]: https://commitfest.postgresql.org/47/4637/ Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
RE: speed up a logical replica setup
Dear Vignesh, Thanks for giving comments! > Few comments for v22-0001 patch: > 1) The second "if (strcmp(PQgetvalue(res, 0, 1), "t") != 0)"" should > be if (strcmp(PQgetvalue(res, 0, 2), "t") != 0): > + if (strcmp(PQgetvalue(res, 0, 1), "t") != 0) > + { > + pg_log_error("permission denied for database %s", > dbinfo[0].dbname); > + return false; > + } > + if (strcmp(PQgetvalue(res, 0, 1), "t") != 0) > + { > + pg_log_error("permission denied for function \"%s\"", > + > "pg_catalog.pg_replication_origin_advance(text, pg_lsn)"); > + return false; > + } I have already pointed out as comment #8 [1] and fixed in v22-0005. > 2) pg_createsubscriber fails if a table is parallely created in the > primary node: > 2024-02-20 14:38:49.005 IST [277261] LOG: database system is ready to > accept connections > 2024-02-20 14:38:54.346 IST [277270] ERROR: relation "public.tbl5" > does not exist > 2024-02-20 14:38:54.346 IST [277270] STATEMENT: CREATE SUBSCRIPTION > pg_createsubscriber_5_277236 CONNECTION ' dbname=postgres' PUBLICATION > pg_createsubscriber_5 WITH (create_slot = false, copy_data = false, > enabled = false) > > If we are not planning to fix this, at least it should be documented The error will be occurred when tables are created after the promotion, right? I think it cannot be fixed until DDL logical replication would be implemented. So, +1 to add descriptions. > 3) Error conditions is verbose mode has invalid error message like > "out of memory" messages like in below: > pg_createsubscriber: waiting the postmaster to reach the consistent state > pg_createsubscriber: postmaster reached the consistent state > pg_createsubscriber: dropping publication "pg_createsubscriber_5" on > database "postgres" > pg_createsubscriber: creating subscription > "pg_createsubscriber_5_278343" on database "postgres" > pg_createsubscriber: error: could not create subscription > "pg_createsubscriber_5_278343" on database "postgres": out of memory Because some places use PQerrorMessage() wrongly. It should be PQresultErrorMessage(). Fixed in v22-0005. > 4) In error cases we try to drop this publication again resulting in error: > + /* > +* Since the publication was created before the > consistent LSN, it is > +* available on the subscriber when the physical > replica is promoted. > +* Remove publications from the subscriber because it > has no use. > +*/ > + drop_publication(conn, [i]); > > Which throws these errors(because of drop publication multiple times): > pg_createsubscriber: dropping publication "pg_createsubscriber_5" on > database "postgres" > pg_createsubscriber: error: could not drop publication > "pg_createsubscriber_5" on database "postgres": ERROR: publication > "pg_createsubscriber_5" does not exist > pg_createsubscriber: dropping publication "pg_createsubscriber_5" on > database "postgres" > pg_createsubscriber: dropping the replication slot > "pg_createsubscriber_5_278343" on database "postgres" Right. One approach is to use DROP PUBLICATION IF EXISTS statement. Thought? > 5) In error cases, wait_for_end_recovery waits even though it has > identified that the replication between primary and standby is > stopped: > +/* > + * Is recovery still in progress? > + * If the answer is yes, it returns 1, otherwise, returns 0. If an error > occurs > + * while executing the query, it returns -1. > + */ > +static int > +server_is_in_recovery(PGconn *conn) > +{ > + PGresult *res; > + int ret; > + > + res = PQexec(conn, "SELECT pg_catalog.pg_is_in_recovery()"); > + > + if (PQresultStatus(res) != PGRES_TUPLES_OK) > + { > + PQclear(res); > + pg_log_error("could not obtain recovery progress"); > + return -1; > + } > + > > You can simulate this by stopping the primary just before > wait_for_end_recovery and you will see these error messages, but > pg_createsubscriber will continue to wait: > pg_createsubscriber: error: could not obtain recovery progress > pg_createsubscriber: error: could not obtain recovery progress > pg_createsubscriber: error: could not obtain recovery progress > pg_createsubscriber: error: could not obtain recovery progress Yeah, v22-0001 cannot detect the disconnection from primary and standby. V22-0007 can detect the standby crash, but v22 set could not detect the primary crash. Euler came up with an approach [2] for it but not implemented yet. [1]: https://www.postgresql.org/message-id/TYCPR01MB12077756323B79042F29DDAEDF54C2%40TYCPR01MB12077.jpnprd01.prod.outlook.com [2]: https://www.postgresql.org/message-id/2231a04b-f2d4-4a4e-b5cd-56be8b002427%40app.fastmail.com Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
RE: pg_upgrade and logical replication
Dear Vignesh, Thanks for reviewing! PSA new version. > > Thanks for the updated patch, Few suggestions: > 1) This can be moved to keep it similar to other tests: > +# Setup a disabled subscription. The upcoming test will check the > +# pg_createsubscriber won't work, so it is sufficient. > +$publisher->safe_psql('postgres', "CREATE PUBLICATION regress_pub1"); > +$old_sub->safe_psql('postgres', > + "CREATE SUBSCRIPTION regress_sub1 CONNECTION '$connstr' > PUBLICATION regress_pub1 WITH (enabled = false)" > +); > + > +$old_sub->stop; > + > +# -- > +# Check that pg_upgrade fails when max_replication_slots configured in the > new > +# cluster is less than the number of subscriptions in the old cluster. > +# -- > +$new_sub->append_conf('postgresql.conf', "max_replication_slots = 0"); > + > +# pg_upgrade will fail because the new cluster has insufficient > +# max_replication_slots. > +command_checks_all( > + [ > + 'pg_upgrade', '--no-sync', '-d', $old_sub->data_dir, > + '-D', $new_sub->data_dir, '-b', $oldbindir, > + '-B', $newbindir, '-s', $new_sub->host, > + '-p', $old_sub->port, '-P', $new_sub->port, > + $mode, '--check', > + ], > > like below and the extra comment can be removed: > +# -- > +# Check that pg_upgrade fails when max_replication_slots configured in the > new > +# cluster is less than the number of subscriptions in the old cluster. > +# -- > +# Create a disabled subscription. > +$publisher->safe_psql('postgres', "CREATE PUBLICATION regress_pub1"); > +$old_sub->safe_psql('postgres', > + "CREATE SUBSCRIPTION regress_sub1 CONNECTION '$connstr' > PUBLICATION regress_pub1 WITH (enabled = false)" > +); > + > +$old_sub->stop; > + > +$new_sub->append_conf('postgresql.conf', "max_replication_slots = 0"); > + > +# pg_upgrade will fail because the new cluster has insufficient > +# max_replication_slots. > +command_checks_all( > + [ > + 'pg_upgrade', '--no-sync', '-d', $old_sub->data_dir, > + '-D', $new_sub->data_dir, '-b', $oldbindir, > + '-B', $newbindir, '-s', $new_sub->host, > + '-p', $old_sub->port, '-P', $new_sub->port, > + $mode, '--check', > + ], Partially fixed. I moved the creation part to below but comments were kept. > 2) This comment can be slightly changed: > +# Change configuration as well not to start the initial sync automatically > +$new_sub->append_conf('postgresql.conf', > + "max_logical_replication_workers = 0"); > > to: > Change configuration so that initial table sync sync does not get > started automatically Fixed. > 3) The old comments were slightly better: > # Resume the initial sync and wait until all tables of subscription > # 'regress_sub5' are synchronized > $new_sub->append_conf('postgresql.conf', > "max_logical_replication_workers = 10"); > $new_sub->restart; > $new_sub->safe_psql('postgres', "ALTER SUBSCRIPTION regress_sub5 > ENABLE"); > $new_sub->wait_for_subscription_sync($publisher, 'regress_sub5'); > > Like: > # Enable the subscription > $new_sub->safe_psql('postgres', "ALTER SUBSCRIPTION regress_sub5 > ENABLE"); > > # Wait until all tables of subscription 'regress_sub5' are synchronized > $new_sub->wait_for_subscription_sync($publisher, 'regress_sub5'); Per comments from Amit [1], I did not change. [1]: https://www.postgresql.org/message-id/CAA4eK1Ls%2BRmJtTvOgaRXd%2BeHSY3x-KUE%3DsfEGQoU-JF_UzA62A%40mail.gmail.com Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/ v5-0001-Fix-testcase.patch Description: v5-0001-Fix-testcase.patch
RE: speed up a logical replica setup
g2; )); $node_p->safe_psql('pg1', 'CREATE TABLE tbl1 (a text)'); $node_p->safe_psql('pg1', "INSERT INTO tbl1 VALUES('first row')"); $node_p->safe_psql('pg2', 'CREATE TABLE tbl2 (a text)'); my $slotname = 'physical_slot'; $node_p->safe_psql('pg2', "SELECT pg_create_physical_replication_slot('$slotname')"); ``` I think setting of the same node can be gathered into one place. Also, any settings and definitions should be done just before they are used. 21. ``` $node_s->append_conf( 'postgresql.conf', qq[ log_min_messages = debug2 primary_slot_name = '$slotname' ]); ``` I could not find a reason why we set to debug2. 22. ``` command_fails ``` command_checks_all() can check returned value and outputs. Should we use it? 23. Can you add headers for each testcases? E.g., ``` # -- # Check pg_createsubscriber fails when the target server is not a # standby of the source. ... # -- # Check pg_createsubscriber fails when the target server is not running ... # -- # Check pg_createsubscriber fails when the target server is a member of # the cascading standby. ... # -- # Check successful dry-run ... # -- # Check successful conversion ``` 24. ``` # Stop node C $node_c->teardown_node; ... $node_p->stop; $node_s->stop; $node_f->stop; ``` Why you choose the teardown? 25. The creation of subscriptions are not directory tested. @subnames contains the name of subscriptions, but it just assumes the number of them is more than two. Since it may be useful, I will post top-up patch on Monday, if there are no updating. [1]: https://www.postgresql.org/message-id/89ccf72b-59f9-4317-b9fd-1e6d20a0c3b1%40app.fastmail.com [2]: https://www.postgresql.org/message-id/TYCPR01MB1207713BEC5C379A05D65E342F54B2%40TYCPR01MB12077.jpnprd01.prod.outlook.com Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/global/
RE: speed up a logical replica setup
Dear Shubham, Thanks for testing. It seems you ran with v20 patch, but I confirmed It could reproduce with v21. > > I found a couple of issues, while verifying the cascaded replication > with the following scenarios: > Scenario 1) Create cascade replication like node1->node2->node3 > without using replication slots (attached > cascade_3node_setup_without_slots has the script for this): > Then I ran pg_createsubscriber by specifying primary as node1 and > standby as node3, this scenario runs successfully. I was not sure if > this should be supported or not? Hmm. After the script, the cascading would be broken. The replication would be: ``` Node1 -> node2 | Node3 ``` And the operation is bit strange. The consistent LSN is gotten from the node1, but node3 waits until it receives the record from NODE2. Can we always success it? > Scenario 2) Create cascade replication like node1->node2->node3 using > replication slots (attached cascade_3node_setup_with_slots has the > script for this): > Here, slot name was used as slot1 for node1 to node2 and slot2 for > node2 to node3. Then I ran pg_createsubscriber by specifying primary > as node1 and standby as node3. In this case pg_createsubscriber fails > with the following error: > pg_createsubscriber: error: could not obtain replication slot > information: got 0 rows, expected 1 row > [Inferior 1 (process 2623483) exited with code 01] > > This is failing because slot name slot2 is used between node2->node3 > but pg_createsubscriber is checked for slot1, the slot which is used > for replication between node1->node2. > Thoughts? Right. The inconsistency is quite strange. Overall, I felt such a case must be rejected. How should we detect at checking phase? Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
RE: speed up a logical replica setup
Dear Euler, Thanks for updating the patch! Before reviewing deeply, here are replies for your comments. > > > Points raised by me [1] are not solved yet. > > > > * What if the target version is PG16-? pg_ctl and pg_resetwal won't work. $ pg_ctl start -D /tmp/blah waiting for server to start 2024-02-15 23:50:03.448 -03 [364610] FATAL: database files are incompatible with server 2024-02-15 23:50:03.448 -03 [364610] DETAIL: The data directory was initialized by PostgreSQL version 16, which is not compatible with this version 17devel. stopped waiting pg_ctl: could not start server Examine the log output. $ pg_resetwal -D /tmp/blah pg_resetwal: error: data directory is of wrong version pg_resetwal: detail: File "PG_VERSION" contains "16", which is not compatible with this program's version "17". > > * What if the found executables have diffent version with > > pg_createsubscriber? The new code take care of it. > I preferred to have a common path and test one by one, but agreed this worked well. Let's keep it and hear opinions from others. > > > * What if the target is sending WAL to another server? > >I.e., there are clusters like `node1->node2-.node3`, and the target is > > node2. The new code detects if the server is in recovery and aborts as you suggested. A new option can be added to ignore the fact there are servers receiving WAL from it. > Confirmed it can detect. > > > * Can we really cleanup the standby in case of failure? > >Shouldn't we suggest to remove the target once? If it finishes the promotion, no. I adjusted the cleanup routine a bit to avoid it. However, we should provide instructions to inform the user that it should create a fresh standby and try again. > I think the cleanup function looks not sufficient. In v21, recovery_ended is kept to true even after drop_publication() is done in setup_subscriber(). I think that made_subscription is not needed, and the function should output some messages when recovery_ended is on. Besides, in case of pg_upgrade, they always report below messages before starting the migration. I think this is more helpful for users. ``` pg_log(PG_REPORT, "\n" "If pg_upgrade fails after this point, you must re-initdb the\n" "new cluster before continuing."); ``` > > > * Can we move outputs to stdout? Are you suggesting to use another logging framework? It is not a good idea because each client program is already using common/logging.c. > Hmm, indeed. Other programs in pg_basebackup seem to use the framework. > v20-0011: Do we really want to interrupt the recovery if the network was momentarily interrupted or if the OS killed walsender? Recovery is critical for the process. I think we should do our best to be resilient and deliver all changes required by the new subscriber. > It might be too strict to raise an ERROR as soon as we met a disconnection. And at least 0011 was wrong - it should be PQgetvalue(res, 0, 1) for still_alive. > The proposal is not correct because the query return no tuples if it is disconnected so you cannot PQgetvalue(). > Sorry for misunderstanding, but you might be confused. pg_createsubcriber sends a query to target, and we are discussing the disconnection between the target and source instances. I think the connection which pg_createsubscriber has is stil alive so PQgetvalue() can get a value. (BTW, callers of server_is_in_recovery() has not considered a disconnection from the target...) > The retry interval (the time that ServerLoop() will create another walreceiver) is determined by DetermineSleepTime() and it is a maximum of 5 seconds (SIGKILL_CHILDREN_AFTER_SECS). One idea is to retry 2 or 3 times before give up using the pg_stat_wal_receiver query. Do you have a better plan? > It's good to determine the threshold. It can define the number of acceptable loss of walreceiver during the loop. I think we should retry at least the postmaster revives the walreceiver. The checking would work once per seconds, so more than 5 (or 10) may be better. Thought? Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/global/
RE: pg_upgrade and logical replication
Dear Vignesh, Thanks for reviewing! PSA new version. > > Thanks for the updated patch, few suggestions: > 1) Can we use a new publication for this subscription too so that the > publication and subscription naming will become consistent throughout > the test case: > +# Table will be in 'd' (data is being copied) state as table sync will fail > +# because of primary key constraint error. > +my $started_query = > + "SELECT count(1) = 1 FROM pg_subscription_rel WHERE srsubstate = 'd'"; > +$old_sub->poll_query_until('postgres', $started_query) > + or die > + "Timed out while waiting for the table state to become 'd' (datasync)"; > + > +# Create another subscription and drop the subscription's replication origin > +$old_sub->safe_psql('postgres', > + "CREATE SUBSCRIPTION regress_sub3 CONNECTION '$connstr' > PUBLICATION regress_pub2 WITH (enabled = false)" > +); > > So after the change it will become like subscription regress_sub3 for > publication regress_pub3, subscription regress_sub4 for publication > regress_pub4 and subscription regress_sub5 for publication > regress_pub5. A new publication was defined. > 2) The tab_upgraded1 table can be created along with create > publication and create subscription itself: > $publisher->safe_psql('postgres', > "CREATE PUBLICATION regress_pub3 FOR TABLE tab_upgraded1"); > $old_sub->safe_psql('postgres', > "CREATE SUBSCRIPTION regress_sub4 CONNECTION '$connstr' PUBLICATION > regress_pub3 WITH (failover = true)" > ); The definition of tab_upgraded1 was moved to the place you pointed. > 3) The tab_upgraded2 table can be created along with create > publication and create subscription itself to keep it consistent: > $publisher->safe_psql('postgres', > - "ALTER PUBLICATION regress_pub2 ADD TABLE tab_upgraded2"); > + "CREATE PUBLICATION regress_pub4 FOR TABLE tab_upgraded2"); > $old_sub->safe_psql('postgres', > - "ALTER SUBSCRIPTION regress_sub2 REFRESH PUBLICATION"); > + "CREATE SUBSCRIPTION regress_sub5 CONNECTION '$connstr' > PUBLICATION regress_pub4" > +); Ditto. > With above fixes, the following can be removed: > # Initial setup > $publisher->safe_psql( > 'postgres', qq[ > CREATE TABLE tab_upgraded1(id int); > CREATE TABLE tab_upgraded2(id int); > ]); > $old_sub->safe_psql( > 'postgres', qq[ > CREATE TABLE tab_upgraded1(id int); > CREATE TABLE tab_upgraded2(id int); > ]); Yes, earlier definitions were removed instead. Also, some comments were adjusted based on these fixes. Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/ v4-0001-Fix-testcase.patch Description: v4-0001-Fix-testcase.patch
RE: pg_upgrade and logical replication
Dear Justin, Thanks for replying! > What optimizations? I can't see them, and since the patch is described > as rearranging test cases (and therefore already difficult to read), I > guess they should be a separate patch, or the optimizations described. The basic idea was to reduce number of CREATE/DROP statement, but it was changed for now - publications and subscriptions were created and dropped per testcases. E.g., In case of successful upgrade, below steps were done: 1. create two publications 2. create a subscription with failover = true 3. avoid further initial sync by setting max_logical_replication_workers = 0 4. create another subscription 5. confirm statuses of tables are either of 'i' or 'r' 6. run pg_upgrade 7. confirm table statuses are preserved 8. confirm replication origins are preserved. New patch is available in [1]. [1]: https://www.postgresql.org/message-id/TYCPR01MB12077B16EEDA360BA645B96F8F54C2%40TYCPR01MB12077.jpnprd01.prod.outlook.com Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
RE: pg_upgrade and logical replication
Dear Amit, > This sounds like a reasonable way to address the reported problem. OK, thanks! > Justin, do let me know if you think otherwise? > > Comment: > === > * > -# Setup an enabled subscription to verify that the running status and > failover > -# option are retained after the upgrade. > +# Setup a subscription to verify that the failover option are retained after > +# the upgrade. > $publisher->safe_psql('postgres', "CREATE PUBLICATION regress_pub1"); > $old_sub->safe_psql('postgres', > - "CREATE SUBSCRIPTION regress_sub1 CONNECTION '$connstr' PUBLICATION > regress_pub1 WITH (failover = true)" > + "CREATE SUBSCRIPTION regress_sub1 CONNECTION '$connstr' > PUBLICATION > regress_pub1 WITH (failover = true, enabled = false)" > ); > > I think it is better not to create a subscription in the early stage > which we wanted to use for the success case. Let's have separate > subscriptions for failure and success cases. I think that will avoid > the newly added ALTER statements in the patch. I made a patch to avoid creating objects as much as possible, but it may lead some confusion. I recreated a patch for creating pub/sub and dropping them at cleanup for every test cases. PSA a new version. Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/ v3-0001-Fix-testcase.patch Description: v3-0001-Fix-testcase.patch
RE: speed up a logical replica setup
Dear Euler, > If someone modifies data after promotion, fine; she has to deal with conflicts, if any. IMO it is solved adding one or two sentences in the documentation. > OK. I could find issues, for now. > > > > Regarding > GUCs, almost all of them is PGC_POSTMASTER (so it cannot be modified unless > the > server is restarted). The ones that are not PGC_POSTMASTER, does not affect > the > pg_createsubscriber execution [1]. > > > > IIUC, primary_conninfo and primary_slot_name is PGC_SIGHUP. Ditto. > Just to confirm - even if the primary_slot_name would be changed during the conversion, the slot initially set would be dropped. Currently we do not find any issues. > > > > I'm just pointing out that this case is a different from pg_upgrade (from > which > this idea was taken). I'm not saying that's a bad idea. I'm just arguing that > you might be preventing some access read only access (monitoring) when it is > perfectly fine to connect to the database and execute queries. As I said > before, the current UI allows anyone to setup the standby to accept only local > connections. Of course, it is an extra step but it is possible. However, once > you apply v16-0007, there is no option but use only local connection during > the > transformation. Is it an acceptable limitation? > > > > My remained concern is written above. If they do not problematic we may not > have > to restrict them for now. At that time, changes > > 1) overwriting a port number, > 2) setting listen_addresses = '' It can be implemented later if people are excited by it. > are not needed, right? IIUC inconsistency of -P may be still problematic. I still think we shouldn't have only the transformed primary_conninfo as option. > Hmm, OK. So let me summarize current status and discussions. Policy) Basically, we do not prohibit to connect to primary/standby. primary_slot_name may be changed during the conversion and tuples may be inserted on target just after the promotion, but it seems no issues. API) -D (data directory) and -d (databases) are definitively needed. Regarding the -P (a connection string for source), we can require it for now. But note that it may cause an inconsistency if the pointed not by -P is different from the node pointde by primary_conninfo. As for the connection string for the target server, we can choose two ways: a) accept native connection string as -S. This can reuse the same parsing mechanism as -P, but there is a room that non-local server is specified. b) accept username/port as -U/-p (Since the policy is like above, listen_addresses would not be overwritten. Also, the port just specify the listening port). This can avoid connecting to non-local, but more options may be needed. (E.g., Is socket directory needed? What about password?) Other discussing point, reported issue) Points raised by me [1] are not solved yet. * What if the target version is PG16-? * What if the found executables have diffent version with pg_createsubscriber? * What if the target is sending WAL to another server? I.e., there are clusters like `node1->node2-.node3`, and the target is node2. * Can we really cleanup the standby in case of failure? Shouldn't we suggest to remove the target once? * Can we move outputs to stdout? [1]: https://www.postgresql.org/message-id/TYCPR01MB1207713BEC5C379A05D65E342F54B2%40TYCPR01MB12077.jpnprd01.prod.outlook.com Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/global/
RE: pg_upgrade and logical replication
Dear Vignesh, Thanks for verifying the fix! > Your proposal to change the tests in the following order: a) failure > due to the insufficient max_replication_slot b) failure because the > pg_subscription_rel has 'd' state c) successful upgrade. looks good to > me. Right. > I have also verified that your changes fixes the issue as the > successful upgrade is moved to the end and the old cluster is no > longer used after upgrade. Yeah, it is same as my expectation. > One minor suggestion: > There is an extra line break here, this can be removed: > @@ -181,139 +310,5 @@ is($result, qq(1), > "check the data is synced after enabling the subscription for > the table that was in init state" > ); > > -# cleanup > Removed. PSA a new version patch. Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/ v2-0001-Fix-testcase.patch Description: v2-0001-Fix-testcase.patch
RE: pg_upgrade and logical replication
Dear Justin, > pg_upgrade/t/004_subscription.pl says > > |my $mode = $ENV{PG_TEST_PG_UPGRADE_MODE} || '--copy'; > > ..but I think maybe it should not. > > When you try to use --link, it fails: > https://cirrus-ci.com/task/4669494061170688 > > |Adding ".old" suffix to old global/pg_control ok > | > |If you want to start the old cluster, you will need to remove > |the ".old" suffix from > /tmp/cirrus-ci-build/build/testrun/pg_upgrade/004_subscription/data/t_004_su > bscription_old_sub_data/pgdata/global/pg_control.old. > |Because "link" mode was used, the old cluster cannot be safely > |started once the new cluster has been started. > |... > | > |postgres: could not find the database system > |Expected to find it in the directory > "/tmp/cirrus-ci-build/build/testrun/pg_upgrade/004_subscription/data/t_004_s > ubscription_old_sub_data/pgdata", > |but could not open file > "/tmp/cirrus-ci-build/build/testrun/pg_upgrade/004_subscription/data/t_004_s > ubscription_old_sub_data/pgdata/global/pg_control": No such file or directory > |# No postmaster PID for node "old_sub" > |[19:36:01.396](0.250s) Bail out! pg_ctl start failed > Good catch! The primal reason of the failure is to reuse the old cluster, even after the successful upgrade. The documentation said [1]: > If you use link mode, the upgrade will be much faster (no file copying) and use less disk space, but you will not be able to access your old cluster once you start the new cluster after the upgrade. > > You could rename pg_control.old to avoid that immediate error, but that > doesn't > address the essential issue that "the old cluster cannot be safely started > once > the new cluster has been started." Yeah, I agreed that it should be avoided to access to the old cluster after the upgrade. IIUC, pg_upgrade would be run third times in 004_subscription. 1. successful upgrade 2. failure due to the insufficient max_replication_slot 3. failure because the pg_subscription_rel has 'd' state And old instance is reused in all of runs. Therefore, the most reasonable fix is to change the ordering of tests, i.e., "successful upgrade" should be done at last. Attached patch modified the test accordingly. Also, it contains some optimizations. This can pass the test on my env: ``` pg_upgrade]$ PG_TEST_PG_UPGRADE_MODE='--link' PG_TEST_TIMEOUT_DEFAULT=10 make check PROVE_TESTS='t/004_subscription.pl' ... # +++ tap check in src/bin/pg_upgrade +++ t/004_subscription.pl .. ok All tests successful. Files=1, Tests=14, 9 wallclock secs ( 0.03 usr 0.00 sys + 0.55 cusr 1.08 csys = 1.66 CPU) Result: PASS ``` How do you think? [1]: https://www.postgresql.org/docs/devel/pgupgrade.html Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/ 0001-Fix-testcase.patch Description: 0001-Fix-testcase.patch
RE: speed up a logical replica setup
Dear Shubham, Thanks for testing the patch! > > I tried verifying few scenarios by using 5 databases and came across > the following errors: > > ./pg_createsubscriber -D ../new_standby -P "host=localhost port=5432 > dbname=postgres" -S "host=localhost port=9000 dbname=postgres" -d db1 > -d db2 -d db3 -d db4 -d db5 > > pg_createsubscriber: error: publisher requires 6 wal sender > processes, but only 5 remain > pg_createsubscriber: hint: Consider increasing max_wal_senders to at least 7. > > It is successful only with 7 wal senders, so we can change error > messages accordingly. > > > pg_createsubscriber: error: publisher requires 6 replication slots, > but only 5 remain > pg_createsubscriber: hint: Consider increasing max_replication_slots > to at least 7. > > It is successful only with 7 replication slots, so we can change error > messages accordingly. I'm not a original author but I don't think it is needed. The hint message has already suggested you to change to 7. According to the doc [1], the primary message should be factual and hint message should be used for suggestions. I felt current code followed the style. Thought? New patch is available in [2]. [1]: https://www.postgresql.org/docs/devel/error-style-guide.html [2]: https://www.postgresql.org/message-id/TYCPR01MB12077A6BB424A025F04A8243DF54F2%40TYCPR01MB12077.jpnprd01.prod.outlook.com Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
RE: speed up a logical replica setup
Dear hackers, I've replied for trackability. > Further comments for v17. > > 01. > This program assumes that the target server has same major version with this. > Because the target server would be restarted by same version's pg_ctl command. > I felt it should be ensured by reading the PG_VERSION. Still investigating. > 02. > pg_upgrade checked the version of using executables, like pg_ctl, postgres, > and > pg_resetwal. I felt it should be as well. Still investigating. > 03. get_bin_directory > ``` > if (find_my_exec(path, full_path) < 0) > { > pg_log_error("The program \"%s\" is needed by %s but was not > found in the\n" >"same directory as \"%s\".\n", >"pg_ctl", progname, full_path); > ``` > > s/"pg_ctl"/progname The message was updated. > 04. > Missing canonicalize_path()? I found find_my_exec() calls canonicalize_path(). No need to do. > 05. > Assuming that the target server is a cascade standby, i.e., it has a role as > another primary. In this case, I thought the child node would not work. > Because > pg_createsubcriber runs pg_resetwal and all WAL files would be discarded at > that > time. I have not tested, but should the program detect it and exit earlier? Still investigating. > 06. > wait_for_end_recovery() waits forever even if the standby has been > disconnected > from the primary, right? should we check the status of the replication via > pg_stat_wal_receiver? Still investigating. > 07. > The cleanup function has couple of bugs. > > * If subscriptions have been created on the database, the function also tries > to > drop a publication. But it leads an ERROR because it has been already > dropped. > See setup_subscriber(). > * If the subscription has been created, drop_replication_slot() leads an > ERROR. > Because the subscriber tried to drop the subscription while executing DROP > SUBSCRIPTION. Only drop_publication() was removed. > 08. > I found that all messages (ERROR, WARNING, INFO, etc...) would output to > stderr, > but I felt it should be on stdout. Is there a reason? pg_dump outputs > messages to > stderr, but the motivation might be to avoid confusion with dumps. Still investigating. > 09. > I'm not sure the cleanup for subscriber is really needed. Assuming that there > are two databases, e.g., pg1 pg2 , and we fail to create a subscription on > pg2. > This can happen when the subscription which has the same name has been > already > created on the primary server. > In this case a subscirption pn pg1 would be removed. But what is a next step? > Since a timelineID on the standby server is larger than the primary (note that > the standby has been promoted once), we cannot resume the physical replication > as-is. IIUC the easiest method to retry is removing a cluster once and > restarting > from pg_basebackup. If so, no need to cleanup the standby because it is > corrupted. > We just say "Please remove the cluster and recreate again". I still think it should be, but not done yet. New patch can be available in [1]. [1]: https://www.postgresql.org/message-id/TYCPR01MB12077A6BB424A025F04A8243DF54F2%40TYCPR01MB12077.jpnprd01.prod.outlook.com Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
RE: speed up a logical replica setup
" > +#include "common/restricted_token.h" > +#include "fe_utils/recovery_gen.h" > +#include "fe_utils/simple_list.h" > +#include "getopt_long.h" > +#include "utils/pidfile.h" > > 8) POSTMASTER_STANDBY and POSTMASTER_FAILED are not being used, is it > required or kept for future purpose: > +enum WaitPMResult > +{ > + POSTMASTER_READY, > + POSTMASTER_STANDBY, > + POSTMASTER_STILL_STARTING, > + POSTMASTER_FAILED > +}; I think they are here because WaitPMResult is just ported from pg_ctl.c. I renamed the enumeration and removed non-necessary attributes. > 9) pg_createsubscriber should be kept after pg_basebackup to maintain > the consistent order: > diff --git a/src/bin/pg_basebackup/.gitignore > b/src/bin/pg_basebackup/.gitignore > index 26048bdbd8..b3a6f5a2fe 100644 > --- a/src/bin/pg_basebackup/.gitignore > +++ b/src/bin/pg_basebackup/.gitignore > @@ -1,5 +1,6 @@ > /pg_basebackup > /pg_receivewal > /pg_recvlogical > +/pg_createsubscriber Addressed. > 10) dry-run help message is not very clear, how about something > similar to pg_upgrade's message like "check clusters only, don't > change any data": > + printf(_(" -d, --database=DBNAME database to > create a subscription\n")); > + printf(_(" -n, --dry-run stop before > modifying anything\n")); > + printf(_(" -t, --recovery-timeout=SECS seconds to wait > for recovery to end\n")); > + printf(_(" -r, --retainretain log file > after success\n")); > + printf(_(" -v, --verbose output verbose > messages\n")); > + printf(_(" -V, --version output version > information, then exit\n")); Changed. New patch is available in [2]. [1]: https://www.postgresql.org/message-id/TYCPR01MB1207713BEC5C379A05D65E342F54B2%40TYCPR01MB12077.jpnprd01.prod.outlook.com [2]: https://www.postgresql.org/message-id/TYCPR01MB12077A6BB424A025F04A8243DF54F2%40TYCPR01MB12077.jpnprd01.prod.outlook.com Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
RE: speed up a logical replica setup
Dear Euler, Further comments for v17. 01. This program assumes that the target server has same major version with this. Because the target server would be restarted by same version's pg_ctl command. I felt it should be ensured by reading the PG_VERSION. 02. pg_upgrade checked the version of using executables, like pg_ctl, postgres, and pg_resetwal. I felt it should be as well. 03. get_bin_directory ``` if (find_my_exec(path, full_path) < 0) { pg_log_error("The program \"%s\" is needed by %s but was not found in the\n" "same directory as \"%s\".\n", "pg_ctl", progname, full_path); ``` s/"pg_ctl"/progname 04. Missing canonicalize_path()? 05. Assuming that the target server is a cascade standby, i.e., it has a role as another primary. In this case, I thought the child node would not work. Because pg_createsubcriber runs pg_resetwal and all WAL files would be discarded at that time. I have not tested, but should the program detect it and exit earlier? 06. wait_for_end_recovery() waits forever even if the standby has been disconnected from the primary, right? should we check the status of the replication via pg_stat_wal_receiver? 07. The cleanup function has couple of bugs. * If subscriptions have been created on the database, the function also tries to drop a publication. But it leads an ERROR because it has been already dropped. See setup_subscriber(). * If the subscription has been created, drop_replication_slot() leads an ERROR. Because the subscriber tried to drop the subscription while executing DROP SUBSCRIPTION. 08. I found that all messages (ERROR, WARNING, INFO, etc...) would output to stderr, but I felt it should be on stdout. Is there a reason? pg_dump outputs messages to stderr, but the motivation might be to avoid confusion with dumps. 09. I'm not sure the cleanup for subscriber is really needed. Assuming that there are two databases, e.g., pg1 pg2 , and we fail to create a subscription on pg2. This can happen when the subscription which has the same name has been already created on the primary server. In this case a subscirption pn pg1 would be removed. But what is a next step? Since a timelineID on the standby server is larger than the primary (note that the standby has been promoted once), we cannot resume the physical replication as-is. IIUC the easiest method to retry is removing a cluster once and restarting from pg_basebackup. If so, no need to cleanup the standby because it is corrupted. We just say "Please remove the cluster and recreate again". Here is a reproducer. 1. apply the txt patch atop 0001 patch. 2. run test_corruption.sh. 3. when you find a below output [1], connect to a testdb from another terminal and run CREATE SUBSCRITPION for the same subscription on the primary 4. Finally, pg_createsubscriber would fail the creation. I also attached server logs of both nodes and the output. Note again that this is a real issue. I used a tricky way for surely overlapping name, but this can happen randomly. 10. While investigating #09, I found that we cannot report properly a reason why the subscription cannot be created. The output said: ``` pg_createsubscriber: error: could not create subscription "pg_createsubscriber_16389_3884" on database "testdb": out of memory ``` But the standby serverlog said: ``` ERROR: subscription "pg_createsubscriber_16389_3884" already exists STATEMENT: CREATE SUBSCRIPTION pg_createsubscriber_16389_3884 CONNECTION 'user=postgres port=5431 dbname=testdb' PUBLICATION pg_createsubscriber_16389 WITH (create_slot = false, copy_data = false, enabled = false) ``` [1] ``` pg_createsubscriber: creating the replication slot "pg_createsubscriber_16389_3884" on database "testdb" pg_createsubscriber: XXX: sleep 20s ``` Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/global/ N1.log Description: N1.log result.dat Description: result.dat server_start_20240209T115112.963.log Description: server_start_20240209T115112.963.log test_corruption.sh Description: test_corruption.sh diff --git a/src/bin/pg_basebackup/pg_createsubscriber.c b/src/bin/pg_basebackup/pg_createsubscriber.c index 9628f32a3e..fdb3e92aa3 100644 --- a/src/bin/pg_basebackup/pg_createsubscriber.c +++ b/src/bin/pg_basebackup/pg_createsubscriber.c @@ -919,6 +919,9 @@ create_logical_replication_slot(PGconn *conn, LogicalRepInfo *dbinfo, pg_log_info("creating the replication slot \"%s\" on database \"%s\"", slot_name, dbinfo->dbname); + pg_log_info("XXX: sleep 20s"); + sleep(20); + appendPQExpBuffer(str, "SELECT lsn FROM pg_create_logical_replication_slot('%s', '%s', %s, false, false)", slot_name, "pgoutput", temporary ? "true" : "false");
RE: Improve eviction algorithm in ReorderBuffer
Dear Sawada-san, Thanks for making v3 patchset. I have also benchmarked the case [1]. Below results are the average of 5th, there are almost the same result even when median is used for the comparison. On my env, the regression cannot be seen. HEAD (1e285a5) HEAD + v3 patches difference 10910.722 ms10714.540 msaround 1.8% Also, here are mino comments for v3 set. 01. bh_nodeidx_entry and ReorderBufferMemTrackState is missing in typedefs.list. 02. ReorderBufferTXNSizeCompare Should we assert {ta, tb} are not NULL? [1]: https://www.postgresql.org/message-id/CAD21AoB-7mPpKnLmBNfzfavG8AiTwEgAdVMuv%3DjzmAp9ex7eyQ%40mail.gmail.com Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
RE: speed up a logical replica setup
Dear Euler, > Remember the target server was a standby (read only access). I don't expect an application trying to modify it; unless it is a buggy application. > What if the client modifies the data just after the promotion? Naively considered, all the changes can be accepted, but are there any issues? > Regarding GUCs, almost all of them is PGC_POSTMASTER (so it cannot be modified unless the server is restarted). The ones that are not PGC_POSTMASTER, does not affect the pg_createsubscriber execution [1]. > IIUC, primary_conninfo and primary_slot_name is PGC_SIGHUP. > I'm just pointing out that this case is a different from pg_upgrade (from which this idea was taken). I'm not saying that's a bad idea. I'm just arguing that you might be preventing some access read only access (monitoring) when it is perfectly fine to connect to the database and execute queries. As I said before, the current UI allows anyone to setup the standby to accept only local connections. Of course, it is an extra step but it is possible. However, once you apply v16-0007, there is no option but use only local connection during the transformation. Is it an acceptable limitation? > My remained concern is written above. If they do not problematic we may not have to restrict them for now. At that time, changes 1) overwriting a port number, 2) setting listen_addresses = '' are not needed, right? IIUC inconsistency of -P may be still problematic. > pglogical_create_subscriber does nothing [2][3]. > Oh, thanks. Just to confirm - pglogical set shared_preload_libraries to '', should we follow or not? Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/global/
RE: speed up a logical replica setup
Dear Euler, I have not finished reviewing, but I can reply to you first. > I didn't complete this task yet so I didn't include it in this patch. > Just to confirm - No need to forcibly include my refactoring patch: you can modify based on your idea. > That's correct. We made a decision to use non physical replication connections (besides the one used to connect primary <-> standby). Hence, my concern about a HBA rule falls apart. I'm not convinced that using a modified primary_conninfo is the only/right answer to fill the subscription connection string. Physical vs logical replication has different requirements when we talk about users. The physical replication requires only the REPLICATION privilege. On the other hand, to create a subscription you must have the privileges of pg_create_subscription role and also CREATE privilege on the specified database. Unless, you are always recommending the superuser for this tool, I'm afraid there will be cases that reusing primary_conninfo will be an issue. The more I think about this UI, the more I think that, if it is not hundreds of lines of code, it uses the primary_conninfo the -P is not specified. > Valid point. It is one of the best practice that users set minimal privileges for each accounts. However... > Ugh. An error will occur the first time (get_primary_sysid) it tries to connect to primary. > I'm not sure it is correct. I think there are several patterns. a) -P option specified a ghost server, i.e., pg_createsubscriber cannot connect to anything. In this case, get_primary_sysid() would be failed because connect_database() would be failed. b) -P option specified an existing server, but it is not my upstream. get_primary_sysid() would be suceeded. In this case, there are two furher branches: b-1) nodes have different system_identifier. pg_createsubscriber would raise an ERROR and stop. b-2) nodes have the same system_identifier (e.g., nodes were generated by the same master). In this case, current checkings would be passed but pg_createsubscriber would stuck or reach timeout. PSA the reproducer. Can pg_createsubscriber check the relation two nodes have been connected by replication protocol? Or, can we assume that all the node surely have different system_identifier? > It is. However, I keep the assignments for 2 reasons: (a) there might be parameters whose default value is not zero, (b) the standard does not say that a null pointer must be represented by zero and (c) there is no harm in being paranoid during initial assignment. > Hmn, so, no need to use `= {0};`, but up to you. Also, according to C99 standard[1], NULL seemed to be defined as 0x0: ``` An integer constant expression with the value 0, or such an expression cast to type void *, is called a null pointer constant. If a null pointer constant is converted to a pointer type, the resulting pointer, called a null pointer, is guaranteed to compare unequal to a pointer to any object or function. ``` > > WriteRecoveryConfig() writes GUC parameters to postgresql.auto.conf, but not > sure it is good. These settings would remain on new subscriber even after the > pg_createsubscriber. Can we avoid it? I come up with passing these parameters > via pg_ctl -o option, but it requires parsing output from > GenerateRecoveryConfig() > (all GUCs must be allign like "-c XXX -c XXX -c XXX..."). I applied a modified version of v16-0006. > Sorry for confusing, it is not a solution. This approach writes parameter settings to postgresql.auto.conf, and they are never removed. Overwritten settings would remain. > I don't understand why v16-0002 is required. In a previous version, this patch was using connections in logical replication mode. After some discussion we decided to change it to regular connections and use SQL functions (instead of replication commands). Is it a requirement for v16-0003? > No, it is not required by 0003. I forgot the decision that we force DBAs to open normal connection establishments for pg_createsubscriber. Please ignore... [1]: https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1256.pdf Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/global/ test_0207.sh Description: test_0207.sh
RE: speed up a logical replica setup
Dear Euler, Sorry for posting e-mail each other. I will read your patch but I want to reply one of yours. > I started reviewing v16-0007 but didn't finish yet. The general idea is ok. However, I'm still worried about preventing some use cases if it provides only the local connection option. What if you want to keep monitoring this instance while the transformation is happening? Let's say it has a backlog that will take some time to apply. Unless, you have a local agent, you have no data about this server until pg_createsubscriber terminates. Even the local agent might not connect to the server unless you use the current port. > (IIUC, 0007 could not overwrite a port number - refactoring is needed) Ah, actually I did not have such a point of view. Assuming that changed port number can avoid connection establishments, there are four options: a) Does not overwrite port and listen_addresses. This allows us to monitor by external agents, but someone can modify GUCs and data during operations. b) Overwrite port but do not do listen_addresses. Not sure it is useful... c) Overwrite listen_addresses but do not do port. This allows us to monitor by local agents, and we can partially protect the database. But there is still a room. d) Overwrite both port and listen_addresses. This can protect databases perfectly but no one can monitor. Hmm, which one should be chosen? I prefer c) or d). Do you know how pglogical_create_subscriber does? Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/global/
RE: speed up a logical replica setup
Dear Shlok, Thanks for updating the patch! > I have created a topup patch 0007 on top of v15-0006. > > I revived the patch which removes -S option and adds some options > instead. The patch add option for --port, --username and --socketdir. > This patch also ensures that anyone cannot connect to the standby > during the pg_createsubscriber, by setting listen_addresses, > unix_socket_permissions, and unix_socket_directories. IIUC, there are two reasons why removing -S may be good: * force users to specify a local-connection, and * avoid connection establishment on standby during the pg_createsubscriber. First bullet is still valid, but we should describe that like pg_upgrade: > pg_upgrade will connect to the old and new servers several times, so you might want to set authentication to peer in pg_hba.conf or use a ~/.pgpass file (see Section 33.16). > Regarding the second bullet, this patch cannot ensure it. pg_createsubscriber just accepts port number which has been already accepted by the standby, it does not change the port number. So any local applications on the standby server can connect to the server and may change primary_conninfo, primary_slot_name, data, etc. So...what should be? How do other think? Beside, 0007 does not follow the recent changes on 0001. E.g., options should be record in CreateSubscriberOptions. Also, should we check the privilege of socket directory? [1]: https://www.postgresql.org/message-id/TY3PR01MB988902B992A4F2E99E1385EDF56F2%40TY3PR01MB9889.jpnprd01.prod.outlook.com Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
RE: speed up a logical replica setup
Dear Shubham, Thanks for testing our codes! > While reviewing the v15 patches I discovered that subscription > connection string has added a lot of options which are not required > now: > v15-0001 > postgres=# select subname, subconninfo from pg_subscription; > subname| subconninfo > ---+-- > pg_createsubscriber_5_1867633 | host=localhost port=5432 dbname=postgres > (1 row) > > > > v15-0001+0002+0003 > postgres=# select subname, subconninfo from pg_subscription; > subname| > > subconninfo > > > > ---+-- > -- > -- > -- > -- > -- > -- > pg_createsubscriber_5_1895366 | user=shubham > passfile='/home/shubham/.pgpass' channel_binding=prefer ho > st=127.0.0.1 port=5432 sslmode=prefer sslcompression=0 > sslcertmode=allow sslsni=1 ssl_min_protocol_versi > on=TLSv1.2 gssencmode=disable krbsrvname=postgres gssdelegation=0 > target_session_attrs=any load_balance_ > hosts=disable dbname=postgres > (1 row) > > > Here, we can see that channel_binding, sslmode, sslcertmode, sslsni, > gssencmode, krbsrvname, etc are getting included. This does not look > intentional, we should keep the subscription connection same as in > v15-0001. You should attach the script the reproducer. I suspected you used pg_basebackup -R command for setting up the standby. In this case, it is intentional. These settings are not caused by the pg_createsubscriber, done by pg_basebackup. If you set primary_conninfo manually like attached, these settings would not appear. As the first place, listed options (E.g., passfile, channel_binding, sslmode, sslcompression, sslcertmode, etc...) were set when you connect to the database via libpq functions. PQconninfoOptions in fe-connect.c lists parameters and their default value. v15-0003 reuses the primary_conninfo for subconninfo attribute. primary_conninfo is set by pg_basebackup specified with '-R' option. The content is built in GenerateRecoveryConfig(), which bypass parameters from PQconninfo(). This function returns all the libpq connection parameters even if it is set as default. So primary_conninfo looks longer. I don't think this works wrongly. Users still can set an arbitrary connection string as primary_conninfo. You just use longer string unintentionally. Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/ shorter_subconninfo.sh Description: shorter_subconninfo.sh
RE: speed up a logical replica setup
Dear Euler, > 03. > ``` > /* > * Is the standby server ready for logical replication? > */ > static bool > check_subscriber(LogicalRepInfo *dbinfo) > ``` > > You said "target server must be a standby" in [1], but I cannot find checks > for it. > IIUC, there are two approaches: > a) check the existence "standby.signal" in the data directory > b) call an SQL function "pg_is_in_recovery" I found that current code (HEAD+v14-0001) leads stuck or timeout because the recovery would be never finished. In attached script emulates the case standby.signal file is missing. This script always fails with below output: ``` ... pg_createsubscriber: waiting the postmaster to reach the consistent state pg_createsubscriber: postmaster was stopped pg_createsubscriber: error: recovery timed out ... ``` I also attached the server log during pg_createsubscriber, and we can see that walreceiver exits before receiving all the required changes. I cannot find a path yet, but the inconsistency lead the situation which walreceiver exists but startup remains. This also said that recovery status must be checked in check_subscriber(). Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/ server_start_20240206T080003.670.log Description: server_start_20240206T080003.670.log test_0206.sh Description: test_0206.sh
RE: Improve eviction algorithm in ReorderBuffer
Dear Sawada-san, > Thank you for the review comments! > > > > > A comment for 0001: > > > > 01. > > ``` > > +static void > > +bh_enlarge_node_array(binaryheap *heap) > > +{ > > +if (heap->bh_size < heap->bh_space) > > +return; > > + > > +heap->bh_space *= 2; > > +heap->bh_nodes = repalloc(heap->bh_nodes, > > + sizeof(bh_node_type) * heap->bh_space); > > +} > > ``` > > > > I'm not sure it is OK to use repalloc() for enlarging bh_nodes. This data > > structure public one and arbitrary codes and extensions can directly refer > > bh_nodes. But if the array is repalloc()'d, the pointer would be updated so > > that > > their referring would be a dangling pointer. > > Hmm I'm not sure this is the case that we need to really worry about, > and cannot come up with a good use case where extensions refer to > bh_nodes directly rather than binaryheap. In PostgreSQL codes, many > Nodes already have pointers and are exposed. Actually, me neither. I could not come up with the use-case - I just said the possibility. If it is not a real issue, we can ignore. > > > > > 04. > > ``` > > extern binaryheap *binaryheap_allocate(int capacity, > > binaryheap_comparator compare, > > - void *arg); > > + bool indexed, void *arg); > > ``` > > > > I felt pre-existing API should not be changed. How about adding > > binaryheap_allocate_extended() or something which can specify the `bool > indexed`? > > binaryheap_allocate() sets heap->bh_indexed to false. > > I'm really not sure it's worth inventing a > binaryheap_allocate_extended() function just for preserving API > compatibility. I think it's generally a good idea to have > xxx_extended() function to increase readability and usability, for > example, for the case where the same (kind of default) arguments are > passed in most cases and the function is called from many places. > However, we have a handful binaryheap_allocate() callers, and I > believe that it would not hurt the existing callers. I kept (external) extensions which uses binaryheap APIs in my mind. I thought we could avoid to raise costs for updating their codes. But I could understand the change is small, so ... up to you. > > 05. > > ``` > > +extern void binaryheap_update_up(binaryheap *heap, bh_node_type d); > > +extern void binaryheap_update_down(binaryheap *heap, bh_node_type d); > > ``` > > > > IIUC, callers must consider whether the node should be shift up/down and use > > appropriate function, right? I felt it may not be user-friendly. > > Right, I couldn't come up with a better interface. > > Another idea I've considered was that the caller provides a callback > function where it can compare the old and new keys. For example, in > reorderbuffer case, we call like: > > binaryheap_update(rb->txn_heap, PointerGetDatum(txn), > ReorderBufferTXNUpdateCompare, (void *) _size); > > Then in ReorderBufferTXNUpdateCompare(), > ReorderBufferTXN *txn = (ReorderBufferTXN *) a;Size old_size = *(Size *) b; > (compare txn->size to "b" ...) > > However it seems complicated... > I considered similar approach: accept old node as an argument of a compare function. But it requires further memory allocation. Do someone have better idea? > > > > > 08. > > IIUC, if more than 1024 transactions are running but they have small amount > > of > > changes, the performance may be degraded, right? Do you have a result in > sucha > > a case? > > I've run a benchmark test that I shared before[1]. Here are results of > decoding a transaction that has 1M subtransaction each of which has 1 > INSERT: > > HEAD: > 1810.192 ms > > HEAD w/ patch: > 2001.094 ms > > I set a large enough value to logical_decoding_work_mem not to evict > any transactions. I can see about about 10% performance regression in > this case. Thanks for running. I think this workload is the worst and an extreme case which would not be occurred on the real system (Such a system should be fixed), so we can say that the regression is up to -10%. I felt it could be negligible but how do other think? Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
RE: Is this a problem in GenericXLogFinish()?
Dear Amit, > > @@ -692,6 +697,9 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, > Buffer ovflbuf, > if (!xlrec.is_prev_bucket_same_wrt) > wbuf_flags |= REGBUF_NO_CHANGE; > XLogRegisterBuffer(1, wbuf, wbuf_flags); > + > + /* Track the registration status for later use */ > + wbuf_registered = true; > } > > XLogRegisterBuffer(2, ovflbuf, REGBUF_STANDARD); > @@ -719,7 +727,12 @@ _hash_freeovflpage(Relation rel, Buffer > bucketbuf, Buffer ovflbuf, > > recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SQUEEZE_PAGE); > > - PageSetLSN(BufferGetPage(wbuf), recptr); > + /* Set LSN to wbuf page buffer only when it is being registered */ > + if (wbuf_registered) > + PageSetLSN(BufferGetPage(wbuf), recptr); > > Why set LSN when the page is not modified (say when we use the flag > REGBUF_NO_CHANGE)? I think you need to use a flag mod_wbuf and set it > in appropriate places during register buffer calls. You are right. Based on the previous discussions, PageSetLSN() must be called after the MakeBufferDirty(). REGBUF_NO_CHANGE has been introduced for skipping these requirements. Definitevely, no_change buffers must not be PageSetLSN()'d. Other pages, e.g., metabuf, has already been followed the rule. I updated the patch based on the requirement. Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/ v2_avoid_registration.patch Description: v2_avoid_registration.patch
RE: Is this a problem in GenericXLogFinish()?
Dear Michael, Amit, > > Amit, this has been applied as of 861f86beea1c, and I got pinged about > the fact this triggers inconsistencies because we always set the LSN > of the write buffer (wbuf in _hash_freeovflpage) but > XLogRegisterBuffer() would *not* be called when the two following > conditions happen: > - When xlrec.ntups <= 0. > - When !xlrec.is_prim_bucket_same_wrt && !xlrec.is_prev_bucket_same_wrt > > And it seems to me that there is still a bug here: there should be no > point in setting the LSN on the write buffer if we don't register it > in WAL at all, no? Thanks for pointing out, I agreed your saying. PSA the patch for diagnosing the issue. This patch can avoid the inconsistency due to the LSN setting and output a debug LOG when we met such a case. I executed hash_index.sql and confirmed the log was output [1]. This meant that current test has already had a workload which meets below conditions: - the overflow page has no tuples (xlrec.ntups is 0), - to-be-written page - wbuf - is not the primary (xlrec.is_prim_bucket_same_wrt is false), and - to-be-written buffer is not next to the overflow page (xlrec.is_prev_bucket_same_wrt is false) So, I think my patch (after removing elog(...) part) can fix the issue. Thought? [1]: ``` LOG: XXX: is_wbuf_registered: false CONTEXT: while vacuuming index "hash_cleanup_index" of relation "public.hash_cleanup_heap" STATEMENT: VACUUM hash_cleanup_heap; ``` Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/ avoid_registration.patch Description: avoid_registration.patch
RE: speed up a logical replica setup
Dear Fabrízio, Euler, I think you set the primary_slot_name to the standby server, right? While reading codes, I found below line in v11-0001. ``` if (primary_slot_name != NULL) { conn = connect_database(dbinfo[0].pubconninfo); if (conn != NULL) { drop_replication_slot(conn, [0], temp_replslot); } ``` Now the temp_replslot is temporary one, so it would be removed automatically. This function will cause the error: replication slot "pg_createsubscriber_%u_startpoint" does not exist. Also, the physical slot is still remained on the primary. In short, "temp_replslot" should be "primary_slot_name". PSA a script file for reproducing. Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/ test_0201.sh Description: test_0201.sh
RE: speed up a logical replica setup
Dear Euler, Thanks for giving comments! I want to reply some of them. > I didn't provide the whole explanation. I'm envisioning the use case that pg_ctl doesn't reach the consistent state and the timeout is reached (the consequence is that pg_createsubscriber aborts the execution). It might occur on a busy server. The probability that it occurs with the current code is low (LSN gap for recovery is small). Maybe I'm anticipating issues when the base backup support is added but better to raise concerns during development. > Hmm, actually I didn't know the case. Thanks for explanation. I want to see how you describe on the doc. > pg_upgrade doesn't but others do like pg_rewind, pg_resetwal, pg_controldata, pg_checksums. It seems newer tools tend to provide short and long options. > Oh, you are right. > Nothing? If you interrupt the execution, there will be objects left behind and you, as someone that decided to do it, have to clean things up. What do you expect this tool to do? The documentation will provide some guidance informing the object name patterns this tool uses and you can check for leftover objects. Someone can argue that is a valid feature request but IMO it is not one in the top of the list. > OK, so let's keep current style. > Why? Are you suggesting that the dry run mode covers just the verification part? If so, it is not a dry run mode. I would expect it to run until the end (or until it accomplish its goal) but *does not* modify data. For pg_resetwal, the modification is one of the last steps and the other ones (KillFoo functions) that are skipped modify data. It ends the dry run mode when it accomplish its goal (obtain the new control data values). If we stop earlier, some of the additional steps won't be covered by the dry run mode and a failure can happen but could be detected if you run a few more steps. > Yes, it was my expectation. I'm still not sure which operations can detect by the dry_run, but we can keep it for now. > Why? See [1]. I prefer the kind mode (always wait until the recovery ends) but you and Amit are proposing a more aggressive mode. The proposal (-t 60) seems ok right now, however, if the goal is to provide base backup support in the future, you certainly should have to add the --recovery-timeout in big clusters or those with high workloads because base backup is run between replication slot creation and consistent LSN. Of course, we can change the default when base backup support is added. > Sorry, I was missing your previous post. Let's keep yours. > Good point. I included a check for pg_create_subscription role and CREATE privilege on the specified database. > Not sure, but can we do the replication origin functions by these privilege? According to the doc[1], these ones seem not to be related. [1]: https://www.postgresql.org/docs/devel/functions-admin.html#FUNCTIONS-REPLICATION Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/global/
RE: speed up a logical replica setup
Dear Fabrízio, Thanks for reporting. I understood that the issue occurred on v11 and v12. I will try to reproduce and check the reason. Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/global/
RE: Documentation to upgrade logical replication cluster
Dear Vignesh, Thanks for updating the patch! Here are my comments for v6. 01. ``` + Logical replication cluster + + + A set of publisher and subscriber instance with publisher instance + replicating changes to the subscriber instance. + + ``` Should we say 1:N relationship is allowed? 02. ``` @@ -70,6 +70,7 @@ PostgreSQL documentation pg_upgrade supports upgrades from 9.2.X and later to the current major release of PostgreSQL, including snapshot and beta releases. + ``` Unnecessary blank. 03. ``` - These are the steps to perform an upgrade - with pg_upgrade: + Below are the steps to perform an upgrade + with pg_upgrade. ``` I'm not sure it should be included in this patch. 04. ``` + If the old primary is prior to version 17.0, then no slots on the primary + are copied to the new standby, so all the slots on the old standby must + be recreated manually. ``` I think that "all the slots on the old standby" must be created manually in any cases. Therefore, the preposition ", so" seems not correct. 05. ``` If the old primary is version 17.0 or later, then + only logical slots on the primary are copied to the new standby, but + other slots on the old standby are not copied, so must be recreated + manually. ``` How about replacing this paragraph to below? ``` All the slots on the old standby must be recreated manually. If the old primary is version 17.0 or later, then only logical slots on the primary are copied to the new standby. ``` Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
RE: Improve eviction algorithm in ReorderBuffer
Dear Sawada-san, I have started to read your patches. Here are my initial comments. At least, all subscription tests were passed on my env. A comment for 0001: 01. ``` +static void +bh_enlarge_node_array(binaryheap *heap) +{ +if (heap->bh_size < heap->bh_space) +return; + +heap->bh_space *= 2; +heap->bh_nodes = repalloc(heap->bh_nodes, + sizeof(bh_node_type) * heap->bh_space); +} ``` I'm not sure it is OK to use repalloc() for enlarging bh_nodes. This data structure public one and arbitrary codes and extensions can directly refer bh_nodes. But if the array is repalloc()'d, the pointer would be updated so that their referring would be a dangling pointer. I think the internal of the structure should be a private one in this case. Comments for 0002: 02. ``` +#include "utils/palloc.h" ``` Is it really needed? I'm not sure who referrs it. 03. ``` typedef struct bh_nodeidx_entry { bh_node_typekey; charstatus; int idx; } bh_nodeidx_entry; ``` Sorry if it is a stupid question. Can you tell me how "status" is used? None of binaryheap and reorderbuffer components refer it. 04. ``` extern binaryheap *binaryheap_allocate(int capacity, binaryheap_comparator compare, - void *arg); + bool indexed, void *arg); ``` I felt pre-existing API should not be changed. How about adding binaryheap_allocate_extended() or something which can specify the `bool indexed`? binaryheap_allocate() sets heap->bh_indexed to false. 05. ``` +extern void binaryheap_update_up(binaryheap *heap, bh_node_type d); +extern void binaryheap_update_down(binaryheap *heap, bh_node_type d); ``` IIUC, callers must consider whether the node should be shift up/down and use appropriate function, right? I felt it may not be user-friendly. Comments for 0003: 06. ``` This commit changes the eviction algorithm in ReorderBuffer to use max-heap with transaction size,a nd use two strategies depending on the number of transactions being decoded. ``` s/a nd/ and/ 07. ``` It could be too expensive to pudate max-heap while preserving the heap property each time the transaction's memory counter is updated, as it could happen very frquently. So when the number of transactions being decoded is small, we add the transactions to max-heap but don't preserve the heap property, which is O(1). We heapify the max-heap just before picking the largest transaction, which is O(n). This strategy minimizes the overheads of updating the transaction's memory counter. ``` s/pudate/update/ 08. IIUC, if more than 1024 transactions are running but they have small amount of changes, the performance may be degraded, right? Do you have a result in sucha a case? Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
RE: speed up a logical replica setup
he manpage, callback functions would not be called when it exits due to signals: > Functions registered using atexit() (and on_exit(3)) are not called if a > process terminates abnormally because of the delivery of a signal. Do you have a good way to handle the case? One solution is to output created objects in any log level, but the consideration may be too much. Thought? 13, main ``` /* * Create a temporary logical replication slot to get a consistent LSN. ``` Just to clarify - I still think the process exits before here in case of dry run. In case of pg_resetwal, the process exits before doing actual works like RewriteControlFile(). 14. main ``` * XXX we might not fail here. Instead, we provide a warning so the user * eventually drops this replication slot later. ``` But there are possibilities to exit(1) in drop_replication_slot(). Is it acceptable? 15. wait_for_end_recovery ``` /* * Bail out after recovery_timeout seconds if this option is set. */ if (recovery_timeout > 0 && timer >= recovery_timeout) ``` Hmm, IIUC, it should be enabled by default [3]. Do you have anything in your mind? 16. main ``` /* * Create the subscription for each database on subscriber. It does not * enable it immediately because it needs to adjust the logical * replication start point to the LSN reported by consistent_lsn (see * set_replication_progress). It also cleans up publications created by * this tool and replication to the standby. */ if (!setup_subscriber(dbinfo, consistent_lsn)) ``` Subscriptions would be created and replication origin would be moved forward here, but latter one can be done only by the superuser. I felt that this should be checked in check_subscriber(). 17. main ``` /* * Change system identifier. */ modify_sysid(pg_resetwal_path, subscriber_dir); ``` Even if I executed without -v option, an output from pg_resetwal command appears. It seems bit strange. ``` $ pg_createsubscriber -D data_N2/ -P "port=5431 user=postgres" -S "port=5432 user=postgres" -d postgres Write-ahead log reset $ ``` [1]: https://www.postgresql.org/message-id/TY3PR01MB9889C362FF76102C88FA1C29F56F2%40TY3PR01MB9889.jpnprd01.prod.outlook.com [2]: https://www.postgresql.org/message-id/TY3PR01MB9889C362FF76102C88FA1C29F56F2%40TY3PR01MB9889.jpnprd01.prod.outlook.com [3]: https://www.postgresql.org/message-id/CAA4eK1JRgnhv_ySzuFjN7UaX9qxz5Hqcwew7Fk%3D%2BAbJbu0Kd9w%40mail.gmail.com Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/global/
RE: Documentation to upgrade logical replication cluster
Dear Vignesh, Thanks for updating the patch! For now, v4 patch LGTM. Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/
RE: speed up a logical replica setup
Dear Euler, Again, thanks for updating the patch! There are my random comments for v9. 01. I cannot find your replies for my comments#7 [1] but you reverted related changes. I'm not sure you are still considering it or you decided not to include changes. Can you clarify your opinion? (It is needed because changes are huge so it quite affects other developments...) 02. ``` + -t seconds + --timeout=seconds ``` But source codes required `--recovery-timeout`. Please update either of them, 03. ``` + *Create a new logical replica from a standby server ``` Junwang pointed out to change here but the change was reverted [2] Can you clarify your opinion as well? 04. ``` +/* + * Is the source server ready for logical replication? If so, create the + * publications and replication slots in preparation for logical replication. + */ +static bool +setup_publisher(LogicalRepInfo *dbinfo) ``` But this function verifies the source server. I felt they should be in the different function. 05. ``` +/* + * Is the target server ready for logical replication? + */ +static bool +setup_subscriber(LogicalRepInfo *dbinfo) Actually, this function does not set up subscriber. It just verifies whether the target can become a subscriber, right? If should be renamed. 06. ``` +atexit(cleanup_objects_atexit); ``` The registration of the cleanup function is too early. This sometimes triggers a core-dump. E.g., ``` $ pg_subscriber --publisher-conninfo --subscriber-conninfo 'user=postgres port=5432' --verbose --database 'postgres' --pgdata data_N2/ pg_subscriber: error: too many command-line arguments (first is "user=postgres port=5432") pg_subscriber: hint: Try "pg_subscriber --help" for more information. Segmentation fault (core dumped) $ gdb ... (gdb) bt #0 cleanup_objects_atexit () at pg_subscriber.c:131 #1 0x7fb982cffce9 in __run_exit_handlers () from /lib64/libc.so.6 #2 0x7fb982cffd37 in exit () from /lib64/libc.so.6 #3 0x004054e6 in main (argc=9, argv=0x7ffc59074158) at pg_subscriber.c:1500 (gdb) f 3 #3 0x004054e6 in main (argc=9, argv=0x7ffc59074158) at pg_subscriber.c:1500 1500exit(1); (gdb) list 1495if (optind < argc) 1496{ 1497pg_log_error("too many command-line arguments (first is \"%s\")", 1498 argv[optind]); 1499pg_log_error_hint("Try \"%s --help\" for more information.", progname); 1500exit(1); 1501} 1502 1503/* 1504 * Required arguments ``` I still think it should be done just before the creation of objects [3]. 07. Missing a removal of publications on the standby. 08. Missing registration of LogicalRepInfo in the typedefs.list. 09 ``` + + + pg_subscriber + option + + ``` Can you reply my comment#2 [4]? I think mandatory options should be written. 10. Just to confirm - will you implement start_standby/stop_standby functions in next version? [1]: https://www.postgresql.org/message-id/TY3PR01MB9889C362FF76102C88FA1C29F56F2%40TY3PR01MB9889.jpnprd01.prod.outlook.com [2]: https://www.postgresql.org/message-id/CAEG8a3%2BwL_2R8n12BmRz7yBP3EBNdHDhmdgxQFA9vS%2BzPR%2B3Kw%40mail.gmail.com [3]: https://www.postgresql.org/message-id/TY3PR01MB9889678E47B918F4D83A6FD8F57B2%40TY3PR01MB9889.jpnprd01.prod.outlook.com [4]: https://www.postgresql.org/message-id/TY3PR01MB9889C362FF76102C88FA1C29F56F2%40TY3PR01MB9889.jpnprd01.prod.outlook.com Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/global/
RE: speed up a logical replica setup
Dear Euler, Thanks for updating the patch! Before reading yours, I wanted to reply some of comments. > I'm still thinking about replacing --subscriber-conninfo with separate items (username, port, password?, host = socket dir). Maybe it is an overengineering. The user can always prepare the environment to avoid unwanted and/or external connections. > For me, required amount of fixes are not so different from current one. How about others? > It is extremely useful because (a) you have a physical replication setup and don't know if it is prepared for logical replication, (b) check GUCs (is max_wal_senders sufficient for this pg_subscriber command? Or is max_replication_slots sufficient to setup the logical replication even though I already have some used replication slots?), (c) connectivity and (d) credentials. > Yeah, it is useful for verification purpose, so let's keep this option. But I still think the naming should be "--check". Also, there are many `if (!dry_run)` but most of them can be removed if the process exits earlier. Thought? > > 05. > I felt we should accept some settings from enviroment variables, like > pg_upgrade. > Currently, below items should be acceted. > > - data directory > - username > - port > - timeout Maybe PGDATA. > Sorry, I cannot follow this. Did you mean that the target data directory should be able to be specified by PGDATA? OF so, +1. > > 06. > ``` > pg_logging_set_level(PG_LOG_WARNING); > ``` > > If the default log level is warning, there are no ways to output debug logs. > (-v option only raises one, so INFO would be output) > I think it should be PG_LOG_INFO. You need to specify multiple -v options. > Hmm. I felt the specification was bit strange...but at least it must be described on the documentation. pg_dump.sgml has similar lines. > > 08. > Not sure, but if we want to record outputs by pg_subscriber, the sub-directory > should be created. The name should contain the timestamp. The log file already contains the timestamp. Why? > This comment assumed outputs by pg_subscriber were also recorded to a file. In this case and if the file also has the same timestamp, I think they can be gathered in the same place. No need if outputs are not recorded. > > 09. > Not sure, but should we check max_slot_wal_keep_size of primary server? It can > avoid to fail starting of logical replicaiton. A broken physical replication *before* running this tool is its responsibility? Hmm. We might add another check that can be noticed during dry run mode. > I thought that we should not generate any broken objects, but indeed, not sure it is our scope. How do other think? > > 11. > ``` > /* > * Stop the subscriber if it is a standby server. Before executing the > * transformation steps, make sure the subscriber is not running because > * one of the steps is to modify some recovery parameters that require a > * restart. > */ > if (stat(pidfile, ) == 0) > ``` > > I kept just in case, but I'm not sure it is still needed. How do you think? > Removing it can reduce an inclusion of pidfile.h. Are you suggesting another way to check if the standby is up and running? > Running `pg_ctl stop` itself can detect whether the process has been still alive. It would exit with 1 when the process is not there. > > I didn't notice the comment, but still not sure the reason. Why we must > reserve > the slot until pg_subscriber finishes? IIUC, the slot would be never used, it > is created only for getting a consistent_lsn. So we do not have to keep. > Also, just before, logical replication slots for each databases are created, > so > WAL records are surely reserved. > I want to confirm the conclusion - will you remove the creation of a transient slot? Also, not tested, I'm now considering that we can reuse the primary_conninfo value. We are assuming that the target server is standby and the current upstream one will convert to publisher. In this case, the connection string is already specified as primary_conninfo so --publisher-conninfo may not be needed. The parameter does not contain database name, so --databases is still needed. I imagine like: 1. Parse options 2. Turn on standby 3. Verify the standby 4. Turn off standby 5. Get primary_conninfo from standby 6. Connect to primary (concat of primary_conninfo and an option is used) 7. Verify the primary ... How do you think? Best Regards, Hayato Kuroda FUJITSU LIMITED https://www.fujitsu.com/global/
RE: speed up a logical replica setup
Dear Peter, Thanks for giving your idea! > > Subscriber has a different meaning of subscription. Subscription is an SQL > > object. Subscriber is the server (node in replication terminology) where the > > subscription resides. Having said that pg_createsubscriber doesn't seem > > a bad > > name because you are creating a new subscriber. (Indeed, you are > > transforming / > > converting but "create" seems closer and users can infer that it is a > > tool to > > build a new logical replica. > > That makes sense. > > (Also, the problem with "convert" etc. is that "convertsubscriber" would > imply that you are converting an existing subscriber to something else. > It would need to be something like "convertbackup" then, which doesn't > seem helpful.) > > > I think "convert" and "transform" fit for this case. However, "create", > > "convert" and "transform" have 6, 7 and 9 characters, respectively. I > > suggest > > that we avoid long names (subscriber already has 10 characters). My > > preference > > is pg_createsubscriber. > > That seems best to me. Just FYI - I'm ok to change the name to pg_createsubscriber. Best Regards, Hayato Kuroda FUJITSU LIMITED
RE: speed up a logical replica setup
PQntuples(res), 1); return NULL; } ``` Error message should be changed. I think this error means the standby has wrong primary_slot_name, right? 18. misc Sometimes the pid of pg_subscriber is referred. It can be stored as global variable. 19. C99-style has been allowed, so loop variables like "i" can be declared in the for-statement, like ``` for (int i = 0; i < MAX; i++) ``` 20. Some comments, docs, and outputs must be fixed when the name is changed. Best Regards, Hayato Kuroda FUJITSU LIMITED
RE: speed up a logical replica setup
Dear hackers, Based on the requirement, I have profiled the performance test. It showed bottlenecks are in small-data case are mainly two - starting a server and waiting until the recovery is done. # Tested source code V7 patch set was applied atop HEAD(0eb23285). No configure options were specified when it was built. # Tested workload I focused on only 100MB/1GB cases because bigger ones have already had good performance. (Number of inserted tuples were same as previous tests) I used bash script instead of tap test framework. See attached. Executed SQLs and operations were almost the same. As you can see, I tested only one-db case. Results may be changed if the number of databases were changed. # Measurement Some debug logs which output current time were added (please see diff file). I picked up some events and done at before/after them. Below bullets showed the measured ones: * Starting a server * Stopping a server * Creating replication slots * Creating publications * Waiting until the recovery ended * Creating subscriptions # Result Below table shows the elapsed time for these events. Raw data is also available by the attached excel file. |Event category |100MB case [sec]|1GB [sec]| |Starting a server |1.414 |1.417| |Stoping a server|0.506 |0.506| |Creating replication slots |0.005 |0.007| |Creating publications |0.001 |0.002| |Waiting until the recovery ended|1.603 |14.529 | |Creating subscriptions |0.012 |0.012| |Total |3.541 |16.473 | |actual time |4.37|17.271 | As you can see, starting servers and waiting seem slow. We cannot omit these, but setting smaller shared_buffers will reduce the start time. One approach is to overwrite the GUC to smaller value, but I think we cannot determine the appropriate value. Best Regards, Hayato Kuroda FUJITSU LIMITED run.sh Description: run.sh perf_result.xlsx Description: perf_result.xlsx add_debug_log.diff Description: add_debug_log.diff