RE: State of pg_createsubscriber

2024-05-23 Thread Hayato Kuroda (Fujitsu)
Dear Amit, Robert,

> So, we have the following options: (a) by default drop the
> pre-existing subscriptions, (b) by default disable the pre-existing
> subscriptions, and add a Note in the docs that users can take
> necessary actions to enable or drop them. Now, we can even think of
> providing a switch to retain the pre-existing subscriptions or
> publications as the user may have some use case where it can be
> helpful for her. For example, retaining publications can help in
> creating a bi-directional setup.

Another point we should consider is the replication slot. If standby server has
had slots and they were forgotten, WAL files won't be discarded so disk full
failure will happen. v2-0004 proposed in [1] drops replication slots when their
failover option is true. This can partially solve the issue, but what should be
for other slots?

[1]: 
https://www.postgresql.org/message-id/CANhcyEV6q1Vhd37i1axUeScLi0UAGVxta1LDa0BV0Eh--TcPMg%40mail.gmail.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



RE: Pgoutput not capturing the generated columns

2024-05-22 Thread Hayato Kuroda (Fujitsu)
Dear Shubham,

Thanks for updating the patch! I checked your patches briefly. Here are my 
comments.

01. API

Since the option for test_decoding is enabled by default, I think it should be 
renamed.
E.g., "skip-generated-columns" or something.

02. ddl.sql

```
+-- check include-generated-columns option with generated column
+CREATE TABLE gencoltable (a int PRIMARY KEY, b int GENERATED ALWAYS AS (a * 2) 
STORED);
+INSERT INTO gencoltable (a) VALUES (1), (2), (3);
+SELECT data FROM pg_logical_slot_get_changes('regression_slot', NULL, NULL, 
'include-xids', '0', 'skip-empty-xacts', '1', 'include-generated-columns', '1');
+data 
+-
+ BEGIN
+ table public.gencoltable: INSERT: a[integer]:1 b[integer]:2
+ table public.gencoltable: INSERT: a[integer]:2 b[integer]:4
+ table public.gencoltable: INSERT: a[integer]:3 b[integer]:6
+ COMMIT
+(5 rows)
```

We should test non-default case, which the generated columns are not generated.

03. ddl.sql

Not sure new tests are in the correct place. Do we have to add new file and 
move tests to it?
Thought?

04. protocol.sgml

Please keep the format of the sgml file.

05. protocol.sgml

The option is implemented as the streaming option of pgoutput plugin, so they 
should be
located under "Logical Streaming Replication Parameters" section.

05. AlterSubscription

```
+   if (IsSet(opts.specified_opts, 
SUBOPT_GENERATED_COLUMN))
+   {
+   ereport(ERROR,
+   
(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+errmsg("toggling 
generated_column option is not allowed.")));
+   }
```

If you don't want to support the option, you can remove SUBOPT_GENERATED_COLUMN
macro from the function. But can you clarify the reason why you do not want?

07. logicalrep_write_tuple

```
-   if (!column_in_column_list(att->attnum, columns))
+   if (!column_in_column_list(att->attnum, columns) && 
!att->attgenerated)
+   continue;
+
+   if (att->attgenerated && !publish_generated_column)
continue;
```

I think changes in v2 was reverted or wrongly merged.

08. test code

Can you add tests that generated columns are replicated by the logical 
replication?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



RE: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-05-15 Thread Hayato Kuroda (Fujitsu)
Dear Peter,

> You wrote "Fixed" for that patch v9-0004 suggestion but I don't think
> anything was changed at all. Accidentally missed?

Sorry, I missed to do `git add` after the revision.
The change was also included in new patch [1].

[1]: 
https://www.postgresql.org/message-id/OSBPR01MB25522052F9F3E3AAD3BA2A8CF5ED2%40OSBPR01MB2552.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



RE: Fix src/test/subscription/t/029_on_error.pl test when wal_debug is enabled

2024-05-15 Thread Hayato Kuroda (Fujitsu)
Dear Amit, Ian,

> I guess it could be more work if we want to enhance the test for
> ERRORs other than the primary key violation. One simple fix is to
> update the log_offset to the location of the LOG after successful
> replication of un-conflicted data. For example, the Log location after
> we execute the below line in the test:
> 
> # Check replicated data
>my $res =
>  $node_subscriber->safe_psql('postgres', "SELECT
> count(*) FROM tbl");
>is($res, $expected, $msg);

I made a patch for confirmation purpose. This worked well on my environment.
Ian, how about you?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



fix_029.diff
Description: fix_029.diff


RE: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-05-14 Thread Hayato Kuroda (Fujitsu)
Dear Peter,

Thanks for reviewing! New patch is available in [1].

> I'm having second thoughts about how these patches mention the option
> values "on|off". These are used in the ALTER SUBSCRIPTION document
> page for 'two_phase' and 'failover' parameters, and then those
> "on|off" get propagated to the code comments, error messages, and
> tests...
> 
> Now I see that on the CREATE SUBSCRIPTION page [1], every boolean
> parameter (even including 'two_phase' and 'failover') is described in
> terms of "true|false" (not "on|off").

Hmm. But I could sentences like "The default value is off,...". Also, in 
alter_subscription.sgml,
"on|off" notation has already been used. Not sure, but I felt there are no 
rules around here.

> In hindsight, it is probably better to refer only to true|false
> everywhere for these boolean parameters, instead of sometimes using
> different values like on|off.
> 
> What do you think?

It's OK for me to make message/code comments consistent. Not sure the 
documentation,
but followed only my part.

[1]: 
https://www.postgresql.org/message-id/OSBPR01MB2552F66463EFCFD654E87C09F5E32%40OSBPR01MB2552.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



RE: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-05-13 Thread Hayato Kuroda (Fujitsu)
ied as 
> true./

Fixed.

> 
> 7.
> +# Verify the prepared transaction still exists
> +$result = $node_subscriber->safe_psql('postgres',
> +"SELECT count(*) FROM pg_prepared_xacts;");
> +is($result, q(1), "prepared transaction still exits");
> +
> 
> TYPO: /exits/exists/

Fixed.

> 
> ~~~
> 
> 8.
> +# Alter the two_phase with the force_alter option. Apart from the above, the
> +# command will abort the prepared transaction and succeed.
> +$node_subscriber->safe_psql('postgres',
> +"ALTER SUBSCRIPTION regress_sub SET (two_phase = off, force_alter
> = true);");
> +$node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION
> regress_sub ENABLE;");
> +
> 
> What does "Apart from the above" mean? Be more explicit.

Clarified like "Apart from the last ALTER SUBSCRIPTION command...".

> 9.
> +# Verify the prepared transaction are aborted
>  $result = $node_subscriber->safe_psql('postgres',
>  "SELECT count(*) FROM pg_prepared_xacts;");
>  is($result, q(0), "prepared transaction done by worker is aborted");
> 
> /transaction are aborted/transaction was aborted/

Fixed.

[1]: 
https://www.postgresql.org/message-id/OSBPR01MB2552FEA48D265EA278AA9F7AF5E22%40OSBPR01MB2552.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



RE: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-05-13 Thread Hayato Kuroda (Fujitsu)
Dear Peter,

Thanks for reviewing! Patch can be available in [1].

> ==
> src/sgml/ref/alter_subscription.sgml
> 
> 1.
> + 
> +  The two_phase parameter can only be altered when
> the
> +  subscription is disabled. When altering the parameter from
> on
> +  to off, the backend process checks prepared
> +  transactions done by the logical replication worker and aborts them.
> + 
> 
> The text may be OK as-is, but I was wondering if it might be better to
> give a more verbose explanation.
> 
> BEFORE
> ... the backend process checks prepared transactions done by the
> logical replication worker and aborts them.
> 
> SUGGESTION
> ... the backend process checks for any incomplete prepared
> transactions done by the logical replication worker (from when
> two_phase parameter was still "on") and, if any are found, those are
> aborted.
>

Fixed.

> ==
> src/backend/commands/subscriptioncmds.c
> 
> 2. AlterSubscription
> 
> - /*
> - * Since the altering two_phase option of subscriptions
> - * also leads to the change of slot option, this command
> - * cannot be rolled back. So prevent we are in the
> - * transaction block.
> + * If two_phase was enabled, there is a possibility the
> + * transactions has already been PREPARE'd. They must be
> + * checked and rolled back.
>   */
> 
> BEFORE
> ... there is a possibility the transactions has already been PREPARE'd.
> 
> SUGGESTION
> ... there is a possibility that transactions have already been PREPARE'd.

Fixed.

> 3. AlterSubscription
> + /*
> + * Since the altering two_phase option of subscriptions
> + * (especially on->off case) also leads to the
> + * change of slot option, this command cannot be rolled
> + * back. So prevent we are in the transaction block.
> + */
>   PreventInTransactionBlock(isTopLevel,
> "ALTER SUBSCRIPTION ... SET (two_phase = off)");
> 
> 
> This comment is a bit vague and includes some typos, but IIUC these
> problems will already be addressed by the 0002 patch changes.AFAIK
> patch 0003 is only moving the 0002 comment.

Yeah, the comment was updated accordingly.

> 
> ==
> src/test/subscription/t/099_twophase_added.pl
> 
> 4.
> +# Check the case that prepared transactions exist on the subscriber node
> +#
> +# If the two_phase is altering from "on" to "off" and there are prepared
> +# transactions on the subscriber, they must be aborted. This test checks it.
> +
> 
> Similar to the comment that I gave for v8-0002. I think there should
> be  comment for the major test comment to
> distinguish it from comments for the sub-steps.

Added.

> 5.
> +# Verify the prepared transaction are aborted because two_phase is changed to
> +# "off".
> +$result = $node_subscriber->safe_psql('postgres',
> +"SELECT count(*) FROM pg_prepared_xacts;");
> +is($result, q(0), "prepared transaction done by worker is aborted");
> +
> 
> /the prepared transaction are aborted/any prepared transactions are aborted/

Fixed.

[1]: 
https://www.postgresql.org/message-id/OSBPR01MB2552FEA48D265EA278AA9F7AF5E22%40OSBPR01MB2552.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



RE: Improving the latch handling between logical replication launcher and worker processes.

2024-05-09 Thread Hayato Kuroda (Fujitsu)
Dear Vignesh,

Thanks for raising idea!

> a) Introduce a new latch to handle worker attach and exit.

Just to confirm - there are three wait events for launchers, so I feel we may be
able to create latches per wait event. Is there a reason to introduce
"a" latch?

> b) Add a new GUC launcher_retry_time which gives more flexibility to
> users as suggested by Amit at [1]. Before 5a3a953, the
> wal_retrieve_retry_interval plays a similar role as the suggested new
> GUC launcher_retry_time, e.g. even if a worker is launched, the
> launcher only wait wal_retrieve_retry_interval time before next round.

Hmm. My concern is how users estimate the value. Maybe the default will be
3min, but should users change it? If so, how long? I think even if it becomes
tunable, they cannot control well.

> c) Don't reset the latch at worker attach and allow launcher main to
> identify and handle it. For this there is a patch v6-0002 available at
> [2].

Does it mean that you want to remove ResetLatch() from
WaitForReplicationWorkerAttach(), right? If so, what about the scenario?

1) The launcher waiting the worker is attached in 
WaitForReplicationWorkerAttach(),
and 2) subscription is created before attaching. In this case, the launcher will
become un-sleepable because the latch is set but won't be reset. It may waste 
the
CPU time.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 


RE: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-05-09 Thread Hayato Kuroda (Fujitsu)
Dear Peter,

> 
> ==
> Commit message
> 
> 1.
> A detailed commit message is needed to describe the purpose and
> details of this patch.

Added.

> ==
> doc/src/sgml/ref/alter_subscription.sgml
> 
> 2. CREATE SUBSCRIPTION
> 
> Shouldn't there be an entry for "force_alter" parameter in the CREATE
> SUBSCRIPTION "parameters" section, instead of just vaguely mentioning
> it in passing when describing the "two_phase" in ALTER SUBSCRIPTION?
>
> 3. ALTER SUBSCRIPTION - alterable parameters
> 
> And shouldn't this new option also be named in the ALTER SUBSCRIPTION
> list: "The parameters that can be altered are..."

Hmm, but the parameter cannot be used for CREATE SUBSCRIPTION. Should we
modify to accept and add the description in the doc? This was not accepted.

> ==
> src/backend/commands/subscriptioncmds.c
> 
> 4.
>   XLogRecPtr lsn;
> + bool twophase_force;
>  } SubOpts;
> 
> IMO this field ought to be called 'force_alter' to be the same as the
> option name. Sure, now it is only relevant for 'two_phase', but that
> might not always be the case in the future.

Modified.

> 5. AlterSubscription
> 
> + /*
> + * Abort prepared transactions if force option is also
> + * specified. Otherwise raise an ERROR.
> + */
> + if (!opts.twophase_force)
> + ereport(ERROR,
> + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> + errmsg("cannot alter %s when there are prepared transactions",
> + "two_phase = false")));
> +
> 
> 5a.
> /if force option is also specified/only if the 'force_alter' option is true/

Modified.

> 
> 5b.
> "two_phase = false" -- IMO that should say "two_phase = off"

Modified.

> 5c.
> IMO this ereport should include a errhint to tell the user they can
> use 'force_alter = true' to avoid getting this error.

Hint was added.

> 6.
> 
> + /* force_alter cannot be used standalone */
> + if (IsSet(opts.specified_opts, SUBOPT_FORCE_ALTER) &&
> + !IsSet(opts.specified_opts, SUBOPT_TWOPHASE_COMMIT))
> + ereport(ERROR,
> + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> + errmsg("%s must be specified with %s",
> + "force_alter", "two_phase")));
> +
> 
> IMO this rule is not necessary so the code should be removed. I think
> using 'force_alter' standalone doesn't do anything at all (certainly,
> it does no harm) so why add more complications (more rules, more code,
> more tests) just for the sake of it?

Removed. So standalone 'force_alter' is now no-op.

> src/test/subscription/t/099_twophase_added.pl
> 
> 7.
> +$node_subscriber->safe_psql('postgres',
> +"ALTER SUBSCRIPTION sub SET (two_phase = off, force_alter = on);");
> 
> "force" is a verb, so it is better to say 'force_alter = true' instead
> of 'force_alter = on'.

Fixed. Actually not sure it is better because I'm not a native.

> 8.
>  $result = $node_subscriber->safe_psql('postgres',
>  "SELECT count(*) FROM pg_prepared_xacts;");
>  is($result, q(0), "prepared transaction done by worker is aborted");
> 
> +$node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION sub
> ENABLE;");
> +
> 
> I felt the ENABLE statement should be above the SELECT statement so
> that the code is more like it was before applying the patch.

Fixed.

Please see attached patch set.


Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



v8-0001-Allow-altering-of-two_phase-option-of-a-SUBSCRIPT.patch
Description:  v8-0001-Allow-altering-of-two_phase-option-of-a-SUBSCRIPT.patch


v8-0002-Alter-slot-option-two_phase-only-when-altering-on.patch
Description:  v8-0002-Alter-slot-option-two_phase-only-when-altering-on.patch


v8-0003-Abort-prepared-transactions-while-altering-two_ph.patch
Description:  v8-0003-Abort-prepared-transactions-while-altering-two_ph.patch


v8-0004-Add-force_alter-option-for-ALTER-SUBSCIRPTION-.-S.patch
Description:  v8-0004-Add-force_alter-option-for-ALTER-SUBSCIRPTION-.-S.patch


RE: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-05-09 Thread Hayato Kuroda (Fujitsu)
Dear Peter,

> Commit Message
> 
> 1.
> The patch needs a commit message to describe the purpose and highlight
> any limitations and other details.

Added.

> ==
> doc/src/sgml/ref/alter_subscription.sgml
> 
> 2.
> +
> + 
> +  The two_phase parameter can only be altered when
> the
> +  subscription is disabled. When altering the parameter from
> true
> +  to false, the backend process checks prepared
> +  transactions done by the logical replication worker and aborts them.
> + 
> 
> Here, the para is referring to "true" and "false" but earlier on this
> page it talks about "twophase = off". IMO it is better to use a
> consistent terminology like "on|off" everywhere instead of randomly
> changing the way it is described each time.

I checked contents and changed to "on|off".

> ==
> src/backend/commands/subscriptioncmds.c
> 
> 3. AlterSubscription
> 
>   if (IsSet(opts.specified_opts, SUBOPT_TWOPHASE_COMMIT))
>   {
> + List *prepared_xacts = NIL;
> 
> This 'prepared_xacts' can be declared at a lower scrope because it is
> only used if (!opts.twophase).
> 
> Furthermore, IIUC you don't need to assign NIL in the declaration
> because there is no chance for it to be unassigned anyway.

Made the namespace narrower and initialization was removed.

> ~~~
> 
> 4. AlterSubscription
> 
> + /*
> + * The changed two_phase option (true->false) of the
> + * slot can't be rolled back.
> + */
>   PreventInTransactionBlock(isTopLevel,
> "ALTER SUBSCRIPTION ... SET (two_phase = off)");
> 
> Here is another example of inconsistent mixing of the terminology
> where the comment says "true"/"false" but the message says "off".
> Let's keep everything consistent. (I prefer on|off).

Modified.

> ~~~
> 
> 5.
> + if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED &&
> + (prepared_xacts = GetGidListBySubid(subid)) != NIL)
> + {
> + ListCell *cell;
> +
> + /* Abort all listed transactions */
> + foreach(cell, prepared_xacts)
> + FinishPreparedTransaction((char *) lfirst(cell),
> +   false);
> +
> + list_free(prepared_xacts);
> + }
> 
> 5A.
> IIRC there is a cleaner way to write this loop without needing
> ListCell variable -- e.g. foreach_ptr() macro?

Changed.

> 5B.
> Shouldn't this be using list_free_deep() so the pstrdup gid gets freed too?

Yeah, fixed.

> ==
> src/test/subscription/t/099_twophase_added.pl
> 
> 6.
> +##
> +# Check the case that prepared transactions exist on subscriber node
> +##
> +
> 
> Give some more detailed comments here similar to the review comment of
> patch v7-0002 for the other part of this TAP test.
> 
> ~~~
> 
> 7. TAP test - comments
> 
> Same as for my v7-0002 review comments, I think this test case also
> needs a few more one-line comments to describe the sub-steps. e.g.:
> 
> # prepare a transaction to insert some rows to the table
> 
> # verify the prepared tx is replicated to the subscriber (because
> 'two_phase = on')
> 
> # toggle the two_phase to 'off' *before* the COMMIT PREPARED
> 
> # verify the prepared tx got aborted
> 
> # do the COMMIT PREPARED (note that now two_phase is 'off')
> 
> # verify the inserted rows got replicated ok

They were fixed based on your previous comments.

> 
> 8. TAP test - subscription name
> 
> It's better to rename the SUBSCRIPTION in this TAP test so you can
> avoid getting log warnings like:
> 
> psql::4: WARNING:  subscriptions created by regression test
> cases should have names starting with "regress_"
> psql::4: NOTICE:  created replication slot "sub" on publisher

Modified, but it was included in 0001.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 




RE: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-05-09 Thread Hayato Kuroda (Fujitsu)
 + appendStringInfo(, "FAILOVER %s ",
> + (*failover) ? "true" : "false");
> +
> + if (two_phase)
> + appendStringInfo(, "TWO_PHASE %s%s ",
> + (*two_phase) ? "true" : "false",
> + failover ? ", " : "");
> +
> + appendStringInfoString(, ");");
> 
> 4a.
> IIUC the comma logic here was broken in v7 when you swapped the order.
> Anyway, IMO it will be better NOT to try combining that comma logic
> with the existing appendStringInfo. Doing it separately is both easier
> and less error-prone.
> 
> Furthermore, the parentheses like "(*two_phase)" instead of just
> "*two_phase" seemed a bit overkill.
> 
> SUGGESTION:
> + if (failover)
> + appendStringInfo(, "FAILOVER %s",
> + *failover ? "true" : "false");
> +
> +   if (failover && two_phase)
> +   appendStringInfo(, ", ");
> +
> + if (two_phase)
> + appendStringInfo(, "TWO_PHASE %s",
> + *two_phase ? "true" : "false");
> +
> + appendStringInfoString(, " );");

Fixed.

> 4b.
> Like I said above, IMO the current separator logic in v7 is broken. So
> it is a bit concerning the tests all passed anyway. How did that
> happen? I think this indicates that there needs to be an additional
> test scenario where both 'failover' and 'two_phase' get altered at the
> same time so this code gets exercised properly.

Right, it was added.

> ==
> src/test/subscription/t/099_twophase_added.pl
> 
> 5.
> +# Define pre-existing tables on both nodes
> 
> Why say they are "pre-existing"? They are not pre-existing because you
> are creating them right here!

Removed the word.

> 6.
> +##
> +# Check the case that prepared transactions exist on publisher node
> +##
> 
> I think this needs a slightly more detailed comment.
> 
> SUGGESTION (this is just an example, but you can surely improve it)
> 
> # Check the case that prepared transactions exist on the publisher node.
> #
> # Since two_phase is "off", then normally this PREPARE will do nothing until
> # the COMMIT PREPARED, but in this test, we toggle the two_phase to "on" again
> # before the COMMIT PREPARED happens.

Changed with adjustments.

> 7.
> Maybe this test case needs a few more one-line comments for each of
> the sub-steps. e.g.:
> 
> # prepare a transaction to insert some rows to the table
> 
> # verify the prepared tx is not yet replicated to the subscriber
> (because 'two_phase = off')
> 
> # toggle the two_phase to 'on' *before* the COMMIT PREPARED
> 
> # verify the inserted rows got replicated ok

Modified like yours, but changed based on the suggestion by Grammarly.

> 8.
> IIUC this test will behave the same even if you DON'T do the toggle
> 'two_phase = on'. So I wonder is there something more you can do to
> test this scenario more convincingly?

I found an indicator. When the apply starts, it outputs the current status of
two_phase option. I added wait_for_log() to ensure below appeared. Thought?

```
ereport(DEBUG1,
(errmsg_internal("logical replication apply worker for 
subscription \"%s\" two_phase is %s",
 MySubscription->name,
 
MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_DISABLED ? 
"DISABLED" :
 
MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_PENDING ? "PENDING" :
 
MySubscription->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED ? "ENABLED" :
 "?")));
```
Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



RE: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-05-09 Thread Hayato Kuroda (Fujitsu)
Dear Peter,

Thanks for reviewing! The patch will be posted in the upcoming post.

> ==
> src/backend/access/transam/twophase.c
> 
> 1. IsTwoPhaseTransactionGidForSubid
> 
> +/*
> + * IsTwoPhaseTransactionGidForSubid
> + * Check whether the given GID is formed by TwoPhaseTransactionGid.
> + */
> +static bool
> +IsTwoPhaseTransactionGidForSubid(Oid subid, char *gid)
> 
> I think the function comment should mention something about 'subid'.
> 
> SUGGESTION
> Check whether the given GID (as formed by TwoPhaseTransactionGid) is
> for the specified 'subid'.

Fixed.

> src/backend/commands/subscriptioncmds.c
> 
> 2. AlterSubscription
> 
> + if (!opts.twophase &&
> + form->subtwophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED
> &&
> + LookupGXactBySubid(subid))
> + /* Add error message */
> + ereport(ERROR,
> + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> + errmsg("cannot disable two_phase when uncommitted prepared
> transactions present"),
> + errhint("Resolve these transactions and try again")));
> 
> The comment "/* Add error message */" seems unnecessary.

Yeah, this was an internal flag. Removed.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



RE: Pgoutput not capturing the generated columns

2024-05-08 Thread Hayato Kuroda (Fujitsu)
Dear Shubham,

Thanks for creating a patch! Here are high-level comments.

1.
Please document the feature. If it is hard to describe, we should change the 
API.

2.
Currently, the option is implemented as streaming option. Are there any reasons
to choose the way? Another approach is to implement as slot option, like 
failover
and temporary.

3.
You said that subscription option is not supported for now. Not sure, is it mean
that logical replication feature cannot be used for generated columns? If so,
the restriction won't be acceptable. If the combination between this and initial
sync is problematic, can't we exclude them in CreateSubscrition and 
AlterSubscription?
E.g., create_slot option cannot be set if slot_name is NONE.

4.
Regarding the test_decoding plugin, it has already been able to decode the
generated columns. So... as the first place, is the proposed option really 
needed
for the plugin? Why do you include it?
If you anyway want to add the option, the default value should be on - which 
keeps
current behavior.

5.
Assuming that the feature become usable used for logical replicaiton. Not sure,
should we change the protocol version at that time? Nodes prior than PG17 may
not want receive values for generated columns. Can we control only by the 
option?

6. logicalrep_write_tuple()

```
-if (!column_in_column_list(att->attnum, columns))
+if (!column_in_column_list(att->attnum, columns) && !att->attgenerated)
+continue;
```

Hmm, does above mean that generated columns are decoded even if they are not in
the column list? If so, why? I think such columns should not be sent.

7.

Some functions refer data->publish_generated_column many times. Can we store
the value to a variable?

Below comments are for test_decoding part, but they may be not needed.

=

a. pg_decode_startup()

```
+else if (strcmp(elem->defname, "include_generated_columns") == 0)
```

Other options for test_decoding do not have underscore. It should be
"include-generated-columns".

b. pg_decode_change()

data->include_generated_columns is referred four times in the function.
Can you store the value to a varibable?


c. pg_decode_change()

```
-true);
+true, data->include_generated_columns );
```

Please remove the blank.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



RE: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-05-08 Thread Hayato Kuroda (Fujitsu)
ession
> instead?

Actually it cannot be done at main regression test. Because altering two_phase
requires the connection between pub/sub, but it is not established in 
subscription.sql
file. Succeeded case for altering failover has not been tested neither, and
I think they have same reason.

> src/test/subscription/t/021_twophase.pl
> 
> 17.
> +# Disable the subscription and alter it to two_phase = false,
> +# verify that the altered subscription reflects the two_phase option.
> 
> /verify/then verify/

Fixed.

> 18.
> +# Now do a prepare on publisher and make sure that it is not replicated.
> +$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
> +$node_publisher->safe_psql(
> +   'postgres', qq{
> +BEGIN;
> +INSERT INTO tab_copy VALUES (100);
> +PREPARE TRANSACTION 'newgid';
> + });
> +
> 
> 18a.
> /on publisher/on the publisher/

Fixed.

> 18b.
> What is that "DROP SUBSCRIPTION tap_sub" doing here? It seems
> misplaced under this comment.

The subscription must be dropped because it also prepares a transaction.
Moved before the test case and added comments.

> 19.
> +# Make sure that there is 0 prepared transaction on the subscriber
> +$result = $node_subscriber->safe_psql('postgres',
> + "SELECT count(*) FROM pg_prepared_xacts;");
> +is($result, qq(0), 'transaction is prepared on subscriber');
> 
> 19a.
> SUGGESTION
> Make sure there are no prepared transactions on the subscriber

Fixed.

> 19b.
> /'transaction is prepared on subscriber'/'should be no prepared
> transactions on subscriber'/

Replaced/

> 20.
> +# Made sure that the commited transaction is replicated.
> 
> /Made sure/Make sure/
> 
> /commited/committed/

Fixed.

> 21.
> +# Make sure that the two-phase is enabled on the subscriber
> +$result = $node_subscriber->safe_psql('postgres',
> + "SELECT subtwophasestate FROM pg_subscription WHERE subname =
> 'tap_sub_copy';"
> +);
> +is($result, qq(e), 'two-phase is disabled');
> 
> The 'two-phase is disabled' is the identical message used in the
> opposite case earlier, so something is amiss. Maybe this one should
> say 'two-phase should be enabled' and the earlier counterpart should
> say 'two-phase should be disabled'.

Both of them were fixed.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 




v7-0001-Allow-altering-of-two_phase-option-of-a-SUBSCRIPT.patch
Description:  v7-0001-Allow-altering-of-two_phase-option-of-a-SUBSCRIPT.patch


v7-0002-Alter-slot-option-two_phase-only-when-altering-tr.patch
Description:  v7-0002-Alter-slot-option-two_phase-only-when-altering-tr.patch


v7-0003-Abort-prepared-transactions-while-altering-two_ph.patch
Description:  v7-0003-Abort-prepared-transactions-while-altering-two_ph.patch


v7-0004-Add-force_alter-option.patch
Description: v7-0004-Add-force_alter-option.patch


RE: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-04-23 Thread Hayato Kuroda (Fujitsu)
Dear hackers,

Per recent commit (b29cbd3da), our patch needed to be rebased.
Here is an updated version.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 
 


v6-0001-Allow-altering-of-two_phase-option-of-a-SUBSCRIPT.patch
Description:  v6-0001-Allow-altering-of-two_phase-option-of-a-SUBSCRIPT.patch


v6-0002-Alter-slot-option-two_phase-only-when-altering-tr.patch
Description:  v6-0002-Alter-slot-option-two_phase-only-when-altering-tr.patch


v6-0003-Abort-prepared-transactions-while-altering-two_ph.patch
Description:  v6-0003-Abort-prepared-transactions-while-altering-two_ph.patch


v6-0004-Add-force_alter-option.patch
Description: v6-0004-Add-force_alter-option.patch


RE: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-04-22 Thread Hayato Kuroda (Fujitsu)
> Dear Vitaly,
> 
> > I looked at the patch and realized that I can't try it easily in the near 
> > future
> > because the solution I'm working on is based on PG16 or earlier. This patch 
> > is
> > not easily applicable to the older releases. I have to port my solution to 
> > the
> > master, which is not done yet.
> 
> We also tried to port our patch for PG16, but the largest barrier was that a
> replication command ALTER_SLOT is not supported. Since the slot option
> two_phase
> can't be modified, it is difficult to skip decoding PREPARE command even when
> altering the option from true to false.
> IIUC, Adding a new feature (e.g., replication command) for minor updates is
> generally
> prohibited
> 
> We must consider another approach for backpatching, but we do not have
> solutions
> for now.

Attached patch set is a ported version for PG16, which breaks ABI. This can be 
used
for testing purpose, but it won't be pushed to REL_16_STABLE.
At least, this patchset can pass my github CI.

Can you apply and check whether your issue is solved?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 
 


REL_16_0001-Allow-altering-of-two_phase-option-of-a-SUBSCRIPTION.patch
Description:  REL_16_0001-Allow-altering-of-two_phase-option-of-a-SUBSCRIPTION.patch


REL_16_0002-Alter-slot-option-two_phase-only-when-altering-true-.patch
Description:  REL_16_0002-Alter-slot-option-two_phase-only-when-altering-true-.patch


REL_16_0003-Abort-prepared-transactions-while-altering-two_phase.patch
Description:  REL_16_0003-Abort-prepared-transactions-while-altering-two_phase.patch


REL_16_0004-Add-force_alter-option.patch
Description: REL_16_0004-Add-force_alter-option.patch


RE: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-04-22 Thread Hayato Kuroda (Fujitsu)
Dear hackers,

> Looking at this a bit more, maybe rolling back all prepared transactions on 
> the
> subscriber when toggling two_phase from true to false might not be desirable
> for the customer. Maybe we should have an option for customers to control
> whether transactions should be rolled back or not. Maybe transactions should
> only be rolled back if a "force" option is also set. What do people think?

And here is a patch for adds new option "force_alter" (better name is very 
welcome).
It could be used only when altering two_phase option. Let me share examples.

Assuming that there are logical replication system with two_phase = on, and
there are prepared transactions:

```
subscriber=# SELECT * FROM pg_prepared_xacts ;
 transaction |   gid|   prepared|  owner   | 
database 
-+--+---+--+--
 741 | pg_gid_16390_741 | 2024-04-22 08:02:34.727913+00 | postgres | 
postgres
 742 | pg_gid_16390_742 | 2024-04-22 08:02:34.729486+00 | postgres | 
postgres
(2 rows)
```

At that time, altering two_phase to false alone will be failed:

```
subscriber=# ALTER SUBSCRIPTION sub DISABLE ;
ALTER SUBSCRIPTION
subscriber=# ALTER SUBSCRIPTION sub SET (two_phase = off);
ERROR:  cannot alter two_phase = false when there are prepared transactions
```

It succeeds if force_alter is also expressly set. Prepared transactions will be
aborted at that time.

```
subscriber=# ALTER SUBSCRIPTION sub SET (two_phase = off, force_alter = on);
ALTER SUBSCRIPTION
subscriber=# SELECT * FROM pg_prepared_xacts ;
 transaction | gid | prepared | owner | database 
-+-+--+---+------
(0 rows)
```

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/global/ 



v5-0001-Allow-altering-of-two_phase-option-of-a-SUBSCRIPT.patch
Description:  v5-0001-Allow-altering-of-two_phase-option-of-a-SUBSCRIPT.patch


v5-0002-Alter-slot-option-two_phase-only-when-altering-tr.patch
Description:  v5-0002-Alter-slot-option-two_phase-only-when-altering-tr.patch


v5-0003-Abort-prepared-transactions-while-altering-two_ph.patch
Description:  v5-0003-Abort-prepared-transactions-while-altering-two_ph.patch


v5-0004-Add-force_alter-option.patch
Description: v5-0004-Add-force_alter-option.patch


RE: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-04-22 Thread Hayato Kuroda (Fujitsu)
Dear Vitaly,

> I looked at the patch and realized that I can't try it easily in the near 
> future
> because the solution I'm working on is based on PG16 or earlier. This patch is
> not easily applicable to the older releases. I have to port my solution to the
> master, which is not done yet.

We also tried to port our patch for PG16, but the largest barrier was that a
replication command ALTER_SLOT is not supported. Since the slot option two_phase
can't be modified, it is difficult to skip decoding PREPARE command even when
altering the option from true to false.
IIUC, Adding a new feature (e.g., replication command) for minor updates is 
generally
prohibited

We must consider another approach for backpatching, but we do not have solutions
for now.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/global/ 

 


RE: Disallow changing slot's failover option in transaction block

2024-04-18 Thread Hayato Kuroda (Fujitsu)
Dear Hou,

Thanks for updating the patch! Let me confirm the content.

In your patch, the pg_dump.c was updated. IIUC the main reason was that
pg_restore executes some queries as single transactions so that ALTER
SUBSCRIPTION cannot be done, right?
Also, failover was synchronized only when we were in the upgrade mode, but
your patch seems to remove the condition. Can you clarify the reason?

Other than that, the patch LGTM.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



RE: Disallow changing slot's failover option in transaction block

2024-04-18 Thread Hayato Kuroda (Fujitsu)
Dear Shveta,

> When I said that option 2 aligns with ALTER-SUB documented behaviour,
> I meant the doc described in [1] stating "When altering the slot_name,
> the failover and two_phase property values of the named slot may
> differ from the counterpart failover and two_phase parameters
> specified in the subscription"
> 
> [1]:
> https://www.postgresql.org/docs/devel/sql-altersubscription.html#SQL-ALTER
> SUBSCRIPTION-PARAMS-SET

I see, thanks for the clarification. Agreed that the description is not 
conflict with
option 2.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



RE: Disallow changing slot's failover option in transaction block

2024-04-18 Thread Hayato Kuroda (Fujitsu)
Dear Shveta,

Sorry for delay response. I missed your post.

> +1.
> 
> Similar to ALTER SUB, CREATE SUB also needs to be fixed. This is
> because we call alter_replication_slot in CREATE SUB as well, for the
> case when slot_name is provided and create_slot=false. But the tricky
> part is we call alter_replication_slot() when creating subscription
> for both failover=true and false. That means if we want to fix it on
> the similar line of ALTER SUB, we have to disallow user from executing
> the CREATE SUBSCRIPTION (slot_name = xx) in a txn block, which seems
> to break some existing use cases. (previously, user can execute such a
> command inside a txn block).
> 
> So, we need to think if there are better ways to fix it.  After
> discussion with Hou-San offlist, here are some ideas:
> 1. do not alter replication slot's failover option when CREATE
> SUBSCRIPTION   WITH failover=false. This means we alter replication
> slot only when failover is set to true. And thus we can disallow
> CREATE SUB WITH (slot_name =xx, failover=true, create_slot=false)
> inside a txn block.
> 
> This option allows user to run CREATE-SUB(create_slot=false) with
> failover=false in txn block like earlier. But on the downside, it
> makes the behavior inconsistent for otherwise simpler option like
> failover,  i.e. with failover=true, CREATE SUB is not allowed in txn
> block while with failover=false, it is allowed. It makes it slightly
> complex to be understood by user.
> 
> 2. let's not disallow CREATE SUB in txn block as earlier, just don't
> perform internal alter-failover during CREATE SUB for existing
> slots(create_slot=false, slot_name=xx)  i.e. when create_slot is
> false, we will ignore failover parameter of CREATE SUB and it is
> user's responsibility to set it appropriately using ALTER SUB cmd. For
> create_slot=true, the behaviour of CREATE-SUB remains same as earlier.
> 
> This option does not add new restriction for CREATE SUB wrt txn block.
> In context of failover with create_slot=false, we already have a
> similar restriction (documented one) for ALTER SUB, i.e. with 'ALTER
> SUBSCRIPTION SET(slot_name = new)', user needs to alter the new slot's
> failover by himself. CREAT SUB can also be documented in similar way.
> This seems simpler to be understood considering existing ALTER SUB's
> behavior as well. Plus, this will make CREATE-SUB code slightly
> simpler and thus easily manageable.
> 
> 3. add a alter_slot option for CREATE SUBSCRIPTION, we can only alter
> the  slot's failover if alter_slot=true. And so we can disallow CREATE
> SUB WITH (slot_name =xx, alter_slot=true) inside a txn block.
> 
> This seems a clean way, as everything will be as per user's consent
> based on alter_slot parameter. But on the downside, this will need
> introducing additional parameter and also adding new restriction of
> running CREATE-sub in txn  block for a specific case.
> 
> 4. Don't alter replication in subscription DDLs. Instead, try to alter
> replication slot's failover in the apply worker. This means we need to
> execute alter_replication_slot each time before starting streaming in
> the apply worker.
> 
> This does not seem appealing to execute alter_replication_slot
> everytime the apply worker starts. But if others think it as a better
> option, it can be further analyzed.

Thanks for describing, I also prefer 2, because it seems bit strange that
CREATE statement leads ALTER.

> Currently, our preference is option 2, as that looks a clean solution
> and also aligns with ALTER-SUB behavior which is already documented.
> Thoughts?
> 
> 
> Note that we could not refer to the design of two_phase here, because
> two_phase can be considered as a streaming option, so it's fine to
> change the two_phase along with START_REPLICATION command. (the
> two_phase is not changed in subscription DDLs, but get changed in
> START_REPLICATION command). But the failover is closely related to a
> replication slot itself.
> 

Sorry, I cannot find statements. Where did you refer?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



RE: Disallow changing slot's failover option in transaction block

2024-04-16 Thread Hayato Kuroda (Fujitsu)
Dear Hou,

> Kuroda-San reported an issue off-list that:
> 
> If user execute ALTER SUBSCRIPTION SET (failover) command inside a txn block
> and rollback, only the subscription option change can be rolled back, while 
> the
> replication slot's failover change is preserved.
> 
> This is because ALTER SUBSCRIPTION SET (failover) command internally
> executes
> the replication command ALTER_REPLICATION_SLOT to change the replication
> slot's
> failover property, but this replication command execution cannot be
> rollback.
> 
> To fix it, I think we can prevent user from executing ALTER SUBSCRIPTION set
> (failover) inside a txn block, which is also consistent to the ALTER
> SUBSCRIPTION REFRESH/DROP SUBSCRIPTION command. Attach a small
> patch to address this.

Thanks for posting the patch, the fix is same as my expectation.
Also, should we add the restriction to the doc? I feel [1] can be updated.


[1]:https://www.postgresql.org/docs/devel/sql-altersubscription.html#SQL-ALTERSUBSCRIPTION-PARAMS-SET

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 





RE: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-04-15 Thread Hayato Kuroda (Fujitsu)
Dear Amit,

> > FYI - We also considered the idea which walsender waits until all prepared
> transactions
> > are resolved before decoding and sending changes, but it did not work well
> > - the restarted walsender sent only COMMIT PREPARED record for
> transactions which
> > have been prepared before disabling the subscription. This happened because
> > 1) if the two_phase option of slots is false, the confirmed_flush can be 
> > ahead of
> >PREPARE record, and
> > 2) after the altering and restarting, start_decoding_at becomes same as
> >confirmed_flush and records behind this won't be decoded.
> >
> 
> I don't understand the exact problem you are facing. IIUC, if the
> commit is after start_decoding_at point and prepare was before it, we
> expect to send the entire transaction followed by a commit record. The
> restart_lsn should be before the start of such a transaction and we
> should have recorded the changes in the reorder buffer.

This behavior is right for two_phase = false case. But if the parameter is
altered between PREPARE and COMMIT PREPARED, there is a possibility that only
COMMIT PREPARED is sent. As the first place, the executed workload is below.

1. created a subscription with (two_phase = false)
2. prepared a transaction on publisher
3. disabled the subscription once
4. altered the subscription to two_phase = true
5. enabled the subscription again
6. did COMMIT PREPARED on the publisher

-> Apply worker would raise an ERROR while applying COMMIT PREPARED record:
ERROR:  prepared transaction with identifier "pg_gid_XXX_YYY" does not exist

Below part describes why the ERROR occurred.

==

### Regarding 1) the confirmed_flush can be ahead of PREPARE record,

If two_phase is off, as you might know, confirmed_flush can be ahead of PREPARE
record by keepalive mechanism.

Walsender sometimes sends a keepalive message in WalSndKeepalive(). Here the LSN
is written, which is lastly decoded record. Since the PREPARE record is skipped
(just handled by ReorderBufferProcessXid()), sometimes the written LSN in the
message can be ahead of PREPARE record. If the WAL records are aligned like 
below,
the LSN can point CHECKPOINT_ONLINE.

...
INSERT
PREPARE txn1
CHECKPOINT_ONLINE
...

On worker side, when it receives the keepalive, it compares the LSN in the
message and lastly received LSN, and advance last_received. Then, the worker 
replies
to the walsender, and at that time it replies that last_recevied record has been
flushed on the subscriber. See send_feedback().
 
On publisher, when the walsender receives the message from subscriber, it reads
the message and advance the confirmed_flush to the written value. If the 
walsender
sends LSN which locates ahead PREPARE, the confirmed flush is updated as well.

### Regarding 2) after the altering, records behind the confirmed_flush are not 
decoded

Then, at decoding phase. The snapshot builder determines the point where 
decoding
is resumed, as start_decoding_at. After the restart, the value is same as
confirmed_flush of the slot. Since the confiremed_fluish is ahead of PREPARE,
the start_decoding_at becomes ahead as well, so whole of prepared transactions
are not decoded.

==

Attached zip file contains the PoC and used script. You can refer what I really 
did.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 

<>


RE: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-04-15 Thread Hayato Kuroda (Fujitsu)
Dear Amit,

> Vitaly, does the minimal solution provided by the proposed patch
> (Allow to alter two_phase option of a subscriber provided no
> uncommitted
> prepared transactions are pending on that subscription.) address your use 
> case?

I think we do not have to handle cases which there are prepared transactions on
publisher/subscriber, as the first step. It leads additional complexity and we
do not have smarter solutions, especially for problem 2.
IIUC it meets the Vitaly's condition, right?

> > 1. While toggling two_phase from true to false, we could probably get a 
> > list of
> prepared transactions for this subscriber id and rollback/abort the prepared
> transactions. This will allow the transactions to be re-applied like a normal
> transaction when the commit comes. Alternatively, if this isn't appropriate 
> doing it
> in the ALTER SUBSCRIPTION context, we could store the xids of all prepared
> transactions of this subscription in a list and when the corresponding xid is 
> being
> committed by the apply worker, prior to commit, we make sure the previously
> prepared transaction is rolled back. But this would add the overhead of 
> checking
> this list every time a transaction is committed by the apply worker.
> >
> 
> In the second solution, if you check at the time of commit whether
> there exists a prior prepared transaction then won't we end up
> applying the changes twice? I think we can first try to achieve it at
> the time of Alter Subscription because the other solution can add
> overhead at each commit?

Yeah, at least the second solution might be problematic. I prototyped
the first one and worked well. However, to make the feature more consistent,
it is prohibit to exist prepared transactions on subscriber for now.
We can ease based on the requirement.

> > 2. No solution yet.
> >
> 
> One naive idea is that on the publisher we can remember whether the
> prepare has been sent and if so then only send commit_prepared,
> otherwise send the entire transaction. On the subscriber-side, we
> somehow, need to ensure before applying the first change whether the
> corresponding transaction is already prepared and if so then skip the
> changes and just perform the commit prepared. One drawback of this
> approach is that after restart, the prepare flag wouldn't be saved in
> the memory and we end up sending the entire transaction again. One way
> to avoid this overhead is that the publisher before sending the entire
> transaction checks with subscriber whether it has a prepared
> transaction corresponding to the current commit. I understand that
> this is not a good idea even if it works but I don't have any better
> ideas. What do you think?

I considered but not sure it is good to add such mechanism. Your idea requires
additional wait-loop, which might lead bugs and unexpected behavior. And it may
degrade the performance based on the network environment.
As for the another solution (worker sends a list of prepared transactions), it
is also not so good because list of prepared transactions may be huge.

Based on above, I think we can reject the case for now.

FYI - We also considered the idea which walsender waits until all prepared 
transactions
are resolved before decoding and sending changes, but it did not work well
- the restarted walsender sent only COMMIT PREPARED record for transactions 
which
have been prepared before disabling the subscription. This happened because
1) if the two_phase option of slots is false, the confirmed_flush can be ahead 
of
   PREPARE record, and
2) after the altering and restarting, start_decoding_at becomes same as
   confirmed_flush and records behind this won't be decoded.

> > 3. We could mandate that the altering of two_phase state only be done after
> disabling the subscription, just like how it is handled for failover option.
> >
> 
> makes sense.

OK, this spec was added.

According to above, I updated the patch with Ajin.
0001 - extends ALTER SUBSCRIPTION statement. A tab-completion was added.
0002 - mandates the subscription has been disabled. Since no need to change 
   AtEOXact_ApplyLauncher(), the change is reverted.
   If no objections, this can be included to 0001.
0003 - checks whether there are transactions prepared by the worker. If found,
   rejects the ALTER SUBSCRIPTION command.
0004 - checks whether there are transactions prepared on publisher. The backend
   connects to the publisher and confirms it. If found, rejects the ALTER
   SUBSCRIPTION command.
0005 - adds TAP test for it.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



v3-0001-Allow-altering-of-two_phase-option-of-a-SUBSCRIPT.patch
Description:  v3-0001-Allow-altering-of-two_phase-option-of-a-SUBSCRIPT.patch


v3-0002-Mandate-the-subscription-has-been-disabled.patch
Descrip

RE: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-04-10 Thread Hayato Kuroda (Fujitsu)
Dear Amit,

> One naive idea is that on the publisher we can remember whether the
> prepare has been sent and if so then only send commit_prepared,
> otherwise send the entire transaction. On the subscriber-side, we
> somehow, need to ensure before applying the first change whether the
> corresponding transaction is already prepared and if so then skip the
> changes and just perform the commit prepared. One drawback of this
> approach is that after restart, the prepare flag wouldn't be saved in
> the memory and we end up sending the entire transaction again. One way
> to avoid this overhead is that the publisher before sending the entire
> transaction checks with subscriber whether it has a prepared
> transaction corresponding to the current commit. I understand that
> this is not a good idea even if it works but I don't have any better
> ideas. What do you think?

Alternative idea is that worker pass a list of prepared transaction as new
option in START_REPLICATION. This can reduce the number of inter-node
communications, but sometimes the list may be huge.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 


RE: Improve eviction algorithm in ReorderBuffer

2024-04-10 Thread Hayato Kuroda (Fujitsu)
Dear Heikki,

I also prototyped the idea, which has almost the same shape.
I attached just in case, but we may not have to see.

Few comments based on the experiment.

```
+   /* txn_heap is ordered by transaction size */
+   buffer->txn_heap = pairingheap_allocate(ReorderBufferTXNSizeCompare, 
NULL);
```

I think the pairing heap should be in the same MemoryContext with the buffer.
Can we add MemoryContextSwithTo()?

```
+   /* Update the max-heap */
+   if (oldsize != 0)
+   pairingheap_remove(rb->txn_heap, >txn_node);
+   pairingheap_add(rb->txn_heap, >txn_node);
...
+   /* Update the max-heap */
+   pairingheap_remove(rb->txn_heap, >txn_node);
+   if (txn->size != 0)
+   pairingheap_add(rb->txn_heap, >txn_node);
```

Since the number of stored transactions does not affect to the insert 
operation, we may able
to add the node while creating ReorederBufferTXN and remove while cleaning up 
it. This can
reduce branches in ReorderBufferChangeMemoryUpdate().

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 

<>


RE: Is this a problem in GenericXLogFinish()?

2024-04-09 Thread Hayato Kuroda (Fujitsu)
Dear Michael,

> On Fri, Apr 05, 2024 at 06:22:58AM +0000, Hayato Kuroda (Fujitsu) wrote:
> > Thanks for pointing out. Yes, we fixed a behavior by the commit aa5edbe37,
> > but we missed the redo case. I made a fix patch based on the suggestion [1].
> 
> +   boolmod_buf = false;
> 
> Perhaps you could name that mod_wbuf, to be consistent with the WAL
> insert path.

Right, fixed.

> I'm slightly annoyed by the fact that there is no check on
> is_prim_bucket_same_wrt for buffer 1 in the BLK_NEEDS_REDO case to
> show the symmetry between the insert and replay paths.  Shouldn't
> there be at least an assert for that in the branch when there are no
> tuples (imagine an else to cover xldata->ntups == 0)?  I mean between
> just before updating the hash page's opaque area when
> is_prev_bucket_same_wrt.

Indeed, added an Assert() in else part. Was it same as your expectation?

> I've been thinking about ways to make such cases detectable in an
> automated fashion.  The best choice would be
> verifyBackupPageConsistency(), just after RestoreBlockImage() on the
> copy of the block image stored in WAL before applying the page masking
> which would mask the LSN.  A problem, unfortunately, is that we would
> need to transfer the knowledge of REGBUF_NO_CHANGE as a new BKPIMAGE_*
> flag so we would be able to track if the block rebuilt from the record
> has the *same* LSN as the copy used for the consistency check.  So
> this edge consistency case would come at a cost, I am afraid, and the
> 8 bits of bimg_info are precious :/

I could not decide that the change has more benefit than its cost, so I did
nothing for it.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



v2_add_modbuf_flag.diff
Description: v2_add_modbuf_flag.diff


RE: Is this a problem in GenericXLogFinish()?

2024-04-05 Thread Hayato Kuroda (Fujitsu)
Dear Michael,

> There is still some divergence between the code path of
> _hash_freeovflpage() and the replay in hash_xlog_squeeze_page() when
> squeezing a page: we do not set the LSN of the write buffer if
> (xlrec.ntups <= 0 && xlrec.is_prim_bucket_same_wrt &&
> !xlrec.is_prev_bucket_same_wrt) when generating the squeeze record,
> but at replay we call PageSetLSN() on the write buffer and mark it
> dirty in this case.  Isn't that incorrect?  It seems to me that we
> should be able to make the conditional depending on the state of the
> xl_hash_squeeze_page record, no?

Thanks for pointing out. Yes, we fixed a behavior by the commit aa5edbe37,
but we missed the redo case. I made a fix patch based on the suggestion [1].

Below part contained my analysis and how I reproduced.
I posted for clarifying the issue to others.


=

## Pointed issue

Assuming the case
- xlrec.ntups is 0,
- xlrec.is_prim_bucket_same_wrt is true, and 
- xlrec.is_prev_bucket_same_wrt is false.
This meant that there are several overflow pages and the tail deadpage is 
removing.
In this case, the primary page is not have to be modified.

While doing the normal operation, the removal is done in _hash_freeovflpage().
If above condition is met, mod_wbuf is not set to true so PageSetLSN() is 
skipped.

While doing the recovery, the squeeze and removal is done in 
hash_xlog_squeeze_page().
In this function PageSetLSN() is set unconditionally.
He said in this case the PageSetLSN should be avoided as well.

## Analysis

IIUC there is the same issue with pointed by [2].
He said the PageSetLSN() should be set when the page is really modified.
In the discussing case, wbug is not modified (just removing the tail entry) so 
that no need
to assign LSN to it. However, we might miss to update the redo case as well.

## How to reproduce

I could reproduce the case below steps.
1. Added the debug log like [3]
2. Constructed a physical replication.
3. Ran hash_index.sql
4. Found the added debug log.

[1]: 
https://www.postgresql.org/message-id/CAA4eK1%2BayneM-8mSRC0iWpDMnm39EwDoqgiOCSqrrMLcdnUnAA%40mail.gmail.com
[2]: https://www.postgresql.org/message-id/ZbyVVG_7eW3YD5-A%40paquier.xyz
[3]:
```
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -713,6 +713,11 @@ hash_xlog_squeeze_page(XLogReaderState *record)
writeopaque->hasho_nextblkno = xldata->nextblkno;
}

+if (xldata->ntups == 0 &&
+xldata->is_prim_bucket_same_wrt &&
+!xldata->is_prev_bucket_same_wrt)
+    elog(LOG, "XXX no need to set PageSetLSN");
+
```

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



add_modbuf_flag.diff
Description: add_modbuf_flag.diff


RE: Improve eviction algorithm in ReorderBuffer

2024-03-29 Thread Hayato Kuroda (Fujitsu)
Dear Sawada-san,

> Agreed.
> 
> I think the patch is in good shape. I'll push the patch with the
> suggestion next week, barring any objections.

Thanks for working on this. Agreed it is committable.
Few minor comments:

```
+ * Either txn or change must be non-NULL at least. We update the memory
+ * counter of txn if it's non-NULL, otherwise change->txn.
```

IIUC no one checks the restriction. Should we add Assert() for it, e.g,:
Assert(txn || change)? 

```
+/* make sure enough space for a new node */
...
+/* make sure enough space for a new node */
```

Should be started with upper case?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



RE: Synchronizing slots from primary to standby

2024-03-28 Thread Hayato Kuroda (Fujitsu)
Dear Hou,

Thanks for updating the patch! Here is a comment for it.

```
+/*
+ * By advancing the restart_lsn, confirmed_lsn, and xmin using
+ * fast-forward logical decoding, we can verify whether a consistent
+ * snapshot can be built. This process also involves saving necessary
+ * snapshots to disk during decoding, ensuring that logical decoding
+ * efficiently reaches a consistent point at the restart_lsn without
+ * the potential loss of data during snapshot creation.
+ */
+pg_logical_replication_slot_advance(remote_slot->confirmed_lsn,
+found_consistent_point);
+ReplicationSlotsComputeRequiredLSN();
+updated_lsn = true;
```

You added them like pg_replication_slot_advance(), but the function also calls
ReplicationSlotsComputeRequiredXmin(false) at that time. According to the 
related
commit b48df81 and discussions [1], I know it is needed only for physical slots,
but it makes more consistent to call requiredXmin() as well, per [2]:

```
This may be a waste if no advancing is done, but it could also be an
advantage to enforce a recalculation of the thresholds for each
function call.  And that's more consistent with the slot copy, drop
and creation.
```

How do you think?

[1]: 
https://www.postgresql.org/message-id/20200609171904.kpltxxvjzislidks%40alap3.anarazel.de
[2]: https://www.postgresql.org/message-id/20200616072727.GA2361%40paquier.xyz

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



RE: pg_upgrade and logical replication

2024-03-27 Thread Hayato Kuroda (Fujitsu)
Dear Vignesh,

> 
> Recently there was a failure in 004_subscription tap test at [1].
> In this failure, the tab_upgraded1 table was expected to have 51
> records but has only 50 records. Before the upgrade both publisher and
> subscriber have 50 records.

Good catch!

> After the upgrade we have inserted one record in the publisher, now
> tab_upgraded1 will have 51 records in the publisher. Then we start the
> subscriber after changing max_logical_replication_workers so that
> apply workers get started and apply the changes received. After
> starting we enable regress_sub5, wait for sync of regress_sub5
> subscription and check for tab_upgraded1 and tab_upgraded2 table data.
> In a few random cases the one record that was inserted into
> tab_upgraded1 table will not get replicated as we have not waited for
> regress_sub4 subscription to apply the changes from the publisher.
> The attached patch has changes to wait for regress_sub4 subscription
> to apply the changes from the publisher before verifying the data.
> 
> [1] -
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba=2024-03-
> 26%2004%3A23%3A13

Yeah, I think it is an oversight in f17529. Previously subscriptions which
receiving changes were confirmed to be caught up, I missed to add the line while
restructuring the script. +1 for your fix.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



RE: speed up a logical replica setup

2024-03-25 Thread Hayato Kuroda (Fujitsu)
Dear Amit, Euler,

> 
> This only drops the publications created by this tool, not the
> pre-existing ones that we discussed in the link provided.

Another concern around here is the case which primary subscribes changes from 
others.
After the conversion, new subscriber also tries to connect to another publisher 
as
well - this may lead conflicts. This causes because both launcher/workers start
after recovery finishes. So, based on the Ashutosh's point, should we remove
such replication objects?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



RE: speed up a logical replica setup

2024-03-25 Thread Hayato Kuroda (Fujitsu)
Dear Bharath, Peter,

> Looks like BF animals aren't happy, please check -
> > https://buildfarm.postgresql.org/cgi-bin/show_failures.pl.
> 
> Looks like sanitizer failures.  There were a few messages about that
> recently, but those were all just about freeing memory after use, which
> we don't necessarily require for client programs.  So maybe something else.

It seems that there are several time of failures, [1] and [2].

## Analysis for failure 1

The failure caused by a time lag between walreceiver finishes and 
pg_is_in_recovery()
returns true.

According to the output [1], it seems that the tool failed at 
wait_for_end_recovery()
with the message "standby server disconnected from the primary". Also, lines
"redo done at..." and "terminating walreceiver process due to administrator 
command"
meant that walreceiver was requested to shut down by XLogShutdownWalRcv().

According to the source, we confirm that walreceiver is shut down in
StartupXLOG()->FinishWalRecovery()->XLogShutdownWalRcv(). Also, 
SharedRecoveryState
is changed to RECOVERY_STATE_DONE (this meant the pg_is_in_recovery() return 
true)
at the latter part of StartupXLOG().

So, if there is a delay between FinishWalRecovery() and change the state, the 
check
in wait_for_end_recovery() would be failed during the time. Since we allow to 
miss
the walreceiver 10 times and it is checked once per second, the failure occurs 
if
the time lag is longer than 10 seconds.

I do not have a good way to fix it. One approach is make NUM_CONN_ATTEMPTS 
larger,
but it's not a fundamental solution.

## Analysis for failure 2

According to [2], the physical replication slot which is specified as 
primary_slot_name
was not used by the walsender process. At that time walsender has not existed.

```
...
pg_createsubscriber: publisher: current wal senders: 0
pg_createsubscriber: command is: SELECT 1 FROM pg_catalog.pg_replication_slots 
WHERE active AND slot_name = 'physical_slot'
pg_createsubscriber: error: could not obtain replication slot information: got 
0 rows, expected 1 row
...
```

Currently standby must be stopped before the command and current code does not
block the flow to ensure the replication is started. So there is a possibility
that the checking is run before walsender is launched.

One possible approach is to wait until the replication starts. Alternative one 
is
to ease the condition.

How do you think?

[1]: 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus=2024-03-25%2013%3A03%3A07
[2]: 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=culicidae=2024-03-25%2013%3A53%3A58

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 


RE: speed up a logical replica setup

2024-03-25 Thread Hayato Kuroda (Fujitsu)
> IIUC, added options were inspired by Tomas. And he said the limitation
> (pub/sub/slot
> name cannot be specified) was acceptable for the first version. I agree with 
> him.
> (To be honest, I feel that options should be fewer for the first release)

Just to confirm - I don't think it is not completely needed. If we can agree a 
specification
in sometime, it's OK for me to add them. If you ask me, I still prefer Tomas's 
approach.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



RE: speed up a logical replica setup

2024-03-25 Thread Hayato Kuroda (Fujitsu)
Dear Amit, Euler,

I also want to share my opinion, just in case.

> On Thu, Mar 21, 2024 at 8:00 PM Euler Taveira  wrote:
> >
> > On Thu, Mar 21, 2024, at 10:33 AM, vignesh C wrote:
> >
> > If we commit this we might not be able to change the way the option
> > behaves once the customers starts using it. How about removing these
> > options in the first version and adding it in the next version after
> > more discussion.
> >
> >
> > We don't need to redesign this one if we want to add a format string in a 
> > next
> > version. A long time ago, pg_dump started to accept pattern for tables 
> > without
> > breaking or deprecating the -t option. If you have 100 databases and you 
> > don't
> > want to specify the options or use a script to generate it for you, you also
> > have the option to let pg_createsubscriber generate the object names for 
> > you.
> > Per my experience, it will be a rare case.
> >
> 
> But, why go with some UI in the first place which we don't think is a
> good one, or at least don't have a broader agreement that it is a good
> one? The problem with self-generated names for users could be that
> they won't be able to make much out of it. If one has to always use
> those internally then probably that would be acceptable. I would
> prefer what Tomas proposed a few emails ago as compared to this one as
> that has fewer options to be provided by users but still, they can
> later identify objects. But surely, we should discuss if you or others
> have better alternatives.

IIUC, added options were inspired by Tomas. And he said the limitation 
(pub/sub/slot
name cannot be specified) was acceptable for the first version. I agree with 
him.
(To be honest, I feel that options should be fewer for the first release)

> The user choosing to convert a physical standby to a subscriber would
> in some cases be interested in converting it for all the databases
> (say for the case of upgrade [1]) and giving options for each database
> would be cumbersome for her.

+1 for the primary use case.

> > Currently dry-run will do the check and might fail on identifying a
> > few failures like after checking subscriber configurations. Then the
> > user will have to correct the configuration and re-run then fix the
> > next set of failures. Whereas the suggest-config will display all the
> > optimal configuration for both the primary and the standby in a single
> > shot. This is not a must in the first version, it can be done as a
> > subsequent enhancement.
> >
> >
> > Do you meant publisher, right? Per order, check_subscriber is done before
> > check_publisher and it checks all settings on the subscriber before 
> > exiting. In
> > v30, I changed the way it provides the required settings. In a previous 
> > version,
> > it fails when it found a wrong setting; the current version, check all 
> > settings
> > from that server before providing a suitable error.
> >
> > pg_createsubscriber: checking settings on publisher
> > pg_createsubscriber: primary has replication slot "physical_slot"
> > pg_createsubscriber: error: publisher requires wal_level >= logical
> > pg_createsubscriber: error: publisher requires 2 replication slots, but 
> > only 0
> remain
> > pg_createsubscriber: hint: Consider increasing max_replication_slots to at 
> > least
> 3.
> > pg_createsubscriber: error: publisher requires 2 wal sender processes, but 
> > only
> 0 remain
> > pg_createsubscriber: hint: Consider increasing max_wal_senders to at least 
> > 3.
> >
> > If you have such an error, you will fix them all and rerun using dry run 
> > mode
> > again to verify everything is ok. I don't have a strong preference about 
> > it. It
> > can be changed easily (unifying the check functions or providing a return 
> > for
> > each of the check functions).
> >
> 
> We can unify the checks but not sure if it is worth it. I am fine
> either way. It would have been better if we provided a way for a user
> to know the tool's requirement rather than letting him know via errors
> but I think it should be okay to extend it later as well.

Both ways are OK, but I prefer to unify checks a bit. The number of working 
modes
in the same executables should be reduced as much as possible.

Also, I feel the current specification is acceptable. pg_upgrade checks one by
one and exits soon in bad cases. If both old and new clusters have issues, the
first run only reports the old one's issues. After DBA fixes and runs again,
issues on the new cluster are output.

[1]: 
https://www.postgresql.org/message-id/8d52c226-7e34-44f7-a919-759bf8d81541%40enterprisedb.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



RE: Have pg_basebackup write "dbname" in "primary_conninfo"?

2024-03-19 Thread Hayato Kuroda (Fujitsu)
Dear Sawada-san,

Thanks for giving comments!

> This behavior makes sense to me. But do we want to handle the case of
> using environment variables too? 

Yeah, v5 does not consider which libpq parameters are specified by environment
variables. Such a variable should be used when the dbname is not expressly 
written
in the connection string.
Such a path was added in the v6 patch. If the dbname is not determined after
parsing the connection string, we call PQconndefaults() to get settings from
environment variables and service files [1], then start to search dbname again.
Below shows an example.

```
PGPORT=5431 PGUSER=kuroda PGDATABASE=postgres pg_basebackup -D data_N2 -R -v
->
primary_conninfo = 'user=kuroda ... port=5431 ... dbname=postgres ... '
```

> IIUC,
>
> pg_basebackup -D tmp -d "user=masahiko dbname=test_db"
>
> is equivalent to:
>
> PGDATABASE="user=masahiko dbname=test_db" pg_basebackup -D tmp

The case won't work. I think You assumed that expanded_dbname like
PQconnectdbParams() [2] can be used for enviroment variables, but it is not 
correct
- it won't parse as connection string again.

In the libpq layer, connection parameters are parsed in 
PQconnectStartParams()->conninfo_array_parse().
When expand_dbname is specified, the entry "dbname" is firstly checked and
parsed its value. They are done at fe-connect.c:5846.

The environment variables are checked and parsed in conninfo_add_defaults(), 
which
is called from conninfo_array_parse(). However, it is done at fe-connect.c:5956 
- the
expand_dbname has already been done at that time. This means there is no chance
that PGDATABASE is parsed as an expanded style.

For example, if the pg_basebackup runs like below:

PGDATABASE="user=kuroda dbname=postgres" pg_basebackup -D data_N2 -R -v

The primary_conninfo written in the file will be:

primary_conninfo = 'user=hayato ... dbname=''user=kuroda dbname=postgres'''

[1]: https://www.postgresql.org/docs/devel/libpq-pgservice.html
[2]: 
https://www.postgresql.org/docs/devel/libpq-connect.html#LIBPQ-PQCONNECTDBPARAMS

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



pg_basebackup-write-dbname.v0006.patch
Description: pg_basebackup-write-dbname.v0006.patch


RE: speed up a logical replica setup

2024-03-18 Thread Hayato Kuroda (Fujitsu)
Dear Peter,

Thanks for giving comments. I want to reply some of them.

> > 17.
> >
> > "specifies promote"
> >
> > We can do double-quote for the word promote.
> 
> The v30 patch has promote, which I think is adequate.

Opps. Actually I did look v29 patch while firstly reviewing. Sorry for noise.

> 
> > 20.
> >
> > New options must be also documented as well. This helps not only users but
> also
> > reviewers.
> > (Sometimes we cannot identify that the implementation is intentinal or not.)
> 
> Which ones are missing?

In v29, newly added options (publication/subscription/replication-slot) was not 
added.
Since they have been added, please ignore.

> > 21.
> >
> > Also, not sure the specification is good. I preferred to specify them by 
> > format
> > string. Because it can reduce the number of arguments and I cannot find use
> cases
> > which user want to control the name of objects.
> >
> > However, your approach has a benefit which users can easily identify the
> generated
> > objects by pg_createsubscriber. How do other think?
> 
> I think listing them explicitly is better for the first version.  It's
> simpler to implement and more flexible.

OK.

> > 22.
> >
> > ```
> > #define BASE_OUTPUT_DIR
>   "pg_createsubscriber_output.d"
> > ```
> >
> > No one refers the define.
> 
> This is gone in v30.

I wrote due to the above reason. Please ignore...

> 
> > 31.
> >
> > ```
> > /* Create replication slot on publisher */
> > if (lsn)
> > pg_free(lsn);
> > ```
> >
> > I think allocating/freeing memory is not so efficient.
> > Can we add a flag to create_logical_replication_slot() for controlling the
> > returning value (NULL or duplicated string)? We can use the condition (i ==
> num_dbs-1)
> > as flag.
> 
> Nothing is even using the return value of
> create_logical_replication_slot().  I think this can be removed altogether.

> > 37.
> >
> > ```
> > /* Register a function to clean up objects in case of failure */
> > atexit(cleanup_objects_atexit);
> > ```
> >
> > Sorry if we have already discussed. I think the registration can be moved 
> > just
> > before the boot of the standby. Before that, the callback will be no-op.
> 
> But it can also stay where it is.  What is the advantage of moving it later?

I thought we could reduce the risk of bugs. Previously some bugs were reported
because the registration is too early. However, this is not a strong opinion.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



RE: PostgreSQL 17 Release Management Team & Feature Freeze

2024-03-18 Thread Hayato Kuroda (Fujitsu)
Dear Michael,

> If you think that this is OK, and as far as I can see this looks OK on
> the thread, then this open item should be moved under "resolved before
> 17beta1", mentioning the commit involved in the fix.  Please see [1]
> for examples.

OK, I understood that I could wait checking from you. Thanks.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 





RE: speed up a logical replica setup

2024-03-17 Thread Hayato Kuroda (Fujitsu)
nother rule?

23.

```
static int  num_dbs = 0;
static int  num_pubs = 0;
static int  num_subs = 0;
static int  num_replslots = 0;
```

I think the name is bit confusing. The number of generating 
publications/subscriptions/replication slots
are always same as the number of databases. They just indicate the number of
specified.

My idea is num_custom_pubs or something. Thought?

24.

```
/* standby / subscriber data directory */
static char *subscriber_dir = NULL;
```

It is bit strange that only subscriber_dir is a global variable. Caller requires
the CreateSubscriberOptions as an argument, except cleanup_objects_atexit() and
main. So, how about makeing `CreateSubscriberOptions opt` to global one?

25.

```
 * Replication slots, publications and subscriptions are created. Depending on
 * the step it failed, it should remove the already created objects if it is
 * possible (sometimes it won't work due to a connection issue).
```

I think it should be specified here that subscriptions won't be removed with the
reason. 

26.

```

/*
 * If the server is promoted, there is no way to use the current setup
 * again. Warn the user that a new replication setup should be done 
before
 * trying again.
 */
```

Per comment 25, we can add a reference like "See comments atop the function"

27.

usage() was not updated based on recent changes.

28.

```
if (strcmp(conn_opt->keyword, "dbname") == 0 && conn_opt->val 
!= NULL)
{
if (dbname)
*dbname = pg_strdup(conn_opt->val);
continue;
}
```

There is a memory-leak if multiple dbname are specified in the conninfo.

29.

```
pg_prng_seed(_state, (uint64) (getpid() ^ time(NULL)));
```

No need to initialize the seed every time. Can you reuse pg_prng_state?

30.

```
if (num_replslots == 0)
dbinfo[i].replslotname = pg_strdup(genname);
```

I think the straightforward way is to use the name of subscription if no name
is specified. This follows the rule for CREATE SUBSCRIPTION.

31.

```
/* Create replication slot on publisher */
if (lsn)
pg_free(lsn);
```

I think allocating/freeing memory is not so efficient.
Can we add a flag to create_logical_replication_slot() for controlling the
returning value (NULL or duplicated string)? We can use the condition (i == 
num_dbs-1)
as flag.

32.

```
/*
 * Close the connection. If exit_on_error is true, it has an undesired
 * condition and it should exit immediately.
 */
static void
disconnect_database(PGconn *conn, bool exit_on_error)
```

In case of disconnect_database(), the second argument should have different 
name.
If it is true, the process exits unconditionally.
Also, comments atop the function must be fixed.


33.

```
wal_level = strdup(PQgetvalue(res, 0, 0));
```

pg_strdup should be used here.

34.

```
{"config-file", required_argument, NULL, 1},
{"publication", required_argument, NULL, 2},
{"replication-slot", required_argument, NULL, 3},
{"subscription", required_argument, NULL, 4},
```

The ordering looks strange for me. According to pg_upgarade and pg_basebackup,
options which do not have short notation are listed behind.

35.

```
opt.sub_port = palloc(16);
```

Per other lines, pg_alloc() should be used.

36.

```
pg_free(opt.sub_port);
```

You said that the leak won't be concerned here. If so, why only 'p' has 
pg_free()?

37.

```
/* Register a function to clean up objects in case of failure */
atexit(cleanup_objects_atexit);
```

Sorry if we have already discussed. I think the registration can be moved just
before the boot of the standby. Before that, the callback will be no-op.

38.

```
/* Subscriber PID file */
snprintf(pidfile, MAXPGPATH, "%s/postmaster.pid", subscriber_dir);

/*
 * If the standby server is running, stop it. Some parameters (that can
 * only be set at server start) are informed by command-line options.
 */
if (stat(pidfile, ) == 0)
```

Hmm. pidfile is used only here, but it is declared in main(). Can it be
separated into another funtion like is_standby_started()?

39.

Or, we may able to introcue "restart_standby_if_needed" or something.

40.

```
 * XXX this code was extracted from BootStrapXLOG().
```

So, can we extract the common part to somewhere? Since system identifier is 
related
with the controldata file, I think it can be located in controldata_util.c.

41.

You said like below in [1], but I could not find the related fix. Can you 
clarify?

> That's a good point. We should state in the documentation that GUCs specified 
> in
> the command-line options are ignored during the execution.

[1]: 
https://www.postgresql.org/message-id/40595e73-c7e1-463a-b8be-49792e870007%40app.fastmail.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/global/ 



RE: PostgreSQL 17 Release Management Team & Feature Freeze

2024-03-17 Thread Hayato Kuroda (Fujitsu)
Dear Michael,

> We are pleased to announce the Release Management Team (RMT) (cc'd)
> for the PostgreSQL 17 release:
> - Robert Haas
> - Heikki Linnakangas
> - Michael Paquier

Thanks for managing the release of PostgreSQL to proceed the right direction!

> You can track open items for the PostgreSQL 17 release here:
> https://wiki.postgresql.org/wiki/PostgreSQL_17_Open_Items

I think the entry can be closed:

```
pg_upgrade with --link mode failing upgrade of publishers
Commit: 29d0a77fa660
Owner: Amit Kapila
```

The reported failure was only related with the test script, not the feature.
The issue has already been analyzed and the fix patch was pushed as f17529b710.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 





RE: speed up a logical replica setup

2024-03-10 Thread Hayato Kuroda (Fujitsu)
Dear Vignesh,

Thanks for updating the patch, but cfbot still got angry [1].
Note that two containers (autoconf and meson) failed at different place,
so I think it is intentional one. It seems that there may be a bug related with 
32-bit build.

We should see and fix as soon as possible.

[1]: http://cfbot.cputube.org/highlights/all.html#4637

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



RE: speed up a logical replica setup

2024-03-08 Thread Hayato Kuroda (Fujitsu)
Dear Tomas, Euler,

Thanks for starting to read the thread! Since I'm not an original author,
I want to reply partially.

> I decided to take a quick look on this patch today, to see how it works
> and do some simple tests. I've only started to get familiar with it, so
> I have only some comments / questions regarding usage, not on the code.
> It's quite possible I didn't understand some finer points, or maybe it
> was already discussed earlier in this very long thread, so please feel
> free to push back or point me to the past discussion.
> 
> Also, some of this is rather opinionated, but considering I didn't see
> this patch before, my opinions may easily be wrong ...

I felt your comments were quit valuable.

> 1) SGML docs
> 
> It seems the SGML docs are more about explaining how this works on the
> inside, rather than how to use the tool. Maybe that's intentional, but
> as someone who didn't work with pg_createsubscriber before I found it
> confusing and not very helpful.
>
> For example, the first half of the page is prerequisities+warning, and
> sure those are useful details, but prerequisities are checked by the
> tool (so I can't really miss this) and warnings go into a lot of details
> about different places where things may go wrong. Sure, worth knowing
> and including in the docs, but maybe not right at the beginning, before
> I learn how to even run the tool?

Hmm, right. I considered below improvements. Tomas and Euler, how do you think?

* Adds more descriptions in "Description" section.
* Moves prerequisities+warning to "Notes" section.
* Adds "Usage" section which describes from a single node.

> I'm not sure FOR ALL TABLES is a good idea. Or said differently, I'm
> sure it won't work for a number of use cases. I know large databases
> it's common to create "work tables" (not necessarily temporary) as part
> of a batch job, but there's no need to replicate those tables.

Indeed, the documentation does not describe that all tables in the database
would be included in the publication.

> I do understand that FOR ALL TABLES is the simplest approach, and for v1
> it may be an acceptable limitation, but maybe it'd be good to also
> support restricting which tables should be replicated (e.g. blacklist or
> whitelist based on table/schema name?).

May not directly related, but we considered that accepting options was a 
next-step [1].

> Note: I now realize this might fall under the warning about DDL, which
> says this:
> 
> Executing DDL commands on the source server while running
> pg_createsubscriber is not recommended. If the target server has
> already been converted to logical replica, the DDL commands must
> not be replicated so an error would occur.

Yeah, they would not be replicated, but not lead ERROR.
So should we say like "Creating tables on the source server..."?

> 5) slot / publication / subscription name
> 
> I find it somewhat annoying it's not possible to specify names for
> objects created by the tool - replication slots, publication and
> subscriptions. If this is meant to be a replica running for a while,
> after a while I'll have no idea what pg_createsubscriber_569853 or
> pg_createsubscriber_459548_2348239 was meant for.
> 
> This is particularly annoying because renaming these objects later is
> either not supported at all (e.g. for replication slots), or may be
> quite difficult (e.g. publications).
> 
> I do realize there are challenges with custom names (say, if there are
> multiple databases to replicate), but can't we support some simple
> formatting with basic placeholders? So we could specify
> 
> --slot-name "myslot_%d_%p"
> 
> or something like that?

Not sure we can do in the first version, but looks nice. One concern is that I
cannot find applications which accepts escape strings like log_line_prefix.
(It may just because we do not have use-case.) Do you know examples?

> BTW what will happen if we convert multiple standbys? Can't they all get
> the same slot name (they all have the same database OID, and I'm not
> sure how much entropy the PID has)?

I tested and the second try did not work. The primal reason was the name of 
publication
- pg_createsubscriber_%u (oid).
FYI - previously we can reuse same publications, but based on my comment [2] the
feature was removed. It might be too optimistic.

[1]: 
https://www.postgresql.org/message-id/TY3PR01MB9889CCBD4D9DAF8BD2F18541F56F2%40TY3PR01MB9889.jpnprd01.prod.outlook.com
[2]: 
https://www.postgresql.org/message-id/TYCPR01MB12077756323B79042F29DDAEDF54C2%40TYCPR01MB12077.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



RE: speed up a logical replica setup

2024-03-06 Thread Hayato Kuroda (Fujitsu)
system name of the 
user
 * running it.
 */
```
Unnecessary blank.

08. main
```
char   *errstr = NULL;
```
 
This declaration can be at else-part.

09. main.

Also, as the first place, do we have to get username if not specified?
I felt libpq can handle the case if we skip passing the info.

10. main
```
appendPQExpBuffer(sub_conninfo_str, "host=%s port=%u user=%s 
fallback_application_name=%s",
  opt.socket_dir, opt.sub_port, 
opt.sub_username, progname);
sub_base_conninfo = get_base_conninfo(sub_conninfo_str->data, NULL);
```

Is it really needed to call get_base_conninfo? I think no need to define
sub_base_conninfo.

11. main

```
/*
 * In dry run mode, the server is restarted with the provided 
command-line
 * options so validation can be applied in the target server. In order 
to
 * preserve the initial state of the server (running), start it without
 * the command-line options.
 */
if (dry_run)
start_standby_server(, pg_ctl_path, NULL, false);
```

I think initial state of the server may be stopped. Now both conditions are 
allowed.
And I think it is not good not to specify the logfile.

12. others

As Peter E pointed out [1], the main function is still huge. It has more than 
400 lines.
I think all functions should have less than 100 line to keep the readability.

I considered separation idea like below. Note that this may require to change
orderings. How do you think?

* add parse_command_options() which accepts user options and verifies them
* add verification_phase() or something which checks system identifier and 
calls check_XXX
* add catchup_phase() or something which creates a temporary slot, writes 
recovery parameters,
  and wait until the end of recovery
* add cleanup_phase() or something which removes primary_slot and modifies the
  system identifier
* stop/start server can be combined into one wrapper.

Attached txt file is proofs the concept.

13. others

PQresultStatus(res) is called 17 times in this source code, it may be redundant.
I think we can introduce a function like executeQueryOrDie() and gather in one 
place.

14. others

I found that pg_createsubscriber does not refer functions declared in other 
files.
Is there a possibility to use them, e.g., streamutils.h?

15. others 

While reading the old discussions [2], Amit suggested to keep the comment and 
avoid
creating a temporary slot. You said "Got it" but temp slot still exists.
Is there any reason? Can you clarify your opinion?

16. others

While reading [2] and [3], I was confused the decision. You and Amit discussed
the combination with pg_createsubscriber and slot sync and how should handle
slots on the physical standby. You seemed to agree to remove such a slot, and
Amit also suggested to raise an ERROR. However, you said in [8] that such
handlings is not mandatory so should raise an WARNING in dry_run. I was quite 
confused.
Am I missing something?

17. others

Per discussion around [4], we might have to consider an if the some options like
data_directory and config_file was initially specified for standby server. 
Another
easy approach is to allow users to specify options like -o in pg_upgrade [5],
which is similar to your idea. Thought?


18. others

How do you handle the reported failure [6]?

19. main
```
char   *pub_base_conninfo = NULL;
char   *sub_base_conninfo = NULL;
char   *dbname_conninfo = NULL;
```

No need to initialize pub_base_conninfo and sub_base_conninfo.
These variables would not be free'd.

20. others

IIUC, slot creations would not be finished if there are prepared transactions.
Should we detect it on the verification phase and raise an ERROR?

21. others

As I said in [7], the catch up would not be finished if long 
recovery_min_apply_delay
is used. Should we overwrite during the catch up?

22. pg_createsubscriber.sgml
```

 Check
 Write recovery parameters into the target data...
```

Not sure, but "Check" seems not needed.

[1]: 
https://www.postgresql.org/message-id/b9aa614c-84ba-a869-582f-8d5e3ab57424%40enterprisedb.com
[2]: 
https://www.postgresql.org/message-id/9fd3018d-0e5f-4507-aee6-efabfb5a4440%40app.fastmail.com
[3]: 
https://www.postgresql.org/message-id/CAA4eK1L%2BE-bdKaOMSw-yWizcuprKMyeejyOwWjq_57%3DUqh-f%2Bg%40mail.gmail.com
[4]: 
https://www.postgresql.org/message-id/TYCPR01MB12077B63D81B49E9DFD323661F55A2%40TYCPR01MB12077.jpnprd01.prod.outlook.com
[5]: 
https://www.postgresql.org/docs/devel/pgupgrade.html#:~:text=options%20to%20be%20passed%20directly%20to%20the%20old%20postgres%20command%3B%20multiple%20option%20invocations%20are%20appended
[6]: 
https://www.postgresql.org/message-id/CAHv8Rj%2B5mzK9Jt%2B7ECogJzfm5czvDCCd5jO1_rCx0bTEYpBE5g%40mail.gmail.com
[7]: 
https://www.postgresql.org/message-id/OS3PR01MB98828B15DD9502C91E0C50D7F57D2%40OS3PR01MB9882.jpnpr

RE: Some shared memory chunks are allocated even if related processes won't start

2024-03-04 Thread Hayato Kuroda (Fujitsu)
Dear Alvaro,

Thanks for giving comments!

> > I agreed it sounds good, but I don't think it can be implemented by
> > current interface. An interface for dynamically allocating memory is
> > GetNamedDSMSegment(), and it returns the same shared memory region if
> > input names are the same.  Therefore, there is no way to re-alloc the
> > shared memory.
> 
> Yeah, I was imagining something like this: the workitem-array becomes a
> struct, which has a name and a "next" pointer and a variable number of
> workitem slots; the AutoVacuumShmem struct has a pointer to the first
> workitem-struct and the last one; when a workitem is requested by
> brininsert, we initially allocate via GetNamedDSMSegment("workitem-0") a
> workitem-struct with a smallish number of elements; if we request
> another workitem and the array is full, we allocate another array via
> GetNamedDSMSegment("workitem-1") and store a pointer to it in workitem-0
> (so that the list can be followed by an autovacuum worker that's
> processing the database), and it's also set as the tail of the list in
> AutoVacuumShmem (so that we know where to store further work items).
> When all items in a workitem-struct are processed, we can free it
> (I guess via dsm_unpin_segment), and make AutoVacuumShmem->av_workitems
> point to the next one in the list.
> 
> This way, the "array" can grow arbitrarily.
>

Basically sounds good. My concerns are:

* GetNamedDSMSegment() does not returns a raw pointer to dsm_segment. This means
  that it may be difficult to do dsm_unpin_segment on the caller side.
* dynamic shared memory is recorded in dhash (dsm_registry_table) and the entry
  won't be deleted. The reference for the chunk might be remained.

Overall, it may be needed that dsm_registry may be also extended. I do not start
working yet, so will share results after trying them.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



RE: Some shared memory chunks are allocated even if related processes won't start

2024-03-04 Thread Hayato Kuroda (Fujitsu)
Dear Alvaro,

Thanks for discussing!

> 
> I think it would be worth allocating AutoVacuumShmem->av_workItems using
> dynamic shmem allocation, particularly to prevent workitems from being
> discarded just because the array is full¹; but other than that, the
> struct is just 64 bytes long so I doubt it's useful to allocate it
> dynamically.
> 
> ¹ I mean, if the array is full, just allocate another array, point to it
> from the original one, and keep going.

OK, I understood that my initial proposal is not so valuable, so I can withdraw 
it.

About the suggetion, you imagined AutoVacuumRequestWork() and brininsert(),
right? I agreed it sounds good, but I don't think it can be implemented by 
current
interface. An interface for dynamically allocating memory is 
GetNamedDSMSegment(),
and it returns the same shared memory region if input names are the same.
Therefore, there is no way to re-alloc the shared memory.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



RE: Some shared memory chunks are allocated even if related processes won't start

2024-03-04 Thread Hayato Kuroda (Fujitsu)
Dear Alvaro,

Thanks for giving comments!

> > While reading codes, I found that ApplyLauncherShmemInit() and
> > AutoVacuumShmemInit() are always called even if they would not be
> > launched.
> 
> Note that there are situations where the autovacuum launcher is started
> even though autovacuum is nominally turned off, and I suspect your
> proposal would break that.  IIRC this occurs when the Xid or multixact
> counters cross the max_freeze_age threshold.

Right. In GetNewTransactionId(), SetTransactionIdLimit() and some other places,
PMSIGNAL_START_AUTOVAC_LAUNCHER is sent to postmaster when the xid exceeds
autovacuum_freeze_max_age. This work has already been written in the doc [1]:

```
To ensure that this does not happen, autovacuum is invoked on any table that
might contain unfrozen rows with XIDs older than the age specified by the
configuration parameter autovacuum_freeze_max_age. (This will happen even
if autovacuum is disabled.)
```

This means that my first idea won't work well. Even if the postmaster does not
initially allocate shared memory, backends may request to start auto vacuum and
use the region. However, the second idea is still valid, which allows the 
allocation
of shared memory dynamically. This is a bit efficient for the system which 
tuples
won't be frozen. Thought?

[1]: 
https://www.postgresql.org/docs/devel/routine-vacuuming.html#VACUUM-FOR-WRAPAROUND

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



RE: Some shared memory chunks are allocated even if related processes won't start

2024-03-03 Thread Hayato Kuroda (Fujitsu)
Dear Tom,

Thanks for replying!

> "Hayato Kuroda (Fujitsu)"  writes:
> > While reading codes, I found that ApplyLauncherShmemInit() and
> AutoVacuumShmemInit()
> > are always called even if they would not be launched.
> > It may be able to reduce the start time to avoid the unnecessary allocation.
> 
> Why would this be a good idea?  It would require preventing the
> decision not to launch them from being changed later, except via
> postmaster restart.

Right. It is important to relax their GucContext.

> We've generally been trying to move away
> from unchangeable-without-restart decisions.  In your example,
> 
> > +if (max_logical_replication_workers == 0 || IsBinaryUpgrade)
> > +return;
> 
> max_logical_replication_workers is already PGC_POSTMASTER so there's
> not any immediate loss of flexibility, but I don't think it's a great
> idea to introduce another reason why it has to be PGC_POSTMASTER.

You are right. The first example implied the max_logical_replication_workers
won't be changed. So it is not appropriate.
So ... what about second one? The approach allows to allocate a memory after
startup, which means that users may able to change the parameter from 0 to
natural number in future. (Of course, such an operation is prohibit for now).
Can it be an initial step to ease the condition?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/





Some shared memory chunks are allocated even if related processes won't start

2024-03-03 Thread Hayato Kuroda (Fujitsu)
Dear hackers,

While reading codes, I found that ApplyLauncherShmemInit() and 
AutoVacuumShmemInit()
are always called even if they would not be launched.
It may be able to reduce the start time to avoid the unnecessary allocation.
However, I know this improvement would be quite small because the allocated 
chunks are
quite small.

Anyway, there are several ways to fix:

1)
Skip calling ShmemInitStruct() if the related process would not be launched.
I think this approach is the easiest way. E.g.,

```
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -962,6 +962,9 @@ ApplyLauncherShmemInit(void)
 {
 boolfound;

+if (max_logical_replication_workers == 0 || IsBinaryUpgrade)
+return;
+
```

2)
Dynamically allocate the shared memory. This was allowed by recent commit [1].
I made a small PoC only for logical launcher to show what I meant. PSA diff 
file.
Since some processes (backend, apply worker, parallel apply worker, and 
tablesync worker)
refers the chunk, codes for attachment must be added on the several places.

If you agree it should be fixed, I will create a patch. Thought?

[1]: 
https://github.com/postgres/postgres/commit/8b2bcf3f287c79eaebf724cba57e5ff664b01e06

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



dynamic_allocation.diff
Description: dynamic_allocation.diff


RE: speed up a logical replica setup

2024-02-27 Thread Hayato Kuroda (Fujitsu)
Dear Euler,

> Sorry, which pg_rewind option did you mention? I cannot find.
> IIUC, -l is an only option which can accept the path, but it is not related 
> with us.

Sorry, I found [1]. I was confused with pg_resetwal. But my opinion was not so 
changed.
The reason why --config-file exists is that pg_rewind requires that target must 
be stopped,
which is different from the current pg_createsubscriber. So I still do not like 
to add options.

[1]: 
https://www.postgresql.org/docs/current/app-pgrewind.html#:~:text=%2D%2Dconfig%2Dfile%3Dfilename
[2]:
```
The target server must be shut down cleanly before running pg_rewind
```

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



RE: Have pg_basebackup write "dbname" in "primary_conninfo"?

2024-02-27 Thread Hayato Kuroda (Fujitsu)
> PSA the patch for implementing it. It is basically same as Ian's one.
> However, this patch still cannot satisfy the condition 3).
> 
> `pg_basebackup -D data_N2 -d "user=postgres" -R`
> -> dbname would not be appeared in primary_conninfo.
> 
> This is because `if (connection_string)` case in GetConnection() explicy 
> override
> a dbname to "replication". I've tried to add a dummy entry {"dbname", NULL} 
> pair
> before the overriding, but it is no-op. Because The replacement of the dbname 
> in
> pqConnectOptions2() would be done only for the valid (=lastly specified)
> connection options.

Oh, this patch missed the straightforward case:

pg_basebackup -D data_N2 -d "user=postgres dbname=replication" -R
-> dbname would not be appeared in primary_conninfo.

So I think it cannot be applied as-is. Sorry for sharing the bad item.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



RE: Have pg_basebackup write "dbname" in "primary_conninfo"?

2024-02-27 Thread Hayato Kuroda (Fujitsu)
Dear Amit,

> We do append dbname=replication even in libpqrcv_connect for .pgpass
> lookup as mentioned in comments. See below:
> libpqrcv_connect()
> {
> 
> else
> {
> /*
> * The database name is ignored by the server in replication mode,
> * but specify "replication" for .pgpass lookup.
> */
> keys[++i] = "dbname";
> vals[i] = "replication";
> }
> ...
> }

OK. So we must add the value for the authorization, right?
I think it should be described even in GetConnection().

> I think as part of this effort we should just add dbname to
> primary_conninfo written in postgresql.auto.conf file. As above, the
> question is valid whether we should do it just for 17 or for prior
> versions. Let's discuss more on that. I am not sure of the use case
> for versions before 17 but commit cca97ce6a665 mentioned that some
> middleware or proxies might however need to know the dbname to make
> the correct routing decision for the connection. Does that apply here
> as well? If so, we can do it and update the docs, otherwise, let's do
> it just for 17.

Hmm, I might lose your requirements again. So, we must satisfy all of below
ones, right?
1) add {"dbname", "replication"} key-value pair to look up .pgpass file 
correctly.
2) If the -R is given, output dbname=xxx value to be used by slotsync worker.
3) If the dbname is not given in the connection string, the same string as
   username must be used to keep the libpq connection rule.
4) No need to add dbname=replication pair 

PSA the patch for implementing it. It is basically same as Ian's one.
However, this patch still cannot satisfy the condition 3).

`pg_basebackup -D data_N2 -d "user=postgres" -R`
-> dbname would not be appeared in primary_conninfo.

This is because `if (connection_string)` case in GetConnection() explicy 
override
a dbname to "replication". I've tried to add a dummy entry {"dbname", NULL} pair
before the overriding, but it is no-op. Because The replacement of the dbname in
pqConnectOptions2() would be done only for the valid (=lastly specified)
connection options.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 


pg_basebackup-write-dbname.v0002.patch
Description: pg_basebackup-write-dbname.v0002.patch


RE: Have pg_basebackup write "dbname" in "primary_conninfo"?

2024-02-27 Thread Hayato Kuroda (Fujitsu)
Dear Amit,

> When dbname is NULL or not given, it defaults to username. This
> follows the specs of the connection string. See (dbname #
> The database name. Defaults to be the same as the user name...) [1].
> Your patch breaks that specs, so I don't think it is correct.

I have proposed the point because I thought pg_basebackup basically wanted to do
a physical replication. But if the general libpq rule is stronger than it, we
should not apply my add_NULL_check.txt. Let's forget it.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



RE: speed up a logical replica setup

2024-02-26 Thread Hayato Kuroda (Fujitsu)
Dear Euler,

>> The possible solution would be
>> 1) allow to run pg_createsubscriber if standby is initially stopped .
>> I observed that pg_logical_createsubscriber also uses this approach.
>> 2) read GUCs via SHOW command and restore them when server restarts
>>
>3. add a config-file option. That's similar to what pg_rewind does.

Sorry, which pg_rewind option did you mention? I cannot find.
IIUC, -l is an only option which can accept the path, but it is not related 
with us.

Also, I'm not sure the benefit to add as new options. Basically it should be 
less.
Is there benefits than read via SHOW? Even if I assume the pg_resetwal has such
an option, the reason is that the target postmaster for pg_resetwal must be 
stopped.

>I expect
>that Debian-based installations will have this issue.

I'm not familiar with the Debian-env, so can you explain the reason?

>It was not a good idea if you want to keep the postgresql.conf outside PGDATA.
>I mean you need extra steps that can be error prone (different settings between
>files).

Yeah, if we use my approach, users who specify such GUCs may not be happy.
So...based on above discussion, we should choose either of below items. Thought?

a)
enforce the standby must be *stopped*, and options like config_file can be 
specified via option.
b)
enforce the standby must be *running*,  options like config_file would be read 
via SHOW command.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/global/ 



RE: speed up a logical replica setup

2024-02-25 Thread Hayato Kuroda (Fujitsu)
Dear Vignesh,

I forgot to reply one of the suggestion.

> 2) There is a CheckDataVersion function which does exactly this, will
> we be able to use this:
> +   /* Check standby server version */
> +   if ((ver_fd = fopen(versionfile, "r")) == NULL)
> +   pg_fatal("could not open file \"%s\" for reading: %m",
> versionfile);
> +
> +   /* Version number has to be the first line read */
> +   if (!fgets(rawline, sizeof(rawline), ver_fd))
> +   {
> +   if (!ferror(ver_fd))
> +   pg_fatal("unexpected empty file \"%s\"", versionfile);
> +   else
> +   pg_fatal("could not read file \"%s\": %m", 
> versionfile);
> +   }
> +
> +   /* Strip trailing newline and carriage return */
> +   (void) pg_strip_crlf(rawline);
> +
> +   if (strcmp(rawline, PG_MAJORVERSION) != 0)
> +   {
> +   pg_log_error("standby server is of wrong version");
> +   pg_log_error_detail("File \"%s\" contains \"%s\",
> which is not compatible with this program's version \"%s\".",
> +   versionfile,
> rawline, PG_MAJORVERSION);
> +   exit(1);
> +   }
> +
> +   fclose(ver_fd);


This suggestion has been rejected because I was not sure the location of the
declaration and the implementation. Function which could be called from clients
must be declared in src/include/{common|fe_utils|utils} directory. I saw files
located at there but I could not appropriate location for CheckDataVersion.
Also, I did not think new file should be created only for this function.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



RE: speed up a logical replica setup

2024-02-22 Thread Hayato Kuroda (Fujitsu)
Dear Vignesh,

> Few comments on the tests:
> 1) If the dry run was successful because of some issue then the server
> will be stopped so we can check for "pg_ctl status" if the server is
> running otherwise the connection will fail in this case. Another way
> would be to check if it does not have "postmaster was stopped"
> messages in the stdout.
> +
> +# Check if node S is still a standby
> +is($node_s->safe_psql('postgres', 'SELECT pg_catalog.pg_is_in_recovery()'),
> +   't', 'standby is in recovery');

Just to confirm - your suggestion is to add `pg_ctl status`, right? Added.

> 2) Can we add verification of  "postmaster was stopped" messages in
> the stdout for dry run without --databases testcase
> +# pg_createsubscriber can run without --databases option
> +command_ok(
> +   [
> +   'pg_createsubscriber', '--verbose',
> +   '--dry-run', '--pgdata',
> +   $node_s->data_dir, '--publisher-server',
> +   $node_p->connstr('pg1'), '--subscriber-server',
> +   $node_s->connstr('pg1')
> +   ],
> +   'run pg_createsubscriber without --databases');
> +

Hmm, in case of --dry-run, the server would be never shut down.
See below part.

```
if (!dry_run)
stop_standby_server(pg_ctl_path, opt.subscriber_dir);
```

> 3) This message "target server must be running" seems to be wrong,
> should it be cannot specify cascading replicating standby as standby
> node(this is for v22-0011 patch :
> +   'pg_createsubscriber', '--verbose', '--pgdata',
> $node_c->data_dir,
> +   '--publisher-server', $node_s->connstr('postgres'),
> +   '--port', $node_c->port, '--socketdir', $node_c->host,
> +   '--database', 'postgres'
> ],
> -   'primary server is in recovery');
> +   1,
> +   [qr//],
> +   [qr/primary server cannot be in recovery/],
> +   'target server must be running');

Fixed.

> 4) Should this be "Wait for subscriber to catch up"
> +# Wait subscriber to catch up
> +$node_s->wait_for_subscription_sync($node_p, $subnames[0]);
> +$node_s->wait_for_subscription_sync($node_p, $subnames[1]);

Fixed.

> 5) Should this be 'Subscriptions has been created on all the specified
> databases'
> +);
> +is($result, qq(2),
> +   'Subscriptions has been created to all the specified databases'
> +);

Fixed, but "has" should be "have".

> 6) Add test to verify current_user is not a member of
> ROLE_PG_CREATE_SUBSCRIPTION, has no create permissions, has no
> permissions to execution replication origin advance functions
> 
> 7) Add tests to verify insufficient max_logical_replication_workers,
> max_replication_slots and max_worker_processes for the subscription
> node
> 
> 8) Add tests to verify invalid configuration of  wal_level,
> max_replication_slots and max_wal_senders for the publisher node

Hmm, but adding these checks may increase the test time. we should
measure the time and then decide.

> 9) We can use the same node name in comment and for the variable
> +# Set up node P as primary
> +$node_p = PostgreSQL::Test::Cluster->new('node_p');
> +$node_p->init(allows_streaming => 'logical');
> +$node_p->start;

Fixed.

> 10) Similarly we can use node_f instead of F in the comments.
> +# Set up node F as about-to-fail node
> +# Force it to initialize a new cluster instead of copying a
> +# previously initdb'd cluster.
> +{
> +   local $ENV{'INITDB_TEMPLATE'} = undef;
> +
> +   $node_f = PostgreSQL::Test::Cluster->new('node_f');
> +   $node_f->init(allows_streaming => 'logical');
> +   $node_f->start;
>

Fixed. Also, recent commit [1] allows to run the initdb forcibly. So followed.

New patch can be available in [2].

[1]: 
https://github.com/postgres/postgres/commit/ff9e1e764fcce9a34467d614611a34d4d2a91b50
[2]: 
https://www.postgresql.org/message-id/TYCPR01MB12077CD76B53F9CAE7AC0F5562%40TYCPR01MB12077.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



RE: speed up a logical replica setup

2024-02-22 Thread Hayato Kuroda (Fujitsu)
Dear Vignesh,

> Few comments:
> 1) The below code can lead to assertion failure if the publisher is
> stopped while dropping the replication slot:
> +   if (primary_slot_name != NULL)
> +   {
> +   conn = connect_database(dbinfo[0].pubconninfo);
> +   if (conn != NULL)
> +   {
> +   drop_replication_slot(conn, [0],
> primary_slot_name);
> +   }
> +   else
> +   {
> +   pg_log_warning("could not drop replication
> slot \"%s\" on primary",
> +  primary_slot_name);
> +   pg_log_warning_hint("Drop this replication
> slot soon to avoid retention of WAL files.");
> +   }
> +   disconnect_database(conn);
> +   }
> 
> pg_createsubscriber: error: connection to database failed: connection
> to server on socket "/tmp/.s.PGSQL.5432" failed: No such file or
> directory
> Is the server running locally and accepting connections on that socket?
> pg_createsubscriber: warning: could not drop replication slot
> "standby_1" on primary
> pg_createsubscriber: hint: Drop this replication slot soon to avoid
> retention of WAL files.
> pg_createsubscriber: pg_createsubscriber.c:432: disconnect_database:
> Assertion `conn != ((void *)0)' failed.
> Aborted (core dumped)
> 
> This is happening because we are calling disconnect_database in case
> of connection failure case too which has the following assert:
> +static void
> +disconnect_database(PGconn *conn)
> +{
> +   Assert(conn != NULL);
> +
> +   PQfinish(conn);
> +}

Right. disconnect_database() was moved to if (conn != NULL) block.

> 2) There is a CheckDataVersion function which does exactly this, will
> we be able to use this:
> +   /* Check standby server version */
> +   if ((ver_fd = fopen(versionfile, "r")) == NULL)
> +   pg_fatal("could not open file \"%s\" for reading: %m",
> versionfile);
> +
> +   /* Version number has to be the first line read */
> +   if (!fgets(rawline, sizeof(rawline), ver_fd))
> +   {
> +   if (!ferror(ver_fd))
> +   pg_fatal("unexpected empty file \"%s\"", versionfile);
> +   else
> +   pg_fatal("could not read file \"%s\": %m", 
> versionfile);
> +   }
> +
> +   /* Strip trailing newline and carriage return */
> +   (void) pg_strip_crlf(rawline);
> +
> +   if (strcmp(rawline, PG_MAJORVERSION) != 0)
> +   {
> +   pg_log_error("standby server is of wrong version");
> +   pg_log_error_detail("File \"%s\" contains \"%s\",
> which is not compatible with this program's version \"%s\".",
> +   versionfile,
> rawline, PG_MAJORVERSION);
> +   exit(1);
> +   }
> +
> +   fclose(ver_fd);


> 3) Should this be added to typedefs.list:
> +enum WaitPMResult
> +{
> +   POSTMASTER_READY,
> +   POSTMASTER_STILL_STARTING
> +};

But the comment from Peter E. [1] was opposite. I did not handle this.

> 4) pgCreateSubscriber should be mentioned after pg_controldata to keep
> the ordering consistency:
> diff --git a/doc/src/sgml/reference.sgml b/doc/src/sgml/reference.sgml
> index aa94f6adf6..c5edd244ef 100644
> --- a/doc/src/sgml/reference.sgml
> +++ b/doc/src/sgml/reference.sgml
> @@ -285,6 +285,7 @@
> 
> 
> 
> +   
> 

This has been already pointed out by Peter E. I did not handle this.

> 5) Here pg_replication_slots should be pg_catalog.pg_replication_slots:
> +   if (primary_slot_name)
> +   {
> +   appendPQExpBuffer(str,
> + "SELECT 1 FROM
> pg_replication_slots "
> + "WHERE active AND
> slot_name = '%s'",
> + primary_slot_name);

Fixed.

> 6) Here pg_settings should be pg_catalog.pg_settings:
> +* - max_worker_processes >= 1 + number of dbs to be converted
> +
> *
> +*/
> +   res = PQexec(conn,
> +"SELECT setting FROM pg_settings
> WHERE name IN ("
> +"'max_logical_replication_workers', "
> +"'max_replication_slots', "
> +"'max_worker_processes', "
> +"'primary_slot_name') "
> +"ORDER BY name");

Fixed.

New version can be available in [2]

[1]: 
https://www.postgresql.org/message-id/3ee79f2c-e8b3-4342-857c-a31b87e1afda%40eisentraut.org
[2]: 
https://www.postgresql.org/message-id/TYCPR01MB12077CD76B53F9CAE7AC0F5562%40TYCPR01MB12077.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



RE: speed up a logical replica setup

2024-02-22 Thread Hayato Kuroda (Fujitsu)
Dear Vignesh,

> Few comments regarding the documentation:
> 1) max_replication_slots information seems to be present couple of times:
> 
> +
> + The target instance must have
> +  linkend="guc-max-replication-slots">max_replication_slots me>
> + and  linkend="guc-max-logical-replication-workers">max_logical_replica
> tion_workers
> + configured to a value greater than or equal to the number of target
> + databases.
> +
>
> +   
> +
> + The target instance must have
> +  linkend="guc-max-replication-slots">max_replication_slots me>
> + configured to a value greater than or equal to the number of target
> + databases and replication slots.
> +
> +   

Fixed.

> 2) Can we add an id to prerequisites and use it instead of referring
> to r1-app-pg_createsubscriber-1:
> - pg_createsubscriber checks if the
> given target data
> - directory has the same system identifier than the source data directory.
> - Since it uses the recovery process as one of the steps, it starts the
> - target server as a replica from the source server. If the system
> - identifier is not the same,
> pg_createsubscriber will
> - terminate with an error.
> + Checks the target can be converted.  In particular, things listed in
> + above section
> would be
> + checked.  If these are not met
> pg_createsubscriber
> + will terminate with an error.
>  

Changed.

> 3) The code also checks the following:
>  Verify if a PostgreSQL binary (progname) is available in the same
> directory as pg_createsubscriber.
> 
> But this is not present in the pre-requisites of documentation.

I think it is quite trivial so that I did not add.

> 4) Here we mention that the target server should be stopped, but the
> same is not mentioned in prerequisites:
> +   Here is an example of using
> pg_createsubscriber.
> +   Before running the command, please make sure target server is stopped.
> +
> +$ pg_ctl -D /usr/local/pgsql/data
> stop
> +
> +

Oh, it is opposite, it should NOT be stopped. Fixed.

> 5) If there is an error during any of the pg_createsubscriber
> operation like if create subscription fails, it might not be possible
> to rollback to the earlier state which had physical-standby
> replication. I felt we should document this and also add it to the
> console message like how we do in case of pg_upgrade.

Added.

New version can be available in [1]

[1]: 
https://www.postgresql.org/message-id/TYCPR01MB12077CD76B53F9CAE7AC0F5562%40TYCPR01MB12077.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



RE: speed up a logical replica setup

2024-02-22 Thread Hayato Kuroda (Fujitsu)
Dear Alvaro,

> > 15.
> >
> > You said in case of failure, cleanups is not needed if the process exits 
> > soon [1].
> > But some functions call PQfinish() then exit(1) or pg_fatal(). Should we 
> > follow?
> 
> Hmm, but doesn't this mean that the server will log an ugly message that
> "client closed connection unexpectedly"?  I think it's nicer to close
> the connection before terminating the process (especially since the
> code for that is already written).

OK. So we should disconnect properly even if the process exits. I added the 
function call
again. Note that PQclear() was not added because it is only related with the 
application.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



RE: speed up a logical replica setup

2024-02-22 Thread Hayato Kuroda (Fujitsu)
Dear Shlok,

> Hi,
> 
> I have reviewed the v21 patch. And found an issue.
> 
> Initially I started the standby server with a new postgresql.conf file
> (not the default postgresql.conf that is present in the instance).
> pg_ctl -D ../standby start -o "-c config_file=/new_path/postgresql.conf"
> 
> And I have made 'max_replication_slots = 1' in new postgresql.conf and
> made  'max_replication_slots = 0' in the default postgresql.conf file.
> Now when we run pg_createsubscriber on standby we get error:
> pg_createsubscriber: error: could not set replication progress for the
> subscription "pg_createsubscriber_5_242843": ERROR:  cannot query or
> manipulate replication origin when max_replication_slots = 0
> NOTICE:  dropped replication slot "pg_createsubscriber_5_242843" on publisher
> pg_createsubscriber: error: could not drop publication
> "pg_createsubscriber_5" on database "postgres": ERROR:  publication
> "pg_createsubscriber_5" does not exist
> pg_createsubscriber: error: could not drop replication slot
> "pg_createsubscriber_5_242843" on database "postgres": ERROR:
> replication slot "pg_createsubscriber_5_242843" does not exist
> 
> I observed that when we run the pg_createsubscriber command, it will
> stop the standby instance (the non-default postgres configuration) and
> restart the standby instance which will now be started with default
> postgresql.conf, where the 'max_replication_slot = 0' and
> pg_createsubscriber will now fail with the error given above.
> I have added the script file with which we can reproduce this issue.
> Also similar issues can happen with other configurations such as port, etc.

Possible. So the issue is that GUC settings might be changed after the restart
so that the verification phase may not be enough. There are similar GUCs in [1]
and they may have similar issues. E.g., if "hba_file" is set when the server
started, the file cannot be seen after the restart, so pg_createsubscriber may
not connect to it anymore.

> The possible solution would be
> 1) allow to run pg_createsubscriber if standby is initially stopped .
> I observed that pg_logical_createsubscriber also uses this approach.
> 2) read GUCs via SHOW command and restore them when server restarts
>

I also prefer the first solution. 
Another reason why the standby should be stopped is for backup purpose.
Basically, the standby instance should be saved before running 
pg_createsubscriber.
An easiest way is hard-copy, and the postmaster should be stopped at that time.
I felt it is better that users can run the command immediately later the 
copying.
Thought?

[1]: https://www.postgresql.org/docs/current/storage-file-layout.html

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



RE: Have pg_basebackup write "dbname" in "primary_conninfo"?

2024-02-20 Thread Hayato Kuroda (Fujitsu)
Dear Horiguchi-san,

> GetConnection()@streamutil.c wants to ensure conninfo has a fallback
> database name ("replication"). However, the function seems to be
> ignoring the case where neither dbname nor connection string is given,
> which is the first case Kuroda-san raised. The second case is the
> intended behavior of the function.
> 
> > /* pg_recvlogical uses dbname only; others use connection_string only.
> */
> > Assert(dbname == NULL || connection_string == NULL);
> 
> And the function incorrectly assumes that the connection string
> requires "dbname=replication".
> 
> >  * Merge the connection info inputs given in form of connection string,
> >  * options and default values (dbname=replication, replication=true,
> etc.)
> 
> But the name is a pseudo database name only used by pg_hba.conf
> (maybe) , which cannot be used as an actual database name (without
> quotation marks, or unless it is actually created). The function
> should not add the fallback database name because the connection
> string for physical replication connection doesn't require the dbname
> parameter. (attached patch)

I was also missing, but the requirement was that the dbname should be included
only when the dbname option was explicitly specified [1]. Even mine and yours
cannot handle like that. Libpq function PQconnectdbParams()->pqConnectOptions2()
fills all the parameter to PGconn, at that time the information whether it is
intentionally specified or not is discarded. Then, GenerateRecoveryConfig() 
would
just write down all the connection parameters from PGconn, they cannot 
recognize.

One approach is that based on Horiguchi-san's one and initial patch, we can
avoid writing when the dbname is same as the username.

[1]: 
https://www.postgresql.org/message-id/CAA4eK1KH1d1J5giPMZVOtMe0iqncf1CpNwkBKoYAmXdC-kEGZg%40mail.gmail.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 





RE: Have pg_basebackup write "dbname" in "primary_conninfo"?

2024-02-20 Thread Hayato Kuroda (Fujitsu)
Dear Robert,

> > Just FYI - here is an extreme case. And note that I have applied proposed 
> > patch.
> >
> > When `pg_basebackup -D data_N2 -R` is used:
> > ```
> > primary_conninfo = 'user=hayato ... dbname=hayato ...
> > ```
> >
> > But when `pg_basebackup -d "" -D data_N2 -R` is used:
> > ```
> > primary_conninfo = 'user=hayato ... dbname=replication
> > ```
> 
> It seems like maybe somebody should look into why this is happening,
> and perhaps fix it.

I think this caused from below part [1] in GetConnection().

If both dbname and connection_string are the NULL, we will enter the else part
and NULL would be substituted - {"dbnmae", NULL} key-value pair is generated
only here.

Then, in PQconnectdbParams()->PQconnectStartParams->pqConnectOptions2(),
the strange part would be found and replaced to the username [2].

I think if both the connection string and the dbname are NULL, it should be
considered as the physical replication connection. here is a patch to fix it.
After the application, below two examples can output "dbname=replication".
You can also confirm.

```
pg_basebackup -D data_N2 -U postgres
pg_basebackup -D data_N2 -R -v

-> primary_conninfo = 'user=postgres ... dbname=replication ...
```

[1]
```
else
{
keywords = pg_malloc0((argcount + 1) * sizeof(*keywords));
values = pg_malloc0((argcount + 1) * sizeof(*values));
keywords[i] = "dbname";
values[i] = dbname;
i++;
}
```

[2]
```
/*
 * If database name was not given, default it to equal user name
 */
if (conn->dbName == NULL || conn->dbName[0] == '\0')
{
free(conn->dbName);
conn->dbName = strdup(conn->pguser);
if (!conn->dbName)
goto oom_error;
}
```


Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/

diff --git a/src/bin/pg_basebackup/streamutil.c 
b/src/bin/pg_basebackup/streamutil.c
index 56d1b15951..aea1ccba36 100644
--- a/src/bin/pg_basebackup/streamutil.c
+++ b/src/bin/pg_basebackup/streamutil.c
@@ -119,7 +119,7 @@ GetConnection(void)
keywords = pg_malloc0((argcount + 1) * sizeof(*keywords));
values = pg_malloc0((argcount + 1) * sizeof(*values));
keywords[i] = "dbname";
-   values[i] = dbname;
+   values[i] = dbname == NULL ? "replication" : dbname;
i++;
}
 


RE: Have pg_basebackup write "dbname" in "primary_conninfo"?

2024-02-20 Thread Hayato Kuroda (Fujitsu)
Dear Robert,

> Seems weird to me. You don't use dbname=replication to ask for a
> replication connection, so why would we ever end up with that
> anywhere? And especially in only one of two such closely related
> cases?

Just FYI - here is an extreme case. And note that I have applied proposed patch.

When `pg_basebackup -D data_N2 -R` is used:
```
primary_conninfo = 'user=hayato ... dbname=hayato ...
```

But when `pg_basebackup -d "" -D data_N2 -R` is used:
```
primary_conninfo = 'user=hayato ... dbname=replication
```

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



RE: Have pg_basebackup write "dbname" in "primary_conninfo"?

2024-02-20 Thread Hayato Kuroda (Fujitsu)
Dear Ian,

Thanks for making the patch.

> With the addition of "pg_sync_replication_slots()", there is now a use-case 
> for
> including "dbname" in "primary_conninfo" and the docs have changed from
> stating [1]:
> 
>   Do not specify a database name in the primary_conninfo string.
> 
> to [2]:
> 
>   For replication slot synchronization (see Section 48.2.3), it is also
>   necessary to specify a valid dbname in the primary_conninfo string. This 
> will
>   only be used for slot synchronization. It is ignored for streaming.
> 
> [1]
> https://www.postgresql.org/docs/16/runtime-config-replication.html#RUNTIME
> -CONFIG-REPLICATION-STANDBY
> [2]
> https://www.postgresql.org/docs/devel/runtime-config-replication.html#RUNTI
> ME-CONFIG-REPLICATION-STANDBY
> 
> However, when setting up a standby (with the intent of creating a logical
> standby) with pg_basebackup, providing the -R/-write-recovery-conf option
> results in a "primary_conninfo" string being generated without a "dbname"
> parameter, which requires a manual change should one intend to use
> "pg_sync_replication_slots()".
> 
> I can't see any reason for continuing to omit "dbname", so suggest it should
> only continue to be omitted for 16 and earlier; see attached patch.

Hmm, I also cannot find a reason, but we can document the change.

> Note that this does mean that if the conninfo string provided to pg_basebackup
> does not contain "dbname", the generated "primary_conninfo" GUC will default 
> to
> "dbname=replication". It would be easy enough to suppress this, but AFAICS
> there's no way to tell if it was explicitly supplied by the user, in which 
> case
> it would be surprising if it were omitted from the final "primary_conninfo"
> string.

I found an inconsistency. When I ran ` pg_basebackup -D data_N2 -U postgres -R`,
dbname would be set as username.

```
primary_conninfo = 'user=postgres ... dbname=postgres
```

However, when I ran `pg_basebackup -D data_N2 -d "user=postgres" -R`,
dbname would be set as "replication". Is it an intentional item?

```
primary_conninfo = 'user=postgres ... dbname=replication...
```

> The only other place where GenerateRecoveryConfig() is called is pg_rewind;
> I can't see any adverse affects from the proposed change. But it's perfectly
> possible there's something blindingly obvious I'm overlooking.

On-going feature pg_createsubscriber[1] also uses GenerateRecoveryConfig(), but
I can't see any bad effects. The output is being used to make consistent standby
from the primary. Even if dbname is set in the primary_conninfo, it would be 
ignored.

[1]: https://commitfest.postgresql.org/47/4637/

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



RE: speed up a logical replica setup

2024-02-20 Thread Hayato Kuroda (Fujitsu)
Dear Vignesh,

Thanks for giving comments!

> Few comments for v22-0001 patch:
> 1) The second "if (strcmp(PQgetvalue(res, 0, 1), "t") != 0)"" should
> be if (strcmp(PQgetvalue(res, 0, 2), "t") != 0):
> +   if (strcmp(PQgetvalue(res, 0, 1), "t") != 0)
> +   {
> +   pg_log_error("permission denied for database %s",
> dbinfo[0].dbname);
> +   return false;
> +   }
> +   if (strcmp(PQgetvalue(res, 0, 1), "t") != 0)
> +   {
> +   pg_log_error("permission denied for function \"%s\"",
> +
> "pg_catalog.pg_replication_origin_advance(text, pg_lsn)");
> +   return false;
> +   }

I have already pointed out as comment #8 [1] and fixed in v22-0005.

> 2) pg_createsubscriber fails if a table is parallely created in the
> primary node:
> 2024-02-20 14:38:49.005 IST [277261] LOG:  database system is ready to
> accept connections
> 2024-02-20 14:38:54.346 IST [277270] ERROR:  relation "public.tbl5"
> does not exist
> 2024-02-20 14:38:54.346 IST [277270] STATEMENT:  CREATE SUBSCRIPTION
> pg_createsubscriber_5_277236 CONNECTION ' dbname=postgres' PUBLICATION
> pg_createsubscriber_5 WITH (create_slot = false, copy_data = false,
> enabled = false)
> 
> If we are not planning to fix this, at least it should be documented

The error will be occurred when tables are created after the promotion, right?
I think it cannot be fixed until DDL logical replication would be implemented.
So, +1 to add descriptions.

> 3) Error conditions is verbose mode has invalid error message like
> "out of memory" messages like in below:
> pg_createsubscriber: waiting the postmaster to reach the consistent state
> pg_createsubscriber: postmaster reached the consistent state
> pg_createsubscriber: dropping publication "pg_createsubscriber_5" on
> database "postgres"
> pg_createsubscriber: creating subscription
> "pg_createsubscriber_5_278343" on database "postgres"
> pg_createsubscriber: error: could not create subscription
> "pg_createsubscriber_5_278343" on database "postgres": out of memory

Because some places use PQerrorMessage() wrongly. It should be
PQresultErrorMessage(). Fixed in v22-0005.

> 4) In error cases we try to drop this publication again resulting in error:
> +   /*
> +* Since the publication was created before the
> consistent LSN, it is
> +* available on the subscriber when the physical
> replica is promoted.
> +* Remove publications from the subscriber because it
> has no use.
> +*/
> +   drop_publication(conn, [i]);
> 
> Which throws these errors(because of drop publication multiple times):
> pg_createsubscriber: dropping publication "pg_createsubscriber_5" on
> database "postgres"
> pg_createsubscriber: error: could not drop publication
> "pg_createsubscriber_5" on database "postgres": ERROR:  publication
> "pg_createsubscriber_5" does not exist
> pg_createsubscriber: dropping publication "pg_createsubscriber_5" on
> database "postgres"
> pg_createsubscriber: dropping the replication slot
> "pg_createsubscriber_5_278343" on database "postgres"

Right. One approach is to use DROP PUBLICATION IF EXISTS statement.
Thought?

> 5) In error cases, wait_for_end_recovery waits even though it has
> identified that the replication between primary and standby is
> stopped:
> +/*
> + * Is recovery still in progress?
> + * If the answer is yes, it returns 1, otherwise, returns 0. If an error 
> occurs
> + * while executing the query, it returns -1.
> + */
> +static int
> +server_is_in_recovery(PGconn *conn)
> +{
> +   PGresult   *res;
> +   int ret;
> +
> +   res = PQexec(conn, "SELECT pg_catalog.pg_is_in_recovery()");
> +
> +   if (PQresultStatus(res) != PGRES_TUPLES_OK)
> +   {
> +   PQclear(res);
> +   pg_log_error("could not obtain recovery progress");
> +   return -1;
> +   }
> +
> 
> You can simulate this by stopping the primary just before
> wait_for_end_recovery and you will see these error messages, but
> pg_createsubscriber will continue to wait:
> pg_createsubscriber: error: could not obtain recovery progress
> pg_createsubscriber: error: could not obtain recovery progress
> pg_createsubscriber: error: could not obtain recovery progress
> pg_createsubscriber: error: could not obtain recovery progress

Yeah, v22-0001 cannot detect the disconnection from primary and standby.
V22-0007 can detect the standby crash, but v22 set could not detect the 
primary crash. Euler came up with an approach [2] for it but not implemented 
yet.


[1]: 
https://www.postgresql.org/message-id/TYCPR01MB12077756323B79042F29DDAEDF54C2%40TYCPR01MB12077.jpnprd01.prod.outlook.com
[2]: 
https://www.postgresql.org/message-id/2231a04b-f2d4-4a4e-b5cd-56be8b002427%40app.fastmail.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



RE: pg_upgrade and logical replication

2024-02-18 Thread Hayato Kuroda (Fujitsu)
Dear Vignesh,

Thanks for reviewing! PSA new version.

> 
> Thanks for the updated patch, Few suggestions:
> 1)  This can be moved to keep it similar to other tests:
> +# Setup a disabled subscription. The upcoming test will check the
> +# pg_createsubscriber won't work, so it is sufficient.
> +$publisher->safe_psql('postgres', "CREATE PUBLICATION regress_pub1");
> +$old_sub->safe_psql('postgres',
> +   "CREATE SUBSCRIPTION regress_sub1 CONNECTION '$connstr'
> PUBLICATION regress_pub1 WITH (enabled = false)"
> +);
> +
> +$old_sub->stop;
> +
> +# --
> +# Check that pg_upgrade fails when max_replication_slots configured in the 
> new
> +# cluster is less than the number of subscriptions in the old cluster.
> +# --
> +$new_sub->append_conf('postgresql.conf', "max_replication_slots = 0");
> +
> +# pg_upgrade will fail because the new cluster has insufficient
> +# max_replication_slots.
> +command_checks_all(
> +   [
> +   'pg_upgrade', '--no-sync', '-d', $old_sub->data_dir,
> +   '-D', $new_sub->data_dir, '-b', $oldbindir,
> +   '-B', $newbindir, '-s', $new_sub->host,
> +   '-p', $old_sub->port, '-P', $new_sub->port,
> +   $mode, '--check',
> +   ],
> 
> like below and the extra comment can be removed:
> +# --
> +# Check that pg_upgrade fails when max_replication_slots configured in the 
> new
> +# cluster is less than the number of subscriptions in the old cluster.
> +# --
> +# Create a disabled subscription.
> +$publisher->safe_psql('postgres', "CREATE PUBLICATION regress_pub1");
> +$old_sub->safe_psql('postgres',
> +   "CREATE SUBSCRIPTION regress_sub1 CONNECTION '$connstr'
> PUBLICATION regress_pub1 WITH (enabled = false)"
> +);
> +
> +$old_sub->stop;
> +
> +$new_sub->append_conf('postgresql.conf', "max_replication_slots = 0");
> +
> +# pg_upgrade will fail because the new cluster has insufficient
> +# max_replication_slots.
> +command_checks_all(
> +   [
> +   'pg_upgrade', '--no-sync', '-d', $old_sub->data_dir,
> +   '-D', $new_sub->data_dir, '-b', $oldbindir,
> +   '-B', $newbindir, '-s', $new_sub->host,
> +   '-p', $old_sub->port, '-P', $new_sub->port,
> +   $mode, '--check',
> +   ],

Partially fixed. I moved the creation part to below but comments were kept.

> 2) This comment can be slightly changed:
> +# Change configuration as well not to start the initial sync automatically
> +$new_sub->append_conf('postgresql.conf',
> +   "max_logical_replication_workers = 0");
> 
> to:
> Change configuration so that initial table sync sync does not get
> started automatically

Fixed.

> 3) The old comments were slightly better:
> # Resume the initial sync and wait until all tables of subscription
> # 'regress_sub5' are synchronized
> $new_sub->append_conf('postgresql.conf',
> "max_logical_replication_workers = 10");
> $new_sub->restart;
> $new_sub->safe_psql('postgres', "ALTER SUBSCRIPTION regress_sub5
> ENABLE");
> $new_sub->wait_for_subscription_sync($publisher, 'regress_sub5');
> 
> Like:
> # Enable the subscription
> $new_sub->safe_psql('postgres', "ALTER SUBSCRIPTION regress_sub5
> ENABLE");
> 
> # Wait until all tables of subscription 'regress_sub5' are synchronized
> $new_sub->wait_for_subscription_sync($publisher, 'regress_sub5');

Per comments from Amit [1], I did not change.

[1]: 
https://www.postgresql.org/message-id/CAA4eK1Ls%2BRmJtTvOgaRXd%2BeHSY3x-KUE%3DsfEGQoU-JF_UzA62A%40mail.gmail.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



v5-0001-Fix-testcase.patch
Description: v5-0001-Fix-testcase.patch


RE: speed up a logical replica setup

2024-02-16 Thread Hayato Kuroda (Fujitsu)
g2;
));
$node_p->safe_psql('pg1', 'CREATE TABLE tbl1 (a text)');
$node_p->safe_psql('pg1', "INSERT INTO tbl1 VALUES('first row')");
$node_p->safe_psql('pg2', 'CREATE TABLE tbl2 (a text)');
my $slotname = 'physical_slot';
$node_p->safe_psql('pg2',
"SELECT pg_create_physical_replication_slot('$slotname')");
```

I think setting of the same node can be gathered into one place.
Also, any settings and definitions should be done just before they are used.

21.
```
$node_s->append_conf(
'postgresql.conf', qq[
log_min_messages = debug2
primary_slot_name = '$slotname'
]);
```

I could not find a reason why we set to debug2.

22.
```
command_fails
```

command_checks_all() can check returned value and outputs.
Should we use it?

23.

Can you add headers for each testcases? E.g., 

```
# --
# Check pg_createsubscriber fails when the target server is not a
# standby of the source.
...
# --
# Check pg_createsubscriber fails when the target server is not running
...
# --
# Check pg_createsubscriber fails when the target server is a member of
# the cascading standby.
...
# --
# Check successful dry-run 
...
# --
# Check successful conversion
```

24.
```
# Stop node C
$node_c->teardown_node;
...
$node_p->stop;
$node_s->stop;
$node_f->stop;
```
Why you choose the teardown?

25.

The creation of subscriptions are not directory tested. @subnames contains the
name of subscriptions, but it just assumes the number of them is more than two.

Since it may be useful, I will post top-up patch on Monday, if there are no 
updating.

[1]: 
https://www.postgresql.org/message-id/89ccf72b-59f9-4317-b9fd-1e6d20a0c3b1%40app.fastmail.com
[2]: 
https://www.postgresql.org/message-id/TYCPR01MB1207713BEC5C379A05D65E342F54B2%40TYCPR01MB12077.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/global/ 



RE: speed up a logical replica setup

2024-02-16 Thread Hayato Kuroda (Fujitsu)
Dear Shubham,

Thanks for testing. It seems you ran with v20 patch, but I confirmed
It could reproduce with v21.

> 
> I found a couple of issues, while verifying the cascaded replication
> with the following scenarios:
> Scenario 1) Create cascade replication like node1->node2->node3
> without using replication slots (attached
> cascade_3node_setup_without_slots has the script for this):
> Then I ran pg_createsubscriber by specifying primary as node1 and
> standby as node3, this scenario runs successfully. I was not sure if
> this should be supported or not?

Hmm. After the script, the cascading would be broken. The replication would be:

```
Node1 -> node2
  |
Node3
```

And the operation is bit strange. The consistent LSN is gotten from the node1,
but node3 waits until it receives the record from NODE2.
Can we always success it?

> Scenario 2) Create cascade replication like node1->node2->node3 using
> replication slots (attached cascade_3node_setup_with_slots has the
> script for this):
> Here, slot name was used as slot1 for node1 to node2 and slot2 for
> node2 to node3. Then I ran pg_createsubscriber by specifying primary
> as node1 and standby as node3. In this case pg_createsubscriber fails
> with the following error:
> pg_createsubscriber: error: could not obtain replication slot
> information: got 0 rows, expected 1 row
> [Inferior 1 (process 2623483) exited with code 01]
> 
> This is failing because slot name slot2 is used between node2->node3
> but pg_createsubscriber is checked for slot1, the slot which is used
> for replication between node1->node2.
> Thoughts?

Right. The inconsistency is quite strange.

Overall, I felt such a case must be rejected. How should we detect at checking 
phase?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



RE: speed up a logical replica setup

2024-02-15 Thread Hayato Kuroda (Fujitsu)
Dear Euler,

Thanks for updating the patch!
Before reviewing deeply, here are replies for your comments.

>
> > Points raised by me [1] are not solved yet.
> > 
> > * What if the target version is PG16-?
pg_ctl and pg_resetwal won't work.
$ pg_ctl start -D /tmp/blah
waiting for server to start
2024-02-15 23:50:03.448 -03 [364610] FATAL:  database files are incompatible 
with server
2024-02-15 23:50:03.448 -03 [364610] DETAIL:  The data directory was 
initialized by PostgreSQL version 16, which is not compatible with this version 
17devel.
stopped waiting
pg_ctl: could not start server
Examine the log output.

$ pg_resetwal -D /tmp/blah
pg_resetwal: error: data directory is of wrong version
pg_resetwal: detail: File "PG_VERSION" contains "16", which is not compatible 
with this program's version "17".

> > * What if the found executables have diffent version with 
> > pg_createsubscriber?

The new code take care of it.
>

I preferred to have a common path and test one by one, but agreed this worked 
well.
Let's keep it and hear opinions from others.

>
> > * What if the target is sending WAL to another server?
> >I.e., there are clusters like `node1->node2-.node3`, and the target is 
> > node2.
The new code detects if the server is in recovery and aborts as you suggested.
A new option can be added to ignore the fact there are servers receiving WAL
from it.
>

Confirmed it can detect.

>
> > * Can we really cleanup the standby in case of failure?
> >Shouldn't we suggest to remove the target once?

If it finishes the promotion, no. I adjusted the cleanup routine a bit to avoid
it. However, we should provide instructions to inform the user that it should
create a fresh standby and try again.
>

I think the cleanup function looks not sufficient. In v21, recovery_ended is 
kept
to true even after drop_publication() is done in setup_subscriber(). I think 
that
made_subscription is not needed, and the function should output some messages
when recovery_ended is on.
Besides, in case of pg_upgrade, they always report below messages before 
starting
the migration. I think this is more helpful for users.

```
pg_log(PG_REPORT, "\n"
   "If pg_upgrade fails after this point, you must re-initdb 
the\n"
   "new cluster before continuing.");
```

>
> > * Can we move outputs to stdout?

Are you suggesting to use another logging framework? It is not a good idea
because each client program is already using common/logging.c.
>

Hmm, indeed. Other programs in pg_basebackup seem to use the framework.

>
v20-0011: Do we really want to interrupt the recovery if the network was
momentarily interrupted or if the OS killed walsender? Recovery is critical for
the process. I think we should do our best to be resilient and deliver all
changes required by the new subscriber.
>

It might be too strict to raise an ERROR as soon as we met a disconnection.
And at least 0011 was wrong - it should be PQgetvalue(res, 0, 1) for 
still_alive.

>
The proposal is not correct because the
query return no tuples if it is disconnected so you cannot PQgetvalue().
>

Sorry for misunderstanding, but you might be confused. pg_createsubcriber
sends a query to target, and we are discussing the disconnection between the
target and source instances. I think the connection which pg_createsubscriber
has is stil alive so PQgetvalue() can get a value.

(BTW, callers of server_is_in_recovery() has not considered a disconnection from
the target...)

>
The
retry interval (the time that ServerLoop() will create another walreceiver) is
determined by DetermineSleepTime() and it is a maximum of 5 seconds
(SIGKILL_CHILDREN_AFTER_SECS). One idea is to retry 2 or 3 times before give up
using the pg_stat_wal_receiver query. Do you have a better plan?
>

It's good to determine the threshold. It can define the number  of acceptable
loss of walreceiver during the loop.
I think we should retry at least the postmaster revives the walreceiver.
The checking would work once per seconds, so more than 5 (or 10) may be better.
Thought?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/global/ 



RE: pg_upgrade and logical replication

2024-02-15 Thread Hayato Kuroda (Fujitsu)
Dear Vignesh,

Thanks for reviewing! PSA new version.

> 
> Thanks for the updated patch, few suggestions:
> 1) Can we use a new publication for this subscription too so that the
> publication and subscription naming will become consistent throughout
> the test case:
> +# Table will be in 'd' (data is being copied) state as table sync will fail
> +# because of primary key constraint error.
> +my $started_query =
> +  "SELECT count(1) = 1 FROM pg_subscription_rel WHERE srsubstate = 'd'";
> +$old_sub->poll_query_until('postgres', $started_query)
> +  or die
> +  "Timed out while waiting for the table state to become 'd' (datasync)";
> +
> +# Create another subscription and drop the subscription's replication origin
> +$old_sub->safe_psql('postgres',
> +   "CREATE SUBSCRIPTION regress_sub3 CONNECTION '$connstr'
> PUBLICATION regress_pub2 WITH (enabled = false)"
> +);
>
> So after the change it will become like subscription regress_sub3 for
> publication regress_pub3, subscription regress_sub4 for publication
> regress_pub4 and subscription regress_sub5 for publication
> regress_pub5.

A new publication was defined.

> 2) The tab_upgraded1 table can be created along with create
> publication and create subscription itself:
> $publisher->safe_psql('postgres',
> "CREATE PUBLICATION regress_pub3 FOR TABLE tab_upgraded1");
> $old_sub->safe_psql('postgres',
> "CREATE SUBSCRIPTION regress_sub4 CONNECTION '$connstr' PUBLICATION
> regress_pub3 WITH (failover = true)"
> );

The definition of tab_upgraded1 was moved to the place you pointed.

> 3) The tab_upgraded2 table can be created along with create
> publication and create subscription itself to keep it consistent:
>  $publisher->safe_psql('postgres',
> -   "ALTER PUBLICATION regress_pub2 ADD TABLE tab_upgraded2");
> +   "CREATE PUBLICATION regress_pub4 FOR TABLE tab_upgraded2");
>  $old_sub->safe_psql('postgres',
> -   "ALTER SUBSCRIPTION regress_sub2 REFRESH PUBLICATION");
> +   "CREATE SUBSCRIPTION regress_sub5 CONNECTION '$connstr'
> PUBLICATION regress_pub4"
> +);

Ditto.

> With above fixes, the following can be removed:
> # Initial setup
> $publisher->safe_psql(
> 'postgres', qq[
> CREATE TABLE tab_upgraded1(id int);
> CREATE TABLE tab_upgraded2(id int);
> ]);
> $old_sub->safe_psql(
> 'postgres', qq[
> CREATE TABLE tab_upgraded1(id int);
> CREATE TABLE tab_upgraded2(id int);
> ]);

Yes, earlier definitions were removed instead.
Also, some comments were adjusted based on these fixes.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



v4-0001-Fix-testcase.patch
Description: v4-0001-Fix-testcase.patch


RE: pg_upgrade and logical replication

2024-02-15 Thread Hayato Kuroda (Fujitsu)
Dear Justin,

Thanks for replying!

> What optimizations?  I can't see them, and since the patch is described
> as rearranging test cases (and therefore already difficult to read), I
> guess they should be a separate patch, or the optimizations described.

The basic idea was to reduce number of CREATE/DROP statement,
but it was changed for now - publications and subscriptions were created and
dropped per testcases. 

E.g., In case of successful upgrade, below steps were done:

1. create two publications
2. create a subscription with failover = true
3. avoid further initial sync by setting max_logical_replication_workers = 0
4. create another subscription
5. confirm statuses of tables are either of 'i' or 'r'
6. run pg_upgrade
7. confirm table statuses are preserved
8. confirm replication origins are preserved.

New patch is available in [1].

[1]: 
https://www.postgresql.org/message-id/TYCPR01MB12077B16EEDA360BA645B96F8F54C2%40TYCPR01MB12077.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/





RE: pg_upgrade and logical replication

2024-02-15 Thread Hayato Kuroda (Fujitsu)
Dear Amit,

> This sounds like a reasonable way to address the reported problem.

OK, thanks!

> Justin, do let me know if you think otherwise?
> 
> Comment:
> ===
> *
> -# Setup an enabled subscription to verify that the running status and 
> failover
> -# option are retained after the upgrade.
> +# Setup a subscription to verify that the failover option are retained after
> +# the upgrade.
>  $publisher->safe_psql('postgres', "CREATE PUBLICATION regress_pub1");
>  $old_sub->safe_psql('postgres',
> - "CREATE SUBSCRIPTION regress_sub1 CONNECTION '$connstr' PUBLICATION
> regress_pub1 WITH (failover = true)"
> + "CREATE SUBSCRIPTION regress_sub1 CONNECTION '$connstr'
> PUBLICATION
> regress_pub1 WITH (failover = true, enabled = false)"
>  );
> 
> I think it is better not to create a subscription in the early stage
> which we wanted to use for the success case. Let's have separate
> subscriptions for failure and success cases. I think that will avoid
> the newly added ALTER statements in the patch.

I made a patch to avoid creating objects as much as possible, but it
may lead some confusion. I recreated a patch for creating pub/sub
and dropping them at cleanup for every test cases.

PSA a new version.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



v3-0001-Fix-testcase.patch
Description: v3-0001-Fix-testcase.patch


RE: speed up a logical replica setup

2024-02-14 Thread Hayato Kuroda (Fujitsu)
Dear Euler,

>
If someone modifies data after promotion, fine; she has to deal with conflicts,
if any. IMO it is solved adding one or two sentences in the documentation.
>

OK. I could find issues, for now.

>
> >
> Regarding
> GUCs, almost all of them is PGC_POSTMASTER (so it cannot be modified unless 
> the
> server is restarted). The ones that are not PGC_POSTMASTER, does not affect 
> the
> pg_createsubscriber execution [1].
> >
> 
> IIUC,  primary_conninfo and primary_slot_name is PGC_SIGHUP.

Ditto.
>

Just to confirm - even if the primary_slot_name would be changed during
the conversion, the slot initially set would be dropped. Currently we do not
find any issues.

>
> >
> I'm just pointing out that this case is a different from pg_upgrade (from 
> which
> this idea was taken). I'm not saying that's a bad idea. I'm just arguing that
> you might be preventing some access read only access (monitoring) when it is
> perfectly fine to connect to the database and execute queries. As I said
> before, the current UI allows anyone to setup the standby to accept only local
> connections. Of course, it is an extra step but it is possible. However, once
> you apply v16-0007, there is no option but use only local connection during 
> the
> transformation. Is it an acceptable limitation?
> >
> 
> My remained concern is written above. If they do not problematic we may not 
> have
> to restrict them for now. At that time, changes 
> 
> 1) overwriting a port number,
> 2) setting listen_addresses = ''

It can be implemented later if people are excited by it.

> are not needed, right? IIUC inconsistency of -P may be still problematic.

I still think we shouldn't have only the transformed primary_conninfo as
option.
>


Hmm, OK. So let me summarize current status and discussions. 

Policy)

Basically, we do not prohibit to connect to primary/standby.
primary_slot_name may be changed during the conversion and
tuples may be inserted on target just after the promotion, but it seems no 
issues.

API)

-D (data directory) and -d (databases) are definitively needed.

Regarding the -P (a connection string for source), we can require it for now.
But note that it may cause an inconsistency if the pointed not by -P is 
different
from the node pointde by primary_conninfo.

As for the connection string for the target server, we can choose two ways:
a)
accept native connection string as -S. This can reuse the same parsing 
mechanism as -P,
but there is a room that non-local server is specified.

b)
accept username/port as -U/-p
(Since the policy is like above, listen_addresses would not be overwritten. 
Also, the port just specify the listening port).
This can avoid connecting to non-local, but more options may be needed.
(E.g., Is socket directory needed? What about password?)

Other discussing point, reported issue)

Points raised by me [1] are not solved yet.

* What if the target version is PG16-?
* What if the found executables have diffent version with pg_createsubscriber?
* What if the target is sending WAL to another server?
   I.e., there are clusters like `node1->node2-.node3`, and the target is node2.
* Can we really cleanup the standby in case of failure?
   Shouldn't we suggest to remove the target once?
* Can we move outputs to stdout?

[1]: 
https://www.postgresql.org/message-id/TYCPR01MB1207713BEC5C379A05D65E342F54B2%40TYCPR01MB12077.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/global/ 



RE: pg_upgrade and logical replication

2024-02-13 Thread Hayato Kuroda (Fujitsu)
Dear Vignesh,

Thanks for verifying the fix!

> Your proposal to change the tests in the following order: a) failure
> due to the insufficient max_replication_slot b) failure because the
> pg_subscription_rel has 'd' state c) successful upgrade. looks good to
> me.

Right.

> I have also verified that your changes fixes the issue as the
> successful upgrade is moved to the end and the old cluster is no
> longer used after upgrade.

Yeah, it is same as my expectation.

> One minor suggestion:
> There is an extra line break here, this can be removed:
> @@ -181,139 +310,5 @@ is($result, qq(1),
> "check the data is synced after enabling the subscription for
> the table that was in init state"
>  );
> 
> -# cleanup
>

Removed.

PSA a new version patch.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



v2-0001-Fix-testcase.patch
Description: v2-0001-Fix-testcase.patch


RE: pg_upgrade and logical replication

2024-02-13 Thread Hayato Kuroda (Fujitsu)
Dear Justin,

> pg_upgrade/t/004_subscription.pl says
> 
> |my $mode = $ENV{PG_TEST_PG_UPGRADE_MODE} || '--copy';
> 
> ..but I think maybe it should not.
> 
> When you try to use --link, it fails:
> https://cirrus-ci.com/task/4669494061170688
> 
> |Adding ".old" suffix to old global/pg_control ok
> |
> |If you want to start the old cluster, you will need to remove
> |the ".old" suffix from
> /tmp/cirrus-ci-build/build/testrun/pg_upgrade/004_subscription/data/t_004_su
> bscription_old_sub_data/pgdata/global/pg_control.old.
> |Because "link" mode was used, the old cluster cannot be safely
> |started once the new cluster has been started.
> |...
> |
> |postgres: could not find the database system
> |Expected to find it in the directory
> "/tmp/cirrus-ci-build/build/testrun/pg_upgrade/004_subscription/data/t_004_s
> ubscription_old_sub_data/pgdata",
> |but could not open file
> "/tmp/cirrus-ci-build/build/testrun/pg_upgrade/004_subscription/data/t_004_s
> ubscription_old_sub_data/pgdata/global/pg_control": No such file or directory
> |# No postmaster PID for node "old_sub"
> |[19:36:01.396](0.250s) Bail out!  pg_ctl start failed
> 

Good catch! The primal reason of the failure is to reuse the old cluster, even 
after
the successful upgrade. The documentation said [1]:

>
If you use link mode, the upgrade will be much faster (no file copying) and use 
less
disk space, but you will not be able to access your old cluster once you start 
the new
cluster after the upgrade.
>

> You could rename pg_control.old to avoid that immediate error, but that 
> doesn't
> address the essential issue that "the old cluster cannot be safely started 
> once
> the new cluster has been started."

Yeah, I agreed that it should be avoided to access to the old cluster after the 
upgrade.
IIUC, pg_upgrade would be run third times in 004_subscription.

1. successful upgrade
2. failure due to the insufficient max_replication_slot
3. failure because the pg_subscription_rel has 'd' state

And old instance is reused in all of runs. Therefore, the most reasonable fix 
is to 
change the ordering of tests, i.e., "successful upgrade" should be done at last.

Attached patch modified the test accordingly. Also, it contains some 
optimizations.
This can pass the test on my env:

```
pg_upgrade]$ PG_TEST_PG_UPGRADE_MODE='--link' PG_TEST_TIMEOUT_DEFAULT=10 make 
check PROVE_TESTS='t/004_subscription.pl'
...
# +++ tap check in src/bin/pg_upgrade +++
t/004_subscription.pl .. ok
All tests successful.
Files=1, Tests=14,  9 wallclock secs ( 0.03 usr  0.00 sys +  0.55 cusr  1.08 
csys =  1.66 CPU)
Result: PASS
```

How do you think?

[1]: https://www.postgresql.org/docs/devel/pgupgrade.html

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 




0001-Fix-testcase.patch
Description: 0001-Fix-testcase.patch


RE: speed up a logical replica setup

2024-02-13 Thread Hayato Kuroda (Fujitsu)
Dear Shubham,

Thanks for testing the patch!

> 
> I tried verifying few scenarios by using 5 databases and came across
> the following errors:
> 
> ./pg_createsubscriber -D ../new_standby -P "host=localhost port=5432
> dbname=postgres" -S "host=localhost port=9000 dbname=postgres"  -d db1
> -d db2 -d db3 -d db4 -d db5
> 
> pg_createsubscriber: error: publisher requires 6 wal sender
> processes, but only 5 remain
> pg_createsubscriber: hint: Consider increasing max_wal_senders to at least 7.
> 
> It is successful only with 7 wal senders, so we can change error
> messages accordingly.
> 
> 
> pg_createsubscriber: error: publisher requires 6 replication slots,
> but only 5 remain
> pg_createsubscriber: hint: Consider increasing max_replication_slots
> to at least 7.
> 
> It is successful only with 7 replication slots, so we can change error
> messages accordingly.

I'm not a original author but I don't think it is needed. The hint message has
already suggested you to change to 7. According to the doc [1],  the primary
message should be factual and hint message should be used for suggestions. I 
felt
current code followed the style. Thought?

New patch is available in [2].

[1]: https://www.postgresql.org/docs/devel/error-style-guide.html
[2]: 
https://www.postgresql.org/message-id/TYCPR01MB12077A6BB424A025F04A8243DF54F2%40TYCPR01MB12077.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



RE: speed up a logical replica setup

2024-02-13 Thread Hayato Kuroda (Fujitsu)
Dear hackers,

I've replied for trackability.

> Further comments for v17.
> 
> 01.
> This program assumes that the target server has same major version with this.
> Because the target server would be restarted by same version's pg_ctl command.
> I felt it should be ensured by reading the PG_VERSION.

Still investigating.

> 02.
> pg_upgrade checked the version of using executables, like pg_ctl, postgres, 
> and
> pg_resetwal. I felt it should be as well.

Still investigating.

> 03. get_bin_directory
> ```
>   if (find_my_exec(path, full_path) < 0)
>   {
>   pg_log_error("The program \"%s\" is needed by %s but was not
> found in the\n"
>"same directory as \"%s\".\n",
>"pg_ctl", progname, full_path);
> ```
> 
> s/"pg_ctl"/progname

The message was updated.

> 04.
> Missing canonicalize_path()?

I found find_my_exec() calls canonicalize_path(). No need to do.

> 05.
> Assuming that the target server is a cascade standby, i.e., it has a role as
> another primary. In this case, I thought the child node would not work. 
> Because
> pg_createsubcriber runs pg_resetwal and all WAL files would be discarded at 
> that
> time. I have not tested, but should the program detect it and exit earlier?

Still investigating.

> 06.
> wait_for_end_recovery() waits forever even if the standby has been 
> disconnected
> from the primary, right? should we check the status of the replication via
> pg_stat_wal_receiver?

Still investigating.

> 07.
> The cleanup function has couple of bugs.
> 
> * If subscriptions have been created on the database, the function also tries 
> to
>   drop a publication. But it leads an ERROR because it has been already 
> dropped.
>   See setup_subscriber().
> * If the subscription has been created, drop_replication_slot() leads an 
> ERROR.
>   Because the subscriber tried to drop the subscription while executing DROP
> SUBSCRIPTION.

Only drop_publication() was removed.

> 08.
> I found that all messages (ERROR, WARNING, INFO, etc...) would output to 
> stderr,
> but I felt it should be on stdout. Is there a reason? pg_dump outputs 
> messages to
> stderr, but the motivation might be to avoid confusion with dumps.

Still investigating.

> 09.
> I'm not sure the cleanup for subscriber is really needed. Assuming that there
> are two databases, e.g., pg1 pg2 , and we fail to create a subscription on 
> pg2.
> This can happen when the subscription which has the same name has been
> already
> created on the primary server.
> In this case a subscirption pn pg1 would be removed. But what is a next step?
> Since a timelineID on the standby server is larger than the primary (note that
> the standby has been promoted once), we cannot resume the physical replication
> as-is. IIUC the easiest method to retry is removing a cluster once and 
> restarting
> from pg_basebackup. If so, no need to cleanup the standby because it is
> corrupted.
> We just say "Please remove the cluster and recreate again".

I still think it should be, but not done yet.

New patch can be available in [1].

[1]: 
https://www.postgresql.org/message-id/TYCPR01MB12077A6BB424A025F04A8243DF54F2%40TYCPR01MB12077.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



RE: speed up a logical replica setup

2024-02-13 Thread Hayato Kuroda (Fujitsu)
"
> +#include "common/restricted_token.h"
> +#include "fe_utils/recovery_gen.h"
> +#include "fe_utils/simple_list.h"
> +#include "getopt_long.h"
> +#include "utils/pidfile.h"
> 
> 8) POSTMASTER_STANDBY and POSTMASTER_FAILED are not being used, is it
> required or kept for future purpose:
> +enum WaitPMResult
> +{
> +   POSTMASTER_READY,
> +   POSTMASTER_STANDBY,
> +   POSTMASTER_STILL_STARTING,
> +   POSTMASTER_FAILED
> +};

I think they are here because WaitPMResult is just ported from pg_ctl.c.
I renamed the enumeration and removed non-necessary attributes.

> 9) pg_createsubscriber should be kept after pg_basebackup to maintain
> the consistent order:
> diff --git a/src/bin/pg_basebackup/.gitignore
> b/src/bin/pg_basebackup/.gitignore
> index 26048bdbd8..b3a6f5a2fe 100644
> --- a/src/bin/pg_basebackup/.gitignore
> +++ b/src/bin/pg_basebackup/.gitignore
> @@ -1,5 +1,6 @@
>  /pg_basebackup
>  /pg_receivewal
>  /pg_recvlogical
> +/pg_createsubscriber

Addressed.

> 10) dry-run help message is not very clear, how about something
> similar to pg_upgrade's message like "check clusters only, don't
> change any data":
> +   printf(_(" -d, --database=DBNAME       database to
> create a subscription\n"));
> +   printf(_(" -n, --dry-run   stop before
> modifying anything\n"));
> +   printf(_(" -t, --recovery-timeout=SECS seconds to wait
> for recovery to end\n"));
> +   printf(_(" -r, --retainretain log file
> after success\n"));
> +   printf(_(" -v, --verbose   output verbose
> messages\n"));
> +   printf(_(" -V, --version   output version
> information, then exit\n"));

Changed.

New patch is available in [2].

[1]: 
https://www.postgresql.org/message-id/TYCPR01MB1207713BEC5C379A05D65E342F54B2%40TYCPR01MB12077.jpnprd01.prod.outlook.com
[2]: 
https://www.postgresql.org/message-id/TYCPR01MB12077A6BB424A025F04A8243DF54F2%40TYCPR01MB12077.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



RE: speed up a logical replica setup

2024-02-09 Thread Hayato Kuroda (Fujitsu)
Dear Euler,

Further comments for v17.

01.
This program assumes that the target server has same major version with this.
Because the target server would be restarted by same version's pg_ctl command.
I felt it should be ensured by reading the PG_VERSION.

02.
pg_upgrade checked the version of using executables, like pg_ctl, postgres, and
pg_resetwal. I felt it should be as well.

03. get_bin_directory
```
if (find_my_exec(path, full_path) < 0)
{
pg_log_error("The program \"%s\" is needed by %s but was not 
found in the\n"
 "same directory as \"%s\".\n",
 "pg_ctl", progname, full_path);
```

s/"pg_ctl"/progname

04.
Missing canonicalize_path()?

05.
Assuming that the target server is a cascade standby, i.e., it has a role as
another primary. In this case, I thought the child node would not work. Because
pg_createsubcriber runs pg_resetwal and all WAL files would be discarded at that
time. I have not tested, but should the program detect it and exit earlier?

06.
wait_for_end_recovery() waits forever even if the standby has been disconnected
from the primary, right? should we check the status of the replication via
pg_stat_wal_receiver?

07.
The cleanup function has couple of bugs.

* If subscriptions have been created on the database, the function also tries to
  drop a publication. But it leads an ERROR because it has been already dropped.
  See setup_subscriber().
* If the subscription has been created, drop_replication_slot() leads an ERROR.
  Because the subscriber tried to drop the subscription while executing DROP 
SUBSCRIPTION.

08.
I found that all messages (ERROR, WARNING, INFO, etc...) would output to stderr,
but I felt it should be on stdout. Is there a reason? pg_dump outputs messages 
to
stderr, but the motivation might be to avoid confusion with dumps.

09.
I'm not sure the cleanup for subscriber is really needed. Assuming that there
are two databases, e.g., pg1 pg2 , and we fail to create a subscription on pg2.
This can happen when the subscription which has the same name has been already
created on the primary server.
In this case a subscirption pn pg1 would be removed. But what is a next step?
Since a timelineID on the standby server is larger than the primary (note that
the standby has been promoted once), we cannot resume the physical replication
as-is. IIUC the easiest method to retry is removing a cluster once and 
restarting
from pg_basebackup. If so, no need to cleanup the standby because it is 
corrupted.
We just say "Please remove the cluster and recreate again".

Here is a reproducer.

1. apply the txt patch atop 0001 patch.
2. run test_corruption.sh.
3. when you find a below output [1], connect to a testdb from another terminal 
and
   run CREATE SUBSCRITPION for the same subscription on the primary
4. Finally, pg_createsubscriber would fail the creation.

I also attached server logs of both nodes and the output.
Note again that this is a real issue. I used a tricky way for surely 
overlapping name,
but this can happen randomly.

10.
While investigating #09, I found that we cannot report properly a reason why the
subscription cannot be created. The output said:

```
pg_createsubscriber: error: could not create subscription 
"pg_createsubscriber_16389_3884" on database "testdb": out of memory
```

But the standby serverlog said:

```
ERROR:  subscription "pg_createsubscriber_16389_3884" already exists
STATEMENT:  CREATE SUBSCRIPTION pg_createsubscriber_16389_3884 CONNECTION 
'user=postgres port=5431 dbname=testdb' PUBLICATION pg_createsubscriber_16389 
WITH (create_slot = false, copy_data = false, enabled = false)
```

[1]
```
pg_createsubscriber: creating the replication slot 
"pg_createsubscriber_16389_3884" on database "testdb"
pg_createsubscriber: XXX: sleep 20s
```

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/global/ 



N1.log
Description: N1.log


result.dat
Description: result.dat


server_start_20240209T115112.963.log
Description: server_start_20240209T115112.963.log


test_corruption.sh
Description: test_corruption.sh
diff --git a/src/bin/pg_basebackup/pg_createsubscriber.c 
b/src/bin/pg_basebackup/pg_createsubscriber.c
index 9628f32a3e..fdb3e92aa3 100644
--- a/src/bin/pg_basebackup/pg_createsubscriber.c
+++ b/src/bin/pg_basebackup/pg_createsubscriber.c
@@ -919,6 +919,9 @@ create_logical_replication_slot(PGconn *conn, 
LogicalRepInfo *dbinfo,
 
pg_log_info("creating the replication slot \"%s\" on database \"%s\"", 
slot_name, dbinfo->dbname);
 
+   pg_log_info("XXX: sleep 20s");
+   sleep(20);
+
appendPQExpBuffer(str, "SELECT lsn FROM 
pg_create_logical_replication_slot('%s', '%s', %s, false, false)",
  slot_name, "pgoutput", temporary ? 
"true" : "false");
 


RE: Improve eviction algorithm in ReorderBuffer

2024-02-08 Thread Hayato Kuroda (Fujitsu)
Dear Sawada-san,

Thanks for making v3 patchset. I have also benchmarked the case [1].
Below results are the average of 5th, there are almost the same result
even when median is used for the comparison. On my env, the regression
cannot be seen.

HEAD (1e285a5)  HEAD + v3 patches   difference
10910.722 ms10714.540 msaround 1.8%

Also, here are mino comments for v3 set.

01.
bh_nodeidx_entry and ReorderBufferMemTrackState is missing in typedefs.list.

02. ReorderBufferTXNSizeCompare
Should we assert {ta, tb} are not NULL?

[1]: 
https://www.postgresql.org/message-id/CAD21AoB-7mPpKnLmBNfzfavG8AiTwEgAdVMuv%3DjzmAp9ex7eyQ%40mail.gmail.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 




RE: speed up a logical replica setup

2024-02-07 Thread Hayato Kuroda (Fujitsu)
Dear Euler,

>
Remember the target server was a standby (read only access). I don't expect an
application trying to modify it; unless it is a buggy application.
>

What if the client modifies the data just after the promotion?
Naively considered, all the changes can be accepted, but are there any issues?

>
Regarding
GUCs, almost all of them is PGC_POSTMASTER (so it cannot be modified unless the
server is restarted). The ones that are not PGC_POSTMASTER, does not affect the
pg_createsubscriber execution [1].
>

IIUC,  primary_conninfo and primary_slot_name is PGC_SIGHUP.

>
I'm just pointing out that this case is a different from pg_upgrade (from which
this idea was taken). I'm not saying that's a bad idea. I'm just arguing that
you might be preventing some access read only access (monitoring) when it is
perfectly fine to connect to the database and execute queries. As I said
before, the current UI allows anyone to setup the standby to accept only local
connections. Of course, it is an extra step but it is possible. However, once
you apply v16-0007, there is no option but use only local connection during the
transformation. Is it an acceptable limitation?
>

My remained concern is written above. If they do not problematic we may not have
to restrict them for now. At that time, changes 

1) overwriting a port number,
2) setting listen_addresses = ''

are not needed, right? IIUC inconsistency of -P may be still problematic.

>
pglogical_create_subscriber does nothing [2][3].
>

Oh, thanks.
Just to confirm - pglogical set shared_preload_libraries to '', should we 
follow or not?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/global/ 



RE: speed up a logical replica setup

2024-02-07 Thread Hayato Kuroda (Fujitsu)
Dear Euler,

I have not finished reviewing, but I can reply to you first.

>
I didn't complete this task yet so I didn't include it in this patch.
>

Just to confirm - No need to forcibly include my refactoring patch: you can
modify based on your idea.

>
That's correct. We made a decision to use non physical replication connections
(besides the one used to connect primary <-> standby). Hence, my concern about
a HBA rule falls apart. I'm not convinced that using a modified
primary_conninfo is the only/right answer to fill the subscription connection
string. Physical vs logical replication has different requirements when we talk
about users. The physical replication requires only the REPLICATION privilege.
On the other hand, to create a subscription you must have the privileges of
pg_create_subscription role and also CREATE privilege on the specified
database. Unless, you are always recommending the superuser for this tool, I'm
afraid there will be cases that reusing primary_conninfo will be an issue. The
more I think about this UI, the more I think that, if it is not hundreds of
lines of code, it uses the primary_conninfo the -P is not specified.
>

Valid point. It is one of the best practice that users set minimal privileges
for each accounts. However...

>
Ugh. An error will occur the first time (get_primary_sysid) it tries to connect
to primary.
>

I'm not sure it is correct. I think there are several patterns.

a) -P option specified a ghost server, i.e., pg_createsubscriber cannot connect 
to
   anything. In this case, get_primary_sysid() would be failed because
   connect_database() would be failed.

b) -P option specified an existing server, but it is not my upstream.
   get_primary_sysid() would be suceeded.
   In this case, there are two furher branches:
   
   b-1) nodes have different system_identifier. pg_createsubscriber would raise
an ERROR and stop.
   b-2) nodes have the same system_identifier (e.g., nodes were generated by the
same master). In this case, current checkings would be passed but
pg_createsubscriber would stuck or reach timeout. PSA the reproducer.

Can pg_createsubscriber check the relation two nodes have been connected by
replication protocol? Or, can we assume that all the node surely have different
system_identifier?

>
It is. However, I keep the assignments for 2 reasons: (a) there might be
parameters whose default value is not zero, (b) the standard does not say that
a null pointer must be represented by zero and (c) there is no harm in being
paranoid during initial assignment.
>

Hmn, so, no need to use `= {0};`, but up to you. Also, according to C99 
standard[1],
NULL seemed to be defined as 0x0:
```
An integer constant expression with the value 0, or such an expression cast to 
type
void *, is called a null pointer constant.
If a null pointer constant is converted to a
pointer type, the resulting pointer, called a null pointer, is guaranteed to 
compare unequal
to a pointer to any object or function.
```

>
> WriteRecoveryConfig() writes GUC parameters to postgresql.auto.conf, but not
> sure it is good. These settings would remain on new subscriber even after the
> pg_createsubscriber. Can we avoid it? I come up with passing these parameters
> via pg_ctl -o option, but it requires parsing output from 
> GenerateRecoveryConfig()
> (all GUCs must be allign like "-c XXX -c XXX -c XXX...").

I applied a modified version of v16-0006.
>

Sorry for confusing, it is not a solution. This approach writes parameter
settings to postgresql.auto.conf, and they are never removed. Overwritten
settings would remain.

>
I don't understand why v16-0002 is required. In a previous version, this patch
was using connections in logical replication mode. After some discussion we
decided to change it to regular connections and use SQL functions (instead of
replication commands). Is it a requirement for v16-0003?
>

No, it is not required by 0003. I forgot the decision that we force DBAs to open
normal connection establishments for pg_createsubscriber. Please ignore...

[1]: https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1256.pdf


Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/global/ 


test_0207.sh
Description: test_0207.sh


RE: speed up a logical replica setup

2024-02-06 Thread Hayato Kuroda (Fujitsu)
Dear Euler,

Sorry for posting e-mail each other. I will read your patch
but I want to reply one of yours.

>
I started reviewing v16-0007 but didn't finish yet. The general idea is ok.
However, I'm still worried about preventing some use cases if it provides only
the local connection option. What if you want to keep monitoring this instance
while the transformation is happening? Let's say it has a backlog that will
take some time to apply. Unless, you have a local agent, you have no data about
this server until pg_createsubscriber terminates. Even the local agent might
not connect to the server unless you use the current port.
>

(IIUC, 0007 could not overwrite a port number - refactoring is needed)

Ah, actually I did not have such a point of view. Assuming that changed port 
number
can avoid connection establishments, there are four options:
a) Does not overwrite port and listen_addresses. This allows us to monitor by
   external agents, but someone can modify GUCs and data during operations.
b) Overwrite port but do not do listen_addresses. Not sure it is useful... 
c) Overwrite listen_addresses but do not do port. This allows us to monitor by
   local agents, and we can partially protect the database. But there is still 
a 
   room.
d) Overwrite both port and listen_addresses. This can protect databases 
perfectly
but no one can monitor.

Hmm, which one should be chosen? I prefer c) or d).
Do you know how pglogical_create_subscriber does?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/global/ 


RE: speed up a logical replica setup

2024-02-06 Thread Hayato Kuroda (Fujitsu)
Dear Shlok,

Thanks for updating the patch!

> I have created a topup patch 0007 on top of v15-0006.
> 
> I revived the patch which removes -S option and adds some options
> instead. The patch add option for --port, --username and --socketdir.
> This patch also ensures that anyone cannot connect to the standby
> during the pg_createsubscriber, by setting listen_addresses,
> unix_socket_permissions, and unix_socket_directories.

IIUC, there are two reasons why removing -S may be good:

* force users to specify a local-connection, and
* avoid connection establishment on standby during the pg_createsubscriber.

First bullet is still valid, but we should describe that like pg_upgrade: 

>
pg_upgrade will connect to the old and new servers several times, so you might
want to set authentication to peer in pg_hba.conf or use a ~/.pgpass file
(see Section 33.16).
>

Regarding the second bullet, this patch cannot ensure it. pg_createsubscriber
just accepts port number which has been already accepted by the standby, it does
not change the port number. So any local applications on the standby server can
connect to the server and may change primary_conninfo, primary_slot_name, data, 
etc.
So...what should be? How do other think?

Beside, 0007 does not follow the recent changes on 0001. E.g., options should be
record in CreateSubscriberOptions. Also, should we check the privilege of socket
directory?

[1]: 
https://www.postgresql.org/message-id/TY3PR01MB988902B992A4F2E99E1385EDF56F2%40TY3PR01MB9889.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



RE: speed up a logical replica setup

2024-02-06 Thread Hayato Kuroda (Fujitsu)
Dear Shubham,

Thanks for testing our codes!

> While reviewing the v15 patches I discovered that subscription
> connection string has added a lot of options which are not required
> now:
> v15-0001
> postgres=# select subname, subconninfo from pg_subscription;
> subname|   subconninfo
> ---+--
> pg_createsubscriber_5_1867633 | host=localhost port=5432 dbname=postgres
> (1 row)
> 
> 
> 
> v15-0001+0002+0003
> postgres=# select subname, subconninfo from pg_subscription;
> subname|
> 
>   subconninfo
> 
> 
> 
> ---+--
> --
> --
> --
> --
> --
> --
> pg_createsubscriber_5_1895366 | user=shubham
> passfile='/home/shubham/.pgpass' channel_binding=prefer ho
> st=127.0.0.1 port=5432 sslmode=prefer sslcompression=0
> sslcertmode=allow sslsni=1 ssl_min_protocol_versi
> on=TLSv1.2 gssencmode=disable krbsrvname=postgres gssdelegation=0
> target_session_attrs=any load_balance_
> hosts=disable dbname=postgres
> (1 row)
> 
> 
> Here, we can see that channel_binding, sslmode, sslcertmode, sslsni,
> gssencmode, krbsrvname, etc are getting included. This does not look
> intentional, we should keep the subscription connection same as in
> v15-0001.

You should attach the script the reproducer. I suspected you used pg_basebackup
-R command for setting up the standby. In this case, it is intentional.
These settings are not caused by the pg_createsubscriber, done by pg_basebackup.
If you set primary_conninfo manually like attached, these settings would not 
appear.

As the first place, listed options (E.g., passfile, channel_binding, sslmode,
sslcompression, sslcertmode, etc...) were set when you connect to the database
via libpq functions. PQconninfoOptions in fe-connect.c lists parameters and
their default value.

v15-0003 reuses the primary_conninfo for subconninfo attribute. primary_conninfo
is set by pg_basebackup specified with '-R' option.
The content is built in GenerateRecoveryConfig(), which bypass parameters from
PQconninfo(). This function returns all the libpq connection parameters even if
it is set as default. So primary_conninfo looks longer.

I don't think this works wrongly. Users still can set an arbitrary connection
string as primary_conninfo. You just use longer string unintentionally.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



shorter_subconninfo.sh
Description: shorter_subconninfo.sh


RE: speed up a logical replica setup

2024-02-06 Thread Hayato Kuroda (Fujitsu)
Dear Euler,

> 03.
> ```
> /*
>  * Is the standby server ready for logical replication?
>  */
> static bool
> check_subscriber(LogicalRepInfo *dbinfo)
> ```
> 
> You said "target server must be a standby" in [1], but I cannot find checks 
> for it.
> IIUC, there are two approaches:
>  a) check the existence "standby.signal" in the data directory
>  b) call an SQL function "pg_is_in_recovery"

I found that current code (HEAD+v14-0001) leads stuck or timeout because the 
recovery
would be never finished. In attached script emulates the case standby.signal 
file is missing.
This script always fails with below output:

```
...
pg_createsubscriber: waiting the postmaster to reach the consistent state
pg_createsubscriber: postmaster was stopped
pg_createsubscriber: error: recovery timed out
...
```

I also attached the server log during pg_createsubscriber, and we can see that
walreceiver exits before receiving all the required changes. I cannot find a 
path
yet, but the inconsistency lead the situation which walreceiver exists but 
startup
remains. This also said that recovery status must be checked in 
check_subscriber().

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



server_start_20240206T080003.670.log
Description: server_start_20240206T080003.670.log


test_0206.sh
Description: test_0206.sh


RE: Improve eviction algorithm in ReorderBuffer

2024-02-05 Thread Hayato Kuroda (Fujitsu)
Dear Sawada-san,

> Thank you for the review comments!
> 
> >
> > A comment for 0001:
> >
> > 01.
> > ```
> > +static void
> > +bh_enlarge_node_array(binaryheap *heap)
> > +{
> > +if (heap->bh_size < heap->bh_space)
> > +return;
> > +
> > +heap->bh_space *= 2;
> > +heap->bh_nodes = repalloc(heap->bh_nodes,
> > +  sizeof(bh_node_type) * heap->bh_space);
> > +}
> > ```
> >
> > I'm not sure it is OK to use repalloc() for enlarging bh_nodes. This data
> > structure public one and arbitrary codes and extensions can directly refer
> > bh_nodes. But if the array is repalloc()'d, the pointer would be updated so 
> > that
> > their referring would be a dangling pointer.
> 
> Hmm I'm not sure this is the case that we need to really worry about,
> and cannot come up with a good use case where extensions refer to
> bh_nodes directly rather than binaryheap. In PostgreSQL codes, many
> Nodes already have pointers and are exposed.

Actually, me neither. I could not come up with the use-case - I just said the 
possibility.
If it is not a real issue, we can ignore.

> 
> >
> > 04.
> > ```
> >  extern binaryheap *binaryheap_allocate(int capacity,
> > binaryheap_comparator compare,
> > -   void *arg);
> > +   bool indexed, void *arg);
> > ```
> >
> > I felt pre-existing API should not be changed. How about adding
> > binaryheap_allocate_extended() or something which can specify the `bool
> indexed`?
> > binaryheap_allocate() sets heap->bh_indexed to false.
> 
> I'm really not sure it's worth inventing a
> binaryheap_allocate_extended() function just for preserving API
> compatibility. I think it's generally a good idea to have
> xxx_extended() function to increase readability and usability, for
> example, for the case where the same (kind of default) arguments are
> passed in most cases and the function is called from many places.
> However, we have a handful binaryheap_allocate() callers, and I
> believe that it would not hurt the existing callers.

I kept (external) extensions which uses binaryheap APIs in my mind.
I thought we could avoid to raise costs for updating their codes. But I could
understand the change is small, so ... up to you.

> > 05.
> > ```
> > +extern void binaryheap_update_up(binaryheap *heap, bh_node_type d);
> > +extern void binaryheap_update_down(binaryheap *heap, bh_node_type d);
> > ```
> >
> > IIUC, callers must consider whether the node should be shift up/down and use
> > appropriate function, right? I felt it may not be user-friendly.
> 
> Right, I couldn't come up with a better interface.
> 
> Another idea I've considered was that the caller provides a callback
> function where it can compare the old and new keys. For example, in
> reorderbuffer case, we call like:
> 
> binaryheap_update(rb->txn_heap, PointerGetDatum(txn),
> ReorderBufferTXNUpdateCompare, (void *) _size);
> 
> Then in ReorderBufferTXNUpdateCompare(),
> ReorderBufferTXN *txn = (ReorderBufferTXN *) a;Size old_size = *(Size *) b;
> (compare txn->size to "b" ...)
> 
> However it seems complicated...
> 

I considered similar approach: accept old node as an argument of a compare 
function.
But it requires further memory allocation. Do someone have better idea?

> 
> >
> > 08.
> > IIUC, if more than 1024 transactions are running but they have small amount 
> > of
> > changes, the performance may be degraded, right? Do you have a result in
> sucha
> > a case?
> 
> I've run a benchmark test that I shared before[1]. Here are results of
> decoding a transaction that has 1M subtransaction each of which has 1
> INSERT:
> 
> HEAD:
> 1810.192 ms
> 
> HEAD w/ patch:
> 2001.094 ms
> 
> I set a large enough value to logical_decoding_work_mem not to evict
> any transactions. I can see about about 10% performance regression in
> this case.

Thanks for running. I think this workload is the worst and an extreme case which
would not be occurred on the real system (Such a system should be fixed), so we
can say that the regression is up to -10%. I felt it could be negligible but how
do other think?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



RE: Is this a problem in GenericXLogFinish()?

2024-02-04 Thread Hayato Kuroda (Fujitsu)
Dear Amit,

> 
> @@ -692,6 +697,9 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf,
> Buffer ovflbuf,
>   if (!xlrec.is_prev_bucket_same_wrt)
>   wbuf_flags |= REGBUF_NO_CHANGE;
>   XLogRegisterBuffer(1, wbuf, wbuf_flags);
> +
> + /* Track the registration status for later use */
> + wbuf_registered = true;
>   }
> 
>   XLogRegisterBuffer(2, ovflbuf, REGBUF_STANDARD);
> @@ -719,7 +727,12 @@ _hash_freeovflpage(Relation rel, Buffer
> bucketbuf, Buffer ovflbuf,
> 
>   recptr = XLogInsert(RM_HASH_ID, XLOG_HASH_SQUEEZE_PAGE);
> 
> - PageSetLSN(BufferGetPage(wbuf), recptr);
> + /* Set LSN to wbuf page buffer only when it is being registered */
> + if (wbuf_registered)
> + PageSetLSN(BufferGetPage(wbuf), recptr);
> 
> Why set LSN when the page is not modified (say when we use the flag
> REGBUF_NO_CHANGE)?  I think you need to use a flag mod_wbuf and set it
> in appropriate places during register buffer calls.

You are right. Based on the previous discussions, PageSetLSN() must be called
after the MakeBufferDirty(). REGBUF_NO_CHANGE has been introduced for skipping
these requirements. Definitevely, no_change buffers must not be PageSetLSN()'d.
Other pages, e.g., metabuf, has already been followed the rule.

I updated the patch based on the requirement.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



v2_avoid_registration.patch
Description: v2_avoid_registration.patch


RE: Is this a problem in GenericXLogFinish()?

2024-02-04 Thread Hayato Kuroda (Fujitsu)
Dear Michael, Amit,

> 
> Amit, this has been applied as of 861f86beea1c, and I got pinged about
> the fact this triggers inconsistencies because we always set the LSN
> of the write buffer (wbuf in _hash_freeovflpage) but
> XLogRegisterBuffer() would *not* be called when the two following
> conditions happen:
> - When xlrec.ntups <= 0.
> - When !xlrec.is_prim_bucket_same_wrt && !xlrec.is_prev_bucket_same_wrt
> 
> And it seems to me that there is still a bug here: there should be no
> point in setting the LSN on the write buffer if we don't register it
> in WAL at all, no?

Thanks for pointing out, I agreed your saying. PSA the patch for diagnosing the
issue.

This patch can avoid the inconsistency due to the LSN setting and output a debug
LOG when we met such a case. I executed hash_index.sql and confirmed the log was
output [1]. This meant that current test has already had a workload which meets 
below
conditions:

 - the overflow page has no tuples (xlrec.ntups is 0),
 - to-be-written page - wbuf - is not the primary (xlrec.is_prim_bucket_same_wrt
   is false), and
 - to-be-written buffer is not next to the overflow page
   (xlrec.is_prev_bucket_same_wrt is false)

So, I think my patch (after removing elog(...) part) can fix the issue. Thought?

[1]:
```
LOG:  XXX: is_wbuf_registered: false
CONTEXT:  while vacuuming index "hash_cleanup_index" of relation 
"public.hash_cleanup_heap"
STATEMENT:  VACUUM hash_cleanup_heap;
```

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



avoid_registration.patch
Description: avoid_registration.patch


RE: speed up a logical replica setup

2024-01-31 Thread Hayato Kuroda (Fujitsu)
Dear Fabrízio, Euler,

I think you set the primary_slot_name to the standby server, right?
While reading codes, I found below line in v11-0001.
```
if (primary_slot_name != NULL)
{
conn = connect_database(dbinfo[0].pubconninfo);
if (conn != NULL)
{
drop_replication_slot(conn, [0], temp_replslot);
}
```

Now the temp_replslot is temporary one, so it would be removed automatically.
This function will cause the error: replication slot 
"pg_createsubscriber_%u_startpoint" does not exist.
Also, the physical slot is still remained on the primary. 
In short, "temp_replslot" should be "primary_slot_name".

PSA a script file for reproducing.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



test_0201.sh
Description: test_0201.sh


RE: speed up a logical replica setup

2024-01-31 Thread Hayato Kuroda (Fujitsu)
Dear Euler,

Thanks for giving comments! I want to reply some of them.

>
I didn't provide the whole explanation. I'm envisioning the use case that pg_ctl
doesn't reach the consistent state and the timeout is reached (the consequence
is that pg_createsubscriber aborts the execution). It might occur on a busy
server. The probability that it occurs with the current code is low (LSN gap
for recovery is small). Maybe I'm anticipating issues when the base backup
support is added but better to raise concerns during development.
>

Hmm, actually I didn't know the case. Thanks for explanation. I want to see
how you describe on the doc.

>
pg_upgrade doesn't but others do like pg_rewind, pg_resetwal, pg_controldata,
pg_checksums. It seems newer tools tend to provide short and long options.
>

Oh, you are right.

>
Nothing? If you interrupt the execution, there will be objects left behind and
you, as someone that decided to do it, have to clean things up. What do you
expect this tool to do? The documentation will provide some guidance informing
the object name patterns this tool uses and you can check for leftover objects.
Someone can argue that is a valid feature request but IMO it is not one in the
top of the list.
>

OK, so let's keep current style.

>
Why? Are you suggesting that the dry run mode covers just the verification
part? If so, it is not a dry run mode. I would expect it to run until the end
(or until it accomplish its goal) but *does not* modify data. For pg_resetwal,
the modification is one of the last steps and the other ones (KillFoo
functions) that are skipped modify data. It ends the dry run mode when it
accomplish its goal (obtain the new control data values). If we stop earlier,
some of the additional steps won't be covered by the dry run mode and a failure
can happen but could be detected if you run a few more steps.
>

Yes, it was my expectation. I'm still not sure which operations can detect by 
the
dry_run, but we can keep it for now.

>
Why? See [1]. I prefer the kind mode (always wait until the recovery ends) but
you and Amit are proposing a more aggressive mode. The proposal (-t 60) seems
ok right now, however, if the goal is to provide base backup support in the
future, you certainly should have to add the --recovery-timeout in big clusters
or those with high workloads because base backup is run between replication slot
creation and consistent LSN. Of course, we can change the default when base
backup support is added.
>

Sorry, I was missing your previous post. Let's keep yours.

>
Good point. I included a check for pg_create_subscription role and CREATE
privilege on the specified database.
>

Not sure, but can we do the replication origin functions by these privilege?
According to the doc[1], these ones seem not to be related.

[1]: 
https://www.postgresql.org/docs/devel/functions-admin.html#FUNCTIONS-REPLICATION

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/global/ 





RE: speed up a logical replica setup

2024-01-31 Thread Hayato Kuroda (Fujitsu)
Dear Fabrízio,

Thanks for reporting. I understood that the issue occurred on v11 and v12.
I will try to reproduce and check the reason.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/global/ 



RE: Documentation to upgrade logical replication cluster

2024-01-30 Thread Hayato Kuroda (Fujitsu)
Dear Vignesh,

Thanks for updating the patch! Here are my comments for v6.

01.
```
+   Logical replication cluster
+   
+
+ A set of publisher and subscriber instance with publisher instance
+ replicating changes to the subscriber instance.
+
+   
```

Should we say 1:N relationship is allowed?

02.
```
@@ -70,6 +70,7 @@ PostgreSQL documentation
pg_upgrade supports upgrades from 9.2.X and later to the current
major release of PostgreSQL, including snapshot 
and beta releases.
   
+
  
```

Unnecessary blank.

03.
```
   
-   These are the steps to perform an upgrade
-   with pg_upgrade:
+   Below are the steps to perform an upgrade
+   with pg_upgrade.
   
```

I'm not sure it should be included in this patch.

04.
```
+   If the old primary is prior to version 17.0, then no slots on the 
primary
+   are copied to the new standby, so all the slots on the old standby must
+   be recreated manually.
```

I think that "all the slots on the old standby" must be created manually in any
cases. Therefore, the preposition ", so" seems not correct.

05.
```
If the old primary is version 17.0 or later, then
+   only logical slots on the primary are copied to the new standby, but
+   other slots on the old standby are not copied, so must be recreated
+   manually.
```

How about replacing this paragraph to below?

```
All the slots on the old standby must be recreated manually. If the old primary
is version 17.0 or later, then only logical slots on the primary are copied to 
the
new standby.
```

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



RE: Improve eviction algorithm in ReorderBuffer

2024-01-30 Thread Hayato Kuroda (Fujitsu)
Dear Sawada-san,

I have started to read your patches. Here are my initial comments.
At least, all subscription tests were passed on my env.

A comment for 0001:

01.
```
+static void
+bh_enlarge_node_array(binaryheap *heap)
+{
+if (heap->bh_size < heap->bh_space)
+return;
+
+heap->bh_space *= 2;
+heap->bh_nodes = repalloc(heap->bh_nodes,
+  sizeof(bh_node_type) * heap->bh_space);
+}
```

I'm not sure it is OK to use repalloc() for enlarging bh_nodes. This data
structure public one and arbitrary codes and extensions can directly refer
bh_nodes. But if the array is repalloc()'d, the pointer would be updated so that
their referring would be a dangling pointer.
I think the internal of the structure should be a private one in this case.

Comments for 0002:

02.
```
+#include "utils/palloc.h"
```

Is it really needed? I'm not sure who referrs it.

03.
```
typedef struct bh_nodeidx_entry
{
bh_node_typekey;
charstatus;
int idx;
} bh_nodeidx_entry;
```

Sorry if it is a stupid question. Can you tell me how "status" is used?
None of binaryheap and reorderbuffer components refer it. 

04.
```
 extern binaryheap *binaryheap_allocate(int capacity,
binaryheap_comparator compare,
-   void *arg);
+   bool indexed, void *arg);
```

I felt pre-existing API should not be changed. How about adding
binaryheap_allocate_extended() or something which can specify the `bool 
indexed`?
binaryheap_allocate() sets heap->bh_indexed to false.

05.
```
+extern void binaryheap_update_up(binaryheap *heap, bh_node_type d);
+extern void binaryheap_update_down(binaryheap *heap, bh_node_type d);
```

IIUC, callers must consider whether the node should be shift up/down and use
appropriate function, right? I felt it may not be user-friendly.

Comments for 0003:

06.
```
This commit changes the eviction algorithm in ReorderBuffer to use
max-heap with transaction size,a nd use two strategies depending on
the number of transactions being decoded.
```

s/a nd/ and/

07.
```
It could be too expensive to pudate max-heap while preserving the heap
property each time the transaction's memory counter is updated, as it
could happen very frquently. So when the number of transactions being
decoded is small, we add the transactions to max-heap but don't
preserve the heap property, which is O(1). We heapify the max-heap
just before picking the largest transaction, which is O(n). This
strategy minimizes the overheads of updating the transaction's memory
counter.
```

s/pudate/update/

08.
IIUC, if more than 1024 transactions are running but they have small amount of
changes, the performance may be degraded, right? Do you have a result in sucha
a case?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



RE: speed up a logical replica setup

2024-01-30 Thread Hayato Kuroda (Fujitsu)
he manpage, callback functions would not be called when it exits
due to signals:

> Functions  registered  using atexit() (and on_exit(3)) are not called if a
> process terminates abnormally because of the delivery of a signal.

Do you have a good way to handle the case? One solution is to output created
objects in any log level, but the consideration may be too much. Thought?

13, main
```
/*
 * Create a temporary logical replication slot to get a consistent LSN.
```

Just to clarify - I still think the process exits before here in case of dry 
run.
In case of pg_resetwal, the process exits before doing actual works like
RewriteControlFile().

14. main
```
 * XXX we might not fail here. Instead, we provide a warning so the user
 * eventually drops this replication slot later.
```

But there are possibilities to exit(1) in drop_replication_slot(). Is it 
acceptable?

15. wait_for_end_recovery
```
/*
 * Bail out after recovery_timeout seconds if this option is 
set.
 */
if (recovery_timeout > 0 && timer >= recovery_timeout)
```

Hmm, IIUC, it should be enabled by default [3]. Do you have anything in your 
mind?

16. main
```
/*
 * Create the subscription for each database on subscriber. It does not
 * enable it immediately because it needs to adjust the logical
 * replication start point to the LSN reported by consistent_lsn (see
 * set_replication_progress). It also cleans up publications created by
 * this tool and replication to the standby.
 */
if (!setup_subscriber(dbinfo, consistent_lsn))
```

Subscriptions would be created and replication origin would be moved forward 
here,
but latter one can be done only by the superuser. I felt that this should be
checked in check_subscriber().

17. main
```
/*
 * Change system identifier.
 */
modify_sysid(pg_resetwal_path, subscriber_dir);
```

Even if I executed without -v option, an output from pg_resetwal command 
appears.
It seems bit strange.

```
$ pg_createsubscriber -D data_N2/ -P "port=5431 user=postgres" -S "port=5432 
user=postgres" -d postgres
Write-ahead log reset
$
```

[1]: 
https://www.postgresql.org/message-id/TY3PR01MB9889C362FF76102C88FA1C29F56F2%40TY3PR01MB9889.jpnprd01.prod.outlook.com
[2]: 
https://www.postgresql.org/message-id/TY3PR01MB9889C362FF76102C88FA1C29F56F2%40TY3PR01MB9889.jpnprd01.prod.outlook.com
[3]: 
https://www.postgresql.org/message-id/CAA4eK1JRgnhv_ySzuFjN7UaX9qxz5Hqcwew7Fk%3D%2BAbJbu0Kd9w%40mail.gmail.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/global/ 





RE: Documentation to upgrade logical replication cluster

2024-01-28 Thread Hayato Kuroda (Fujitsu)
Dear Vignesh,

Thanks for updating the patch! For now, v4 patch LGTM.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/ 



RE: speed up a logical replica setup

2024-01-25 Thread Hayato Kuroda (Fujitsu)
Dear Euler,

Again, thanks for updating the patch! There are my random comments for v9.

01.
I cannot find your replies for my comments#7 [1] but you reverted related 
changes.
I'm not sure you are still considering it or you decided not to include changes.
Can you clarify your opinion?
(It is needed because changes are huge so it quite affects other 
developments...)

02.
```
+   -t seconds
+   --timeout=seconds
```

But source codes required `--recovery-timeout`. Please update either of them,

03.
```
+ *Create a new logical replica from a standby server
```

Junwang pointed out to change here but the change was reverted [2]
Can you clarify your opinion as well?

04.
```
+/*
+ * Is the source server ready for logical replication? If so, create the
+ * publications and replication slots in preparation for logical replication.
+ */
+static bool
+setup_publisher(LogicalRepInfo *dbinfo)
```

But this function verifies the source server. I felt they should be in the
different function.

05.
```
+/*
+ * Is the target server ready for logical replication?
+ */
+static bool
+setup_subscriber(LogicalRepInfo *dbinfo)


Actually, this function does not set up subscriber. It just verifies whether the
target can become a subscriber, right? If should be renamed.

06.
```
+atexit(cleanup_objects_atexit);
```

The registration of the cleanup function is too early. This sometimes triggers
a core-dump. E.g., 

```
$ pg_subscriber --publisher-conninfo --subscriber-conninfo 'user=postgres 
port=5432' --verbose --database 'postgres' --pgdata data_N2/
pg_subscriber: error: too many command-line arguments (first is "user=postgres 
port=5432")
pg_subscriber: hint: Try "pg_subscriber --help" for more information.
Segmentation fault (core dumped)

$ gdb ...
(gdb) bt
#0  cleanup_objects_atexit () at pg_subscriber.c:131
#1  0x7fb982cffce9 in __run_exit_handlers () from /lib64/libc.so.6
#2  0x7fb982cffd37 in exit () from /lib64/libc.so.6
#3  0x004054e6 in main (argc=9, argv=0x7ffc59074158) at 
pg_subscriber.c:1500
(gdb) f 3
#3  0x004054e6 in main (argc=9, argv=0x7ffc59074158) at 
pg_subscriber.c:1500
1500exit(1);
(gdb) list
1495if (optind < argc)
1496{
1497pg_log_error("too many command-line arguments (first is 
\"%s\")",
1498 argv[optind]);
1499pg_log_error_hint("Try \"%s --help\" for more 
information.", progname);
1500exit(1);
1501}
1502 
1503/*
1504 * Required arguments
```

I still think it should be done just before the creation of objects [3].

07.
Missing a removal of publications on the standby.

08.
Missing registration of LogicalRepInfo in the typedefs.list.

09
```
+ 
+  
+   pg_subscriber
+   option
+  
+ 
```

Can you reply my comment#2 [4]? I think mandatory options should be written.

10.
Just to confirm - will you implement start_standby/stop_standby functions in 
next version?

[1]: 
https://www.postgresql.org/message-id/TY3PR01MB9889C362FF76102C88FA1C29F56F2%40TY3PR01MB9889.jpnprd01.prod.outlook.com
[2]: 
https://www.postgresql.org/message-id/CAEG8a3%2BwL_2R8n12BmRz7yBP3EBNdHDhmdgxQFA9vS%2BzPR%2B3Kw%40mail.gmail.com
[3]: 
https://www.postgresql.org/message-id/TY3PR01MB9889678E47B918F4D83A6FD8F57B2%40TY3PR01MB9889.jpnprd01.prod.outlook.com
[4]: 
https://www.postgresql.org/message-id/TY3PR01MB9889C362FF76102C88FA1C29F56F2%40TY3PR01MB9889.jpnprd01.prod.outlook.com

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/global/ 





RE: speed up a logical replica setup

2024-01-25 Thread Hayato Kuroda (Fujitsu)
Dear Euler,

Thanks for updating the patch! Before reading yours, I wanted to reply some of 
comments.

>
I'm still thinking about replacing --subscriber-conninfo with separate items
(username, port, password?, host = socket dir). Maybe it is an overengineering.
The user can always prepare the environment to avoid unwanted and/or external
connections.
>

For me, required amount of fixes are not so different from current one. How 
about
others?

>
It is extremely useful because (a) you have a physical replication setup and
don't know if it is prepared for logical replication, (b) check GUCs (is
max_wal_senders sufficient for this pg_subscriber command? Or is
max_replication_slots sufficient to setup the logical replication even though I
already have some used replication slots?), (c) connectivity and (d)
credentials.
>

Yeah, it is useful for verification purpose, so let's keep this option.
But I still think the naming should be "--check". Also, there are many
`if (!dry_run)` but most of them can be removed if the process exits earlier.
Thought?


>
> 05.
> I felt we should accept some settings from enviroment variables, like 
> pg_upgrade.
> Currently, below items should be acceted.
> 
> - data directory
> - username
> - port
> - timeout

Maybe PGDATA.
>

Sorry, I cannot follow this. Did you mean that the target data directory should
be able to be specified by PGDATA? OF so, +1.

>
> 06.
> ```
> pg_logging_set_level(PG_LOG_WARNING);
> ```
> 
> If the default log level is warning, there are no ways to output debug logs.
> (-v option only raises one, so INFO would be output)
> I think it should be PG_LOG_INFO.

You need to specify multiple -v options.
>

Hmm. I felt the specification was bit strange...but at least it must be
described on the documentation. pg_dump.sgml has similar lines.

>
> 08.
> Not sure, but if we want to record outputs by pg_subscriber, the sub-directory
> should be created. The name should contain the timestamp.

The log file already contains the timestamp. Why?
>

This comment assumed outputs by pg_subscriber were also recorded to a file.
In this case and if the file also has the same timestamp, I think they can be
gathered in the same place. No need if outputs are not recorded.


>
> 09.
> Not sure, but should we check max_slot_wal_keep_size of primary server? It can
> avoid to fail starting of logical replicaiton.

A broken physical replication *before* running this tool is its responsibility?
Hmm. We might add another check that can be noticed during dry run mode.
>

I thought that we should not generate any broken objects, but indeed, not sure
it is our scope. How do other think?


>
> 11.
> ```
> /*
> * Stop the subscriber if it is a standby server. Before executing the
> * transformation steps, make sure the subscriber is not running because
> * one of the steps is to modify some recovery parameters that require a
> * restart.
> */
> if (stat(pidfile, ) == 0)
> ```
> 
> I kept just in case, but I'm not sure it is still needed. How do you think?
> Removing it can reduce an inclusion of pidfile.h.

Are you suggesting another way to check if the standby is up and running?
>

Running `pg_ctl stop` itself can detect whether the process has been still 
alive.
It would exit with 1 when the process is not there.

>
> I didn't notice the comment, but still not sure the reason. Why we must 
> reserve
> the slot until pg_subscriber finishes? IIUC, the slot would be never used, it
> is created only for getting a consistent_lsn. So we do not have to keep.
> Also, just before, logical replication slots for each databases are created, 
> so
> WAL records are surely reserved.
>

I want to confirm the conclusion - will you remove the creation of a transient 
slot?

Also, not tested, I'm now considering that we can reuse the primary_conninfo 
value.
We are assuming that the target server is standby and the current upstream one 
will
convert to publisher. In this case, the connection string is already specified 
as
primary_conninfo so --publisher-conninfo may not be needed. The parameter does
not contain database name, so --databases is still needed. I imagine like:

1. Parse options
2. Turn on standby
3. Verify the standby
4. Turn off standby
5. Get primary_conninfo from standby
6. Connect to primary (concat of primary_conninfo and an option is used)
7. Verify the primary
...

How do you think?

Best Regards,
Hayato Kuroda
FUJITSU LIMITED
https://www.fujitsu.com/global/ 





RE: speed up a logical replica setup

2024-01-25 Thread Hayato Kuroda (Fujitsu)
Dear Peter,

Thanks for giving your idea!

> > Subscriber has a different meaning of subscription. Subscription is an SQL
> > object. Subscriber is the server (node in replication terminology) where the
> > subscription resides. Having said that pg_createsubscriber doesn't seem
> > a bad
> > name because you are creating a new subscriber. (Indeed, you are
> > transforming /
> > converting but "create" seems closer and users can infer that it is a
> > tool to
> > build a new logical replica.
> 
> That makes sense.
> 
> (Also, the problem with "convert" etc. is that "convertsubscriber" would
> imply that you are converting an existing subscriber to something else.
> It would need to be something like "convertbackup" then, which doesn't
> seem helpful.)
> 
> > I think "convert" and "transform" fit for this case. However, "create",
> > "convert" and "transform" have 6, 7 and 9 characters,  respectively. I
> > suggest
> > that we avoid long names (subscriber already has 10 characters). My
> > preference
> > is pg_createsubscriber.
> 
> That seems best to me.

Just FYI - I'm ok to change the name to pg_createsubscriber.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



RE: speed up a logical replica setup

2024-01-25 Thread Hayato Kuroda (Fujitsu)
   PQntuples(res), 1);
return NULL;
}
```

Error message should be changed. I think this error means the standby has wrong 
primary_slot_name, right?

18. misc
Sometimes the pid of pg_subscriber is referred. It can be stored as global 
variable.

19.
C99-style has been allowed, so loop variables like "i" can be declared in the 
for-statement, like

```
for (int i = 0; i < MAX; i++)
```

20.
Some comments, docs, and outputs must be fixed when the name is changed.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



RE: speed up a logical replica setup

2024-01-24 Thread Hayato Kuroda (Fujitsu)
Dear hackers,

Based on the requirement, I have profiled the performance test. It showed 
bottlenecks
are in small-data case are mainly two - starting a server and waiting until the
recovery is done.

# Tested source code

V7 patch set was applied atop HEAD(0eb23285). No configure options were 
specified
when it was built.

# Tested workload

I focused on only 100MB/1GB cases because bigger ones have already had good 
performance.
(Number of inserted tuples were same as previous tests)
I used bash script instead of tap test framework. See attached. Executed SQLs 
and
operations were almost the same.

As you can see, I tested only one-db case. Results may be changed if the number
of databases were changed.

# Measurement
Some debug logs which output current time were added (please see diff file).
I picked up some events and done at before/after them. Below bullets showed the 
measured ones:

* Starting a server
* Stopping a server
* Creating replication slots
* Creating publications
* Waiting until the recovery ended
* Creating subscriptions

# Result 
Below table shows the elapsed time for these events. Raw data is also available
by the attached excel file.

|Event category  |100MB case [sec]|1GB [sec]|
|Starting a server   |1.414   |1.417|
|Stoping a server|0.506   |0.506|
|Creating replication slots  |0.005   |0.007|
|Creating publications   |0.001   |0.002|
|Waiting until the recovery ended|1.603   |14.529   |
|Creating subscriptions  |0.012   |0.012|
|Total   |3.541   |16.473   |
|actual time |4.37|17.271   |


As you can see, starting servers and waiting seem slow. We cannot omit these,
but setting smaller shared_buffers will reduce the start time. One approach is
to overwrite the GUC to smaller value, but I think we cannot determine the
appropriate value.

Best Regards,
Hayato Kuroda
FUJITSU LIMITED



run.sh
Description: run.sh


perf_result.xlsx
Description: perf_result.xlsx


add_debug_log.diff
Description: add_debug_log.diff


  1   2   3   4   5   6   >