from:"Peter Smith"

Re: Pgoutput not capturing the generated columns

2024-05-23 Thread Peter Smith

Hi,

Here are some review comments for the patch v3-0001.

I don't think v3 addressed any of my previous review comments for
patches v1 and v2. [1][2]

So the comments below are limited only to the new code (i.e. the v3
versus v2 differences). Meanwhile, all my previous review comments may
still apply.

==
GENERAL

The patch applied gives whitespace warnings:

[postgres@CentOS7-x64 oss_postgres_misc]$ git apply
../patches_misc/v3-0001-Support-generated-column-capturing-generated-colu.patch
../patches_misc/v3-0001-Support-generated-column-capturing-generated-colu.patch:150:
trailing whitespace.

../patches_misc/v3-0001-Support-generated-column-capturing-generated-colu.patch:202:
trailing whitespace.

../patches_misc/v3-0001-Support-generated-column-capturing-generated-colu.patch:730:
trailing whitespace.
warning: 3 lines add whitespace errors.

==
contrib/test_decoding/test_decoding.c

1. pg_decode_change

  MemoryContext old;
+ bool include_generated_columns;
+

I'm not really convinced this variable saves any code.

==
doc/src/sgml/protocol.sgml

2.
+
+ include-generated-columns
+ 
+
+The include-generated-columns option controls whether
generated columns should be included in the string representation of
tuples during logical decoding in PostgreSQL. This allows users to
customize the output format based on whether they want to include
these columns or not.
+ 
+ 
+ 

2a.
Something is not correct when this name has hyphens and all the nearby
parameter names do not. Shouldn't it be all uppercase like the other
boolean parameter?

~

2b.
Text in the SGML file should be wrapped properly.

~

2c.
IMO the comment can be more terse and it also needs to specify that it
is a boolean type, and what is the default value if not passed.

SUGGESTION

INCLUDE_GENERATED_COLUMNS [ boolean ]

If true, then generated columns should be included in the string
representation of tuples during logical decoding in PostgreSQL. The
default is false.

==
src/backend/replication/logical/proto.c

3. logicalrep_write_tuple

- if (!column_in_column_list(att->attnum, columns))
+ if (!column_in_column_list(att->attnum, columns) && !att->attgenerated)
+ continue;
+
+ if (att->attgenerated && !publish_generated_column)
  continue;

3a.
This code seems overcomplicated checking the same flag multiple times.

SUGGESTION
if (att->attgenerated)
{
  if (!publish_generated_column)
continue;
}
else
{
  if (!column_in_column_list(att->attnum, columns))
continue;
}

~

3b.
The same logic occurs several times in logicalrep_write_tuple

~~~

4. logicalrep_write_attrs

  if (!column_in_column_list(att->attnum, columns))
  continue;

+ if (att->attgenerated && !publish_generated_column)
+ continue;
+

Shouldn't these code fragments (2x in this function) look the same as
in logicalrep_write_tuple? See the above review comments.

==
src/backend/replication/pgoutput/pgoutput.c

5. maybe_send_schema

  TransactionId topxid = InvalidTransactionId;
+ bool publish_generated_column = data->publish_generated_column;

I'm not convinced this saves any code, and anyway, it is not
consistent with other fields in this function that are not extracted
to another variable (e.g. data->streaming).

~~~

6. pgoutput_change
-
+ bool publish_generated_column = data->publish_generated_column;
+

I'm not convinced this saves any code, and anyway, it is not
consistent with other fields in this function that are not extracted
to another variable (e.g. data->binary).

==
[1] My v1 review -
https://www.postgresql.org/message-id/CAHut+PsuJfcaeg6zst=6pe5uyjv_uxvrhu3ck7w2ahb1uqy...@mail.gmail.com
[2] My v2 review -
https://www.postgresql.org/message-id/CAHut%2BPv4RpOsUgkEaXDX%3DW2rhHAsJLiMWdUrUGZOcoRHuWj5%2BQ%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Synchronizing slots from primary to standby

2024-05-22 Thread Peter Smith

ibers
+ from the primary server.
+

7a.
I felt that the step should just say "Confirm that the standby server
is not lagging behind the subscribers.". So the text "This step can be
skipped..." should be a separate paragraph.

~

7b.
The 2nd standby_slot_names should use a varname font.

~

7c.
/may vary at different points in time due to/can vary due to/



8.
+   
+Firstly, on the subscriber node check the last replayed WAL.
+This step needs to be run on the database(s) that includes the failover
+enabled subscription(s), to find the last replayed WAL on
each database.

8a.
Don't need to say "Firstly,"

~

8b.
The text "This step..." can be simplified as below:

BEFORE
This step needs to be run on the database(s) that includes the
failover enabled subscription(s), to find the last replayed WAL on
each database.

SUGGESTION
This step needs to be run on any database that includes
failover-enabled subscriptions.

~~~

9.
+ 
+  Next, on the standby server check that the last-received WAL location
+  is ahead of the replayed WAL location(s) on the subscriber identified
+  above. If the above SQL result was NULL, it means the subscriber has not
+  yet replayed any WAL, so the standby server must be ahead of the
+  subscriber, and this step can be skipped.


Don't need to say "Next,"

~~~

10.
+  
+   If the result (failover_ready) of both above steps is
+   true, existing subscriptions will be able to continue without
losing any data
+   that has been flushed to the new primary server.
+  

Let's word this more like the same sentence top of the page. See
review comment #3c

SUGGESTION
If the result (failover_ready) of both steps above
is true, then existing subscriptions can continue subscribing to
publications now on the new primary server without any loss of data.

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Pgoutput not capturing the generated columns

2024-05-21 Thread Peter Smith

mp; !publish_generated_column)
+ continue;

ditto #6

~~

8. logicalrep_write_attrs

- if (att->attisdropped || att->attgenerated)
+ if (att->attisdropped)
  continue;

  if (!column_in_column_list(att->attnum, columns))
  continue;

+ if (att->attgenerated && !publish_generated_column)
+ continue;
+

ditto #6

~~

9.
- if (att->attisdropped || att->attgenerated)
+ if (att->attisdropped)
  continue;

  if (!column_in_column_list(att->attnum, columns))
  continue;

+ if (att->attgenerated && !publish_generated_column)
+ continue;

ditto #6

==
src/include/catalog/pg_subscription.h


10. CATALOG

+ bool subgeneratedcolumn; /* True if generated colums must be published */

/colums/columns/

==
src/test/regress/sql/publication.sql

11.
--- error: generated column "d" can't be in list
+-- ok


Maybe change "ok" to say like "ok: generated cols can be in the list too"

==

12.
GENERAL - Missing CREATE SUBSCRIPTION test?
GENERAL - Missing ALTER SUBSCRIPTION test?

How come this patch adds a new CREATE SUBSCRIPTION option but does not
seem to include any test case for that option in either the CREATE
SUBSCRIPTION or ALTER SUBSCRIPTION regression tests?

==
[1] My v1 review -
https://www.postgresql.org/message-id/CAHut+PsuJfcaeg6zst=6pe5uyjv_uxvrhu3ck7w2ahb1uqy...@mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Re: inconsistent quoting in error messages

2024-05-20 Thread Peter Smith

On Tue, May 21, 2024 at 2:56 PM Kyotaro Horiguchi
 wrote:
>
> I noticed that there are slightly inconsistent messages regarding
> quoting policies.
>
> > This happens if you temporarily set "wal_level=minimal" on the server.
> > WAL generated with "full_page_writes=off" was replayed during online backup
>
> > pg_log_standby_snapshot() can only be used if "wal_level" >= "replica"
>
> > WAL streaming ("max_wal_senders" > 0) requires "wal_level" to be "replica" 
> > or "logical"
>
> I think it's best to quote variable names and values separately, like
> "wal_level" = "minimal" (but not use quotes for numeric values), as it
> seems to be the most common practice. Anyway, we might want to unify
> them.
>
>
> Likewise, I saw two different versions of values with units.
>
> > "max_stack_depth" must not exceed %ldkB.
> > "vacuum_buffer_usage_limit" must be 0 or between %d kB and %d kB
>
> I'm not sure, but it seems like the latter version is more common.
>
> regards.
>

Hi,

I think it might be better to keep all the discussions about GUC
quoting and related topics like this confined to the main thread here
[1]. Otherwise, we might end up with a bunch of competing patches.

==
[1] 
https://www.postgresql.org/message-id/CAHut%2BPv-kSN8SkxSdoHano_wPubqcg5789ejhCDZAcLFceBR-w%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Re: GUC names in messages

2024-05-19 Thread Peter Smith

On Fri, May 17, 2024 at 9:57 PM Peter Eisentraut  wrote:
>
> On 17.05.24 05:31, Peter Smith wrote:
> >> I think we should accept your two patches
> >>
> >> v6-0001-GUC-names-docs.patch
> >> v6-0002-GUC-names-add-quotes.patch
> >>
> >> which effectively everyone was in favor of and which seem to be the most
> >> robust and sustainable solution.
> >>
> >> (The remaining three patches from the v6 set would be PG18 material at
> >> this point.)
> > Thanks very much for taking an interest in resurrecting this thread.
> >
> > It was always my intention to come back to this when the dust had
> > settled on PG17. But it would be even better if the docs for the rule
> > "just quote everything", and anything else you deem acceptable, can be
> > pushed sooner.
> >
> > Of course, there will still be plenty more to do for PG18, including
> > locating examples in newly pushed code for messages that have slipped
> > through the cracks during the last few months using different formats,
> > and other improvements, but those tasks should become easier if we can
> > get some of these v6 patches out of the way first.
>
> I committed your 0001 and 0002 now, with some small fixes.
>
> There has also been quite a bit of new code, of course, since you posted
> your patches, so we'll probably find a few more things that could use
> adjustment.
>
> I'd be happy to consider the rest of your patch set after beta1 and/or
> for PG18.
>

Thanks for pushing!

I'll try to dedicate more time to this sometime soon to go through all
the code again to track down those loose ends.

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-05-16 Thread Peter Smith

Hi Kuroda-san,

I did not apply these v12* patches, but I have diff'ed all of them
with the previous v11* patches and confirmed that all of my previous
review comments now seem to be addressed.

I don't have any more comments to make at this time. Thanks!

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: GUC names in messages

2024-05-16 Thread Peter Smith

On Thu, May 16, 2024 at 9:35 PM Peter Eisentraut  wrote:
>
> On 04.01.24 07:53, Peter Smith wrote:
> >> Now that I read this again, I think this is wrong.
> >>
> >> We should decide the quoting for a category, not the actual content.
> >> Like, quote all file names; do not quote keywords.
> >>
> >> This led to the attempted patch to decide the quoting of GUC parameter
> >> names dynamically based on the actual content, which no one really
> >> liked.  But then, to preserve consistency, we also need to be uniform in
> >> quoting GUC parameter names where the name is hardcoded.
> >>
> >
> > I agree. By attempting to define when to and when not to use quotes it
> > has become overcomplicated.
> >
> > Earlier in the thread, I counted how quotes were used in the existing
> > messages [5]; there were ~39 quoted and 164 not quoted. Based on that
> > we chose to stay with the majority, and leave all the unquoted ones so
> > only adding quotes "when necessary". In hindsight, that was probably
> > the wrong choice because it opened a can of worms about what "when
> > necessary" even means (e.g. what about underscores, mixed case etc).
> >
> > Certainly one simple rule "just quote everything" is easiest to follow.
>
> I've been going through the translation updates for PG17 these days and
> was led back around to this issue.  It seems we left it in an
> intermediate state that no one was really happy with and which is
> arguably as inconsistent or more so than before.
>
> I think we should accept your two patches
>
> v6-0001-GUC-names-docs.patch
> v6-0002-GUC-names-add-quotes.patch
>
> which effectively everyone was in favor of and which seem to be the most
> robust and sustainable solution.
>
> (The remaining three patches from the v6 set would be PG18 material at
> this point.)

Thanks very much for taking an interest in resurrecting this thread.

It was always my intention to come back to this when the dust had
settled on PG17. But it would be even better if the docs for the rule
"just quote everything", and anything else you deem acceptable, can be
pushed sooner.

Of course, there will still be plenty more to do for PG18, including
locating examples in newly pushed code for messages that have slipped
through the cracks during the last few months using different formats,
and other improvements, but those tasks should become easier if we can
get some of these v6 patches out of the way first.

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-05-16 Thread Peter Smith

Hi, Here are my remaining review comments for the latest v11* patches.

//
v11-0001
//

No changes. No comments.

//
v11-0002
//

==
doc/src/sgml/ref/alter_subscription.sgml

2.1.
ALTER SUBSCRIPTION ... SET (failover = on|off) and
-   ALTER SUBSCRIPTION ... SET (two_phase = on|off)
+   ALTER SUBSCRIPTION ... SET (two_phase = off)

My other thread patch has already been pushed [1], so now you can
modify this to say "true|false" as previously suggested.


//
v11-0003
//

==
src/backend/commands/subscriptioncmds.c

3.1. AlterSubscription

+ subtwophase = LOGICALREP_TWOPHASE_STATE_DISABLED;
+ }
+ else
+ subtwophase = LOGICALREP_TWOPHASE_STATE_PENDING;
+
+
  /* Change system catalog acoordingly */
  values[Anum_pg_subscription_subtwophasestate - 1] =
- CharGetDatum(opts.twophase ?
- LOGICALREP_TWOPHASE_STATE_PENDING :
- LOGICALREP_TWOPHASE_STATE_DISABLED);
+ CharGetDatum(subtwophase);
  replaces[Anum_pg_subscription_subtwophasestate - 1] = true;

Sorry, I don't think that 'subtwophase' change is an improvement --
IMO the existing ternary code was fine as-is.

I only reported the excessive flag checking in the previous v10-0003
review because of some extra "if (!opts.twophase)" code but that was
caused by what you called "wrong git operations." and is already fixed
now.

//
v11-0004
//

==
src/sgml/catalogs.sgml

4.1.
+ 
+  
+   subforcealter bool
+  
+  
+   If true, then the ALTER
SUBSCRIPTION
+   can disable two_phase option, even if there are
+   uncommitted prepared transactions from when two_phase
+   was enabled
+  
+ 
+

I think this description should be changed to say what it *really*
does. IMO, the stuff about 'two_phase' is more like a side-effect.

SUGGESTION (this is just pseudo-code. You can add the CREATE
SUBSCRIPTION 'force_alter' link appropriately)

If true, then the ALTER SUBSCRIPTION command can
sometimes be forced to proceed instead of giving an error. See
force_alter parameter for details about when this might
be useful.

==
[1] 
https://github.com/postgres/postgres/commit/fa65a022db26bc446fb67ce1d7ac543fa4bb72e4

Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Docs: Always use true|false instead of sometimes on|off for the subscription options

2024-05-16 Thread Peter Smith

On Thu, May 16, 2024 at 10:42 PM David Rowley  wrote:
>
> On Thu, 16 May 2024 at 19:05, Peter Smith  wrote:
> >
> > On Thu, May 16, 2024 at 3:11 PM David Rowley  wrote:
> > > If you want to do this, what's the reason to limit it to just this one
> > > page of the docs?
> >
> > Yeah, I had a vested interest in this one place because I've been
> > reviewing the other thread [1] that was mentioned above. If that other
> > thread chooses "true|false" then putting "true|false" adjacent to
> > another "on|off" will look silly.
>
> OK, looking a bit further I see this option is new to v17.  After a
> bit more study of the sgml, I agree that it's worth changing this.
>
> I've pushed your patch.
>

Thanks very much for pushing. It was just a one-time thing -- I won't
go looking for more examples like it.

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Docs: Always use true|false instead of sometimes on|off for the subscription options

2024-05-16 Thread Peter Smith

On Thu, May 16, 2024 at 3:11 PM David Rowley  wrote:
>
> On Thu, 16 May 2024 at 12:29, Peter Smith  wrote:
> > There are lots of subscription options listed on the CREATE
> > SUBSCRIPTION page [1].
> >
> > Although these boolean options are capable of accepting different
> > values like "1|0", "on|off", "true|false", here they are all described
> > only using values "true|false".
>
> If you want to do this, what's the reason to limit it to just this one
> page of the docs?

Yeah, I had a vested interest in this one place because I've been
reviewing the other thread [1] that was mentioned above. If that other
thread chooses "true|false" then putting "true|false" adjacent to
another "on|off" will look silly. But if that other thread adopts the
existing 'failover=on|off' values then it will diverge even further
from being consistent with the CREATE SUBSCRIPTION page.
Unfortunately, that other thread cannot take it upon itself to make
this change because it has nothing to do with the 'failover' option,
So I saw no choice but to post an independent patch for this.

I checked all the PUBLICATION/SUBSCRIPTION reference pages and this
was the only inconsistent value that I found. But I might be mistaken.

>
> If the following is anything to go by, it doesn't seem we're very
> consistent about this over the entire documentation.
>
> doc$ git grep "on" | wc -l
> 122
>
> doc$ git grep "true" | wc -l
> 222
>
> And:
>
> doc$ git grep "off" | wc -l
> 102
>
> doc$ git grep "false" | wc -l
> 162
>

Hmm. I'm not entirely sure if those stats are meaningful because I'm
not saying option values should avoid "on|off" -- the point was I just
suggesting they should be used consistent with where they are
described. For example, the CREATE SUBSCRIPTION page describes every
boolean option value as "true|false", so let's use "true|false" in
every other docs place where those are mentioned. OTOH, other options
on other pages may be described as "on|off" which is fine by me, but
then those should similarly be using "on|off" again wherever they are
mentioned.

> I think unless we're going to standardise on something then there's
> not much point in adjusting individual cases. IMO, there could be an
> endless stream of follow-on patches as a result of accepting this.
>

Standardisation might be ideal, but certainly, I'm not going to
attempt to make a giant patch that impacts the entire documentation
just for this one small improvement.

It seems a shame if "perfect" becomes the enemy of "good"; How else do
you suggest I can make the ALTER SUBSCRIPTION page better? If this
one-line change is rejected then the most likely outcome is nothing
will ever happen to change it.

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Pgoutput not capturing the generated columns

2024-05-16 Thread Peter Smith

uldn't they both look like this:

if (att->attisdropped)
  continue;

if (att->attgenerated && !publish_generated_column)
  continue;

if (!column_in_column_list(att->attnum, columns))
  continue;
==
src/backend/replication/pgoutput/pgoutput.c

5.
 static void send_relation_and_attrs(Relation relation, TransactionId xid,
  LogicalDecodingContext *ctx,
- Bitmapset *columns);
+ Bitmapset *columns,
+ bool publish_generated_column);

Use plural. /publish_generated_column/publish_generated_columns/

~~~

6. parse_output_parameters

  bool origin_option_given = false;
+ bool generate_column_option_given = false;

  data->binary = false;
  data->streaming = LOGICALREP_STREAM_OFF;
  data->messages = false;
  data->two_phase = false;
+ data->publish_generated_column = false;

I think the 1st var should be 'include_generated_columns_option_given'
for consistency with the name of the actual option that was given.

==
src/include/replication/logicalproto.h

7.
(Same as a previous review comment)

For all the API changes the new parameter name should be plural.

/publish_generated_column/publish_generated_columns/

==
src/include/replication/pgoutput.h

8.
  bool publish_no_origin;
+ bool publish_generated_column;
 } PGOutputData;

/publish_generated_column/publish_generated_columns/

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Docs: Always use true|false instead of sometimes on|off for the subscription options

2024-05-15 Thread Peter Smith

Hi hackers,

There are lots of subscription options listed on the CREATE
SUBSCRIPTION page [1].

Although these boolean options are capable of accepting different
values like "1|0", "on|off", "true|false", here they are all described
only using values "true|false".

~

IMO this consistent way of describing the boolean option values ought
to be followed also on the ALTER SUBSCRIPTION page [2]. Specifically,
the ALTER SUBSCRIPTION page has one mention of "SET (failover =
on|off)" which I think should be changed to say "SET (failover =
true|false)"

Now this little change hardly seems important, but actually, it is
motivated by another thread [3] under development where this ALTER
SUBSCRIPTION "(failover = on|off)" was copied again, thereby making
the consistency between the CREATE SUBSCRIPTION and ALTER SUBSCRIPTION
pages grow further apart, so I think it is best to nip that in the bud
and simply use "true|false" values everywhere.

PSA a patch to make this proposed change.

==
[1] https://www.postgresql.org/docs/devel/sql-createsubscription.html
[2] https://www.postgresql.org/docs/devel/sql-altersubscription.html
[3] 
https://www.postgresql.org/message-id/flat/8fab8-65d74c80-1-2f28e880%4039088166

Kind Regards,
Peter Smith.
Fujitsu Australia


v1-0001-Always-use-true-false-instead-of-on-off-for-boole.patch
Description: Binary data

Re: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-05-14 Thread Peter Smith

On Tue, May 14, 2024 at 10:03 PM Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear Peter,
>
...
> > 4.11.
> >
> > +# Alter the two_phase with the force_alter option. Apart from the the last
> > +# ALTER SUBSCRIPTION command, the command will abort the prepared
> > transaction
> > +# and succeed.
> >
> > There is typo "the the" and the wording is a bit strange. Why not just say:
> >
> > SUGGESTION
> > Alter the two_phase true to false with the force_alter option enabled.
> > This command will succeed after aborting the prepared transaction.
>
> Fixed.
>

You wrote "Fixed" for that patch v9-0004 suggestion but I don't think
anything was changed at all. Accidentally missed?

==
Kind Regards,
Peter Smith.
Futjisu Australia

Re: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-05-14 Thread Peter Smith

istBySubid(subid)) != NIL)
{
/* Abort all listed transactions */
foreach_ptr(char, gid, prepared_xacts)
FinishPreparedTransaction(gid, false);

list_free_deep(prepared_xacts);
}
}

/* Change system catalog acoordingly */
values[Anum_pg_subscription_subtwophasestate - 1] =
CharGetDatum(opts.twophase ?
LOGICALREP_TWOPHASE_STATE_PENDING :
LOGICALREP_TWOPHASE_STATE_DISABLED);
replaces[Anum_pg_subscription_subtwophasestate - 1] = true;
}

~

Why is "if (!opts.twophase)" being checked at the top and then
immediately being checed again here:
+ if (!opts.twophase)
+ PreventInTransactionBlock(isTopLevel,
+ "ALTER SUBSCRIPTION ... SET (two_phase = false)");

And then again here:
CharGetDatum(opts.twophase ?
LOGICALREP_TWOPHASE_STATE_PENDING :
LOGICALREP_TWOPHASE_STATE_DISABLED);

There is no need to re-check a flag that was already checked, so
clearly some of this logic/code is either wrong or redundant.

==
src/test/subscription/t/099_twophase_added.pl

(Let's change these on|off to true|false to match what you did already
in patch 0002).

3.3.
+#
+# Check the case that prepared transactions exist on the subscriber node
+#
+# If the two_phase is altering from "on" to "off" and there are prepared
+# transactions on the subscriber, they must be aborted. This test checks it.


/off/false/

/on/true/

~~~

3.4.
+# Verify the prepared transaction has been replicated to the subscriber because
+# two_phase is set to "on".

/on/true/

~~~

3.5.
+# Toggle the two_phase to "off" before the COMMIT PREPARED
+$node_subscriber->safe_psql(
+'postgres', "
+ALTER SUBSCRIPTION regress_sub DISABLE;
+ALTER SUBSCRIPTION regress_sub SET (two_phase = off);
+ALTER SUBSCRIPTION regress_sub ENABLE;");

/off/false/

/two_phase = off/two_phase = false/

~~~

3.6.
+# Verify any prepared transactions are aborted because two_phase is changed to
+# "off".

/off/false/

//
patch v10-0004
//

==
4.g1. GENERAL - document rendering fails

FYI - The document failed to build after I apply patch 0003. Did you try it?

In my environment it reported some unbalanced tags:

ref/create_subscription.sgml:448: parser error : Opening and ending
tag mismatch: link line 436 and para
 
^
ref/create_subscription.sgml:449: parser error : Opening and ending
tag mismatch: para line 435 and listitem


etc.

==
doc/src/sgml/ref/alter_subscription.sgml

4.1.
  
   The two_phase parameter can only be altered when the
-  subscription is disabled. When altering the parameter from
true
-  to false, the backend process checks for any
incomplete
-  prepared transactions done by the logical replication worker (from when
-  two_phase parameter was still true)
-  and, if any are found, those are aborted.
+  subscription is disabled. Altering the parameter from
true
+  to false will give an error when when there are
+  prepared transactions done by the logical replication worker. If you want
+  to alter the parameter forcibly in this case,
+  force_alter
+  option must be set to true at the same time.
  

TYPO: "when when"

Why is necessary to say "at the same time"?

==
doc/src/sgml/ref/create_subscription.sgml

4.2.
+   
+force_alter (boolean)
+
+ 
+  Specifies if the ALTER SUBSCRIPTION
+  can be forced to proceed instead of giving an error. There is
+  currently only one scenario where this parameter has any effect: When
+  altering two_phase option from
true
+  to false it is possible for there to be incomplete
+  prepared transactions done by the logical replication worker (from
+  when two_phase parameter was still
true).
+  If force_alter is false, then
+  this will give an error; if force_alter is
+  true, then the incomplete prepared transactions
+  are aborted and the alter will proceed.
+  The default is false.
+ 
+
+   

IMO this will be better broken into multiple paragraphs.

1. Specifies...
2. There is...
3. The default is...

==
src/test/subscription/t/099_twophase_added.pl

(Let's change all the on|off to true|false like you already did in patch 0002.

4.3.
+# Try altering the two_phase option to "off." The command will fail since there
+# is a prepared transaction and the 'force_alter' option is not specified as
+# true.
+my $stdout;
+my $stderr;

/off./false/

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-05-13 Thread Peter Smith

-+-+-++---+--+--++---+---+--++-+--
- regress_testsub4 | regress_subscription_user | f   | {testpub}
| f  | off   | d| f| none   |
t | f | f| off|
dbname=regress_doesnotexist | 0/0
+
  List of
subscriptions
+   Name   |   Owner   | Enabled | Publication
| Binary | Streaming | Two-phase commit | Disable on error | Origin |
Password required | Run as owner? | Failover | Force_alter |
Synchronous commit |  Conninfo   | Skip LSN
+--+---+-+-++---+--+--++---+---+--+-++-+--

The column heading should be "Force alter", as already mentioned by an
earlier review comment (#4.7)

==
src/test/subscription/t/099_twophase_added.pl

4.11.

+# Alter the two_phase with the force_alter option. Apart from the the last
+# ALTER SUBSCRIPTION command, the command will abort the prepared transaction
+# and succeed.

There is typo "the the" and the wording is a bit strange. Why not just say:

SUGGESTION
Alter the two_phase true to false with the force_alter option enabled.
This command will succeed after aborting the prepared transaction.

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-05-13 Thread Peter Smith

Hi Kuroda-san,

I'm having second thoughts about how these patches mention the option
values "on|off". These are used in the ALTER SUBSCRIPTION document
page for 'two_phase' and 'failover' parameters, and then those
"on|off" get propagated to the code comments, error messages, and
tests...

Now I see that on the CREATE SUBSCRIPTION page [1], every boolean
parameter (even including 'two_phase' and 'failover') is described in
terms of "true|false" (not "on|off").

In hindsight, it is probably better to refer only to true|false
everywhere for these boolean parameters, instead of sometimes using
different values like on|off.

What do you think?

==
[1] https://www.postgresql.org/docs/devel/sql-createsubscription.html

Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-05-13 Thread Peter Smith

Hi, Here are some review comments for v8-0002.

==
Commit message

1.
Regarding the off->on case, the logical replication already has a mechanism for
it, so there is no need to do anything special for the on->off case; however,
we must connect to the publisher and expressly change the parameter. The
operation cannot be rolled back, and altering the parameter from "on" to "off"
within a transaction is prohibited.

~

I think the difference between "off"-to"on" and "on"-to"off" needs to
be explained in more detail. Specifically "already has a mechanism for
it" seems very vague.

==
src/backend/commands/subscriptioncmds.c

2.
  /*
- * The changed two_phase option of the slot can't be rolled
- * back.
+ * Since the altering two_phase option of subscriptions
+ * also leads to the change of slot option, this command
+ * cannot be rolled back. So prevent we are in the
+ * transaction block.
  */
- PreventInTransactionBlock(isTopLevel, "ALTER SUBSCRIPTION ... SET
(two_phase)");
+ if (!opts.twophase)
+ PreventInTransactionBlock(isTopLevel,
+

2a.
There is a typo: "So prevent we are".

SUGGESTION (minor adjustment and typo fix)
Since altering the two_phase option of subscriptions also leads to
changing the slot option, this command cannot be rolled back. So
prevent this if we are in a transaction block.

~

2b.
But, in my previous review [v7-0002#3] I asked if the comment could
explain why this check is only needed for two_phase "on"-to-"off" but
not for "off"-to-"on". That explanation/reason is still not yet given
in the latest comment.

~~~

3.
  /*
- * Try to acquire the connection necessary for altering slot.
+ * Check the need to alter the replication slot. Failover and two_phase
+ * options are controlled by both the publisher (as a slot option) and the
+ * subscriber (as a subscription option).
+ */
+ update_failover = replaces[Anum_pg_subscription_subfailover - 1];
+ update_two_phase = (replaces[Anum_pg_subscription_subtwophasestate - 1] &&
+ !opts.twophase);


(This is similar to the previous comment). In my previous review
[v7-0002#3a] I asked why update_two_phase is TRUE only if 'two-phase'
is being updated "on"-to-"off", but not when it is being updated
"off"-to-"on". That explanation/reason is still not yet given in the
latest comment.

==
src/backend/replication/libpqwalreceiver/libpqwalreceiver.c

4.
- appendStringInfo(, "ALTER_REPLICATION_SLOT %s ( FAILOVER %s,
TWO_PHASE %s )",
- quote_identifier(slotname),
- failover ? "true" : "false",
- two_phase ? "true" : "false");
+ appendStringInfo(, "ALTER_REPLICATION_SLOT %s ( ",
+ quote_identifier(slotname));
+
+ if (failover)
+ appendStringInfo(, "FAILOVER %s",
+ *failover ? "true" : "false");
+
+ if (failover && two_phase)
+ appendStringInfo(, ", ");
+
+ if (two_phase)
+ appendStringInfo(, "TWO_PHASE %s",
+ *two_phase ? "true" : "false");
+
+ appendStringInfoString(, ");");

It will be better if that last line includes the extra space like I
had suggested in [v7-0002#4a] so the result will have the same spacing
as in the original code. e.g.

+ appendStringInfoString(, " );");

==
src/test/subscription/t/099_twophase_added.pl

5.
+# Check the case that prepared transactions exist on the publisher node.
+#
+# Since the two_phase is "off", then normally, this PREPARE will do nothing
+# until the COMMIT PREPARED, but in this test, we toggle the two_phase to "on"
+# again before the COMMIT PREPARED happens.

This is a major test part so IMO this comment should have
# like it had before, to distinguish it from all
the sub-step comments.

==

My v7-0002 review -
https://www.postgresql.org/message-id/CAHut%2BPtu_w_UCGR-5DbenA%2By7wRiA8QPi_ZP%3DCCJ3SGdTn-%3D%3Dg%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia.

Re: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-05-13 Thread Peter Smith

on done by worker is aborted");

/transaction are aborted/transaction was aborted/

==
Response to my v7-0004 review --
https://www.postgresql.org/message-id/OSBPR01MB2552F738ACF1DA6838025C4FF5E62%40OSBPR01MB2552.jpnprd01.prod.outlook.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-05-13 Thread Peter Smith

Hi, Here are some review comments for v8-0003.

==
src/sgml/ref/alter_subscription.sgml

1.
+ 
+  The two_phase parameter can only be altered when the
+  subscription is disabled. When altering the parameter from
on
+  to off, the backend process checks prepared
+  transactions done by the logical replication worker and aborts them.
+ 

The text may be OK as-is, but I was wondering if it might be better to
give a more verbose explanation.

BEFORE
... the backend process checks prepared transactions done by the
logical replication worker and aborts them.

SUGGESTION
... the backend process checks for any incomplete prepared
transactions done by the logical replication worker (from when
two_phase parameter was still "on") and, if any are found, those are
aborted.

==
src/backend/commands/subscriptioncmds.c

2. AlterSubscription

- /*
- * Since the altering two_phase option of subscriptions
- * also leads to the change of slot option, this command
- * cannot be rolled back. So prevent we are in the
- * transaction block.
+ * If two_phase was enabled, there is a possibility the
+ * transactions has already been PREPARE'd. They must be
+ * checked and rolled back.
  */

BEFORE
... there is a possibility the transactions has already been PREPARE'd.

SUGGESTION
... there is a possibility that transactions have already been PREPARE'd.

~~~

3. AlterSubscription
+ /*
+ * Since the altering two_phase option of subscriptions
+ * (especially on->off case) also leads to the
+ * change of slot option, this command cannot be rolled
+ * back. So prevent we are in the transaction block.
+ */
  PreventInTransactionBlock(isTopLevel,
"ALTER SUBSCRIPTION ... SET (two_phase = off)");


This comment is a bit vague and includes some typos, but IIUC these
problems will already be addressed by the 0002 patch changes.AFAIK
patch 0003 is only moving the 0002 comment.

==
src/test/subscription/t/099_twophase_added.pl

4.
+# Check the case that prepared transactions exist on the subscriber node
+#
+# If the two_phase is altering from "on" to "off" and there are prepared
+# transactions on the subscriber, they must be aborted. This test checks it.
+

Similar to the comment that I gave for v8-0002. I think there should
be  comment for the major test comment to
distinguish it from comments for the sub-steps.

~~~

5.
+# Verify the prepared transaction are aborted because two_phase is changed to
+# "off".
+$result = $node_subscriber->safe_psql('postgres',
+"SELECT count(*) FROM pg_prepared_xacts;");
+is($result, q(0), "prepared transaction done by worker is aborted");
+

/the prepared transaction are aborted/any prepared transactions are aborted/

==
Kind Regards,
Peter Smith
Fujitsu Australia

Re: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-05-09 Thread Peter Smith

Hi, Here are some review comments for patch v7-0004

==
Commit message

1.
A detailed commit message is needed to describe the purpose and
details of this patch.

==
doc/src/sgml/ref/alter_subscription.sgml

2. CREATE SUBSCRIPTION

Shouldn't there be an entry for "force_alter" parameter in the CREATE
SUBSCRIPTION "parameters" section, instead of just vaguely mentioning
it in passing when describing the "two_phase" in ALTER SUBSCRIPTION?

~

3. ALTER SUBSCRIPTION - alterable parameters

And shouldn't this new option also be named in the ALTER SUBSCRIPTION
list: "The parameters that can be altered are..."

==
src/backend/commands/subscriptioncmds.c

4.
  XLogRecPtr lsn;
+ bool twophase_force;
 } SubOpts;

IMO this field ought to be called 'force_alter' to be the same as the
option name. Sure, now it is only relevant for 'two_phase', but that
might not always be the case in the future.

~~~

5. AlterSubscription

+ /*
+ * Abort prepared transactions if force option is also
+ * specified. Otherwise raise an ERROR.
+ */
+ if (!opts.twophase_force)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot alter %s when there are prepared transactions",
+ "two_phase = false")));
+

5a.
/if force option is also specified/only if the 'force_alter' option is true/

~

5b.
"two_phase = false" -- IMO that should say "two_phase = off"

~

5c.
IMO this ereport should include a errhint to tell the user they can
use 'force_alter = true' to avoid getting this error.

~~~

6.

+ /* force_alter cannot be used standalone */
+ if (IsSet(opts.specified_opts, SUBOPT_FORCE_ALTER) &&
+ !IsSet(opts.specified_opts, SUBOPT_TWOPHASE_COMMIT))
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("%s must be specified with %s",
+ "force_alter", "two_phase")));
+

IMO this rule is not necessary so the code should be removed. I think
using 'force_alter' standalone doesn't do anything at all (certainly,
it does no harm) so why add more complications (more rules, more code,
more tests) just for the sake of it?

==
src/test/subscription/t/099_twophase_added.pl

7.
+$node_subscriber->safe_psql('postgres',
+"ALTER SUBSCRIPTION sub SET (two_phase = off, force_alter = on);");

"force" is a verb, so it is better to say 'force_alter = true' instead
of 'force_alter = on'.

~~~

8.
 $result = $node_subscriber->safe_psql('postgres',
 "SELECT count(*) FROM pg_prepared_xacts;");
 is($result, q(0), "prepared transaction done by worker is aborted");

+$node_subscriber->safe_psql('postgres', "ALTER SUBSCRIPTION sub ENABLE;");
+

I felt the ENABLE statement should be above the SELECT statement so
that the code is more like it was before applying the patch.

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-05-09 Thread Peter Smith

Hi, Here are some review comments for the patch v7-0003.

==
Commit Message

1.
The patch needs a commit message to describe the purpose and highlight
any limitations and other details.

==
doc/src/sgml/ref/alter_subscription.sgml

2.
+
+ 
+  The two_phase parameter can only be altered when the
+  subscription is disabled. When altering the parameter from
true
+  to false, the backend process checks prepared
+  transactions done by the logical replication worker and aborts them.
+ 

Here, the para is referring to "true" and "false" but earlier on this
page it talks about "twophase = off". IMO it is better to use a
consistent terminology like "on|off" everywhere instead of randomly
changing the way it is described each time.

==
src/backend/commands/subscriptioncmds.c

3. AlterSubscription

  if (IsSet(opts.specified_opts, SUBOPT_TWOPHASE_COMMIT))
  {
+ List *prepared_xacts = NIL;

This 'prepared_xacts' can be declared at a lower scrope because it is
only used if (!opts.twophase).

Furthermore, IIUC you don't need to assign NIL in the declaration
because there is no chance for it to be unassigned anyway.

~~~

4. AlterSubscription

+ /*
+ * The changed two_phase option (true->false) of the
+ * slot can't be rolled back.
+ */
  PreventInTransactionBlock(isTopLevel,
"ALTER SUBSCRIPTION ... SET (two_phase = off)");

Here is another example of inconsistent mixing of the terminology
where the comment says "true"/"false" but the message says "off".
Let's keep everything consistent. (I prefer on|off).

~~~

5.
+ if (sub->twophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED &&
+ (prepared_xacts = GetGidListBySubid(subid)) != NIL)
+ {
+ ListCell *cell;
+
+ /* Abort all listed transactions */
+ foreach(cell, prepared_xacts)
+ FinishPreparedTransaction((char *) lfirst(cell),
+   false);
+
+ list_free(prepared_xacts);
+ }

5A.
IIRC there is a cleaner way to write this loop without needing
ListCell variable -- e.g. foreach_ptr() macro?

~

5B.
Shouldn't this be using list_free_deep() so the pstrdup gid gets freed too?

==
src/test/subscription/t/099_twophase_added.pl

6.
+##
+# Check the case that prepared transactions exist on subscriber node
+##
+

Give some more detailed comments here similar to the review comment of
patch v7-0002 for the other part of this TAP test.

~~~

7. TAP test - comments

Same as for my v7-0002 review comments, I think this test case also
needs a few more one-line comments to describe the sub-steps. e.g.:

# prepare a transaction to insert some rows to the table

# verify the prepared tx is replicated to the subscriber (because
'two_phase = on')

# toggle the two_phase to 'off' *before* the COMMIT PREPARED

# verify the prepared tx got aborted

# do the COMMIT PREPARED (note that now two_phase is 'off')

# verify the inserted rows got replicated ok

~~~

8. TAP test - subscription name

It's better to rename the SUBSCRIPTION in this TAP test so you can
avoid getting log warnings like:

psql::4: WARNING:  subscriptions created by regression test
cases should have names starting with "regress_"
psql::4: NOTICE:  created replication slot "sub" on publisher

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-05-08 Thread Peter Smith

over ? "true" : "false");
+
+   if (failover && two_phase)
+   appendStringInfo(, ", ");
+
+ if (two_phase)
+ appendStringInfo(, "TWO_PHASE %s",
+ *two_phase ? "true" : "false");
+
+ appendStringInfoString(, " );");

~~

4b.
Like I said above, IMO the current separator logic in v7 is broken. So
it is a bit concerning the tests all passed anyway. How did that
happen? I think this indicates that there needs to be an additional
test scenario where both 'failover' and 'two_phase' get altered at the
same time so this code gets exercised properly.

==
src/test/subscription/t/099_twophase_added.pl

5.
+# Define pre-existing tables on both nodes

Why say they are "pre-existing"? They are not pre-existing because you
are creating them right here!

~~~

6.
+##
+# Check the case that prepared transactions exist on publisher node
+##

I think this needs a slightly more detailed comment.

SUGGESTION (this is just an example, but you can surely improve it)

# Check the case that prepared transactions exist on the publisher node.
#
# Since two_phase is "off", then normally this PREPARE will do nothing until
# the COMMIT PREPARED, but in this test, we toggle the two_phase to "on" again
# before the COMMIT PREPARED happens.

~~~

7.
Maybe this test case needs a few more one-line comments for each of
the sub-steps. e.g.:

# prepare a transaction to insert some rows to the table

# verify the prepared tx is not yet replicated to the subscriber
(because 'two_phase = off')

# toggle the two_phase to 'on' *before* the COMMIT PREPARED

# verify the inserted rows got replicated ok

~~~

8.
IIUC this test will behave the same even if you DON'T do the toggle
'two_phase = on'. So I wonder is there something more you can do to
test this scenario more convincingly?

==
Kind Regards,
Peter Smith
Fujitsu Australia

Re: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-05-08 Thread Peter Smith

Hi Kuroda-san,

Thanks for addressing most of my v6-0001 review comments.

Below are some minor follow-up comments for v7-0001.

==
src/backend/access/transam/twophase.c

1. IsTwoPhaseTransactionGidForSubid

+/*
+ * IsTwoPhaseTransactionGidForSubid
+ * Check whether the given GID is formed by TwoPhaseTransactionGid.
+ */
+static bool
+IsTwoPhaseTransactionGidForSubid(Oid subid, char *gid)

I think the function comment should mention something about 'subid'.

SUGGESTION
Check whether the given GID (as formed by TwoPhaseTransactionGid) is
for the specified 'subid'.

==
src/backend/commands/subscriptioncmds.c

2. AlterSubscription

+ if (!opts.twophase &&
+ form->subtwophasestate == LOGICALREP_TWOPHASE_STATE_ENABLED &&
+ LookupGXactBySubid(subid))
+ /* Add error message */
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot disable two_phase when uncommitted prepared
transactions present"),
+ errhint("Resolve these transactions and try again")));

The comment "/* Add error message */" seems unnecessary.

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-05-07 Thread Peter Smith

 see
that this code was extracted from where it was previously inlined in
subscriptioncmds.c, so maybe the 'false' is necessary for another
reason? At least maybe some explanatory comment is needed for why you
are passing this flag as false?

==
src/backend/replication/logical/worker.c

11.
- /* two-phase should not be altered */
+ /* two-phase should not be altered while the worker exists */
  Assert(newsub->twophasestate == MySubscription->twophasestate);
/should not/cannot/

==
src/backend/replication/slot.c

12.
 void
-ReplicationSlotAlter(const char *name, bool failover)
+ReplicationSlotAlter(const char *name, bool two_phase, bool failover)

Same comment as mentioned elsewhere (#15), IMO the new 'two_phase'
parameter should be last.

~~~

13.
+ if (MyReplicationSlot->data.two_phase != two_phase)
+ {
+ SpinLockAcquire(>mutex);
+ MyReplicationSlot->data.two_phase = two_phase;
+ SpinLockRelease(>mutex);
+
+ update_slot = true;
+ }
+
+
  if (MyReplicationSlot->data.failover != failover)
  {
  SpinLockAcquire(>mutex);
  MyReplicationSlot->data.failover = failover;
  SpinLockRelease(>mutex);

+ update_slot = true;
+ }

13a.
Doesn't it make more sense for the whole check/set to be "atomic",
i.e. put the mutex also around the check?

SUGGEST
SpinLockAcquire(>mutex);
if (MyReplicationSlot->data.two_phase != two_phase)
{
  MyReplicationSlot->data.two_phase = two_phase;
  update_slot = true;
}
SpinLockRelease(>mutex);

~

Also, (if you agree with the above) why not include both checks
(two_phase and failover) within the same mutex instead of
acquiring/releasing it twice:

SUGGEST
SpinLockAcquire(>mutex);
if (MyReplicationSlot->data.two_phase != two_phase)
{
  MyReplicationSlot->data.two_phase = two_phase;
  update_slot = true;
}
if (MyReplicationSlot->data.failover != failover)
{
  MyReplicationSlot->data.failover = failover;
  update_slot = true;
}
SpinLockAcquire(>mutex);

~~~

13b.
There are double blank lines after the first if-block.

==
src/backend/replication/walsender.c

14.
 static void
-ParseAlterReplSlotOptions(AlterReplicationSlotCmd *cmd, bool *failover)
+ParseAlterReplSlotOptions(AlterReplicationSlotCmd *cmd,
+   bool *two_phase, bool *failover)

Same comment as mentioned elsewhere (#15), IMO the new 'two_phase'
parameter should be last.

==
src/include/replication/walreceiver.h

15.
 typedef void (*walrcv_alter_slot_fn) (WalReceiverConn *conn,
const char *slotname,
+   bool two_phase,
bool failover);

Somehow, I feel it is more normal to add the new code (the 'two_phase'
parameter) at the END, instead of into the middle of the existing
parameters. It also keeps it alphabetical which makes it consistent
with other places like the tab-completion code.

This comment about swapping the order (putting new stuff last) will
propagate changes to lots of other related places. I refer to this
comment in a few other places in this post but there are probably more
the same that I missed.

==
src/test/regress/sql/subscription.sql

16.
I know you do this already in the TAP test, but doesn't the test case
to demonstrate that 'two-phase' option can be altered when the
subscription is disabled actually belong here in the regression
instead?

==
src/test/subscription/t/021_twophase.pl

17.
+# Disable the subscription and alter it to two_phase = false,
+# verify that the altered subscription reflects the two_phase option.

/verify/then verify/

~~~

18.
+# Now do a prepare on publisher and make sure that it is not replicated.
+$node_subscriber->safe_psql('postgres', "DROP SUBSCRIPTION tap_sub");
+$node_publisher->safe_psql(
+   'postgres', qq{
+BEGIN;
+INSERT INTO tab_copy VALUES (100);
+PREPARE TRANSACTION 'newgid';
+ });
+

18a.
/on publisher/on the publisher/

18b.
What is that "DROP SUBSCRIPTION tap_sub" doing here? It seems
misplaced under this comment.

~~~

19.
+# Make sure that there is 0 prepared transaction on the subscriber
+$result = $node_subscriber->safe_psql('postgres',
+ "SELECT count(*) FROM pg_prepared_xacts;");
+is($result, qq(0), 'transaction is prepared on subscriber');

19a.
SUGGESTION
Make sure there are no prepared transactions on the subscriber

~~~

19b.
/'transaction is prepared on subscriber'/'should be no prepared
transactions on subscriber'/

~~~

20.
+# Made sure that the commited transaction is replicated.

/Made sure/Make sure/

/commited/committed/

~~~

21.
+# Make sure that the two-phase is enabled on the subscriber
+$result = $node_subscriber->safe_psql('postgres',
+ "SELECT subtwophasestate FROM pg_subscription WHERE subname = 'tap_sub_copy';"
+);
+is($result, qq(e), 'two-phase is disabled');

The 'two-phase is disabled' is the identical message used in the
opposite case earlier, so something is amiss. Maybe this one should
say 'two-phase should be enabled' and the earlier counterpart should
say 'two-phase should be disabled'.

==
Kind Regards,
Peter Smith
Fujitsu Australia

Re: Improve the connection failure error messages

2024-04-25 Thread Peter Smith

Hi, just by visual inspection of the v3/v4 patch diffs, the latest v4 LGTM.

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: logicalrep_worker_launch -- counting/checking the worker limits

2024-04-22 Thread Peter Smith

On Sun, Mar 31, 2024 at 8:12 PM Andrey M. Borodin  wrote:
>
>
>
> > On 15 Aug 2023, at 07:38, Peter Smith  wrote:
> >
> > A rebase was needed due to a recent push [1].
> >
> > PSA v3.
>
>
> > On 14 Jan 2024, at 10:43, vignesh C  wrote:
> >
> > I have changed the status of the patch to "Waiting on Author" as
> > Amit's queries at [1] have not been verified and concluded. Please
> > feel free to address them and change the status back again.
>
> Hi Peter!
>
> Are you still interested in this thread? If so - please post an answer to 
> Amit's question.
> If you are not interested - please Withdraw a CF entry [0].
>
> Thanks!

Yeah, sorry for the long period of inactivity on this thread. Although
I still have some interest in it, I don't know when I'll get back to
it again so meantime I've withdrawn this from the CF as requested.

Kind regards,
Peter Smith
Fujitsu Australia

Re: Improve the connection failure error messages

2024-03-12 Thread Peter Smith

FYI -- some more code has been pushed since this patch was last
updated. AFAICT perhaps you'll want to update this patch again for the
following new connection messages on HEAD:

- slotfuncs.c [1]
- slotsync.c [2]

--
[1] 
https://github.com/postgres/postgres/blob/0b84f5c419a300dc1b1a70cf63b9907208e52643/src/backend/replication/slotfuncs.c#L989
[2] 
https://github.com/postgres/postgres/blob/0b84f5c419a300dc1b1a70cf63b9907208e52643/src/backend/replication/logical/slotsync.c#L1258

Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Improve eviction algorithm in ReorderBuffer

2024-03-12 Thread Peter Smith

On Wed, Mar 13, 2024 at 12:48 PM Masahiko Sawada  wrote:
>
> On Wed, Mar 13, 2024 at 10:15 AM Peter Smith  wrote:
> >
> > On Tue, Mar 12, 2024 at 4:23 PM Masahiko Sawada  
> > wrote:
> > >
> > > On Fri, Mar 8, 2024 at 12:58 PM Peter Smith  wrote:
> > > >
> > ...
> > > > > > 5.
> > > > > > + *
> > > > > > + * If 'indexed' is true, we create a hash table to track of each 
> > > > > > node's
> > > > > > + * index in the heap, enabling to perform some operations such as 
> > > > > > removing
> > > > > > + * the node from the heap.
> > > > > >   */
> > > > > >  binaryheap *
> > > > > > -binaryheap_allocate(int capacity, binaryheap_comparator compare, 
> > > > > > void *arg)
> > > > > > +binaryheap_allocate(int capacity, binaryheap_comparator compare,
> > > > > > + bool indexed, void *arg)
> > > > > >
> > > > > > BEFORE
> > > > > > ... enabling to perform some operations such as removing the node 
> > > > > > from the heap.
> > > > > >
> > > > > > SUGGESTION
> > > > > > ... to help make operations such as removing nodes more efficient.
> > > > > >
> > > > >
> > > > > But these operations literally require the indexed binary heap as we
> > > > > have an assertion:
> > > > >
> > > > > void
> > > > > binaryheap_remove_node_ptr(binaryheap *heap, bh_node_type d)
> > > > > {
> > > > > bh_nodeidx_entry *ent;
> > > > >
> > > > > Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
> > > > > Assert(heap->bh_indexed);
> > > > >
> > > >
> > > > I didn’t quite understand -- the operations mentioned are "operations
> > > > such as removing the node", but binaryheap_remove_node() also removes
> > > > a node from the heap. So I still felt the comment wording of the patch
> > > > is not quite correct.
> > >
> > > Now I understand your point. That's a valid point.
> > >
> > > >
> > > > Now, if the removal of a node from an indexed heap can *only* be done
> > > > using binaryheap_remove_node_ptr() then:
> > > > - the other removal functions (binaryheap_remove_*) probably need some
> > > > comments to make sure nobody is tempted to call them directly for an
> > > > indexed heap.
> > > > - maybe some refactoring and assertions are needed to ensure those
> > > > *cannot* be called directly for an indexed heap.
> > > >
> > >
> > > If the 'index' is true, the caller can not only use the existing
> > > functions but also newly added functions such as
> > > binaryheap_remove_node_ptr() and binaryheap_update_up() etc. How about
> > > something like below?
> > >
> >
> > You said: "can not only use the existing functions but also..."
> >
> > Hmm. Is that right? IIUC those existing "remove" functions should NOT
> > be called directly if the heap was "indexed" because they'll delete
> > the node from the heap OK, but any corresponding index for that
> > deleted node will be left lying around -- i.e. everything gets out of
> > sync. This was the reason for my original concern.
> >
>
> All existing binaryheap functions should be available even if the
> binaryheap is 'indexed'. For instance, with the patch,
> binaryheap_remote_node() is:
>
> void
> binaryheap_remove_node(binaryheap *heap, int n)
> {
> int cmp;
>
> Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
> Assert(n >= 0 && n < heap->bh_size);
>
> /* compare last node to the one that is being removed */
> cmp = heap->bh_compare(heap->bh_nodes[--heap->bh_size],
>heap->bh_nodes[n],
>heap->bh_arg);
>
> /* remove the last node, placing it in the vacated entry */
> replace_node(heap, n, heap->bh_nodes[heap->bh_size]);
>
> /* sift as needed to preserve the heap property */
> if (cmp > 0)
> sift_up(heap, n);
> else if (cmp < 0)
> sift_down(heap, n);
> }
>
> The replace_node(), sift_up() and sift_down() update node's index as
> well if the bin

Re: Improve eviction algorithm in ReorderBuffer

2024-03-12 Thread Peter Smith

On Tue, Mar 12, 2024 at 4:23 PM Masahiko Sawada  wrote:
>
> On Fri, Mar 8, 2024 at 12:58 PM Peter Smith  wrote:
> >
...
> > > > 5.
> > > > + *
> > > > + * If 'indexed' is true, we create a hash table to track of each node's
> > > > + * index in the heap, enabling to perform some operations such as 
> > > > removing
> > > > + * the node from the heap.
> > > >   */
> > > >  binaryheap *
> > > > -binaryheap_allocate(int capacity, binaryheap_comparator compare, void 
> > > > *arg)
> > > > +binaryheap_allocate(int capacity, binaryheap_comparator compare,
> > > > + bool indexed, void *arg)
> > > >
> > > > BEFORE
> > > > ... enabling to perform some operations such as removing the node from 
> > > > the heap.
> > > >
> > > > SUGGESTION
> > > > ... to help make operations such as removing nodes more efficient.
> > > >
> > >
> > > But these operations literally require the indexed binary heap as we
> > > have an assertion:
> > >
> > > void
> > > binaryheap_remove_node_ptr(binaryheap *heap, bh_node_type d)
> > > {
> > > bh_nodeidx_entry *ent;
> > >
> > > Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
> > > Assert(heap->bh_indexed);
> > >
> >
> > I didn’t quite understand -- the operations mentioned are "operations
> > such as removing the node", but binaryheap_remove_node() also removes
> > a node from the heap. So I still felt the comment wording of the patch
> > is not quite correct.
>
> Now I understand your point. That's a valid point.
>
> >
> > Now, if the removal of a node from an indexed heap can *only* be done
> > using binaryheap_remove_node_ptr() then:
> > - the other removal functions (binaryheap_remove_*) probably need some
> > comments to make sure nobody is tempted to call them directly for an
> > indexed heap.
> > - maybe some refactoring and assertions are needed to ensure those
> > *cannot* be called directly for an indexed heap.
> >
>
> If the 'index' is true, the caller can not only use the existing
> functions but also newly added functions such as
> binaryheap_remove_node_ptr() and binaryheap_update_up() etc. How about
> something like below?
>

You said: "can not only use the existing functions but also..."

Hmm. Is that right? IIUC those existing "remove" functions should NOT
be called directly if the heap was "indexed" because they'll delete
the node from the heap OK, but any corresponding index for that
deleted node will be left lying around -- i.e. everything gets out of
sync. This was the reason for my original concern.

>  * If 'indexed' is true, we create a hash table to track each node's
>  * index in the heap, enabling to perform some operations such as
>  * binaryheap_remove_node_ptr() etc.
>

Yeah, something like that... I'll wait for the next patch version
before commenting further.

--
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Improve eviction algorithm in ReorderBuffer

2024-03-11 Thread Peter Smith

ing:

Assert(!binaryheap_empty(rb->txn_heap));

~~~

10.
+
+/*
+ * Compare between sizes of two transactions. This is for a binary heap
+ * comparison function.
+ */
+static int
+ReorderBufferTXNSizeCompare(Datum a, Datum b, void *arg)

10a.
/Compare between sizes of two transactions./Compare two transactions by size./

~~~

10b.
IMO this comparator function belongs just before the
ReorderBufferAllocate() function since that is the only place where it
is used.

==
src/include/replication/reorderbuffer.h

11.
+/* State of how to track the memory usage of each transaction being decoded */
+typedef enum ReorderBufferMemTrackState
+{
+ /*
+ * We don't update max-heap while updating the memory counter. The
+ * max-heap is built before use.
+ */
+ REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP,
+
+ /*
+ * We also update the max-heap when updating the memory counter so the
+ * heap property is always preserved.
+ */
+ REORDER_BUFFER_MEM_TRACK_MAINTAIN_MAXHEAP,
+} ReorderBufferMemTrackState;
+

In my GENERAL review comment #0, I suggested the removal of this
entire enum. e.g. It could be replaced with a boolean field
'track_txn_sizes'

TBH, I think there is a better way to handle this "state". IIUC
- the txn_heap is always allocated up-front.
- you only "build" it when > threshold and
- when it drops < 0.9 x threshold you reset it.

Therefore, AFAICT you do not need to maintain any “switch states” at
all; you simply need to check binaryheap_empty(txn_heap), right?
* If the heap is empty…. It means you are NOT tracking, so don’t use it
* If the heap is NOT empty …. It means you ARE tracking, so use it.

~

Using my idea to remove the state flag will have the side effect of
simplifying many other parts of this patch. For example

BEFORE
+static void
+ReorderBufferMaybeChangeNoMaxHeap(ReorderBuffer *rb)
+{
+ if (rb->memtrack_state == REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP)
+ return;
+
...
+ if (binaryheap_size(rb->txn_heap) < REORDER_BUFFER_MEM_TRACK_THRESHOLD * 0.9)
+ {
+ rb->memtrack_state = REORDER_BUFFER_MEM_TRACK_NO_MAXHEAP;
+ binaryheap_reset(rb->txn_heap);
+ }
+}

AFTER
+static void
+ReorderBufferMaybeChangeNoMaxHeap(ReorderBuffer *rb)
+{
+ if (binaryheap_empty(rb->txn_heap))
+ return;
+
...
+ if (binaryheap_size(rb->txn_heap) < REORDER_BUFFER_MEM_TRACK_THRESHOLD * 0.9)
+ binaryheap_reset(rb->txn_heap);
+}

~~~

12. struct ReorderBuffer

+ /* Max-heap for sizes of all top-level and sub transactions */
+ ReorderBufferMemTrackState memtrack_state;
+ binaryheap *txn_heap;
+

12a.
Why is this being referred to in the commit message and code comments
as "max-heap" when the field is not called by that same name? Won't it
be better to give the field a better name -- e.g. "txn_maxheap" or
similar?

~

12b.
This comment should also say that the heap is ordered by tx size --
(e.g. the comparator is ReorderBufferTXNSizeCompare)

--
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Improve eviction algorithm in ReorderBuffer

2024-03-07 Thread Peter Smith

On Thu, Mar 7, 2024 at 2:16 PM Masahiko Sawada  wrote:
>
> On Tue, Mar 5, 2024 at 3:28 PM Peter Smith  wrote:
> >

> > 4a.
> > The comment in simplehash.h says
> >  *   The following parameters are only relevant when SH_DEFINE is defined:
> >  *   - SH_KEY - ...
> >  *   - SH_EQUAL(table, a, b) - ...
> >  *   - SH_HASH_KEY(table, key) - ...
> >  *   - SH_STORE_HASH - ...
> >  *   - SH_GET_HASH(tb, a) - ...
> >
> > So maybe it is nicer to reorder the #defines in that same order?
> >
> > SUGGESTION:
> > +#define SH_PREFIX bh_nodeidx
> > +#define SH_ELEMENT_TYPE bh_nodeidx_entry
> > +#define SH_KEY_TYPE bh_node_type
> > +#define SH_SCOPE extern
> > +#ifdef FRONTEND
> > +#define SH_RAW_ALLOCATOR pg_malloc0
> > +#endif
> >
> > +#define SH_DEFINE
> > +#define SH_KEY key
> > +#define SH_EQUAL(tb, a, b) (memcmp(, , sizeof(bh_node_type)) == 0)
> > +#define SH_HASH_KEY(tb, key) \
> > + hash_bytes((const unsigned char *) , sizeof(bh_node_type))
> > +#include "lib/simplehash.h"
>
> I'm really not sure it helps increase readability. For instance, for
> me it's readable if SH_DEFINE and SH_DECLARE come to the last before
> #include since it's more obvious whether we want to declare, define or
> both. Other simplehash.h users also do so.
>

OK.

> > 5.
> > + *
> > + * If 'indexed' is true, we create a hash table to track of each node's
> > + * index in the heap, enabling to perform some operations such as removing
> > + * the node from the heap.
> >   */
> >  binaryheap *
> > -binaryheap_allocate(int capacity, binaryheap_comparator compare, void *arg)
> > +binaryheap_allocate(int capacity, binaryheap_comparator compare,
> > + bool indexed, void *arg)
> >
> > BEFORE
> > ... enabling to perform some operations such as removing the node from the 
> > heap.
> >
> > SUGGESTION
> > ... to help make operations such as removing nodes more efficient.
> >
>
> But these operations literally require the indexed binary heap as we
> have an assertion:
>
> void
> binaryheap_remove_node_ptr(binaryheap *heap, bh_node_type d)
> {
> bh_nodeidx_entry *ent;
>
> Assert(!binaryheap_empty(heap) && heap->bh_has_heap_property);
> Assert(heap->bh_indexed);
>

I didn’t quite understand -- the operations mentioned are "operations
such as removing the node", but binaryheap_remove_node() also removes
a node from the heap. So I still felt the comment wording of the patch
is not quite correct.

Now, if the removal of a node from an indexed heap can *only* be done
using binaryheap_remove_node_ptr() then:
- the other removal functions (binaryheap_remove_*) probably need some
comments to make sure nobody is tempted to call them directly for an
indexed heap.
- maybe some refactoring and assertions are needed to ensure those
*cannot* be called directly for an indexed heap.

> > 7b.
> > IMO the parameters would be better the other way around (e.g. 'index'
> > before the 'node') because that's what the assignments look like:
> >
> >
> > heap->bh_nodes[heap->bh_size] = d;
> >
> > becomes:
> > bh_set_node(heap, heap->bh_size, d);
> >
>
> I think it assumes heap->bh_nodes is an array. But if we change it in
> the future, it will no longer make sense. I think it would make more
> sense if we define the parameters in an order like "we set the 'node'
> at 'index'". What do you think?

YMMV. The patch code is also OK by me if you prefer it.

//

And, here are some review comments for v8-0002.

==
1. delete_nodeidx

+/*
+ * Remove the node's index from the hash table if the heap is indexed.
+ */
+static bool
+delete_nodeidx(binaryheap *heap, bh_node_type node)
+{
+ if (!binaryheap_indexed(heap))
+ return false;
+
+ return bh_nodeidx_delete(heap->bh_nodeidx, node);
+}

1a.
In v8 this function was changed to now return bool, so, I think the
function comment should explain the meaning of that return value.

~

1b.
I felt the function body is better expressed positively: "If this then
do that", instead of "If not this then do nothing otherwise do that"

SUGGESTION
if (binaryheap_indexed(heap))
  return bh_nodeidx_delete(heap->bh_nodeidx, node);

return false;

~~~

2.
+static void
+replace_node(binaryheap *heap, int index, bh_node_type new_node)
+{
+ bool found PG_USED_FOR_ASSERTS_ONLY;
+
+ /* Quick return if not necessary to move */
+ if (heap->bh_nodes[index] == new_node)
+ return;
+
+ /*
+ * Remove overwritten node's index. The overwritten node's position must
+ * have been tracked, if enabled.
+ */
+ found = delete_nodeidx(heap, heap->bh_nodes[index]);

Re: Synchronizing slots from primary to standby

2024-03-06 Thread Peter Smith

Here are some review comments for v107-0001

==
src/backend/replication/slot.c

1.
+/*
+ * Struct for the configuration of standby_slot_names.
+ *
+ * Note: this must be a flat representation that can be held in a single chunk
+ * of guc_malloc'd memory, so that it can be stored as the "extra" data for the
+ * standby_slot_names GUC.
+ */
+typedef struct
+{
+ int slot_num;
+
+ /* slot_names contains nmembers consecutive nul-terminated C strings */
+ char slot_names[FLEXIBLE_ARRAY_MEMBER];
+} StandbySlotConfigData;
+

1a.
To avoid any ambiguity this 1st field is somehow a slot ID number, I
felt a better name would be 'nslotnames' or even just 'n' or 'count',

~

1b.
(fix typo)

SUGGESTION for the 2nd field comment
slot_names is a chunk of 'n' X consecutive null-terminated C strings

~

1c.
A more explanatory name for this typedef maybe is 'StandbySlotNamesConfigData' ?

~~~


2.
+/* This is parsed and cached configuration for standby_slot_names */
+static StandbySlotConfigData *standby_slot_config;


2a.
/This is parsed and cached configuration for .../This is the parsed
and cached configuration for .../

~

2b.
Similar to above -- since this only has name information maybe it is
more correct to call it 'standby_slot_names_config'?

~~~

3.
+/*
+ * A helper function to validate slots specified in GUC standby_slot_names.
+ *
+ * The rawname will be parsed, and the parsed result will be saved into
+ * *elemlist.
+ */
+static bool
+validate_standby_slots(char *rawname, List **elemlist)

/and the parsed result/and the result/

~~~

4. check_standby_slot_names

+ /* Need a modifiable copy of string */
+ rawname = pstrdup(*newval);

/copy of string/copy of the GUC string/

~~~

5.
+assign_standby_slot_names(const char *newval, void *extra)
+{
+ /*
+ * The standby slots may have changed, so we must recompute the oldest
+ * LSN.
+ */
+ ss_oldest_flush_lsn = InvalidXLogRecPtr;
+
+ standby_slot_config = (StandbySlotConfigData *) extra;
+}

To avoid leaking don't we need to somewhere take care to free any
memory used by a previous value (if any) of this
'standby_slot_config'?

~~~

6. AcquiredStandbySlot

+/*
+ * Return true if the currently acquired slot is specified in
+ * standby_slot_names GUC; otherwise, return false.
+ */
+bool
+AcquiredStandbySlot(void)
+{
+ const char *name;
+
+ /* Return false if there is no value in standby_slot_names */
+ if (standby_slot_config == NULL)
+ return false;
+
+ name = standby_slot_config->slot_names;
+ for (int i = 0; i < standby_slot_config->slot_num; i++)
+ {
+ if (strcmp(name, NameStr(MyReplicationSlot->data.name)) == 0)
+ return true;
+
+ name += strlen(name) + 1;
+ }
+
+ return false;
+}

6a.
Just checking "(standby_slot_config == NULL)" doesn't seem enough to
me, because IIUIC it is possible when 'standby_slot_names' has no
value then maybe standby_slot_config is not NULL but
standby_slot_config->slot_num is 0.

~

6b.
IMO this function would be tidier written such that the
MyReplicationSlot->data.name is passed as a parameter. Then you can
name the function more naturally like:

IsSlotInStandbySlotNames(const char *slot_name)

~

6c.
IMO the body of the function will be tidier if written so there are
only 2 returns instead of 3 like

SUGGESTION:
if (...)
{
  for (...)
  {
 ...
return true;
  }
}
return false;

~~~

7.
+ /*
+ * Don't need to wait for the standbys to catch up if there is no value in
+ * standby_slot_names.
+ */
+ if (standby_slot_config == NULL)
+ return true;

(similar to a previous review comment)

This check doesn't seem enough because IIUIC it is possible when
'standby_slot_names' has no value then maybe standby_slot_config is
not NULL but standby_slot_config->slot_num is 0.

~~~

8. WaitForStandbyConfirmation

+ /*
+ * Don't need to wait for the standby to catch up if the current acquired
+ * slot is not a logical failover slot, or there is no value in
+ * standby_slot_names.
+ */
+ if (!MyReplicationSlot->data.failover || !standby_slot_config)
+ return;

(similar to a previous review comment)

IIUIC it is possible that when 'standby_slot_names' has no value, then
standby_slot_config is not NULL but standby_slot_config->slot_num is
0. So shouldn't that be checked too?

Perhaps it is convenient to encapsulate this check using some macro:
#define StandbySlotNamesHasNoValue() (standby_slot_config = NULL ||
standby_slot_config->slot_num == 0)

--
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Improve eviction algorithm in ReorderBuffer

2024-03-04 Thread Peter Smith

h_indexed)
  (void) bh_nodeidx_delete(heap->bh_nodeidx, node);

~~~

10.
+/*
+ * Replace the node at 'idx' with the given node 'replaced_by'. Also
+ * update their positions accordingly.
+ */
+static void
+bh_replace_node(binaryheap *heap, int idx, bh_node_type replaced_by)

10a.
Would 'node' or 'new_node' or 'replacement' be a better name than 'replaced_by'?

~

10b.
I noticed that the index param is called 'idx' here but in other
functions, it is called 'index'. I think either is good (I prefer
'idx') but at least everywhere should use the same name for
consistency.

~~~

11.
+static void
+bh_replace_node(binaryheap *heap, int idx, bh_node_type replaced_by)
+{
+ /* Remove overwritten node's index */
+ bh_delete_nodeidx(heap, heap->bh_nodes[idx]);
+
+ /* Replace it with the given new node */
+ if (idx < heap->bh_size)
+ {
+ bool found PG_USED_FOR_ASSERTS_ONLY;
+
+ found = bh_set_node(heap, replaced_by, idx);
+
+ /* The overwritten node's index must already be tracked */
+ Assert(!heap->bh_indexed || found);
+ }
+}

I did not understand the condition.
e.g. Can you explain when is idx NOT less than heap->bh_size?
e.g. If this condition failed then nothing gets replaced (??)

~~~

==
src/include/lib/binaryheap.h

12.
+/*
+ * Struct for A hash table element to store the node's index in the bh_nodes
+ * array.
+ */
+typedef struct bh_nodeidx_entry

/for A hash table/for a hash table/

~~~

13.
+/* define parameters necessary to generate the hash table interface */

Suggest uppercase "Define" and add a period.

~~~

14.
+
+ /*
+ * If bh_indexed is true, the bh_nodeidx is used to track of each node's
+ * index in bh_nodes. This enables the caller to perform
+ * binaryheap_remove_node_ptr(), binaryheap_update_up/down in O(log n).
+ */
+ bool bh_indexed;
+ bh_nodeidx_hash *bh_nodeidx;
 } binaryheap;

I'm wondering why the separate 'bh_indexed' is necessary at all. Can't
you just use the bh_nodeidx value? E.g. If bh_nodeidx == NULL then it
means there is no index tracking, otherwise there is.

--
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Improve eviction algorithm in ReorderBuffer

2024-03-04 Thread Peter Smith

Hi, Here are some review comments for v7-0001

1.
/*
 * binaryheap_free
 *
 * Releases memory used by the given binaryheap.
 */
void
binaryheap_free(binaryheap *heap)
{
pfree(heap);
}


Shouldn't the above function (not modified by the patch) also firstly
free the memory allocated for the heap->bh_nodes?

~~~

2.
+/*
+ * Make sure there is enough space for nodes.
+ */
+static void
+bh_enlarge_node_array(binaryheap *heap)
+{
+ heap->bh_space *= 2;
+ heap->bh_nodes = repalloc(heap->bh_nodes,
+   sizeof(bh_node_type) * heap->bh_space);
+}

Strictly speaking, this function doesn't really "Make sure" of
anything because the caller does the check whether we need more space.
All that happens here is allocating more space. Maybe this function
comment should say something like "Double the space allocated for
nodes."

--
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Synchronizing slots from primary to standby

2024-03-04 Thread Peter Smith

Here are some review comments for v105-0001

==
doc/src/sgml/config.sgml

1.
+   
+The standbys corresponding to the physical replication slots in
+standby_slot_names must configure
+sync_replication_slots = true so they can receive
+logical failover slots changes from the primary.
+   

/slots changes/slot changes/

==
doc/src/sgml/func.sgml

2.
+The function may be waiting if the specified slot is a logical failover
+slot and standby_slot_names
+is configured.

I know this has been through multiple versions already, but this
latest wording "may be waiting..." doesn't seem very good to me.

How about one of these?

* The function may not be able to return immediately if the specified
slot is a logical failover slot and standby_slot_names is configured.

* The function return might be blocked if the specified slot is a
logical failover slot and standby_slot_names is configured.

* If the specified slot is a logical failover slot then the function
will block until all physical slots specified in standby_slot_names
have confirmed WAL receipt.

* If the specified slot is a logical failover slot then the function
will not return until all physical slots specified in
standby_slot_names have confirmed WAL receipt.

~~~

3.
+slot may return to an earlier position. The function may be waiting if
+the specified slot is a logical failover slot and
+standby_slot_names


Same as previous review comment #2

==
src/backend/replication/slot.c

4. WaitForStandbyConfirmation

+ * Used by logical decoding SQL functions that acquired logical failover slot.

IIUC it doesn't work like that. pg_logical_slot_get_changes_guts()
calls here unconditionally (i.e. the SQL functions don't even check if
they are failover slots before calling this) so the comment seems
misleading/redundant.

==
src/backend/replication/walsender.c

5. NeedToWaitForWal

+ /*
+ * Check if the standby slots have caught up to the flushed position. It
+ * is good to wait up to the flushed position and then let the WalSender
+ * send the changes to logical subscribers one by one which are already
+ * covered by the flushed position without needing to wait on every change
+ * for standby confirmation.
+ */
+ if (NeedToWaitForStandbys(flushed_lsn, wait_event))
+ return true;
+
+ *wait_event = 0;
+ return false;
+}
+

5a.
The comment (or part of it?) seems misplaced because it is talking
WalSender sending changes, but that is not happening in this function.

Also, isn't what this is saying already described by the other comment
in the caller? e.g.:

+ /*
+ * Within the loop, we wait for the necessary WALs to be flushed to
+ * disk first, followed by waiting for standbys to catch up if there
+ * are enough WALs or upon receiving the shutdown signal. To avoid the
+ * scenario where standbys need to catch up to a newer WAL location in
+ * each iteration, we update our idea of the currently flushed
+ * position only if we are not waiting for standbys to catch up.
+ */

~

5b.
Most of the code is unnecessary. AFAICT all this is exactly same as just 1 line:

return NeedToWaitForStandbys(flushed_lsn, wait_event);

~~~

6. WalSndWaitForWal

+ /*
+ * Within the loop, we wait for the necessary WALs to be flushed to
+ * disk first, followed by waiting for standbys to catch up if there
+ * are enough WALs or upon receiving the shutdown signal. To avoid the
+ * scenario where standbys need to catch up to a newer WAL location in
+ * each iteration, we update our idea of the currently flushed
+ * position only if we are not waiting for standbys to catch up.
+ */

Regarding that 1st sentence: maybe this logic used to be done
explicitly "within the loop" but IIUC this logic is now hidden inside
NeedToWaitForWal() so the comment should mention that.

--
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Synchronizing slots from primary to standby

2024-03-03 Thread Peter Smith

On Mon, Mar 4, 2024 at 2:38 PM Amit Kapila  wrote:
>
> On Mon, Mar 4, 2024 at 6:57 AM Peter Smith  wrote:
> >
> > On Sun, Mar 3, 2024 at 2:56 PM Amit Kapila  wrote:
> > >
> > > On Sun, Mar 3, 2024 at 5:17 AM Peter Smith  wrote:
> > > >
> > ...
> > > > 9. NeedToWaitForWal
> > > >
> > > > + /*
> > > > + * Check if the standby slots have caught up to the flushed position. 
> > > > It
> > > > + * is good to wait up to flushed position and then let it send the 
> > > > changes
> > > > + * to logical subscribers one by one which are already covered in 
> > > > flushed
> > > > + * position without needing to wait on every change for standby
> > > > + * confirmation. Note that after receiving the shutdown signal, an 
> > > > ERROR
> > > > + * is reported if any slots are dropped, invalidated, or inactive. This
> > > > + * measure is taken to prevent the walsender from waiting indefinitely.
> > > > + */
> > > > + if (NeedToWaitForStandby(target_lsn, flushed_lsn, wait_event))
> > > > + return true;
> > > >
> > > > I felt it was confusing things for this function to also call to the
> > > > other one -- it seems an overlapping/muddling of the purpose of these.
> > > > I think it will be easier to understand if the calling code just calls
> > > > to one or both of these functions as required.
> > > >
> > >
> > > I felt otherwise because the caller has to call these functions at
> > > more than one place which makes the caller's code difficult to follow.
> > > It is better to encapsulate the computation of wait_event.
> > >
> >
> > You may have misinterpreted my review comment because I didn't say
> > anything about changing the encapsulation of the computation of the
> > wait_event.
> >
>
> No, I have understood it in the same way as you have outlined in this
> email and liked the way the current patch has it.
>

OK, if the code will remain as-is wouldn't it be better to anyway
change the function name to indicate what it really does?

e.g.  NeedToWaitForWal --> NeedToWaitForWalFlushOrStandbys

--
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Synchronizing slots from primary to standby

2024-03-03 Thread Peter Smith

On Mon, Mar 4, 2024 at 2:49 PM Amit Kapila  wrote:
>
> On Mon, Mar 4, 2024 at 7:25 AM Peter Smith  wrote:
> >
> >
> > ==
> > doc/src/sgml/config.sgml
> >
> > 2.
> > +pg_logical_slot_peek_changes,
> > +when used with failover enabled logical slots, will block until all
> > +physical slots specified in standby_slot_names 
> > have
> > +confirmed WAL receipt.
> >
> > /failover enabled logical slots/failover-enabled logical slots/
> >
>
> How about just saying logical failover slots at this and other places?
>

Yes, that wording works too.

--
Kind Regards,
Peter Smith.
Fujitsu Australia.

Re: Synchronizing slots from primary to standby

2024-03-03 Thread Peter Smith

On Sun, Mar 3, 2024 at 6:51 PM Zhijie Hou (Fujitsu)
 wrote:
>
> On Sunday, March 3, 2024 7:47 AM Peter Smith  wrote:
> >

> > 3.
> > +   
> > +
> > + Value * is not accepted as it is inappropriate 
> > to
> > + block logical replication for physical slots that either lack
> > + associated standbys or have standbys associated that are not
> > enabled
> > + for replication slot synchronization. (see
> > +  > linkend="logicaldecoding-replication-slots-synchronization"/>).
> > +
> > +   
> >
> > Why does the document need to provide an excuse/reason for the rule?
> > You could just say something like:
> >
> > SUGGESTION
> > The slots must be named explicitly. For example, specifying wildcard values 
> > like
> > * is not permitted.
>
> As suggested by Amit, I moved this to code comments.

Was the total removal of this note deliberate? I only suggested
removing the *reason* for the rule, not the entire rule. Otherwise,
the user won't know to avoid doing this until they try it and get an
error.


> >
> > 9. NeedToWaitForWal
> >
> > + /*
> > + * Check if the standby slots have caught up to the flushed position.
> > + It
> > + * is good to wait up to flushed position and then let it send the
> > + changes
> > + * to logical subscribers one by one which are already covered in
> > + flushed
> > + * position without needing to wait on every change for standby
> > + * confirmation. Note that after receiving the shutdown signal, an
> > + ERROR
> > + * is reported if any slots are dropped, invalidated, or inactive. This
> > + * measure is taken to prevent the walsender from waiting indefinitely.
> > + */
> > + if (NeedToWaitForStandby(target_lsn, flushed_lsn, wait_event)) return
> > + true;
> >
> > I felt it was confusing things for this function to also call to the other 
> > one -- it
> > seems an overlapping/muddling of the purpose of these.
> > I think it will be easier to understand if the calling code just calls to 
> > one or both
> > of these functions as required.
>
> Same as Amit, I didn't change this.

AFAICT my previous review comment was misinterpreted. Please see [1]
for more details.



Here are some more review comments for v104-1

==
Commit message

1.
Additionally, The SQL functions pg_logical_slot_get_changes,
pg_logical_slot_peek_changes and pg_replication_slot_advance are modified
to wait for the replication slots specified in 'standby_slot_names' to
catch up before returning.

~

Maybe that should be expressed using similar wording as the docs...

SUGGESTION
Additionally, The SQL functions ... are modified. Now, when used with
failover-enabled logical slots, these functions will block until all
physical slots specified in 'standby_slot_names' have confirmed WAL
receipt.

==
doc/src/sgml/config.sgml

2.
+pg_logical_slot_peek_changes,
+when used with failover enabled logical slots, will block until all
+physical slots specified in standby_slot_names have
+confirmed WAL receipt.

/failover enabled logical slots/failover-enabled logical slots/

==
doc/src/sgml/func.sgml

3.
+The function may be blocked if the specified slot is a failover enabled
+slot and standby_slot_names
+is configured.


/a failover enabled slot//a failover-enabled slot/

~~~

4.
+slot may return to an earlier position. The function may be blocked if
+the specified slot is a failover enabled slot and
+standby_slot_names
+is configured.

/a failover enabled slot//a failover-enabled slot/

==
src/backend/replication/slot.c

5.
+/*
+ * Wait for physical standbys to confirm receiving the given lsn.
+ *
+ * Used by logical decoding SQL functions that acquired failover enabled slot.
+ * It waits for physical standbys corresponding to the physical slots specified
+ * in the standby_slot_names GUC.
+ */
+void
+WaitForStandbyConfirmation(XLogRecPtr wait_for_lsn)

/failover enabled slot/failover-enabled slot/

~~~

6.
+ /*
+ * Don't need to wait for the standby to catch up if the current acquired
+ * slot is not a failover enabled slot, or there is no value in
+ * standby_slot_names.
+ */

/failover enabled slot/failover-enabled slot/

==
src/backend/replication/slotfuncs.c

7.
+
+ /*
+ * Wake up logical walsenders holding failover enabled slots after
+ * updating the restart_lsn of the physical slot.
+ */
+ PhysicalWakeupLogicalWalSnd();

/failover enabled slots/failover-enabled slots/

==
src/backend/replication/walsender.c

8.
+/*
+ * Wake up the logical walsender processes with failover enabled slots if the
+ * currently acquired physical s

Re: Synchronizing slots from primary to standby

2024-03-03 Thread Peter Smith

On Sun, Mar 3, 2024 at 2:56 PM Amit Kapila  wrote:
>
> On Sun, Mar 3, 2024 at 5:17 AM Peter Smith  wrote:
> >
...
> > 9. NeedToWaitForWal
> >
> > + /*
> > + * Check if the standby slots have caught up to the flushed position. It
> > + * is good to wait up to flushed position and then let it send the changes
> > + * to logical subscribers one by one which are already covered in flushed
> > + * position without needing to wait on every change for standby
> > + * confirmation. Note that after receiving the shutdown signal, an ERROR
> > + * is reported if any slots are dropped, invalidated, or inactive. This
> > + * measure is taken to prevent the walsender from waiting indefinitely.
> > + */
> > + if (NeedToWaitForStandby(target_lsn, flushed_lsn, wait_event))
> > + return true;
> >
> > I felt it was confusing things for this function to also call to the
> > other one -- it seems an overlapping/muddling of the purpose of these.
> > I think it will be easier to understand if the calling code just calls
> > to one or both of these functions as required.
> >
>
> I felt otherwise because the caller has to call these functions at
> more than one place which makes the caller's code difficult to follow.
> It is better to encapsulate the computation of wait_event.
>

You may have misinterpreted my review comment because I didn't say
anything about changing the encapsulation of the computation of the
wait_event.

I only wrote it is better IMO for the functions to stick to just one
job each according to their function name. E.g.:
- NeedToWaitForStandby should *only* check if we need to wait for standbys
- NeedToWaitForWal should *only* check if we need to wait for WAL
flush; i.e. this shouldn't be also calling NeedToWaitForStandby().

Also, AFAICT the caller changes should not be difficult. Indeed, these
changes will make the code aligned properly with what the comments are
saying:

BEFORE
/*
 * Fast path to avoid acquiring the spinlock in case we already know we
 * have enough WAL available and all the standby servers have confirmed
 * receipt of WAL up to RecentFlushPtr. This is particularly interesting
 * if we're far behind.
 */
if (!XLogRecPtrIsInvalid(RecentFlushPtr) &&
  !NeedToWaitForWal(loc, RecentFlushPtr, _event))
  return RecentFlushPtr;

SUGGESTED
...
if (!XLogRecPtrIsInvalid(RecentFlushPtr) &&
  !NeedToWaitForWal(loc, RecentFlushPtr, _event) &&
  !NeedToWaitForStandby(loc, RecentFlushPtr, _event)
  return RecentFlushPtr;

~~~

BEFORE
/*
 * Exit the loop if already caught up and doesn't need to wait for
 * standby slots.
 */
if (!wait_for_standby_at_stop &&
  !NeedToWaitForWal(loc, RecentFlushPtr, _event))
  break;

SUGGESTED
...
if (!wait_for_standby_at_stop &&
  !NeedToWaitForWal(loc, RecentFlushPtr, _event) &&
  !NeedToWaitForStandby(loc, RecentFlushPtr, _event))
  break;

--
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: pub/sub - specifying optional parameters without values.

2024-03-02 Thread Peter Smith

On Fri, Jan 12, 2024 at 4:07 PM Peter Smith  wrote:
>
> On Mon, Jan 30, 2023 at 8:36 AM Tom Lane  wrote:
> >
> > Zheng Li  writes:
> > > The behavior is due to the following code
> > > https://github.com/postgres/postgres/blob/master/src/backend/commands/define.c#L113
> >
> > Yeah, so you can grep for places that have this behavior by looking
> > for defGetBoolean calls ... and there are quite a few.  That leads
> > me to the conclusion that we'd better invent a fairly stylized
> > documentation solution that we can plug into a lot of places,
> > rather than thinking of slightly different ways to say it and
> > places to say it.  I'm not necessarily opposed to Peter's desire
> > to fix replication-related commands first, but we have more to do
> > later.
> >
> > I'm also not that thrilled with putting the addition up at the top
> > of the relevant text.  This behavior is at least two decades old,
> > so if we've escaped documenting it at all up to now, it can't be
> > that important to most people.
> >
> > I also notice that ALTER SUBSCRIPTION has fully three different
> > sub-sections with about equal claims on this note, if we're going
> > to stick it directly into the affected option lists.
> >
> > That all leads me to propose that we add the new text at the end of
> > the Parameters  in the affected man pages.  So about
> > like the attached.  (I left out alter_publication.sgml, as I'm not
> > sure it needs its own copy of this text --- it doesn't describe
> > individual parameters at all, just refer to CREATE PUBLICATION.)
> >
> > regards, tom lane
> >
>
> Hi,
>
> Here is a similar update for another page: "55.4 Streaming Replication
> Protocol" [0]. This patch was prompted by a review comment reply at
> [1] (#2).
>
> I've used text almost the same as the boilerplate text added by the
> previous commit [2]
>
> ~
>
> PSA patch v4.
>
> ==
> [0] https://www.postgresql.org/docs/devel/protocol-replication.html
> [1] 
> https://www.postgresql.org/message-id/OS0PR01MB571663BCE8B28597D462FADE946A2%40OS0PR01MB5716.jpnprd01.prod.outlook.com
> [2] 
> https://github.com/postgres/postgres/commit/ec7e053a98f39a9e3c7e6d35f0d2e83933882399
>
> Kind Regards,
> Peter Smith.
> Fujitsu Australia

Bump.

Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Synchronizing slots from primary to standby

2024-03-02 Thread Peter Smith

On Sat, Mar 2, 2024 at 2:51 PM Zhijie Hou (Fujitsu)
 wrote:
>
> On Friday, March 1, 2024 12:23 PM Peter Smith  wrote:
> >
...
> > ==
> > src/backend/replication/slot.c
> >
> > 2. validate_standby_slots
> >
> > + else if (!ReplicationSlotCtl)
> > + {
> > + /*
> > + * We cannot validate the replication slot if the replication slots'
> > + * data has not been initialized. This is ok as we will validate the
> > + * specified slot when waiting for them to catch up. See
> > + * StandbySlotsHaveCaughtup for details.
> > + */
> > + }
> > + else
> > + {
> > + /*
> > + * If the replication slots' data have been initialized, verify if the
> > + * specified slots exist and are logical slots.
> > + */
> > + LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
> >
> > IMO that 2nd comment does not need to say "If the replication slots'
> > data have been initialized," because that is implicit from the if/else.
>
> Removed.

I only meant to suggest removing the 1st part, not the entire comment.
I thought it is still useful to have a comment like:

/* Check that the specified slots exist and are logical slots.*/

==

And, here are some review comments for v103-0001.

==
Commit message

1.
Additionally, The SQL functions pg_logical_slot_get_changes,
pg_logical_slot_peek_changes and pg_replication_slot_advance are modified
to wait for the replication slots mentioned in 'standby_slot_names' to
catch up before returning.

~

(use the same wording as previous in this message)

/mentioned in/specified in/

==
doc/src/sgml/config.sgml

2.
+Additionally, when using the replication management functions
+
+pg_replication_slot_advance,
+
+pg_logical_slot_get_changes, and
+
+pg_logical_slot_peek_changes,
+with failover enabled logical slots, the functions will wait for the
+physical slots specified in standby_slot_names to
+confirm WAL receipt before proceeding.
+   

It says "the ... functions" twice so maybe reword it slightly.

BEFORE
Additionally, when using the replication management functions
pg_replication_slot_advance, pg_logical_slot_get_changes, and
pg_logical_slot_peek_changes, with failover enabled logical slots, the
functions will wait for the physical slots specified in
standby_slot_names to confirm WAL receipt before proceeding.

SUGGESTION
Additionally, the replication management functions
pg_replication_slot_advance, pg_logical_slot_get_changes, and
pg_logical_slot_peek_changes, when used with failover enabled logical
slots, will wait for the physical slots specified in
standby_slot_names to confirm WAL receipt before proceeding.

(Actually the "will wait ... before proceeding" is also a bit tricky,
so below is another possible rewording)

SUGGESTION #2
Additionally, the replication management functions
pg_replication_slot_advance, pg_logical_slot_get_changes, and
pg_logical_slot_peek_changes, when used with failover enabled logical
slots, will block until all physical slots specified in
standby_slot_names have confirmed WAL receipt.

~~~

3.
+   
+
+ Value * is not accepted as it is inappropriate to
+ block logical replication for physical slots that either lack
+ associated standbys or have standbys associated that are not enabled
+ for replication slot synchronization. (see
+ ).
+
+   

Why does the document need to provide an excuse/reason for the rule?
You could just say something like:

SUGGESTION
The slots must be named explicitly. For example, specifying wildcard
values like * is not permitted.

==
doc/src/sgml/func.sgml

4.
@@ -28150,7 +28150,7 @@ postgres=# SELECT '0/0'::pg_lsn +
pd.segment_number * ps.setting::int + :offset

-   
+   

  pg_logical_slot_get_changes

@@ -28177,7 +28177,7 @@ postgres=# SELECT '0/0'::pg_lsn +
pd.segment_number * ps.setting::int + :offset

-   
+   

  pg_logical_slot_peek_changes

@@ -28232,7 +28232,7 @@ postgres=# SELECT '0/0'::pg_lsn +
pd.segment_number * ps.setting::int + :offset

-   
+   

  pg_replication_slot_advance

Should these 3 functions say something about how their behaviour is
affected by 'standby_slot_names' and give a link back to the GUC
'standby_slot_names' docs?

==
src/backend/replication/slot.c

5. StandbySlotsHaveCaughtup

+ if (!slot)
+ {
+ /*
+ * If the provided slot does not exist, report a message and exit
+ * the loop. It is possible for a user to specify a slot name in
+ * standby_slot_names that does not exist just before the server
+ * startup. The GUC check_hook(validate_standby_slots) cannot
+ * validate such a slot during startup

Re: Synchronizing slots from primary to standby

2024-02-29 Thread Peter Smith

On Fri, Mar 1, 2024 at 5:11 PM Masahiko Sawada  wrote:
>
...
> +/*
> + * "*" is not accepted as in that case primary will not be able to 
> know
> + * for which all standbys to wait for. Even if we have physical slots
> + * info, there is no way to confirm whether there is any standby
> + * configured for the known physical slots.
> + */
> +if (strcmp(*newval, "*") == 0)
> +{
> +GUC_check_errdetail("\"*\" is not accepted for
> standby_slot_names");
> +return false;
> +}
>
> Why only '*' is checked aside from validate_standby_slots()? I think
> that the doc doesn't mention anything about '*' and '*' cannot be used
> as a replication slot name. So even if we don't have this check, it
> might be no problem.
>

Hi, a while ago I asked this same question. See [1 #28] for the response..

--
[1] 
https://www.postgresql.org/message-id/OS0PR01MB571646B8186F6A06404BD50194BDA%40OS0PR01MB5716.jpnprd01.prod.outlook.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Synchronizing slots from primary to standby

2024-02-29 Thread Peter Smith

 the shutdown signal
+ * is received.
+ *
+ * Set failover_slot to true if the current acquired slot is a failover enabled
+ * slot and we are streaming.
+ *
+ * If returning true, the function sets the appropriate wait event in
+ * wait_event; otherwise, wait_event is set to 0.
+ */
+static bool
+NeedToWaitForWal(XLogRecPtr target_lsn, XLogRecPtr flushed_lsn,
+ bool prioritize_stop, bool failover_slot,
+ uint32 *wait_event)

8a.
/Set prioritize_stop to true/Specify prioritize_stop=true/

/Set failover_slot to true/Specify failover_slot=true/

~

8b.
Aren't the static function names typically snake_case?

~~~

9.
+ /*
+ * Check if we need to wait for WALs to be flushed to disk. We don't need
+ * to wait for WALs after receiving the shutdown signal unless the
+ * wait_for_wal_on_stop is true.
+ */
+ if (target_lsn > flushed_lsn && !(prioritize_stop && got_STOPPING))
+ *wait_event = WAIT_EVENT_WAL_SENDER_WAIT_FOR_WAL;

The comment says 'wait_for_wal_on_stop' but the code says 'prioritize_stop' (??)

~~~

10.
+ /*
+ * Check if we need to wait for WALs to be flushed to disk. We don't need
+ * to wait for WALs after receiving the shutdown signal unless the
+ * wait_for_wal_on_stop is true.
+ */
+ if (target_lsn > flushed_lsn && !(prioritize_stop && got_STOPPING))
+ *wait_event = WAIT_EVENT_WAL_SENDER_WAIT_FOR_WAL;
+
+ /*
+ * Check if the standby slots have caught up to the flushed position. It is
+ * good to wait up to RecentFlushPtr and then let it send the changes to
+ * logical subscribers one by one which are already covered in
+ * RecentFlushPtr without needing to wait on every change for standby
+ * confirmation. Note that after receiving the shutdown signal, an ERROR is
+ * reported if any slots are dropped, invalidated, or inactive. This
+ * measure is taken to prevent the walsender from waiting indefinitely.
+ */
+ else if (failover_slot && !StandbySlotsHaveCaughtup(flushed_lsn, elevel))
+ *wait_event = WAIT_EVENT_WAIT_FOR_STANDBY_CONFIRMATION;
+ else
+ return false;
+
+ return true;

This if/else/else seems overly difficult to read. IMO it will be
easier if written like:

SUGGESTION

if (target_lsn > flushed_lsn && !(prioritize_stop && got_STOPPING))
{
  *wait_event = WAIT_EVENT_WAL_SENDER_WAIT_FOR_WAL;
  return true;
}


if (failover_slot && !StandbySlotsHaveCaughtup(flushed_lsn, elevel))
{
  *wait_event = WAIT_EVENT_WAIT_FOR_STANDBY_CONFIRMATION;
  return true;
}

return false;

--
Kind Regards,
Peter Smith.
Fujitsu Australia

DOCS: Avoid using abbreviation "aka"

2024-02-28 Thread Peter Smith

While checking some recently pushed changes [1] I noticed
documentation [2] that includes the abbreviation "aka".

IMO it is preferable to avoid informal abbreviations like "aka" in the
documents, because not everyone will understand the meaning.
Furthermore, I think this is reinforced by the fact this was the
*only* example of "aka" that I could find in all of the .sgml. Indeed,
assuming that "aka" is short for "also known as" then the sentence
still doesn't seem correct even after those words are substituted.

HEAD
For the synchronization to work, it is mandatory to have a physical
replication slot between the primary and the standby aka
primary_slot_name should be configured on the standby, and
hot_standby_feedback must be enabled on the standby.

SUGGESTION
For the synchronization to work, it is mandatory to have a physical
replication slot between the primary and the standby (i.e.,
primary_slot_name should be configured on the standby), and
hot_standby_feedback must be enabled on the standby.

I found that the "aka" was introduced in v86-0001 [3]. So my
replacement text above restores to something similar to how it was in
v85-0001.

PSA a patch for the same.

--
[1]
https://github.com/postgres/postgres/commit/ddd5f4f54a026db6a6692876d0d44aef902ab686#diff-29c2d2e0480177b04f9c3d82c1454f8c00a11b8e761a9c9f5f4f6d61e6f19252
[2]
https://www.postgresql.org/docs/devel/logicaldecoding-explanation.html#LOGICALDECODING-REPLICATION-SLOTS-SYNCHRONIZATION
[3] [1]
https://www.postgresql.org/message-id/OS0PR01MB5716E581B4227DDEB4DE6C30944F2%40OS0PR01MB5716.jpnprd01.prod.outlook.com

Kind Regards,
Peter Smith.
Fujitsu Australia

v1-0001-Replace-aka-in-docs.patch
Description: Binary data

Re: Synchronizing slots from primary to standby

2024-02-28 Thread Peter Smith

On Tue, Feb 27, 2024 at 11:35 PM Amit Kapila  wrote:
>
> On Tue, Feb 27, 2024 at 12:48 PM Peter Smith  wrote:
> >
> > Here are some review comments for v99-0001
> >
> > ==
> > 0. GENERAL.
> >
> > +#standby_slot_names = '' # streaming replication standby server slot names 
> > that
> > + # logical walsender processes will wait for
> >
> > IMO the GUC name is too generic. There is nothing in this name to
> > suggest it has anything to do with logical slot synchronization; that
> > meaning is only found in the accompanying comment -- it would be
> > better if the GUC name itself were more self-explanatory.
> >
> > e.g. Maybe like 'wal_sender_sync_standby_slot_names' or
> > 'wal_sender_standby_slot_names', 'wal_sender_wait_for_standby_slots',
> > or ...
> >
>
> It would be wrong and or misleading to append wal_sender to this GUC
> name as this is used during SQL APIs as well.

Fair enough, but the fact that some SQL functions might wait is also
not mentioned in the config file comment, nor in the documentation for
GUC  'standby_slot_names'. Seems like a docs omission?

> Also, adding wait sounds
> more like a boolean. So, I don't see the proposed names any better
> than the current one.
>

Anyway, the point is that the current GUC name 'standby_slot_names' is
not ideal IMO because it doesn't have enough meaning by itself -- e.g.
you have to read the accompanying comment or documentation to have any
idea of its purpose.

My suggested GUC names were mostly just to get people thinking about
it. Maybe others can come up with better names.

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Synchronizing slots from primary to standby

2024-02-28 Thread Peter Smith

On Wed, Feb 28, 2024 at 1:23 PM Zhijie Hou (Fujitsu)
 wrote:
>
> On Tuesday, February 27, 2024 3:18 PM Peter Smith  
> wrote:
...
> > 20.
> > +#
> > +# | > standby1 (primary_slot_name = sb1_slot) # | > standby2
> > +(primary_slot_name = sb2_slot) # primary - | # | > subscriber1
> > +(failover = true) # | > subscriber2 (failover = false)
> >
> > In the diagram, the "--->" means a mixture of physical standbys and logical
> > pub/sub replication. Maybe it can be a bit clearer?
> >
> > SUGGESTION
> >
> > # primary (publisher)
> > #
> > # (physical standbys)
> > # | > standby1 (primary_slot_name = sb1_slot)
> > # | > standby2 (primary_slot_name = sb2_slot)
> > #
> > # (logical replication)
> > # | > subscriber1 (failover = true, slot_name = lsub1_slot)
> > # | > subscriber2 (failover = false, slot_name = lsub2_slot)
> >
>
> I think one can distinguish it based on the 'standby' and 'subscriber' as 
> well, because
> 'standby' normally refer to physical standby while the other refer to logical.
>

Ok, but shouldn't it at least include info about the logical slot
names associated with the subscribers (slot_name = lsub1_slot,
slot_name = lsub2_slot) like suggested above?

==

Here are some more review comments for v100-0001

==
doc/src/sgml/config.sgml

1.
+   
+Lists the streaming replication standby server slot names that logical
+WAL sender processes will wait for. Logical WAL sender processes will
+send decoded changes to plugins only after the specified replication
+slots confirm receiving WAL. This guarantees that logical replication
+slots with failover enabled do not consume changes until those changes
+are received and flushed to corresponding physical standbys. If a
+logical replication connection is meant to switch to a physical standby
+after the standby is promoted, the physical replication slot for the
+standby should be listed here. Note that logical replication will not
+proceed if the slots specified in the standby_slot_names do
not exist or
+are invalidated.
+   

Is that note ("Note that logical replication will not proceed if the
slots specified in the standby_slot_names do not exist or are
invalidated") meant only for subscriptions marked for 'failover' or
any subscription? Maybe wording can be modified to help clarify it?

==
src/backend/replication/slot.c

2.
+/*
+ * A helper function to validate slots specified in GUC standby_slot_names.
+ */
+static bool
+validate_standby_slots(char **newval)
+{
+ char*rawname;
+ List*elemlist;
+ bool ok;
+
+ /* Need a modifiable copy of string */
+ rawname = pstrdup(*newval);
+
+ /* Verify syntax and parse string into a list of identifiers */
+ ok = SplitIdentifierString(rawname, ',', );
+
+ if (!ok)
+ {
+ GUC_check_errdetail("List syntax is invalid.");
+ }
+
+ /*
+ * If the replication slots' data have been initialized, verify if the
+ * specified slots exist and are logical slots.
+ */
+ else if (ReplicationSlotCtl)
+ {
+ foreach_ptr(char, name, elemlist)
+ {
+ ReplicationSlot *slot;
+
+ slot = SearchNamedReplicationSlot(name, true);
+
+ if (!slot)
+ {
+ GUC_check_errdetail("replication slot \"%s\" does not exist",
+ name);
+ ok = false;
+ break;
+ }
+
+ if (!SlotIsPhysical(slot))
+ {
+ GUC_check_errdetail("\"%s\" is not a physical replication slot",
+ name);
+ ok = false;
+ break;
+ }
+ }
+ }
+
+ pfree(rawname);
+ list_free(elemlist);
+ return ok;
+}

2a.
I didn't mention this previously because I thought this function was
not going to change anymore, but since Bertrand suggested some changes
[1], I will say IMO the { } are fine here for the single statement,
but I think it will be better to rearrange this code to be like below.
Having a 2nd NOP 'else' gives a much better place where you can put
your ReplicationSlotCtl comment.

if (!ok)
{
  GUC_check_errdetail("List syntax is invalid.");
}
else if (!ReplicationSlotCtl)
{
  
}
else
{
  foreach_ptr ...
}

~

2b.
In any case even if don't refactor anything I still think you need to
extend that comment to explain how skipping ReplicationSlotCtl is NULL
is only OK because there will be some later checking also done in the
FilterStandbySlots() function to catch any config problems.

--
[1] 
https://www.postgresql.org/message-id/Zd8ahZXw82ieFxX/%40ip-10-97-1-34.eu-west-3.compute.internal

Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Synchronizing slots from primary to standby

2024-02-28 Thread Peter Smith

On Wed, Feb 28, 2024 at 10:21 PM Amit Kapila  wrote:
>
> On Wed, Feb 28, 2024 at 3:26 PM shveta malik  wrote:
> >
> >
> > Here is the patch which addresses the above comments. Also optimized
> > the test a little bit. Now we use pg_sync_replication_slots() function
> > instead of worker to test the operator-redirection using search-patch.
> > This has been done to simplify the test case and reduce the added
> > time.
> >
>
> I have slightly adjusted the comments in the attached, otherwise, LGTM.
>
> --

- if (logical)
+ /*
+ * Set always-secure search path for the cases where the connection is
+ * used to run SQL queries, so malicious users can't get control.
+ */
+ if (logical || !replication)
  {
  PGresult   *res;

I found this condition a bit confusing. According to the
libpqrcv_connect function comment:

 * This function can be used for both replication and regular connections.
 * If it is a replication connection, it could be either logical or physical
 * based on input argument 'logical'.

IIUC that comment is saying the 'replication' flag is like the main
categorization and the 'logical' flag is like a subcategory (for when
'replication' is true). Therefore, won't the modified check be better
to be written the other way around? This will also be consistent with
the way the Assert was written.

SUGGESTION
if (!replication || logical)
{
  ...

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Synchronizing slots from primary to standby

2024-02-26 Thread Peter Smith

29.
+# Stop the standby associated with the specified physical replication slot so
+# that the logical replication slot won't receive changes until the standby
+# slot's restart_lsn is advanced or the slot is removed from the
+# standby_slot_names list.
+$primary->safe_psql('postgres', "TRUNCATE tab_int;");
+$primary->wait_for_catchup('regress_mysub1');
+$standby1->stop;

Isn't this fragment more like the first step of the *next* TEST
instead of the last step of this one?

~~~

30.
+##
+# Verify that when using pg_logical_slot_get_changes to consume changes from a
+# logical slot with failover enabled, it will also wait for the slots specified
+# in standby_slot_names to catch up.
+##

AFAICT this test is checking only that the function cannot return
while waiting for the stopped standby, but it doesn't seem to check
that it *does* return when the stopped standby comes alive again.

~~~

31.
+$result =
+  $subscriber1->safe_psql('postgres', "SELECT count(*) = 0 FROM tab_int;");
+is($result, 't',
+ "subscriber1 doesn't get data as the sb1_slot doesn't catch up");

Do you think this fragment should have a comment?

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Add publisher and subscriber to glossary documentation.

2024-02-25 Thread Peter Smith

Hi, the patch v4 LGTM.

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Add publisher and subscriber to glossary documentation.

2024-02-25 Thread Peter Smith

Here are some comments for patch v3:

1.
+  
+   Publication node
+   
+
+ A node where a
+ publication is defined
+ for logical
replication.
+
+   
+  
+

I felt the word "node" here should link to the glossary term "Node",
instead of directly to the term "Instance".

~~

2.
+  
+   Subscription node
+   
+
+ A node where a
+ subscription is defined
+ for logical
replication.
+
+   
+  
+

(same comment as above)

I felt the word "node" here should link to the glossary term "Node",
instead of directly to the term "Instance".

~~

Apart from those links, it looks good to me. Let's see what others think.

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Logging parallel worker draught

2024-02-25 Thread Peter Smith

On Mon, Feb 26, 2024 at 6:13 AM Tomas Vondra
 wrote:
>

> 1) name of the GUC
>
> I find the "log_parallel_worker_draught" to be rather unclear :-( Maybe
> it's just me and everyone else just immediately understands what this
> does / what will happen after it's set to "on", but I find it rather
> non-intuitive.
>

Also, I don't understand how the word "draught" (aka "draft") makes
sense here -- I assume the intended word was "drought" (???).

==
Kind Regards,
Peter Smith.
Fujitsu Australia.

Re: About a recently-added message

2024-02-22 Thread Peter Smith

On Thu, Feb 22, 2024 at 11:36 AM Kyotaro Horiguchi
 wrote:
>
> At Wed, 21 Feb 2024 14:57:42 +0530, Amit Kapila  
> wrote in
> > On Tue, Feb 20, 2024 at 3:21 PM shveta malik  wrote:
> > >
> > > okay, attached v2 patch with changed error msgs and double quotes
> > > around logical.
> > >
> >
> > Horiguchi-San, does this address all your concerns related to
> > translation with these new messages?
>
> Yes, I'm happy with all of the changes.  The proposed patch appears to
> cover all instances related to slotsync.c, and it looks fine to
> me. Thanks!
>
> I found that logica.c is also using the policy that I complained
> about, but it is a separate issue.
>
> ./logical.c 122:errmsg("logical decoding requires wal_level >= 
> logical")));
>

Hmm. I have a currently stalled patch-set to simplify the quoting of
all the GUC names by using one rule. The consensus is to *always*
quote them. See [1]. And those patches already are addressing that
logical.c code mentioned above.

IMO it would be good if we could try to get that patch set approved to
fix everything in one go, instead of ad-hoc hunting for and fixing
cases one at a time.

==
[1] 
https://www.postgresql.org/message-id/CAHut%2BPsf3NewXbsFKY88Qn1ON1_dMD6343MuWdMiiM2Ds9a_wA%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Add publisher and subscriber to glossary documentation.

2024-02-22 Thread Peter Smith

Here are some comments for patch v2.

==

1. There are whitespace problems

[postgres@CentOS7-x64 oss_postgres_misc]$ git apply
../patches_misc/v2-0001-Add-publisher-and-subscriber-to-glossary-document.patch
../patches_misc/v2-0001-Add-publisher-and-subscriber-to-glossary-document.patch:43:
trailing whitespace.
  A node where publication is defined for
../patches_misc/v2-0001-Add-publisher-and-subscriber-to-glossary-document.patch:45:
trailing whitespace.
  It replicates a set of changes from a table or a group of tables in
../patches_misc/v2-0001-Add-publisher-and-subscriber-to-glossary-document.patch:66:
trailing whitespace.
 A node where subscription is defined for
../patches_misc/v2-0001-Add-publisher-and-subscriber-to-glossary-document.patch:67:
trailing whitespace.
 logical replication.
../patches_misc/v2-0001-Add-publisher-and-subscriber-to-glossary-document.patch:68:
trailing whitespace.
 It subscribes to one or more publications on a publisher node and pulls
warning: squelched 2 whitespace errors
warning: 7 lines add whitespace errors.

~~~

2. Publisher node

+  
+   Publisher node
+   
+
+  A node where publication is defined for
+  logical
replication.
+  It replicates a set of changes from a table or a group of tables in
+  publication to the subscriber node.
+
+
+ For more information, see
+ .
+
+   
+  
+


2a.
I felt the term here should be "Publication node".

Indeed, there should also be additional glossary terms like
"Publisher" (i.e. it is the same as "Publication node") and
"Subscriber" (i.e. it is the same as "Subscription node"). These
definitions will then be consistent with node descriptions already in
sections 30.1 and 30.2


~

2b.
"node" should include a link to the glossary term. Same for any other
terms mentioned

~

2c.
/A node where publication is defined for/A node where a publication is
defined for/

~

2d.
"It replicates" is misleading because it is the PUBLICATION doing the
replicating, not the node.

IMO it will be better to include 2 more glossary terms "Publication"
and "Subscription" where you can say this kind of information. Then
the link "logical-replication-publication" also belongs under the
"Publication" term.


~~~

3.
+  
+   Subscriber node
+   
+
+ A node where subscription is defined for
+ logical replication.
+ It subscribes to one or more publications on a publisher node and pulls
+ a set of changes from a table or a group of tables in publications it
+ subscribes to.
+
+
+ For more information, see
+ .
+
+   
+  

All comments are similar to those above.

==

In summary, IMO it should be a bit more like below:

SUGGESTION (include all the necessary links etc)

Publisher
See "Publication node"

Publication
A publication replicates the changes of one or more tables to a
subscription. For more information, see Section 30.1

Publication node
A node where a publication is defined for logical replication.

Subscriber
See "Subscription node"

Subscription
A subscription receives the changes of one or more tables from the
publications it subscribes to. For more information, see Section 30.2

Subscription node
A node where a subscription is defined for logical replication.

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Documentation to upgrade logical replication cluster

2024-02-22 Thread Peter Smith

Hi Vignesh, I have no further comments. Patch v9 LGTM.

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Add lookup table for replication slot invalidation causes

2024-02-22 Thread Peter Smith

On Thu, Feb 22, 2024 at 5:56 PM Michael Paquier  wrote:
>
> On Thu, Feb 22, 2024 at 05:30:08PM +1100, Peter Smith wrote:
> > Oops. Perhaps I meant more like below -- in any case, the point was
> > the same -- to ensure RS_INVAL_NONE is what returns if something
> > unexpected happens.
>
> You are right that this could be a bit confusing, even if we should
> never reach this state.  How about avoiding to return the index of the
> loop as result, as of the attached?  Would you find that cleaner?
> --

Hi, yes, it should never happen, but thanks for making the changes.

I would've just removed every local variable instead of adding more of
them. I also felt the iteration starting from RS_INVAL_NONE instead of
0 is asserting RS_INVAL_NONE must always be the first enum and can't
be rearranged. Probably it will never happen, but why require it?

--
ReplicationSlotInvalidationCause
GetSlotInvalidationCause(const char *conflict_reason)
{
for (ReplicationSlotInvalidationCause cause = 0; cause <=
RS_INVAL_MAX_CAUSES; cause++)
if (strcmp(SlotInvalidationCauses[cause], conflict_reason) == 0)
return cause;

Assert(0);
return RS_INVAL_NONE;
}
--

But maybe those nits are a matter of personal choice. Your patch code
addressed my main concern, so it LGTM.

--
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Add lookup table for replication slot invalidation causes

2024-02-21 Thread Peter Smith

On Thu, Feb 22, 2024 at 5:19 PM Peter Smith  wrote:
>
> Hi, Sorry for the late comment but isn't the pushed logic now
> different to what it was there before?
>
> IIUC previously (in a non-debug build) if the specified
> conflict_reason was not found, it returned RS_INVAL_NONE -- now it
> seems to return whatever enum happens to be last.
>
> How about something more like below:
>
> --
> ReplicationSlotInvalidationCause
> GetSlotInvalidationCause(const char *conflict_reason)
> {
>   ReplicationSlotInvalidationCause cause;
>   boolfound  = false;
>
>   for (cause = 0; !found && cause <= RS_INVAL_MAX_CAUSES; cause++)
> found = strcmp(SlotInvalidationCauses[cause], conflict_reason) == 0;
>
>   Assert(found);
>   return found ? cause : RS_INVAL_NONE;
> }
> --
>

Oops. Perhaps I meant more like below -- in any case, the point was
the same -- to ensure RS_INVAL_NONE is what returns if something
unexpected happens.

ReplicationSlotInvalidationCause
GetSlotInvalidationCause(const char *conflict_reason)
{
ReplicationSlotInvalidationCause cause;

for (cause = 0; cause <= RS_INVAL_MAX_CAUSES; cause++)
{
if (strcmp(SlotInvalidationCauses[cause], conflict_reason) == 0)
return cause;
}

Assert(0);
return RS_INVAL_NONE;
}

--
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Add lookup table for replication slot invalidation causes

2024-02-21 Thread Peter Smith

Hi, Sorry for the late comment but isn't the pushed logic now
different to what it was there before?

IIUC previously (in a non-debug build) if the specified
conflict_reason was not found, it returned RS_INVAL_NONE -- now it
seems to return whatever enum happens to be last.

How about something more like below:

--
ReplicationSlotInvalidationCause
GetSlotInvalidationCause(const char *conflict_reason)
{
  ReplicationSlotInvalidationCause cause;
  boolfound  = false;

  for (cause = 0; !found && cause <= RS_INVAL_MAX_CAUSES; cause++)
found = strcmp(SlotInvalidationCauses[cause], conflict_reason) == 0;

  Assert(found);
  return found ? cause : RS_INVAL_NONE;
}
--

Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Documentation to upgrade logical replication cluster

2024-02-21 Thread Peter Smith

Here are some minor comments for patch v8-0001.

==
doc/src/sgml/glossary.sgml

1.
+   
+
+ A set of publisher and subscriber instances with publisher instance
+ replicating changes to the subscriber instance.
+
+   

/with publisher instance/with the publisher instance/

~~~

2.
There are 2 SQL fragments but they are wrapped differently (see
below). e.g. it is not clear why is the 2nd fragment wrapped since it
is shorter than the 1st. OTOH, maybe you want the 1st fragment to
wrap. Either way, consistency wrapping would be better.


postgres=# SELECT slot_name FROM pg_replication_slots WHERE slot_type
= 'logical' AND conflict_reason IS NOT NULL;
 slot_name
---
(0 rows)

versus

SELECT count(*) FROM pg_replication_slots WHERE slot_type = 'logical'
AND temporary IS false;

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Synchronizing slots from primary to standby

2024-02-09 Thread Peter Smith

FYI -- I checked patch v81-0001 to find which of the #includes are
strictly needed.

==
src/backend/replication/logical/slotsync.c

1.
+#include "postgres.h"
+
+#include 
+
+#include "access/genam.h"
+#include "access/table.h"
+#include "access/xlog_internal.h"
+#include "access/xlogrecovery.h"
+#include "catalog/pg_database.h"
+#include "commands/dbcommands.h"
+#include "libpq/pqsignal.h"
+#include "pgstat.h"
+#include "postmaster/bgworker.h"
+#include "postmaster/fork_process.h"
+#include "postmaster/interrupt.h"
+#include "postmaster/postmaster.h"
+#include "replication/logical.h"
+#include "replication/logicallauncher.h"
+#include "replication/walreceiver.h"
+#include "replication/slotsync.h"
+#include "storage/ipc.h"
+#include "storage/lmgr.h"
+#include "storage/procarray.h"
+#include "tcop/tcopprot.h"
+#include "utils/builtins.h"
+#include "utils/fmgroids.h"
+#include "utils/guc_hooks.h"
+#include "utils/pg_lsn.h"
+#include "utils/ps_status.h"
+#include "utils/timeout.h"
+#include "utils/varlena.h"

Many of these #includes seem unnecessary. e.g. I was able to remove
all those that are commented-out below, and the file still compiles OK
with no warnings:

#include "postgres.h"

//#include 

//#include "access/genam.h"
//#include "access/table.h"
#include "access/xlog_internal.h"
#include "access/xlogrecovery.h"
#include "catalog/pg_database.h"
#include "commands/dbcommands.h"
//#include "libpq/pqsignal.h"
//#include "pgstat.h"
//#include "postmaster/bgworker.h"
//#include "postmaster/fork_process.h"
//#include "postmaster/interrupt.h"
//#include "postmaster/postmaster.h"
#include "replication/logical.h"
//#include "replication/logicallauncher.h"
//#include "replication/walreceiver.h"
#include "replication/slotsync.h"
#include "storage/ipc.h"
#include "storage/lmgr.h"
#include "storage/procarray.h"
//#include "tcop/tcopprot.h"
#include "utils/builtins.h"
//#include "utils/fmgroids.h"
//#include "utils/guc_hooks.h"
#include "utils/pg_lsn.h"
//#include "utils/ps_status.h"
//#include "utils/timeout.h"
//#include "utils/varlena.h"

==
src/backend/replication/slot.c

2.
 #include "pgstat.h"
+#include "replication/slotsync.h"
 #include "replication/slot.h"
+#include "replication/walsender.h"
 #include "storage/fd.h"

The #include "replication/walsender.h" seems to be unnecessary.

==
src/backend/replication/walsender.c

3.
 #include "replication/logical.h"
+#include "replication/slotsync.h"
 #include "replication/slot.h"

The #include "replication/slotsync.h" is needed, but only for Assert code:
Assert(am_cascading_walsender || IsSyncingReplicationSlots());

So you could #ifdef around that #include if you wish to.

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Documentation to upgrade logical replication cluster

2024-02-08 Thread Peter Smith

Here are some review comments for patch v7-0001.

==
doc/src/sgml/glossary.sgml

1.
+  
+   Logical replication cluster
+   
+
+ A set of publisher and subscriber instance with publisher instance
+ replicating changes to the subscriber instance.
+
+   
+  

1a.
/instance with/instances with/

~~~

1b.
The description then made me want to look up the glossary definition
of a "publisher instance" and "subscriber instance", but then I was
quite surprised that even "Publisher" and "Subscriber" terms are not
described in the glossary. Should this patch add those, or should we
start another thread for adding them?

==
doc/src/sgml/logical-replication.sgml

2.
+  
+   Migration of logical replication clusters is possible only when all the
+   members of the old logical replication clusters are version 17.0 or later.
+  

Here is where "logical replication clusters" is mentioned. Shouldn't
this first reference be linked to that new to the glossary entry --
e.g. logical replication clusters

~~~

3.
+   
+Following are the prerequisites for pg_upgrade
+to be able to upgrade the logical slots. If these are not met an error
+will be reported.
+   

SUGGESTION
The following prerequisites are required for ...

~~~

4.
+ 
+  All slots on the old cluster must be usable, i.e., there are no slots
+  whose
+  pg_replication_slots.conflict_reason
+  is not NULL.
+ 

The double-negative is too tricky "no slots whose ... not NULL", needs
rewording. Maybe it is better to instead use an example as the next
bullet point does.

~~~

5.
+
+   
+Following are the prerequisites for
pg_upgrade to
+be able to upgrade the subscriptions. If these are not met an error
+will be reported.
+   

SUGGESTION
The following prerequisites are required for ...

==
doc/src/sgml/ref/pgupgrade.sgml

6.
+   
+
+ The steps to upgrade logical replication clusters are not covered here;
+ refer to  for details.
+
+   

Maybe here too there should be a link to the glossary term "logical
replication clusters".

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Synchronizing slots from primary to standby

2024-02-08 Thread Peter Smith

irmed_lsn or catalog_xmin is invalid but slot
+ * is not invalidated, that means we have fetched the remote_slot in
+ * its RS_EPHEMERAL state itself. In such a case, avoid syncing it
+ * yet. We can always sync it in the next sync cycle when the
+ * remote_slot is persisted and has valid lsn(s) and xmin values.
+ *
+ * XXX: In future, if we plan to expose 'slot->data.persistency' in
+ * pg_replication_slots view, then we can avoid fetching RS_EPHEMERAL
+ * slots in the first place.
+ */

SUGGESTION (1st para)
If restart_lsn, confirmed_lsn or catalog_xmin is invalid but the slot
is valid, that means we have fetched the remote_slot in its
RS_EPHEMERAL state. In such a case, don't sync it; we can always sync
it in the next ...

~~~

7.
+ /*
+ * Use shared lock to prevent a conflict with
+ * ReplicationSlotsDropDBSlots(), trying to drop the same slot during
+ * a drop-database operation.
+ */
+ LockSharedObject(DatabaseRelationId, remote_dbid, 0, AccessShareLock);
+
+ synchronize_one_slot(remote_slot, remote_dbid);
+
+ UnlockSharedObject(DatabaseRelationId, remote_dbid, 0, AccessShareLock);

IMO remove the blank lines (e.g., you don't use this kind of
formatting for spin locks)

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Make documentation builds reproducible

2024-02-08 Thread Peter Smith

On Thu, Feb 8, 2024 at 9:47 PM Peter Eisentraut  wrote:
>
> On 23.01.24 02:06, Peter Smith wrote:
> > This has been working forever, but seems to have broken due to commit
> > [1] having an undeclared variable.
>
> > runtime error: file stylesheet-html-common.xsl line 452 element if
> > Variable 'autolink.index.see' has not been declared.
> > make: *** [html-stamp] Error 10
>
> I have committed a fix for this.  I have successfully tested docbook-xsl
> 1.77.1 through 1.79.*.
>

Yes, the latest is working for me now. Thanks!

==
Kind Regards,
Peter Smith.
Fujitsu Australia.

Re: Synchronizing slots from primary to standby

2024-02-07 Thread Peter Smith

press this comment?

~~~

10.
+/*
+ * Validates if all necessary GUCs for slot synchronization are set
+ * appropriately, otherwise raise ERROR.
+ */

/Validates if all/Check all/

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Question about behavior of deletes with REPLICA IDENTITY NOTHING

2024-02-07 Thread Peter Smith

On Thu, Feb 8, 2024 at 11:12 AM James Coleman  wrote:
>
> On Wed, Feb 7, 2024 at 6:04 PM Peter Smith  wrote:
> >
> > On Thu, Feb 8, 2024 at 9:04 AM James Coleman  wrote:
> > >
> > > On Wed, Feb 7, 2024 at 3:22 PM Laurenz Albe  
> > > wrote:
> > > >
> > > > On Wed, 2024-02-07 at 15:12 -0500, James Coleman wrote:
> > > > > We recently noticed some behavior that seems reasonable but also
> > > > > surprised our engineers based on the docs.
> > > > >
> > > > > If we have this setup:
> > > > > create table items(i int);
> > > > > insert into items(i) values (1);
> > > > > create publication test_pub for all tables;
> > > > >
> > > > > Then when we:
> > > > > delete from items where i = 1;
> > > > >
> > > > > we get:
> > > > > ERROR:  cannot delete from table "items" because it does not have a
> > > > > replica identity and publishes deletes
> > > > > HINT:  To enable deleting from the table, set REPLICA IDENTITY using
> > > > > ALTER TABLE.
> > > > >
> > > > > Fair enough. But if we do this:
> > > > > alter table items replica identity nothing;
> > > > >
> > > > > because the docs [1] say that NOTHING means "Records no information
> > > > > about the old row." We still get the same error when we try the DELETE
> > > > > again.
> > > >
> > > > Well, "REPLICA IDENTITY NOTHING" is the same as "has no replica 
> > > > identity".
> > > > So is "REPLICA IDENTITY DEFAULT" if there is no primary key, or
> > > > "REPLICA IDENTITY USING INDEX ..." if the index is dropped.
> > > >
> > > > See "pg_class": the column "relreplident" is not nullable.
> > >
> > > Right, I think the confusing point for us is that the docs for NOTHING
> > > ("Records no information about the old row") imply you can decide you
> > > don't have to record anything if you don't want to do so, but the
> > > publication feature is effectively overriding that and asserting that
> > > you can't make that choice.
> > >
> >
> > Hi, I can see how the current docs could be interpreted in a way that
> > was not intended.
> >
> > ~~~
> >
> > To emphasise the DEFAULT behaviour that Laurenze described, I felt
> > there could be another sentence about DEFAULT, the same as there is
> > already for the USING INDEX case.
> >
> > BEFORE [1]
> > Records the old values of the columns of the primary key, if any. This
> > is the default for non-system tables.
> >
> > SUGGESTION
> > Records the old values of the columns of the primary key, if any. This
> > is the default for non-system tables. If there is no primary key, the
> > behavior is the same as NOTHING.
> >
> > ~~~
> >
> > If that is done, then would a publication docs tweak like the one
> > below clarify things sufficiently?
> >
> > BEFORE [2]
> > If a table without a replica identity is added to a publication that
> > replicates UPDATE or DELETE operations then subsequent UPDATE or
> > DELETE operations will cause an error on the publisher.
> >
> > SUGGESTION
> > If a table without a replica identity (or with replica identity
> > behavior equivalent to NOTHING) is added to a publication that
> > replicates UPDATE or DELETE operations then subsequent UPDATE or
> > DELETE operations will cause an error on the publisher.
> >
> > ==
> > [1] 
> > https://www.postgresql.org/docs/current/sql-altertable.html#SQL-ALTERTABLE-REPLICA-IDENTITY
> > [2] 
> > https://www.postgresql.org/docs/current/logical-replication-publication.html
> >
> > Kind Regards,
> > Peter Smith.
> > Fujitsu Australia
>
> Thanks for looking at this!
>
> Yes, both of those changes together would make this unambiguous (and,
> I think, easier to mentally parse).
>

OK, here then is a patch to do like that.

==
Kind Regards,
Peter Smith.
Fujitsu Australia


v1-0001-replica-identity-clarifications.patch
Description: Binary data

Re: Question about behavior of deletes with REPLICA IDENTITY NOTHING

2024-02-07 Thread Peter Smith

On Thu, Feb 8, 2024 at 9:04 AM James Coleman  wrote:
>
> On Wed, Feb 7, 2024 at 3:22 PM Laurenz Albe  wrote:
> >
> > On Wed, 2024-02-07 at 15:12 -0500, James Coleman wrote:
> > > We recently noticed some behavior that seems reasonable but also
> > > surprised our engineers based on the docs.
> > >
> > > If we have this setup:
> > > create table items(i int);
> > > insert into items(i) values (1);
> > > create publication test_pub for all tables;
> > >
> > > Then when we:
> > > delete from items where i = 1;
> > >
> > > we get:
> > > ERROR:  cannot delete from table "items" because it does not have a
> > > replica identity and publishes deletes
> > > HINT:  To enable deleting from the table, set REPLICA IDENTITY using
> > > ALTER TABLE.
> > >
> > > Fair enough. But if we do this:
> > > alter table items replica identity nothing;
> > >
> > > because the docs [1] say that NOTHING means "Records no information
> > > about the old row." We still get the same error when we try the DELETE
> > > again.
> >
> > Well, "REPLICA IDENTITY NOTHING" is the same as "has no replica identity".
> > So is "REPLICA IDENTITY DEFAULT" if there is no primary key, or
> > "REPLICA IDENTITY USING INDEX ..." if the index is dropped.
> >
> > See "pg_class": the column "relreplident" is not nullable.
>
> Right, I think the confusing point for us is that the docs for NOTHING
> ("Records no information about the old row") imply you can decide you
> don't have to record anything if you don't want to do so, but the
> publication feature is effectively overriding that and asserting that
> you can't make that choice.
>

Hi, I can see how the current docs could be interpreted in a way that
was not intended.

~~~

To emphasise the DEFAULT behaviour that Laurenze described, I felt
there could be another sentence about DEFAULT, the same as there is
already for the USING INDEX case.

BEFORE [1]
Records the old values of the columns of the primary key, if any. This
is the default for non-system tables.

SUGGESTION
Records the old values of the columns of the primary key, if any. This
is the default for non-system tables. If there is no primary key, the
behavior is the same as NOTHING.

~~~

If that is done, then would a publication docs tweak like the one
below clarify things sufficiently?

BEFORE [2]
If a table without a replica identity is added to a publication that
replicates UPDATE or DELETE operations then subsequent UPDATE or
DELETE operations will cause an error on the publisher.

SUGGESTION
If a table without a replica identity (or with replica identity
behavior equivalent to NOTHING) is added to a publication that
replicates UPDATE or DELETE operations then subsequent UPDATE or
DELETE operations will cause an error on the publisher.

==
[1] 
https://www.postgresql.org/docs/current/sql-altertable.html#SQL-ALTERTABLE-REPLICA-IDENTITY
[2] https://www.postgresql.org/docs/current/logical-replication-publication.html

Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Synchronizing slots from primary to standby

2024-02-06 Thread Peter Smith

EL_LOGICAL)
+ {
+ ereport(ERROR,
+ errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("bad configuration for slot synchronization"),
+ errhint("\"wal_level\" must be >= logical."));
+ return false;
+ }
+
+ /*
+ * The primary_conninfo is required to make connection to primary for
+ * getting slots information.
+ */
+ if (PrimaryConnInfo == NULL || *PrimaryConnInfo == '\0')
+ {
+ ereport(ERROR,
+ /* translator: %s is a GUC variable name */
+ errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("bad configuration for slot synchronization"),
+ errhint("\"%s\" must be defined.", "primary_conninfo"));
+ return false;
+ }
+
+ /*
+ * The slot synchronization needs a database connection for walrcv_exec to
+ * work.
+ */
+ dbname = walrcv_get_dbname_from_conninfo(PrimaryConnInfo);
+ if (dbname == NULL)
+ {
+ ereport(ERROR,
+
+ /*
+ * translator: 'dbname' is a specific option; %s is a GUC variable
+ * name
+ */
+ errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("bad configuration for slot synchronization"),
+ errhint("'dbname' must be specified in \"%s\".", "primary_conninfo"));
+ return false;
+ }
+
+ return true;
+}

The code of this function has been flip-flopping between versions.
Now, it is always giving an ERROR when something is wrong, so all of
the "return false" are unreachable. It also means the function comment
is wrong, and the boolean return is unused/unnecessary.

~~~

6. SlotSyncShmemInit

+/*
+ * Allocate and initialize slot sync shared memory.
+ */

This comment should use the same style wording as the other nearby
shmem function comments.

SUGGESTION
Allocate and initialize the shared memory of slot synchronization.

~~~

7.
+/*
+ * Cleanup the shared memory of slot synchronization.
+ */
+static void
+SlotSyncShmemExit(int code, Datum arg)

Since this is static, should it use the snake case naming convention?
-- e.g. slot_sync_shmem_exit.

~~~

8.
+/*
+ * Register the callback function to clean up the shared memory of slot
+ * synchronization.
+ */
+void
+SlotSyncInitialize(void)
+{
+ before_shmem_exit(SlotSyncShmemExit, 0);
+}

This is only doing registration for cleanup of shmem stuff. So, does
it really need it to be a separate function, or can this be registered
within SlotSyncShmemInit() itself?

~~~

9. SyncReplicationSlots

+ PG_TRY();
+ {
+ validate_primary_slot_name(wrconn);
+
+ (void) synchronize_slots(wrconn);
+ }
+ PG_FINALLY();
+ {
+ if (syncing_slots)
+ {
+ SpinLockAcquire(>mutex);
+ SlotSyncCtx->syncing = false;
+ SpinLockRelease(>mutex);
+
+ syncing_slots = false;
+ }
+
+ walrcv_disconnect(wrconn);
+ }
+ PG_END_TRY();

IIUC, the "if (syncing_slots)" part is not really for normal
operation, but it is a safe-guard for cleaning up if some unexpected
ERROR happens. Maybe there should be a comment to say that.

==
src/test/recovery/t/040_standby_failover_slots_sync.pl

10.
+# Confirm that the logical failover slot is created on the standby and is
+# flagged as 'synced'
+is($standby1->safe_psql('postgres',
+ q{SELECT count(*) = 2 FROM pg_replication_slots WHERE slot_name IN
('lsub1_slot', 'lsub2_slot') AND synced;}),
+ "t",
+ 'logical slots have synced as true on standby');

/slot is created/slots are created/

/and is flagged/and are flagged/

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Synchronizing slots from primary to standby

2024-02-05 Thread Peter Smith

Hi, I took another high-level look at all the funtion names of the
slotsync.c file.

==
src/backend/replication/logical/slotsync.c

+static bool
+local_slot_update(RemoteSlot * remote_slot, Oid remote_dbid)

+static List *
+get_local_synced_slots(void)

+static bool
+check_sync_slot_on_remote(ReplicationSlot *local_slot, List *remote_slots,

+static void
+drop_obsolete_slots(List *remote_slot_list)

+static void
+reserve_wal_for_slot(XLogRecPtr restart_lsn)

+static bool
+update_and_persist_slot(RemoteSlot * remote_slot, Oid remote_dbid)

+static bool
+synchronize_one_slot(RemoteSlot * remote_slot, Oid remote_dbid)
+get_slot_invalidation_cause(char *conflict_reason)

+static bool
+synchronize_slots(WalReceiverConn *wrconn)

+static void
+validate_primary_slot(WalReceiverConn *wrconn, int slot_invalid_elevel)

+static bool
+validate_slotsync_params(int elevel)

+bool
+IsSyncingReplicationSlots(void)

+Datum
+pg_sync_replication_slots(PG_FUNCTION_ARGS)

~~~

There seems some muddling of names here:
- "local" versus ? and "remote" versus "primary"; or sometimes the
function does not give an indication.
- "sync_slot" versus "synced_slot" versus nothing
- "check" versus "validate"
- etc.

Below are some suggestions (some are unchanged); probably there are
better ideas for names but my point is that the current names could be
improved:

CURRENT SUGGESTION
get_local_synced_slots get_local_synced_slots
check_sync_slot_on_remote check_local_synced_slot_exists_on_remote
drop_obsolete_slots  drop_local_synced_slots
reserve_wal_for_slot reserve_wal_for_local_slot
local_slot_update  update_local_synced_slot
update_and_persist_slot   update_and_persist_local_synced_slot
get_slot_invalidation_cause  get_slot_conflict_reason
synchronize_slots  synchronize_remote_slots_to_local
synchronize_one_slotsynchronize_remote_slot_to_local
validate_primary_slotcheck_remote_synced_slot_exists
validate_slotsync_params check_local_config
IsSyncingReplicationSlots IsSyncingReplicationSlots
pg_sync_replication_slots pg_sync_replication_slots

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Synchronizing slots from primary to standby

2024-02-05 Thread Peter Smith

Hi,

Previously ([1] #19 and #22) I had suggested that some conflict_reason
code could be split off and pushed first as a prerequisite/
preparatory/ independent patch.

At the time, my suggestion was premature because there was still a lot
of code under development.

But now the affected code is in the first patch 1, and there are
already other precedents of slot-sync preparatory patches getting
pushed.

So I thought to resurrect this splitting suggestion again, as perhaps
is the right time to do it.

Details are below:

==

IMO the new #defines for the conflict_reason stuff plus where they get
used can be pushed as an independent patch.

Specifically, this stuff:

~~~

>From src/include/replication/slot.h:

+/*
+ * The possible values for 'conflict_reason' returned in
+ * pg_get_replication_slots.
+ */
+#define SLOT_INVAL_WAL_REMOVED_TEXT "wal_removed"
+#define SLOT_INVAL_HORIZON_TEXT "rows_removed"
+#define SLOT_INVAL_WAL_LEVEL_TEXT   "wal_level_insufficient"
+

~~~

>From src/backend/replication/logical/slotsync.c:

Also, IMO this function should live in slot.c; Although slotsync.c
might be the only caller, this is not really a slot-sync specific
function.

+/*
+ * Maps the pg_replication_slots.conflict_reason text value to
+ * ReplicationSlotInvalidationCause enum value
+ */
+static ReplicationSlotInvalidationCause
+get_slot_invalidation_cause(char *conflict_reason)
+{
+ Assert(conflict_reason);
+
+ if (strcmp(conflict_reason, SLOT_INVAL_WAL_REMOVED_TEXT) == 0)
+ return RS_INVAL_WAL_REMOVED;
+ else if (strcmp(conflict_reason, SLOT_INVAL_HORIZON_TEXT) == 0)
+ return RS_INVAL_HORIZON;
+ else if (strcmp(conflict_reason, SLOT_INVAL_WAL_LEVEL_TEXT) == 0)
+ return RS_INVAL_WAL_LEVEL;
+ else
+ Assert(0);
+
+ /* Keep compiler quiet */
+ return RS_INVAL_NONE;
+}

~~~

>From src/backend/replication/slotfuncs.c:


  case RS_INVAL_WAL_REMOVED:
- values[i++] = CStringGetTextDatum("wal_removed");
+ values[i++] = CStringGetTextDatum(SLOT_INVAL_WAL_REMOVED_TEXT);
  break;

  case RS_INVAL_HORIZON:
- values[i++] = CStringGetTextDatum("rows_removed");
+ values[i++] = CStringGetTextDatum(SLOT_INVAL_HORIZON_TEXT);
  break;

  case RS_INVAL_WAL_LEVEL:
- values[i++] = CStringGetTextDatum("wal_level_insufficient");
+ values[i++] = CStringGetTextDatum(SLOT_INVAL_WAL_LEVEL_TEXT);
  break;

~~~

Thoughts?

==
[1] 
https://www.postgresql.org/message-id/CAHut%2BPtJAAPghc4GPt0k%3DjeMz1qu4H7mnaDifOHsVsMqi-qOLA%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Synchronizing slots from primary to standby

2024-02-05 Thread Peter Smith

ndStringInfo(_name, "%s_", cluster_name);
appendStringInfoString(_name, "slotsync");

~~~

22.
+ /*
+ * Establish the connection to the primary server for slots
+ * synchronization.
+ */
+ wrconn = walrcv_connect(PrimaryConnInfo, false, false, false,
+ app_name.data, );

Unnecessarily verbose?

SUGGESTION
Connect to the primary server.

~~~

23.
+ syncing_slots = true;
+
+ PG_TRY();
+ {
+ /*
+ * Using the specified primary server connection, validates the slot
+ * in primary_slot_name.
+ */
+ validate_primary_slot(wrconn, ERROR);
+
+ (void) synchronize_slots(wrconn);
+ }
+ PG_FINALLY();
+ {
+ syncing_slots = false;
+ walrcv_disconnect(wrconn);
+ }
+ PG_END_TRY();

23a.
IMO the "syncing_slots = true;" can be deferred until immediately
before call to synchronize_slots();

~

23b.
I felt the comment seems backwards, so can be worded as suggested
elsewhere in this post.

SUGGESTION
Validate the 'primary_slot_name' using the specified primary server connection.

OTOH, if you can change the function name to
validate_primary_slot_name() then no comment is needed because then it
becomes self-explanatory.

==
src/backend/replication/slot.c

24.
+ /*
+ * Do not allow users to create the slots with failover enabled on the
+ * standby as we do not support sync to the cascading standby.
+ *
+ * Slots with failover enabled can still be created when doing slot
+ * synchronization, as it needs to maintain this value in sync with the
+ * remote slots.
+ */
+ if (failover && RecoveryInProgress() && !IsSyncingReplicationSlots())
+ ereport(ERROR,
+ errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot enable failover for a replication slot"
+" created on the standby"));

I felt it started to become confusing using "synchronization" and
"sync" in the same sentence.

SUGGESTION
However, slots with failover enabled can be created during slot
synchronization because we need to retain the same values as the
remote slot.


==
.../t/040_standby_failover_slots_sync.pl

25.
+
+$standby1->safe_psql('postgres', "SELECT pg_sync_replication_slots();");

Since this is where we use the function added by this patch, it
deserves to have a comment.

SUGGESTION
# Synchronize the primary server slots to the standby.

==
src/tools/pgindent/typedefs.list

26.
It looks like 'RemoteSlot' should be included in the typedefs.list
file. Probably this is the explanation for the space problems I
reported earlier in this post.

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Synchronizing slots from primary to standby

2024-02-04 Thread Peter Smith

On Fri, Feb 2, 2024 at 11:18 PM shveta malik  wrote:
>
> On Fri, Feb 2, 2024 at 12:25 PM Peter Smith  wrote:
> >
> > Here are some review comments for v750002.
>
> Thanks for the feedback Peter. Addressed all in v76 except one.
>
> > (this is a WIP but this is what I found so far...)
>
> > I wonder if it is better to log all the problems in one go instead of
> > making users stumble onto them one at a time after fixing one and then
> > hitting the next problem. e.g. just set some variable "all_ok =
> > false;" each time instead of all the "return false;"
> >
> > Then at the end of the function just "return all_ok;"
>
> If we do this way, then we need to find a way to combine the msgs as
> well, otherwise the same msg will be repeated multiple times. For the
> concerned functionality (which needs one time config effort by user),
> I feel the existing way looks okay. We may consider optimizing it if
> we get more comments here.
>

I don't think combining messages is necessary;   I considered these
all as different (not the same msg repeated multiple times) since they
all have different errhints.

I felt a user would only know to make a configuration correction when
they are informed something is wrong, so my review point was we could
tell them all the wrong things up-front so then those can all be fixed
with a "one time config effort by user".

Otherwise, if multiple settings (e.g. from the list below) have wrong
values, I imagined the user will fix the first reported one, then the
next bad config will be reported, then the user will fix that one,
then the next bad config will be reported, then the user will fix that
one, and so on. It just seemed potentially/unnecessarilly painful.

- errhint("\"%s\" must be defined.", "primary_slot_name"));
- errhint("\"%s\" must be enabled.", "hot_standby_feedback"));
- errhint("\"wal_level\" must be >= logical."));
- errhint("\"%s\" must be defined.", "primary_conninfo"));
- errhint("'dbname' must be specified in \"%s\".", "primary_conninfo"));

~

Anyway, I just wanted to explain my review comment some more because
maybe my reason wasn't clear the first time. Whatever your decision
is, it is fine by me.

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: src/bin/pg_upgrade/t/004_subscription.pl test comment fix

2024-02-04 Thread Peter Smith

On Sat, Feb 3, 2024 at 5:28 PM Amit Kapila  wrote:
>
> On Thu, Feb 1, 2024 at 5:58 AM Peter Smith  wrote:
> >
> > OK. Now using your suggested 2nd sentence:
> >
> > +# The subscription's running status and failover option should be preserved
> > +# in the upgraded instance. So regress_sub1 should still have
> > subenabled,subfailover
> > +# set to true, while regress_sub2 should have both set to false.
> >
> > ~
> >
> > I also tweaked some other nearby comments/messages which did not
> > mention the 'failover' preservation.
> >
>
> Looks mostly good to me. One minor nitpick:
> *
> along with retaining the replication origin's remote lsn
> -# and subscription's running status.
> +# and subscription's running status and failover option.
>
> I don't think we need to use 'and' twice in the above sentence. We
> should use ',' between different properties. I can change this on
> Monday and push it unless you think otherwise.
>

+1

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Synchronizing slots from primary to standby

2024-02-01 Thread Peter Smith

Info == '\0')
+ {
+ ereport(LOG,
+ /* translator: %s is a GUC variable name */
+ errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("bad configuration for slot synchronization"),
+ errhint("\"%s\" must be defined.", "primary_conninfo"));
+ return false;
+ }
+
+ /*
+ * The slot sync worker needs a database connection for walrcv_exec to
+ * work.
+ */
+ *dbname = walrcv_get_dbname_from_conninfo(PrimaryConnInfo);
+ if (*dbname == NULL)
+ {
+ ereport(LOG,
+
+ /*
+ * translator: 'dbname' is a specific option; %s is a GUC variable
+ * name
+ */
+ errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("bad configuration for slot synchronization"),
+ errhint("'dbname' must be specified in \"%s\".", "primary_conninfo"));
+ return false;
+ }
+
+ return true;
+}

I wonder if it is better to log all the problems in one go instead of
making users stumble onto them one at a time after fixing one and then
hitting the next problem. e.g. just set some variable "all_ok =
false;" each time instead of all the "return false;"

Then at the end of the function just "return all_ok;"

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Synchronizing slots from primary to standby

2024-02-01 Thread Peter Smith

Here are some review comments for v750001.

==
Commit message

1.
This patch provides support for non-replication connection
in libpqrcv_connect().

~

1a.
/connection/connections/

~

1b.
Maybe there needs to be a few more sentences just to describe what you
mean by "non-replication connection".

~

1c.
IIUC although the 'replication' parameter is added, in this patch
AFAICT every call to the connect function is still passing that
argument as true. If that's correct, probably this patch comment
should emphasise that this patch doesn't change any functionality at
all but is just preparation for later patches which *will* pass false
for the replication arg.

~~~

2.
This patch also implements a new API libpqrcv_get_dbname_from_conninfo()
to extract database name from the given connection-info

~

/extract database name/the extract database name/

==
.../libpqwalreceiver/libpqwalreceiver.c

3.
+ * Apart from walreceiver, the libpq-specific routines here are now being used
+ * by logical replication worker as well.

/worker/workers/

~~~

4. libpqrcv_connect

 /*
- * Establish the connection to the primary server for XLOG streaming
+ * Establish the connection to the primary server.
+ *
+ * The connection established could be either a replication one or
+ * a non-replication one based on input argument 'replication'. And further
+ * if it is a replication connection, it could be either logical or physical
+ * based on input argument 'logical'.

That first comment ("could be either a replication one or...") seemed
a bit meaningless (e.g. it like saying "this boolean argument can be
true or false") because it doesn't describe what is the meaning of a
"replication connection" versus what is a "non-replication
connection".

~~~

5.
/* We can not have logical without replication */
Assert(replication || !logical);

if (replication)
{
keys[++i] = "replication";
vals[i] = logical ? "database" : "true";

if (!logical)
{
/*
* The database name is ignored by the server in replication mode,
* but specify "replication" for .pgpass lookup.
*/
keys[++i] = "dbname";
vals[i] = "replication";
}
}

keys[++i] = "fallback_application_name";
vals[i] = appname;
if (logical)
{
...
}

~

The Assert already says we cannot be 'logical' if not 'replication',
therefore IMO it seemed strange that the code was not refactored to
bring that 2nd "if (logical)" code to within the scope of the "if
(replication)".

e.g. Can't you do something like this:

Assert(replication || !logical);

if (replication)
{
  ...
  if (logical)
  {
...
  }
  else
  {
...
  }
}
keys[++i] = "fallback_application_name";
vals[i] = appname;

~~~

6. libpqrcv_get_dbname_from_conninfo

+ for (PQconninfoOption *opt = opts; opt->keyword != NULL; ++opt)
+ {
+ /*
+ * If multiple dbnames are specified, then the last one will be
+ * returned
+ */
+ if (strcmp(opt->keyword, "dbname") == 0 && opt->val &&
+ opt->val[0] != '\0')
+ dbname = pstrdup(opt->val);
+ }

Should you also pfree the old dbname instead of gathering a bunch of
strdups if there happened to be multiple dbnames specified ?

SUGGESTION
if (strcmp(opt->keyword, "dbname") == 0 && opt->val && *opt->val)
{
   if (dbname)
pfree(dbname);
   dbname = pstrdup(opt->val);
}

==
src/include/replication/walreceiver.h

7.
/*
 * walrcv_connect_fn
 *
 * Establish connection to a cluster.  'logical' is true if the
 * connection is logical, and false if the connection is physical.
 * 'appname' is a name associated to the connection, to use for example
 * with fallback_application_name or application_name.  Returns the
 * details about the connection established, as defined by
 * WalReceiverConn for each WAL receiver module.  On error, NULL is
 * returned with 'err' including the error generated.
 */
typedef WalReceiverConn *(*walrcv_connect_fn) (const char *conninfo,
   bool replication,
   bool logical,
   bool must_use_password,
   const char *appname,
   char **err);

~

The comment is missing any description of the new parameter 'replication'.

~~~

8.
+/*
+ * walrcv_get_dbname_from_conninfo_fn
+ *
+ * Returns the dbid from the primary_conninfo
+ */
+typedef char *(*walrcv_get_dbname_from_conninfo_fn) (const char *conninfo);
+

/dbid/database name/

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Synchronizing slots from primary to standby

2024-01-31 Thread Peter Smith

Here are some review comments for v740001.

==
src/sgml/logicaldecoding.sgml

1.
+   
+Replication Slot Synchronization
+
+ A logical replication slot on the primary can be synchronized to the hot
+ standby by enabling the failover option during slot
+ creation and setting
+ enable_syncslot
+ on the standby. For the synchronization
+ to work, it is mandatory to have a physical replication slot between the
+ primary and the standby, and
+ hot_standby_feedback
+ must be enabled on the standby. It is also necessary to specify a valid
+ dbname in the
+ primary_conninfo
+ string, which is used for slot synchronization and is ignored
for streaming.
+

IMO we don't need to repeat that last part ", which is used for slot
synchronization and is ignored for streaming." because that is a
detail about the primary_conninfo GUC, and the same information is
already described in that GUC section.

==

2. ALTER_REPLICATION_SLOT slot_name ( option [, ...] ) #

  
-  If true, the slot is enabled to be synced to the standbys.
+  If true, the slot is enabled to be synced to the standbys
+  so that logical replication can be resumed after failover.
  

This also should have the sentence "The default is false.", e.g. the
same as the same option in CREATE_REPLICATION_SLOT says.

==
synchronize_one_slot

3.
+ /*
+ * Make sure that concerned WAL is received and flushed before syncing
+ * slot to target lsn received from the primary server.
+ *
+ * This check will never pass if on the primary server, user has
+ * configured standby_slot_names GUC correctly, otherwise this can hit
+ * frequently.
+ */
+ latestFlushPtr = GetStandbyFlushRecPtr(NULL);
+ if (remote_slot->confirmed_lsn > latestFlushPtr)

BEFORE
This check will never pass if on the primary server, user has
configured standby_slot_names GUC correctly, otherwise this can hit
frequently.

SUGGESTION (simpler way to say the same thing?)
This will always be the case unless the standby_slot_names GUC is not
correctly configured on the primary server.

~~~

4.
+ /* User created slot with the same name exists, raise ERROR. */

/User created/User-created/

~~~

5. synchronize_slots, and also drop_obsolete_slots

+ /*
+ * Use shared lock to prevent a conflict with
+ * ReplicationSlotsDropDBSlots(), trying to drop the same slot while
+ * drop-database operation.
+ */

(same code comment is in a couple of places)

SUGGESTION (while -> during, etc.)

Use a shared lock to prevent conflicting with
ReplicationSlotsDropDBSlots() trying to drop the same slot during a
drop-database operation.

~~~

6. validate_parameters_and_get_dbname

strcmp() just for the empty string "" might be overkill.

6a.
+ if (PrimarySlotName == NULL || strcmp(PrimarySlotName, "") == 0)

SUGGESTION
if (PrimarySlotName == NULL || *PrimarySlotName == '\0')

~~

6b.
+ if (PrimaryConnInfo == NULL || strcmp(PrimaryConnInfo, "") == 0)

SUGGESTION
if (PrimaryConnInfo == NULL || *PrimaryConnInfo == '\0')

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: src/bin/pg_upgrade/t/004_subscription.pl test comment fix

2024-01-31 Thread Peter Smith

On Wed, Jan 31, 2024 at 7:48 PM Alvaro Herrera  wrote:
>
> How about rewording it more extensively?  It doesn't read great IMO.
> I would use something like
>
> # In the upgraded instance, the running status and failover option of the
> # subscription with the failover option should have been preserved; the other
> # should not.
> # So regress_sub1 should still have subenabled,subfailover set to true,
> # while regress_sub2 should have both set to false.
>

IIUC this suggested comment is implying that the running status is
*only* preserved when the failover option is true. But AFAIK that is
not correct. e.g. I hacked the test to keep subscription regress_sub2
as ENABLED but the result after the upgrade was subenabled/subfailover
= t/f, not f/f.

> I think the symmetry between the two lines confuses more than helps.
> It's not a huge thing but since we're editing anyway, why not?
>

OK. Now using your suggested 2nd sentence:

+# The subscription's running status and failover option should be preserved
+# in the upgraded instance. So regress_sub1 should still have
subenabled,subfailover
+# set to true, while regress_sub2 should have both set to false.

~

I also tweaked some other nearby comments/messages which did not
mention the 'failover' preservation.

PSA v2.

==
Kind Regards,
Peter Smith.
Fujitsu Australia

v2-0001-Fix-pg_upgrade-test-comment.patch
Description: Binary data

Re: Documentation: warn about two_phase when altering a subscription

2024-01-31 Thread Peter Smith

On Wed, Jan 31, 2024 at 4:55 PM Bertrand Drouvot
 wrote:
...
> As a non native English speaker somehow I have to rely on you for those
> suggestions ;-)
>
> They make sense to me so applied both in v2 attached.
>

Patch v2 looks OK to me, but probably there is still room for
improvement. Let's see what others think.

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Improve the connection failure error messages

2024-01-31 Thread Peter Smith

On Wed, Jan 31, 2024 at 9:58 PM Nisha Moond  wrote:
>
>
>> AFAIK some recent commits patches (e,g [1]  for the "slot sync"
>> development) have created some more cases of "could not connect..."
>> messages. So, you might need to enhance your patch to deal with any
>> new ones in the latest HEAD.
>>
>> ==
>> [1] 
>> https://github.com/postgres/postgres/commit/776621a5e4796fa214b6b29a7ca134f6c138572a
>>
> Thank you for the update.
> The v3 patch has the changes needed as per the latest HEAD.
>

Hi, just going by visual inspection of the v2/v3 patch diffs, the
latest v3 LGTM.

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: src/bin/pg_upgrade/t/004_subscription.pl test comment fix

2024-01-30 Thread Peter Smith

On Wed, Jan 31, 2024 at 4:51 PM vignesh C  wrote:
>
> On Wed, 31 Jan 2024 at 10:27, Peter Smith  wrote:
> >
> > Hi,
> >
> > PSA a small fix for a misleading comment found in the pg_upgrade test code.
>
> Is the double # intentional here, as I did not see this style of
> commenting used elsewhere:
> +# # Upgraded regress_sub1 should still have enabled and failover as true.
> +# # Upgraded regress_sub2 should still have enabled and failover as false.
>

Unintentional caused by copy/paste. Thanks for reporting. I will post
a fixed patch tomorrow.

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Improve the connection failure error messages

2024-01-30 Thread Peter Smith

AFAIK some recent commits patches (e,g [1]  for the "slot sync"
development) have created some more cases of "could not connect..."
messages. So, you might need to enhance your patch to deal with any
new ones in the latest HEAD.

==
[1] 
https://github.com/postgres/postgres/commit/776621a5e4796fa214b6b29a7ca134f6c138572a

Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Synchronizing slots from primary to standby

2024-01-30 Thread Peter Smith

On Wed, Jan 31, 2024 at 2:18 PM Amit Kapila  wrote:
>
> On Wed, Jan 31, 2024 at 7:27 AM Peter Smith  wrote:
> >
> > I saw that v73-0001 was pushed, but it included some last-minute
> > changes that I did not get a chance to check yesterday.
> >
> > Here are some review comments for the new parts of that patch.
> >
> > ==
> > doc/src/sgml/ref/create_subscription.sgml
> >
> > 1.
> > connect (boolean)
> >
> > Specifies whether the CREATE SUBSCRIPTION command should connect
> > to the publisher at all. The default is true. Setting this to false
> > will force the values of create_slot, enabled, copy_data, and failover
> > to false. (You cannot combine setting connect to false with setting
> > create_slot, enabled, copy_data, or failover to true.)
> >
> > ~
> >
> > I don't think the first part "Setting this to false will force the
> > values ... failover to false." is strictly correct.
> >
> > I think is correct to say all those *other* properties (create_slot,
> > enabled, copy_data) are forced to false because those otherwise have
> > default true values.
> >
>
> So, won't when connect=false, the user has to explicitly provide such
> values (create_slot, enabled, etc.) as false? If so, is using 'force'
> strictly correct?

Perhaps the original docs text could be worded differently; I think
the word "force" here just meant setting connection=false
forces/causes/makes those other options behave "as if" they had been
set to false without the user explicitly doing anything to them. The
point is they are made to behave *differently* to their normal
defaults.

So,
connect=false ==> this actually sets enabled=false (you can see this
with \dRs+), which is different to the default setting of 'enabled'
connect=false ==> will not create a slot (because there is no
connection), which is different to the default behaviour for
'create-slot'
connect=false ==> will not copy tables (because there is no
connection), which is different to the default behaviour for
'copy_data;'

OTOH, failover is different
connect=false ==> failover is not possible (because there is no
connection), but the 'failover' default is false anyway, so no change
to the default behaviour for this one.

>
> > ~~~
> >
> > 2.
> >   
> >Since no connection is made when this option is
> >false, no tables are subscribed. To initiate
> >replication, you must manually create the replication slot, 
> > enable
> > -  the subscription, and refresh the subscription. See
> > +  the failover if required, enable the subscription, and refresh 
> > the
> > +  subscription. See
> > > linkend="logical-replication-subscription-examples-deferred-slot"/>
> >for examples.
> >   
> >
> > IMO "see the failover if required" is very vague. See what failover?
> >
>
> AFAICS, the committed docs says: "To initiate replication, you must
> manually create the replication slot, enable the failover if required,
> ...". I am not sure what you are referring to.
>

My mistake. I was misreading the patch code without applying the
patch. Sorry for the noise.

> >
> > ==
> > src/bin/pg_upgrade/t/004_subscription.pl
> >
> > 5.
> > -# The subscription's running status should be preserved. Old subscription
> > -# regress_sub1 should be enabled and old subscription regress_sub2 should 
> > be
> > -# disabled.
> > +# The subscription's running status and failover option should be 
> > preserved.
> > +# Old subscription regress_sub1 should have enabled and failover as true 
> > while
> > +# old subscription regress_sub2 should have enabled and failover as false.
> >  $result =
> >$new_sub->safe_psql('postgres',
> > - "SELECT subname, subenabled FROM pg_subscription ORDER BY subname");
> > -is( $result, qq(regress_sub1|t
> > -regress_sub2|f),
> > + "SELECT subname, subenabled, subfailover FROM pg_subscription ORDER
> > BY subname");
> > +is( $result, qq(regress_sub1|t|t
> > +regress_sub2|f|f),
> >   "check that the subscription's running status are preserved");
> >
> > ~
> >
> > Calling those "old subscriptions" seems misleading. Aren't these the
> > new/upgraded subscriptions being checked here?
> >
>
> Again the quoted wording is not introduced by this patch. But, I see
> your point and it is better if you can start a separate thread for it.
>

OK. I created a separate thread for this [1]

==
[1] 
https://www.postgresql.org/message-id/cahut+pu1uslphrysptacy1k_q-ddsrxnfhmj_2u1nfqbc1y...@mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia.

src/bin/pg_upgrade/t/004_subscription.pl test comment fix

2024-01-30 Thread Peter Smith

Hi,

PSA a small fix for a misleading comment found in the pg_upgrade test code.

==
Kind Regards,
Peter Smith.
Fujitsu Australia


v1-0001-Fix-test-comment.patch
Description: Binary data

Re: Documentation: warn about two_phase when altering a subscription

2024-01-30 Thread Peter Smith

Hi, thanks for the patch. Here are some review comments for v1

==

(below is not showing the links and other sgml rendering -- all those LGTM)

BEFORE
When altering the slot_name, the failover and two_phase properties
values of the named slot may differ from their counterparts failover
and two_phase parameters specified in the subscription. When creating
the slot, ensure the slot failover and two_phase properties match
their counterparts parameters values of the subscription.

SUGGESTION
When altering the slot_name, the failover and two_phase property
values of the named slot may differ from the counterpart failover and
two_phase parameters specified by the subscription. When creating the
slot, ensure the slot properties failover and two_phase match their
counterpart parameters of the subscription.

~

BEFORE
Otherwise, the slot on the publisher may behave differently from what
subscription's failover and two_phase options say: for example, the
slot on the publisher could ...

SUGGESTION:
Otherwise, the slot on the publisher may behave differently from what
these subscription options say: for example, the slot on the publisher
could ...

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Synchronizing slots from primary to standby

2024-01-30 Thread Peter Smith

Hi,

I saw that v73-0001 was pushed, but it included some last-minute
changes that I did not get a chance to check yesterday.

Here are some review comments for the new parts of that patch.

==
doc/src/sgml/ref/create_subscription.sgml

1.
connect (boolean)

Specifies whether the CREATE SUBSCRIPTION command should connect
to the publisher at all. The default is true. Setting this to false
will force the values of create_slot, enabled, copy_data, and failover
to false. (You cannot combine setting connect to false with setting
create_slot, enabled, copy_data, or failover to true.)

~

I don't think the first part "Setting this to false will force the
values ... failover to false." is strictly correct.

I think is correct to say all those *other* properties (create_slot,
enabled, copy_data) are forced to false because those otherwise have
default true values. But the 'failover' has default false, so it
cannot get force-changed at all because you can't set connect to false
when failover is true as the second part ("You cannot combine...")
explains.

IMO remove 'failover' from that first sentence.

~~~

2.
  
   Since no connection is made when this option is
   false, no tables are subscribed. To initiate
   replication, you must manually create the replication slot, enable
-  the subscription, and refresh the subscription. See
+  the failover if required, enable the subscription, and refresh the
+  subscription. See
   
   for examples.
  

IMO "see the failover if required" is very vague. See what failover?
The slot property failover, or the subscription option failover? And
"see" it for what purpose?

I think the intention was probably to say something like "ensure the
manually created slot has the same matching failover property value as
the subscriber failover option", but that is not clear from the
current text.

==
doc/src/sgml/ref/pg_dump.sgml

3.
dump can be restored without requiring network access to the remote
servers.  It is then up to the user to reactivate the subscriptions in a
suitable way.  If the involved hosts have changed, the connection
-   information might have to be changed.  It might also be appropriate to
+   information might have to be changed.  If the subscription needs to
+   be enabled for
+   failover,
+   then same needs to be done by executing
+   
+   ALTER SUBSCRIPTION ... SET(failover = true)
+   after the slot has been created.  It might also be appropriate to

"then same needs to be done" (English?)

BEFORE
If the subscription needs to be enabled for failover, then same needs
to be done by executing ALTER SUBSCRIPTION ... SET(failover = true)
after the slot has been created.

SUGGESTION
If the subscription needs to be enabled for failover, execute ALTER
SUBSCRIPTION ... SET(failover = true) after the slot has been created.

==
src/backend/commands/subscriptioncmds.c

4.
 #define SUBOPT_RUN_AS_OWNER 0x1000
-#define SUBOPT_LSN 0x2000
-#define SUBOPT_ORIGIN 0x4000
+#define SUBOPT_FAILOVER 0x2000
+#define SUBOPT_LSN 0x4000
+#define SUBOPT_ORIGIN 0x8000
+

A spurious blank line was added.

==
src/bin/pg_upgrade/t/004_subscription.pl

5.
-# The subscription's running status should be preserved. Old subscription
-# regress_sub1 should be enabled and old subscription regress_sub2 should be
-# disabled.
+# The subscription's running status and failover option should be preserved.
+# Old subscription regress_sub1 should have enabled and failover as true while
+# old subscription regress_sub2 should have enabled and failover as false.
 $result =
   $new_sub->safe_psql('postgres',
- "SELECT subname, subenabled FROM pg_subscription ORDER BY subname");
-is( $result, qq(regress_sub1|t
-regress_sub2|f),
+ "SELECT subname, subenabled, subfailover FROM pg_subscription ORDER
BY subname");
+is( $result, qq(regress_sub1|t|t
+regress_sub2|f|f),
  "check that the subscription's running status are preserved");

~

Calling those "old subscriptions" seems misleading. Aren't these the
new/upgraded subscriptions being checked here?

Should the comment be more like:

# The subscription's running status and failover option should be preserved.
# Upgraded regress_sub1 should still have enabled and failover as true.
# Upgraded regress_sub2 should still have enabled and failover as false.

==
Kind Regards,
Peter Smith.
Fujitsu Australia.

Re: subscription disable_on_error not working after ALTER SUBSCRIPTION set bad conninfo

2024-01-29 Thread Peter Smith

On Fri, Jan 19, 2024 at 7:21 PM Masahiko Sawada  wrote:
>
> On Thu, Jan 18, 2024 at 6:54 PM Amit Kapila  wrote:
> >
> > On Thu, Jan 18, 2024 at 11:15 AM Peter Smith  wrote:
> > >
> > > On Thu, Jan 18, 2024 at 12:55 PM Masahiko Sawada  
> > > wrote:
> > > >
> > > ...
> > > >
> > > > Although we can improve it to handle this case too, I'm not sure it's
> > > > a bug. The doc says[1]:
> > > >
> > > > Specifies whether the subscription should be automatically disabled if
> > > > any errors are detected by subscription workers during data
> > > > replication from the publisher.
> > > >
> > > > When an apply worker is trying to establish a connection, it's not
> > > > replicating data from the publisher.
> > > >
> > > > Regards,
> > > >
> > > > [1] 
> > > > https://www.postgresql.org/docs/devel/sql-createsubscription.html#SQL-CREATESUBSCRIPTION-PARAMS-WITH-DISABLE-ON-ERROR
> > > >
> > > > --
> > > > Masahiko Sawada
> > > > Amazon Web Services: https://aws.amazon.com
> > >
> > > Yeah, I had also seen that wording of those docs. And I agree it
> > > leaves open some room for doubts because strictly from that wording it
> > > can be interpreted that establishing the connection is not actually
> > > "data replication from the publisher" in which case maybe there is no
> > > bug.
> > >
> >
> > As far as I remember that was the intention. The idea was if there is
> > any conflict during apply that users manually need to fix, they have
> > the provision to stop repeating the error. If we wish to extend the
> > purpose of this option for another valid use case and there is a good
> > way to achieve the same then we can discuss but I don't think we need
> > to change it in back-branches.
>
> I agree not to change it in back-branches.
>
> What is the use case of extending disable_on_error?
>

The use-case is that with my user-hat on I had assumed
disable_on_error behaviour was as per the name implied so the
subscription would disable on getting (any) error. OTOH I agree, my
expectation is not exactly what the current docs say.

Also, I had thought the motivation for that option was to avoid having
infinite repeating errors that might be caused by the user or data.
e.g. a simple typo in the conninfo can cause this error and AFAIK the
ALTER will appear successful so the user won't know anything about it
unless they also check the logs. OTOH something like a connection
error may only be temporary (caused by a network issue?) and not
caused by a user typo at all, so I can see perhaps that is why
disable_on_error is OK to be excluding connection errors.

 TBH I think there are pros and cons to doing nothing and leaving the
existing behaviour as-is or extending it -- I'm happy to go either
way.

Another idea is to leave behaviour unchanged, but add a note in the
docs like "Note: connection errors (e.g. specifying a bad conninfo
using ALTER SUBSCRIPTION) will not cause the subscription to become
disabled"

Thoughts?

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Synchronizing slots from primary to standby

2024-01-29 Thread Peter Smith

Here are some review comments for v72-0001

==
doc/src/sgml/ref/alter_subscription.sgml

1.
+  parameter value of the subscription. Otherwise, the slot on the
+  publisher may behave differently from what subscription's
+  failover
+  option says. The slot on the publisher could either be
+  synced to the standbys even when the subscription's
+  failover
+  option is disabled or could be disabled for sync
+  even when the subscription's
+  failover
+  option is enabled.
+ 

It is a bit wordy to keep saying "disabled/enabled"

BEFORE
The slot on the publisher could either be synced to the standbys even
when the subscription's failover option is disabled or could be
disabled for sync even when the subscription's failover option is
enabled.

SUGGESTION
The slot on the publisher could be synced to the standbys even when
the subscription's failover = false or may not be syncing even when
the subscription's failover = true.

==
.../t/040_standby_failover_slots_sync.pl

2.
+# Enable subscription
+$subscriber1->safe_psql('postgres',
+ "ALTER SUBSCRIPTION regress_mysub1 ENABLE");
+
+# Disable failover for enabled subscription
+my ($result, $stdout, $stderr) = $subscriber1->psql('postgres',
+ "ALTER SUBSCRIPTION regress_mysub1 SET (failover = false)");
+ok( $stderr =~ /ERROR:  cannot set failover for enabled subscription/,
+ "altering failover is not allowed for enabled subscription");
+

Currently, those tests are under scope the big comment:

+##
+# Test that changing the failover property of a subscription updates the
+# corresponding failover property of the slot.
+##

But that comment is not quite relevant to these tests. So, add another
one just these:

SUGGESTION:
##
# Test that cannot modify the failover option for enabled subscriptions.
######

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Make documentation builds reproducible

2024-01-29 Thread Peter Smith

On Thu, Jan 25, 2024 at 9:12 AM Peter Smith  wrote:
>
> On Tue, Jan 23, 2024 at 12:32 PM Peter Smith  wrote:
> >
> > On Tue, Jan 23, 2024 at 12:13 PM Tom Lane  wrote:
> > >
> > > Peter Smith  writes:
> > > > I usually the HTML documentation locally using command:
> > > > make STYLE=website html
> > > > This has been working forever, but seems to have broken due to commit
> > > > [1] having an undeclared variable.
> > >
> > > Interestingly, that still works fine for me, on RHEL8 with
> > >
> > > docbook-dtds-1.0-69.el8.noarch
> > > docbook-style-xsl-1.79.2-9.el8.noarch
> > > docbook-style-dsssl-1.79-25.el8.noarch
> > >
> > > What docbook version are you using?
> > >
> >
> > [postgres@CentOS7-x64 sgml]$ sudo yum list installed | grep docbook
> > docbook-dtds.noarch   1.0-60.el7 @anaconda
> > docbook-style-dsssl.noarch1.79-18.el7@base
> > docbook-style-xsl.noarch  1.78.1-3.el7   @anaconda
> >
>
> IIUC these releases notes [1] say autolink.index.see existed since
> v1.79.1, but unfortunately, that is more recent than my ancient
> installed v1.78.1
>
> From the release notes:
> --
> Robert Stayton: autolink.index.see.xml
>
> New param to control automatic links in index from see and
> seealso to indexterm primary.
> --
>
> ==
> [1] https://docbook.sourceforge.net/release/xsl/1.79.1/RELEASE-NOTES.html
>

Is anything going to be changed for this? Since the recent commit [1]
when building the docs now each time I need to first hack (e.g. either
the Makefile or stylesheet-html-common.xml) to declare the missing
‘autolink.index.see’ variable. I know that my old OS is approaching
EOL but I thought my docbook installation was still valid.

==
[1] 
https://github.com/postgres/postgres/commit/b0f0a9432d0b6f53634a96715f2666f6d4ea25a1

Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Synchronizing slots from primary to standby

2024-01-28 Thread Peter Smith

Here are some review comments for v70-0001.

==
doc/src/sgml/protocol.sgml

1.
Related to this, please also review my other patch to the same docs
page protocol.sgml [1].

==
src/backend/replication/logical/tablesync.c

2.
  walrcv_create_slot(LogRepWorkerWalRcvConn,
 slotname, false /* permanent */ , false /* two_phase */ ,
+false,
 CRS_USE_SNAPSHOT, origin_startpos);

I know it was previously mentioned in this thread that inline
parameter comments are unnecessary, but here they are already in the
existing code so shouldn't we do the same?

==
src/backend/replication/repl_gram.y

3.
+/* ALTER_REPLICATION_SLOT slot options */
+alter_replication_slot:
+ K_ALTER_REPLICATION_SLOT IDENT '(' generic_option_list ')'
+ {
+ AlterReplicationSlotCmd *cmd;
+ cmd = makeNode(AlterReplicationSlotCmd);
+ cmd->slotname = $2;
+ cmd->options = $4;
+ $$ = (Node *) cmd;
+ }
+ ;
+

IMO write that comment with parentheses, so it matches the code.

SUGGESTION
ALTER_REPLICATION_SLOT slot ( options )

==
[1] 
https://www.postgresql.org/message-id/CAHut%2BPtDWSmW8uiRJF1LfGQJikmo7V2jdysLuRmtsanNZc7fNw%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Documentation to upgrade logical replication cluster

2024-01-28 Thread Peter Smith

Hi Vignesh,

Here are some review comments for patch v4.

These are cosmetic only; otherwise v4 LGTM.

==
doc/src/sgml/ref/pgupgrade.sgml

1.
Configure the servers for log shipping.  (You do not need to run
pg_backup_start() and
pg_backup_stop()
or take a file system backup as the standbys are still synchronized
-   with the primary.)  Only logical slots on the primary are copied to the
-   new standby, but other slots on the old standby are not copied so must
-   be recreated manually.
+   with the primary.)  If the old cluster is prior to 17.0, then no slots
+   on the primary are copied to the new standby, so all the slots must be
+   recreated manually. If the old cluster is 17.0 or later, then only
+   logical slots on the primary are copied to the new standby, but other
+   slots on the old standby are not copied so must be recreated manually.
   

Perhaps the part from "If the old cluster is prior..." should be in a
new paragraph.

==
doc/src/sgml/logical-replication.sgml

2.
+   
+Setup the 
+subscriber configurations in the new subscriber.
+pg_upgrade attempts to migrate subscription
+dependencies which includes the subscription's table information present in
+pg_subscription_rel
+system catalog and also the subscription's replication origin. This allows
+logical replication on the new subscriber to continue from where the
+old subscriber was up to. Migration of subscription dependencies is only
+supported when the old cluster is version 17.0 or later. Subscription
+dependencies on clusters before version 17.0 will silently be ignored.
+   

Perhaps the part from "pg_upgrade attempts..." should be in a new paragraph.

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Synchronizing slots from primary to standby

2024-01-24 Thread Peter Smith

Here are some review comments for v67-0002.

==
src/backend/replication/logical/slotsync.c

1.
+/* The sleep time (ms) between slot-sync cycles varies dynamically
+ * (within a MIN/MAX range) according to slot activity. See
+ * wait_for_slot_activity() for details.
+ */
+#define MIN_WORKER_NAPTIME_MS  200
+#define MAX_WORKER_NAPTIME_MS  3 /* 30s */
+
+static long sleep_ms = MIN_WORKER_NAPTIME_MS;

In my previous review for this, I meant for there to be no whitespace
between the #defines and the static long sleep_ms so the prior comment
then clearly belongs to all 3 lines

~~~

2. synchronize_one_slot

+ /*
+ * Sanity check: Make sure that concerned WAL is received and flushed
+ * before syncing slot to target lsn received from the primary server.
+ *
+ * This check should never pass as on the primary server, we have waited
+ * for the standby's confirmation before updating the logical slot.
+ */
+ latestFlushPtr = GetStandbyFlushRecPtr(NULL);
+ if (remote_slot->confirmed_lsn > latestFlushPtr)
+ {
+ ereport(LOG,
+ errmsg("skipping slot synchronization as the received slot sync"
+" LSN %X/%X for slot \"%s\" is ahead of the standby position %X/%X",
+LSN_FORMAT_ARGS(remote_slot->confirmed_lsn),
+remote_slot->name,
+LSN_FORMAT_ARGS(latestFlushPtr)));
+
+ return false;
+ }

Previously in v65 this was an elog, but now it is an ereport. But
since this is a sanity check condition that "should never pass" wasn't
the elog the more appropriate choice?

~~~

3. synchronize_one_slot

+ /*
+ * We don't want any complicated code while holding a spinlock, so do
+ * namestrcpy() and get_database_oid() outside.
+ */
+ namestrcpy(_name, remote_slot->plugin);
+ dbid = get_database_oid(remote_slot->database, false);

IMO just simplify the whole comment, here and for the other similar
comment in local_slot_update().

SUGGESTION
/* Avoid expensive operations while holding a spinlock. */

~~~

4. synchronize_slots

+ /* Construct the remote_slot tuple and synchronize each slot locally */
+ tupslot = MakeSingleTupleTableSlot(res->tupledesc, );
+ while (tuplestore_gettupleslot(res->tuplestore, true, false, tupslot))
+ {
+ bool isnull;
+ RemoteSlot *remote_slot = palloc0(sizeof(RemoteSlot));
+ Datum d;
+
+ remote_slot->name = TextDatumGetCString(slot_getattr(tupslot, 1, ));
+ Assert(!isnull);
+
+ remote_slot->plugin = TextDatumGetCString(slot_getattr(tupslot, 2, ));
+ Assert(!isnull);
+
+ /*
+ * It is possible to get null values for LSN and Xmin if slot is
+ * invalidated on the primary server, so handle accordingly.
+ */
+ d = slot_getattr(tupslot, 3, );
+ remote_slot->confirmed_lsn = isnull ? InvalidXLogRecPtr :
+ DatumGetLSN(d);
+
+ d = slot_getattr(tupslot, 4, );
+ remote_slot->restart_lsn = isnull ? InvalidXLogRecPtr : DatumGetLSN(d);
+
+ d = slot_getattr(tupslot, 5, );
+ remote_slot->catalog_xmin = isnull ? InvalidTransactionId :
+ DatumGetTransactionId(d);
+
+ remote_slot->two_phase = DatumGetBool(slot_getattr(tupslot, 6, ));
+ Assert(!isnull);
+
+ remote_slot->database = TextDatumGetCString(slot_getattr(tupslot,
+ 7, ));
+ Assert(!isnull);
+
+ d = slot_getattr(tupslot, 8, );
+ remote_slot->invalidated = isnull ? RS_INVAL_NONE :
+ get_slot_invalidation_cause(TextDatumGetCString(d));

Would it be better to get rid of the hardwired column numbers and then
be able to use the SLOTSYNC_COLUMN_COUNT already defined as a sanity
check?

SUGGESTION
int col = 0;
...
remote_slot->name = TextDatumGetCString(slot_getattr(tupslot, ++col, ));
...
remote_slot->plugin = TextDatumGetCString(slot_getattr(tupslot, ++col,
));
...
d = slot_getattr(tupslot, ++col, );
remote_slot->confirmed_lsn = isnull ? InvalidXLogRecPtr : DatumGetLSN(d);
...
d = slot_getattr(tupslot, ++col, );
remote_slot->restart_lsn = isnull ? InvalidXLogRecPtr : DatumGetLSN(d);
...
d = slot_getattr(tupslot, ++col, );
remote_slot->catalog_xmin = isnull ? InvalidTransactionId :
DatumGetTransactionId(d);
...
remote_slot->two_phase = DatumGetBool(slot_getattr(tupslot, ++col, ));
...
remote_slot->database = TextDatumGetCString(slot_getattr(tupslot,
++col, ));
...
d = slot_getattr(tupslot, ++col, );
remote_slot->invalidated = isnull ? RS_INVAL_NONE :
get_slot_invalidation_cause(TextDatumGetCString(d));

/* Sanity check */
Asert(col == SLOTSYNC_COLUMN_COUNT);

~~~

5.
+static char *
+validate_parameters_and_get_dbname(void)
+{
+ char*dbname;

These are configuration issues, so probably all these ereports could
also set errcode(ERRCODE_INVALID_PARAMETER_VALUE).

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Synchronizing slots from primary to standby

2024-01-24 Thread Peter Smith

Here are some review comments for the patch v67-0001.

==
1.
There are a couple of places checking for failover usage on a standby.

+ if (RecoveryInProgress() && failover)
+ ereport(ERROR,
+ errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot enable failover for a replication slot"
+" created on the standby"));

and

+ if (RecoveryInProgress() && failover)
+ ereport(ERROR,
+ errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot enable failover for a replication slot"
+" on the standby"));

IMO the conditions should be written the other way around (failover &&
RecoveryInProgress()) to avoid the unnecessary function calls when
'failover' flag is probably mostly default false anyway.

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Documentation to upgrade logical replication cluster

2024-01-24 Thread Peter Smith

Here are some review comments for patch v3.

==
doc/src/sgml/ref/pgupgrade.sgml

1.
+
+
+ This page does not cover steps to upgrade logical replication
clusters, refer
+for details on upgrading
+ logical replication clusters.
+
+

I felt that maybe this note was misplaced. Won't it be better to put
this down in the "Usage" section of this page?

BEFORE
These are the steps to perform an upgrade with pg_upgrade:

SUGGESTION (or something like this)
Below are the steps to perform an upgrade with pg_upgrade.

Note, the steps to upgrade logical replication clusters are not
covered here; refer to
for details.

~~~

2.
Configure the servers for log shipping. (You do not need to run
pg_backup_start() and
pg_backup_stop()
or take a file system backup as the standbys are still synchronized
- with the primary.) Only logical slots on the primary are copied to the
- new standby, but other slots on the old standby are not copied so must
- be recreated manually.
+ with the primary.) In version 17.0 or later, only logical slots on the
+ primary are copied to the new standby, but other slots on the
old standby
+ are not copied so must be recreated manually.

This para was still unclear to me. What is version 17.0 referring to
-- the old_cluster version? Do you mean something like:
If the old cluster is < v17 then logical slots are not copied. If the
old_cluster is >= v17 then...

==
doc/src/sgml/logical-replication.sgml

3.
+
+While upgrading a subscriber, write operations can be performed in the
+publisher, these changes will be replicated to the subscriber once the
+subscriber upgradation is completed.
+

3a.
/publisher, these changes/publisher. These changes/

3b.
"upgradation" ??. See [1]

maybe just /upgradation/upgrade/

~~~

4. GENERAL - prompts/paths

I noticed in v3 you removed all the cmd prompts like:
dba@node1:/opt/PostgreSQL/postgres/17/bin$
dba@node2:/opt/PostgreSQL/postgres/18/bin$
etc.

I thought those were helpful to disambiguate which server/version was
being operated on. I wonder if there is some way to keep information
still but not make it look like a real current directory that
Kuroda-san did not like:

e.g. Maybe something like the below is possible?

(dba@node1: v17) pg_upgrade...
(dba@node2: v18) pg_upgrade...

==
[1]
https://english.stackexchange.com/questions/192187/upgradation-not-universally-accepted#:~:text=Not%20all%20dictionaries%20(or%20native,by%20most%20non%2DIE%20speakers.

Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Make documentation builds reproducible

2024-01-24 Thread Peter Smith

On Tue, Jan 23, 2024 at 12:32 PM Peter Smith  wrote:
>
> On Tue, Jan 23, 2024 at 12:13 PM Tom Lane  wrote:
> >
> > Peter Smith  writes:
> > > I usually the HTML documentation locally using command:
> > > make STYLE=website html
> > > This has been working forever, but seems to have broken due to commit
> > > [1] having an undeclared variable.
> >
> > Interestingly, that still works fine for me, on RHEL8 with
> >
> > docbook-dtds-1.0-69.el8.noarch
> > docbook-style-xsl-1.79.2-9.el8.noarch
> > docbook-style-dsssl-1.79-25.el8.noarch
> >
> > What docbook version are you using?
> >
>
> [postgres@CentOS7-x64 sgml]$ sudo yum list installed | grep docbook
> docbook-dtds.noarch   1.0-60.el7 @anaconda
> docbook-style-dsssl.noarch1.79-18.el7@base
> docbook-style-xsl.noarch  1.78.1-3.el7   @anaconda
>

IIUC these releases notes [1] say autolink.index.see existed since
v1.79.1, but unfortunately, that is more recent than my ancient
installed v1.78.1

>From the release notes:
--
Robert Stayton: autolink.index.see.xml

New param to control automatic links in index from see and
seealso to indexterm primary.
--

==
[1] https://docbook.sourceforge.net/release/xsl/1.79.1/RELEASE-NOTES.html

Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Synchronizing slots from primary to standby

2024-01-23 Thread Peter Smith

as we do not support sync to the cascading standby.
+ */
+ if (RecoveryInProgress() && failover)
+ ereport(ERROR,
+ errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot alter replication slot to have failover"
+" enabled on the standby"));

I felt the errmsg could be expressed with less ambiguity:

SUGGESTION:
cannot enable failover for a replication slot on the standby

==
src/backend/replication/slotfuncs.c

11. create_physical_replication_slot

  /* acquire replication slot, this will check for conflicting names */
  ReplicationSlotCreate(name, false,
-   temporary ? RS_TEMPORARY : RS_PERSISTENT, false);
+   temporary ? RS_TEMPORARY : RS_PERSISTENT, false,
+   false);

Having an inline comment might be helpful here instead of passing "false,false"

SUGGESTION
ReplicationSlotCreate(name, false,
 temporary ? RS_TEMPORARY : RS_PERSISTENT, false,
 false /* failover */);

~~~

12. create_logical_replication_slot

+ /*
+ * Do not allow users to create the slots with failover enabled on the
+ * standby as we do not support sync to the cascading standby.
+ */
+ if (RecoveryInProgress() && failover)
+ ereport(ERROR,
+ errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot create replication slot with failover"
+" enabled on the standby"));

(similar to previous comment)

SUGGESTION:
cannot enable failover for a replication slot created on the standby

~~~

13. copy_replication_slot

  * hence pass find_startpoint false.  confirmed_flush will be set
  * below, by copying from the source slot.
+ *
+ * To avoid potential issues with the slotsync worker when the
+ * restart_lsn of a replication slot goes backwards, we set the
+ * failover option to false here. This situation occurs when a slot on
+ * the primary server is dropped and immediately replaced with a new
+ * slot of the same name, created by copying from another existing
+ * slot. However, the slotsync worker will only observe the restart_lsn
+ * of the same slot going backwards.
  */
  create_logical_replication_slot(NameStr(*dst_name),
  plugin,
  temporary,
  false,
+ false,
  src_restart_lsn,
  false);
(similar to an earlier comment)

Having an inline comment might be helpful here.

e.g. false /* failover */,

==
src/backend/replication/walreceiver.c

14.
- walrcv_create_slot(wrconn, slotname, true, false, 0, NULL);
+ walrcv_create_slot(wrconn, slotname, true, false, false, 0, NULL);

(similar to an earlier comment)

Having an inline comment might be helpful here:

SUGGESTION
walrcv_create_slot(wrconn, slotname, true, false, false /* failover
*/, 0, NULL);

==
src/backend/replication/walsender.c

15. CreateReplicationSlot

  ReplicationSlotCreate(cmd->slotname, false,
cmd->temporary ? RS_TEMPORARY : RS_PERSISTENT,
-   false);
+   false, false);

(similar to an earlier comment)

Having an inline comment might be helpful here.

e.g. false /* failover */,

~~~

16. CreateReplicationSlot

+ /*
+ * Do not allow users to create the slots with failover enabled on the
+ * standby as we do not support sync to the cascading standby.
+ */
+ if (RecoveryInProgress() && failover)
+ ereport(ERROR,
+ errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot create replication slot with failover"
+" enabled on the standby"));
+
  /*
  * Initially create persistent slot as ephemeral - that allows us to
  * nicely handle errors during initialization because it'll get
@@ -1243,7 +1265,7 @@ CreateReplicationSlot(CreateReplicationSlotCmd *cmd)
  */
  ReplicationSlotCreate(cmd->slotname, true,
cmd->temporary ? RS_TEMPORARY : RS_EPHEMERAL,
-   two_phase);
+   two_phase, failover);

This errmsg seems to be repeated in a few places, so I wondered if
this code can be refactored to call direct to
create_logical_replication_slot() so the errmsg can be just once in a
common place.

OTOH, if it cannot be refactored, then needs to be using same errmsg
as suggested by earlier review comments (see above).

==
src/include/catalog/pg_subscription.h

17.
+ bool subfailover; /* True if the associated replication slots
+ * (i.e. the main slot and the table sync
+ * slots) in the upstream database are enabled
+ * to be synchronized to the physical
+ * standbys. */

(same as earlier)

/physical standbys/physical standby/

~~~

18.
+ bool failover; /* True if the associated replication slots
+ * (i.e. the main slot and the table sync
+ * slots) in the upstream database are enabled
+ * to be synchronized to the physical
+ * standbys. */

(same as earlier)

/physical standbys/physical standby/

==
src/include/replication/slot.h

19.
+
+ /*
+ * Is this a failover slot (sync candidate for physical standbys)? Only
+ * relevant for logical slots on the primary server.
+ */
+ bool failover;

(same as earlier)

/physical standbys/physical standby/

==
[1] Nisha errmsg -
https://www.postgresql.org/message-id/CABdArM5-VR4Akt_AHap_0Ofne0cTcsdnN6FcNe%2BMU8eXsa_ERQ%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Synchronizing slots from primary to standby

2024-01-22 Thread Peter Smith

overy", should
there also be a check for that (e.g. RecoveryInProgress) in the added
Assert?

==
src/include/replication/walreceiver.h

5.
 typedef char *(*walrcv_identify_system_fn) (WalReceiverConn *conn,
  TimeLineID *primary_tli);
+/*
+ * walrcv_get_dbname_from_conninfo_fn
+ *
+ * Returns the dbid from the primary_conninfo
+ */
+typedef char *(*walrcv_get_dbname_from_conninfo_fn) (const char *conninfo);

It looks like a blank line that previously existed has been lost.

==
[1] 
https://www.postgresql.org/message-id/CAHut%2BPsf3NewXbsFKY88Qn1ON1_dMD6343MuWdMiiM2Ds9a_wA%40mail.gmail.com

Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Make documentation builds reproducible

2024-01-22 Thread Peter Smith

On Tue, Jan 23, 2024 at 12:13 PM Tom Lane  wrote:
>
> Peter Smith  writes:
> > I usually the HTML documentation locally using command:
> > make STYLE=website html
> > This has been working forever, but seems to have broken due to commit
> > [1] having an undeclared variable.
>
> Interestingly, that still works fine for me, on RHEL8 with
>
> docbook-dtds-1.0-69.el8.noarch
> docbook-style-xsl-1.79.2-9.el8.noarch
> docbook-style-dsssl-1.79-25.el8.noarch
>
> What docbook version are you using?
>

[postgres@CentOS7-x64 sgml]$ sudo yum list installed | grep docbook
docbook-dtds.noarch   1.0-60.el7 @anaconda
docbook-style-dsssl.noarch1.79-18.el7@base
docbook-style-xsl.noarch  1.78.1-3.el7   @anaconda

==
Kind Regards,
Peter Smith.
Fujitsu Australia

Re: Make documentation builds reproducible

2024-01-22 Thread Peter Smith

Hi,

I usually the HTML documentation locally using command:

make STYLE=website html

~

This has been working forever, but seems to have broken due to commit
[1] having an undeclared variable.

e.g.
[postgres@CentOS7-x64 sgml]$ make STYLE=website html
{ \
  echo ""; \
  echo ""; \
} > version.sgml
'/usr/bin/perl' ./mk_feature_tables.pl YES
../../../src/backend/catalog/sql_feature_packages.txt
../../../src/backend/catalog/sql_features.txt >
features-supported.sgml
'/usr/bin/perl' ./mk_feature_tables.pl NO
../../../src/backend/catalog/sql_feature_packages.txt
../../../src/backend/catalog/sql_features.txt >
features-unsupported.sgml
'/usr/bin/perl' ./generate-errcodes-table.pl
../../../src/backend/utils/errcodes.txt > errcodes-table.sgml
'/usr/bin/perl' ./generate-keywords-table.pl . > keywords-table.sgml
'/usr/bin/perl' ./generate-targets-meson.pl targets-meson.txt
generate-targets-meson.pl > targets-meson.sgml
'/usr/bin/perl'
../../../src/backend/utils/activity/generate-wait_event_types.pl
--docs ../../../src/backend/utils/activity/wait_event_names.txt
/usr/bin/xmllint --nonet --path . --path . --output postgres-full.xml
--noent --valid postgres.sgml
/usr/bin/xsltproc --nonet --path . --path . --stringparam pg.version
'17devel' --param website.stylesheet 1 stylesheet.xsl
postgres-full.xml
runtime error: file stylesheet-html-common.xsl line 452 element if
Variable 'autolink.index.see' has not been declared.
make: *** [html-stamp] Error 10

==
[1] 
https://github.com/postgres/postgres/commit/b0f0a9432d0b6f53634a96715f2666f6d4ea25a1

Kind Regards,
Peter Smith.
Fujitsu Australia

Re: POC: Extension for adding distributed tracing - pg_tracing

2024-01-21 Thread Peter Smith

2024-01 Commitfest.

Hi, This patch has a CF status of "Needs Review" [1], but it seems
there were CFbot test failures last time it was run [2]. Please have a
look and post an updated version if necessary.

==
[1] https://commitfest.postgresql.org/46/4456/
[2] https://cirrus-ci.com/task/5581154296791040

Kind Regards,
Peter Smith.

Re: Shared detoast Datum proposal

2024-01-21 Thread Peter Smith

2024-01 Commitfest.

Hi, This patch has a CF status of "Needs Review" [1], but it seems
there were CFbot test failures last time it was run [2]. Please have a
look and post an updated version if necessary.

==
[1] https://commitfest.postgresql.org/46/4759/
[2] https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/46/4759

Kind Regards,
Peter Smith.

Re: Things I don't like about \du's "Attributes" column

2024-01-21 Thread Peter Smith

2024-01 Commitfest.

Hi, This patch has a CF status of "Needs Review" [1], but it seems
there were CFbot test failures last time it was run [2]. Please have a
look and post an updated version if necessary.

==
[1] https://commitfest.postgresql.org/46/4738/
[2] https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest/46/4738

Kind Regards,
Peter Smith.

Re: pg_stat_statements and "IN" conditions

2024-01-21 Thread Peter Smith

2024-01 Commitfest.

Hi, This patch has a CF status of "Needs Review" [1], but it seems
there was a CFbot test failure last time it was run [2]. Please have a
look and post an updated version if necessary.

==
[1] https://commitfest.postgresql.org/46/2837/
[2] https://cirrus-ci.com/task/6688578378399744

Kind Regards,
Peter Smith.

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1451 matches

Mail list logo