Re: speed up a logical replica setup

2024-05-24 Thread Shlok Kyal
On Wed, 22 May 2024 at 16:50, Amit Kapila  wrote:
>
> On Mon, May 20, 2024 at 4:30 PM Shlok Kyal  wrote:
> > >
> > > I was trying to test this utility when 'sync_replication_slots' is on
> > > and it gets in an ERROR loop [1] and never finishes. Please find the
> > > postgresql.auto used on the standby attached. I think if the standby
> > > has enabled sync_slots, you need to pass dbname in
> > > GenerateRecoveryConfig(). I couldn't test it further but I wonder if
> > > there are already synced slots on the standby (either due to
> > > 'sync_replication_slots' or users have used
> > > pg_sync_replication_slots() before invoking pg_createsubscriber),
> > > those would be retained as it is on new subscriber and lead to
> > > unnecessary WAL retention and dead rows.
> > >
> > > [1]
> > > 2024-04-30 11:50:43.239 IST [12536] LOG:  slot sync worker started
> > > 2024-04-30 11:50:43.247 IST [12536] ERROR:  slot synchronization
> > > requires dbname to be specified in primary_conninfo
> >
> > Hi,
> >
> > I tested the scenario posted by Amit in [1], in which retaining synced
> > slots lead to unnecessary WAL retention and ERROR. This is raised as
> > the second open point in [2].
> > The steps to reproduce the issue:
> > (1) Setup physical replication with sync slot feature turned on by
> > setting sync_replication_slots = 'true' or using
> > pg_sync_replication_slots() on the standby node.
> > For physical replication setup, run pg_basebackup with -R  and -d option.
> > (2) Create a logical replication slot on primary node with failover
> > option as true. A corresponding slot is created on standby as part of
> > sync slot feature.
> > (3) Run pg_createsubscriber on standby node.
> > (4) On Checking for the replication slot on standby node, I noticed
> > that the logical slots created in step 2 are retained.
> >  I have attached the script to reproduce the issue.
> >
> > I and Kuroda-san worked to resolve open points. Here are patches to
> > solve the second and third point in [2].
> > Patches proposed by Euler are also attached just in case, but they
> > were not modified.
> >
> > v2-0001: not changed
> >
>
> Shouldn't we modify it as per the suggestion given in the email [1]? I
> am wondering if we can entirely get rid of checking the primary
> business and simply rely on recovery_timeout and keep checking
> server_is_in_recovery(). If so, we can modify the test to use
> non-default recovery_timeout (say 180s or something similar if we have
> used it at any other place). As an additional check we can ensure that
> constent_lsn is present on standby.
>

I have made changes as per your suggestion. I have used
pg_last_wal_receive_lsn to get lsn last wal received on standby and
check if consistent_lsn is already present on the standby.
I have only made changes in v3-0001. v3-0002, v3-0003, v3-0004 and
v3-0005 are not modified.

Thanks and Regards,
Shlok Kyal


v3-0005-Avoid-outputing-some-messages-in-dry_run-mode.patch
Description: Binary data


v3-0003-Disable-slot-sync-during-the-convertion.patch
Description: Binary data


v3-0001-Improve-the-code-that-checks-if-the-recovery-is-f.patch
Description: Binary data


v3-0004-Drop-replication-slots-which-had-been-synchronize.patch
Description: Binary data


v3-0002-Improve-the-code-that-checks-if-the-primary-slot-.patch
Description: Binary data


Re: State of pg_createsubscriber

2024-05-22 Thread Shlok Kyal
> Just to summarize, apart from BF failures for which we had some
> discussion, I could recall the following open points:
>
> 1. After promotion, the pre-existing replication objects should be
> removed (either optionally or always), otherwise, it can lead to a new
> subscriber not being able to restart or getting some unwarranted data.
> [1][2].
>
I tried to reproduce the case and found a case where pre-existing
replication objects can cause unwanted scenario:

Suppose we have a setup of nodes N1, N2 and N3.
N1 and N2 are in streaming replication where N1 is primary and N2 is standby.
N3 and N1 are in logical replication where N3 is publisher and N1 is subscriber.
The subscription created on N1 is replicated to N2 due to streaming replication.

Now, after we run pg_createsubscriber on N2 and start the N2 server,
we get the following logs repetitively:
2024-05-22 11:37:18.619 IST [27344] ERROR:  could not start WAL
streaming: ERROR:  replication slot "test1" is active for PID 27202
2024-05-22 11:37:18.622 IST [27317] LOG:  background worker "logical
replication apply worker" (PID 27344) exited with exit code 1
2024-05-22 11:37:23.610 IST [27349] LOG:  logical replication apply
worker for subscription "test1" has started
2024-05-22 11:37:23.624 IST [27349] ERROR:  could not start WAL
streaming: ERROR:  replication slot "test1" is active for PID 27202
2024-05-22 11:37:23.627 IST [27317] LOG:  background worker "logical
replication apply worker" (PID 27349) exited with exit code 1
2024-05-22 11:37:28.616 IST [27382] LOG:  logical replication apply
worker for subscription "test1" has started

Note: 'test1' is the name of the subscription created on N1 initially
and by default, slot name is the same as the subscription name.

Once the N2 server is started after running pg_createsubscriber, the
subscription that was earlier replicated by streaming replication will
now try to connect to the publisher. Since the subscription name in N2
is the same as the subscription created in N1, it will not be able to
start a replication slot as the slot with the same name is active for
logical replication between N3 and N1.

Also, there would be a case where N1 becomes down for some time. Then
in that case subscription on N2 will connect to the publication on N3
and now data from N3 will be replicated to N2 instead of N1. And once
N1 is up again, subscription on N1 will not be able to connect to
publication on N3 as it is already connected to N2. This can lead to
data inconsistency.

This error did not happen before running pg_createsubscriber on
standby node N2, because there is no 'logical replication launcher'
process on standby node.

Thanks and Regards,
Shlok Kyal




Re: speed up a logical replica setup

2024-05-20 Thread Shlok Kyal
Hi,
>
> I was trying to test this utility when 'sync_replication_slots' is on
> and it gets in an ERROR loop [1] and never finishes. Please find the
> postgresql.auto used on the standby attached. I think if the standby
> has enabled sync_slots, you need to pass dbname in
> GenerateRecoveryConfig(). I couldn't test it further but I wonder if
> there are already synced slots on the standby (either due to
> 'sync_replication_slots' or users have used
> pg_sync_replication_slots() before invoking pg_createsubscriber),
> those would be retained as it is on new subscriber and lead to
> unnecessary WAL retention and dead rows.
>
> [1]
> 2024-04-30 11:50:43.239 IST [12536] LOG:  slot sync worker started
> 2024-04-30 11:50:43.247 IST [12536] ERROR:  slot synchronization
> requires dbname to be specified in primary_conninfo

Hi,

I tested the scenario posted by Amit in [1], in which retaining synced
slots lead to unnecessary WAL retention and ERROR. This is raised as
the second open point in [2].
The steps to reproduce the issue:
(1) Setup physical replication with sync slot feature turned on by
setting sync_replication_slots = 'true' or using
pg_sync_replication_slots() on the standby node.
For physical replication setup, run pg_basebackup with -R  and -d option.
(2) Create a logical replication slot on primary node with failover
option as true. A corresponding slot is created on standby as part of
sync slot feature.
(3) Run pg_createsubscriber on standby node.
(4) On Checking for the replication slot on standby node, I noticed
that the logical slots created in step 2 are retained.
 I have attached the script to reproduce the issue.

I and Kuroda-san worked to resolve open points. Here are patches to
solve the second and third point in [2].
Patches proposed by Euler are also attached just in case, but they
were not modified.

v2-0001: not changed
v2-0002: not changed
v2-0003: ensures the slot sync is disabled during the conversion. This
resolves the second point.
v2-0004: drops sync slots which may be retained after running. This
resolves the second point.
v2-0005: removes misleading output messages in dry-run. This resolves
the third point.

[1]: 
https://www.postgresql.org/message-id/CAA4eK1KdCb%2B5sjYu6qCMXXdCX1y_ihr8kFzMozq0%3DP%3DauYxgog%40mail.gmail.com
[2]: 
https://www.postgresql.org/message-id/CAA4eK1J22UEfrqx222h5j9DQ7nxGrTbAa_BC%2B%3DmQXdXs-RCsew%40mail.gmail.com

Thanks and Regards,
Shlok Kyal
# cleanup
rm -rf ../primary ../standby primary.log standby.log

# setup primary node
./initdb -D ../primary
 
cat << EOF >> ../primary/postgresql.conf
wal_level = logical
EOF

./pg_ctl -D ../primary -l primary.log start
./psql -d postgres -c "CREATE table t1 (c1 int);"
./psql -d postgres -c "Insert into t1 values(1);"
./psql -d postgres -c "Insert into t1 values(2);"
./psql -d postgres -c "INSERT into t1 values(3);"
./psql -d postgres -c "INSERT into t1 values(4);"

# setup standby node with sync slot feature
./pg_basebackup -h localhost -X stream -v -W -R -S 'sb1' -C -D ../standby/ -d 'dbname=postgres' 

cat << EOF >> ../standby/postgresql.conf
sync_replication_slots = 'true'
hot_standby_feedback = 'on'
port = 9000
EOF

./pg_ctl -D ../standby -l standby.log start

# create a replication slot on primary with failover option
./psql -d postgres -c "SELECT 'init' FROM pg_create_logical_replication_slot('failover_true_slot_2', 'test_decoding', false, false, true);"
sleep 2

# check if slots are synced on standby
./psql -d postgres -p 9000 -c "select * from pg_replication_slots;" 
./pg_ctl -D ../standby -l standby.log stop

# run pg_createsubscriber on standby
./pg_createsubscriber -D ../standby/ -p 9000 -P "host=localhost port=5432 dbname=postgres" -d postgres -v

# check if replication slots are retained after running pg_createsubscriber
./pg_ctl -D ../standby -l standby.log start
./psql -d postgres -p 9000 -c "select * from pg_replication_slots;" 

./pg_ctl -D ../standby -l standby.log stop
./pg_ctl -D ../primary -l primary.log stop


v2-0001-Improve-the-code-that-checks-if-the-recovery-is-f.patch
Description: Binary data


v2-0002-Improve-the-code-that-checks-if-the-primary-slot-.patch
Description: Binary data


v2-0003-Disable-slot-sync-during-the-convertion.patch
Description: Binary data


v2-0004-Drop-replication-slots-which-had-been-synchronize.patch
Description: Binary data


v2-0005-Avoid-outputing-some-messages-in-dry_run-mode.patch
Description: Binary data


Re: Pgoutput not capturing the generated columns

2024-05-20 Thread Shlok Kyal
Hi Kuroda-san,

Thanks for reviewing the patch. I have fixed some of the comments
> 2.
> Currently, the option is implemented as streaming option. Are there any 
> reasons
> to choose the way? Another approach is to implement as slot option, like 
> failover
> and temporary.
I think the current approach is appropriate. The options such as
failover and temporary seem like properties of a slot and I think
decoding of generated column should not be slot specific. Also adding
a new option for slot may create an overhead.

> 3.
> You said that subscription option is not supported for now. Not sure, is it 
> mean
> that logical replication feature cannot be used for generated columns? If so,
> the restriction won't be acceptable. If the combination between this and 
> initial
> sync is problematic, can't we exclude them in CreateSubscrition and 
> AlterSubscription?
> E.g., create_slot option cannot be set if slot_name is NONE.
Added an option 'generated_column' for create subscription. Currently
it allow to set 'generated_column' option as true only if 'copy_data'
is set to false.
Also we don't allow user to alter the 'generated_column' option.

> 6. logicalrep_write_tuple()
>
> ```
> -if (!column_in_column_list(att->attnum, columns))
> +if (!column_in_column_list(att->attnum, columns) && 
> !att->attgenerated)
> +continue;
> ```
>
> Hmm, does above mean that generated columns are decoded even if they are not 
> in
> the column list? If so, why? I think such columns should not be sent.
Fixed

Thanks and Regards,
Shlok Kyal


v2-0001-Support-generated-column-capturing-generated-colu.patch
Description: Binary data


Re: speed up a logical replica setup

2024-03-20 Thread Shlok Kyal
> Hi,
>
> > I'm attaching a new version (v30) that adds:
> >
> > * 3 new options (--publication, --subscription, --replication-slot) to 
> > assign
> >   names to the objects. The --database option used to ignore duplicate 
> > names,
> >   however, since these new options rely on the number of database options to
> >   match the number of object name options, it is forbidden from now on. The
> >   duplication is also forbidden for the object names to avoid errors 
> > earlier.
> > * rewrite the paragraph related to unusuable target server after
> >   pg_createsubscriber fails.
> > * Vignesh reported an issue [1] related to reaching the recovery stop point
> >   before the consistent state is reached. I proposed a simple patch that 
> > fixes
> >   the issue.
> >
> > [1] 
> > https://www.postgresql.org/message-id/CALDaNm3VMOi0GugGvhk3motghaFRKSWMCSE2t3YX1e%2BMttToxg%40mail.gmail.com
> >
>
> I have added a top-up patch v30-0003. The issue in [1] still exists in
> the v30 patch. I was not able to come up with an approach to handle it
> in the code, so I have added it to the documentation in the warning
> section. Thoughts?
> I am not changing the version as I have not made any changes in
> v30-0001 and v30-0002.
>
> [1]: 
> https://www.postgresql.org/message-id/cahv8rj+5mzk9jt+7ecogjzfm5czvdccd5jo1_rcx0bteypb...@mail.gmail.com

There was some whitespace error in the v30-0003 patch. Updated the patch.



Thanks and regards,
Shlok Kyal
From d6717bcaf94302c0d41eabd448802b1fae6167d9 Mon Sep 17 00:00:00 2001
From: Euler Taveira 
Date: Mon, 5 Jun 2023 14:39:40 -0400
Subject: [PATCH v31 1/3] pg_createsubscriber: creates a new logical replica
 from a standby server

It must be run on the target server and should be able to connect to the
source server (publisher) and the target server (subscriber). All tables
in the specified database(s) are included in the logical replication
setup. A pair of publication and subscription objects are created for
each database.

The main advantage of pg_createsubscriber over the common logical
replication setup is the initial data copy. It also reduces the catchup
phase.

Some prerequisites must be met to successfully run it. It is basically
the logical replication requirements. It starts creating a publication
using FOR ALL TABLES and a replication slot for each specified database.
Write recovery parameters into the target data directory and start the
target server. It specifies the LSN of the last replication slot
(replication start point) up to which the recovery will proceed. Wait
until the target server is promoted. Create one subscription per
specified database (using publication and replication slot created in a
previous step) on the target server. Set the replication progress to the
replication start point for each subscription. Enable the subscription
for each specified database on the target server. And finally, change
the system identifier on the target server.
---
 doc/src/sgml/ref/allfiles.sgml|1 +
 doc/src/sgml/ref/pg_createsubscriber.sgml |  525 
 doc/src/sgml/reference.sgml   |1 +
 src/bin/pg_basebackup/.gitignore  |1 +
 src/bin/pg_basebackup/Makefile|   11 +-
 src/bin/pg_basebackup/meson.build |   19 +
 src/bin/pg_basebackup/pg_createsubscriber.c   | 2131 +
 .../t/040_pg_createsubscriber.pl  |  326 +++
 8 files changed, 3012 insertions(+), 3 deletions(-)
 create mode 100644 doc/src/sgml/ref/pg_createsubscriber.sgml
 create mode 100644 src/bin/pg_basebackup/pg_createsubscriber.c
 create mode 100644 src/bin/pg_basebackup/t/040_pg_createsubscriber.pl

diff --git a/doc/src/sgml/ref/allfiles.sgml b/doc/src/sgml/ref/allfiles.sgml
index 4a42999b18..f5be638867 100644
--- a/doc/src/sgml/ref/allfiles.sgml
+++ b/doc/src/sgml/ref/allfiles.sgml
@@ -205,6 +205,7 @@ Complete list of usable sgml source files in this directory.
 
 
 
+
 
 
 
diff --git a/doc/src/sgml/ref/pg_createsubscriber.sgml b/doc/src/sgml/ref/pg_createsubscriber.sgml
new file mode 100644
index 00..642213b5a4
--- /dev/null
+++ b/doc/src/sgml/ref/pg_createsubscriber.sgml
@@ -0,0 +1,525 @@
+
+
+
+ 
+  pg_createsubscriber
+ 
+
+ 
+  pg_createsubscriber
+  1
+  Application
+ 
+
+ 
+  pg_createsubscriber
+  convert a physical replica into a new logical replica
+ 
+
+ 
+  
+   pg_createsubscriber
+   option
+   
+
+ -d
+ --database
+
+dbname
+
+ -D 
+ --pgdata
+
+datadir
+
+ -P
+ --publisher-server
+
+connstr
+   
+  
+ 
+
+ 
+  Description
+  
+pg_createsubscriber creates a new logical
+replica from a physical standby server. All tables in the specified
+database are included in the logical replication setup. A pair of
+publication and subscription objects

Re: speed up a logical replica setup

2024-03-19 Thread Shlok Kyal
Hi,

> I'm attaching a new version (v30) that adds:
>
> * 3 new options (--publication, --subscription, --replication-slot) to assign
>   names to the objects. The --database option used to ignore duplicate names,
>   however, since these new options rely on the number of database options to
>   match the number of object name options, it is forbidden from now on. The
>   duplication is also forbidden for the object names to avoid errors earlier.
> * rewrite the paragraph related to unusuable target server after
>   pg_createsubscriber fails.
> * Vignesh reported an issue [1] related to reaching the recovery stop point
>   before the consistent state is reached. I proposed a simple patch that fixes
>   the issue.
>
> [1] 
> https://www.postgresql.org/message-id/CALDaNm3VMOi0GugGvhk3motghaFRKSWMCSE2t3YX1e%2BMttToxg%40mail.gmail.com
>

I have added a top-up patch v30-0003. The issue in [1] still exists in
the v30 patch. I was not able to come up with an approach to handle it
in the code, so I have added it to the documentation in the warning
section. Thoughts?
I am not changing the version as I have not made any changes in
v30-0001 and v30-0002.

[1]: 
https://www.postgresql.org/message-id/cahv8rj+5mzk9jt+7ecogjzfm5czvdccd5jo1_rcx0bteypb...@mail.gmail.com



Thanks and regards,
Shlok Kyal


v30-0001-pg_createsubscriber-creates-a-new-logical-replic.patch
Description: Binary data


v30-0002-Stop-the-target-server-earlier.patch
Description: Binary data


v30-0003-Document-a-limitation-of-pg_createsubscriber.patch
Description: Binary data


Re: Add publisher and subscriber to glossary documentation.

2024-03-14 Thread Shlok Kyal
Hi Andrew,

> If there's a movement towards "node" to refer to the database which has the 
> Subscription object, then perhaps the documentation for
>
> 31.2. Subscription, Chapter 31. Logical Replication should be updated as 
> well, since it uses both the "database" and "node" terms on the same page, 
> and to me referring to the same thing (I could be missing a subtlety).
>
>
> See:
>
>
> "The subscriber database..."
>
>
> "A subscriber node may..."
>
>
> Also, the word "database" in this sentence: "A subscription defines the 
> connection to another database" to me works, but I think using "node" there 
> could be more consistent if it’s referring to the server instance running the 
> database that holds the PUBLICATION. The connection string information 
> example later on the page shows "host" and "dbname" configured in the 
> CONNECTION value for the SUBSCRIPTION. This sentence seems like the use of 
> "database" in casual style to mean the "server instance" (or "node").
>
>
> Also, the "The node where a subscription is defined". That one actually feels 
> to me like "The database where a subscription is defined", but then that 
> contradicts what I just said, and "node" is fine here but I think "node" 
> should be on the preceding sentence too.
>
>
> Anyway, hopefully these examples show “node” and “database” are mixed and 
> perhaps others agree using one consistently might help the goals of the docs.

For me the existing content looks good, I felt let's keep it as it is
unless others feel differently.

Thanks and regards,
Shlok Kyal




Re: speed up a logical replica setup

2024-03-04 Thread Shlok Kyal
Hi,
> I applied the v25 patch and did RUN the 'pg_createsubscriber' command.
> I was unable to execute it and experienced the following error:
>
> $ ./pg_createsubscriber -D node2/ -P "host=localhost port=5432
> dbname=postgres"  -d postgres -d db1 -p 9000 -r
> ./pg_createsubscriber: invalid option -- 'p'
> pg_createsubscriber: hint: Try "pg_createsubscriber --help" for more
> information.
>
> Here, the --p is not accepting the desired port number. Thoughts?

I investigated it and found that:

+ while ((c = getopt_long(argc, argv, "d:D:nP:rS:t:v",
+ long_options, _index)) != -1)
+ {

Here 'p', 's' and 'U' options are missing so we are getting the error.
Also, I think the 'S' option should be removed from here.

I also see that specifying long options like --subscriber-port,
--subscriber-username, --socket-directory works fine.
Thoughts?

Thanks and regards,
Shlok Kyal




Re: Add publisher and subscriber to glossary documentation.

2024-02-25 Thread Shlok Kyal
> 1.
> +  
> +   Publication node
> +   
> +
> + A node where a
> +  linkend="glossary-publication">publication is defined
> + for logical
> replication.
> +
> +   
> +  
> +
>
> I felt the word "node" here should link to the glossary term "Node",
> instead of directly to the term "Instance".
>
> ~~
>
> 2.
> +  
> +   Subscription node
> +   
> +
> + A node where a
> +  linkend="glossary-subscription">subscription is defined
> + for logical
> replication.
> +
> +   
> +  
> +
>
> (same comment as above)
>
> I felt the word "node" here should link to the glossary term "Node",
> instead of directly to the term "Instance".

I have addressed the comments and have attached the updated version.

Thanks and Regards,
Shlok Kyal


v4-0001-Add-publisher-and-subscriber-to-glossary-document.patch
Description: Binary data


Re: Add publisher and subscriber to glossary documentation.

2024-02-23 Thread Shlok Kyal
> Here are some comments for patch v2.
>
> ==
>
> 1. There are whitespace problems
>
> [postgres@CentOS7-x64 oss_postgres_misc]$ git apply
> ../patches_misc/v2-0001-Add-publisher-and-subscriber-to-glossary-document.patch
> ../patches_misc/v2-0001-Add-publisher-and-subscriber-to-glossary-document.patch:43:
> trailing whitespace.
>   A node where publication is defined for
> ../patches_misc/v2-0001-Add-publisher-and-subscriber-to-glossary-document.patch:45:
> trailing whitespace.
>   It replicates a set of changes from a table or a group of tables in
> ../patches_misc/v2-0001-Add-publisher-and-subscriber-to-glossary-document.patch:66:
> trailing whitespace.
>  A node where subscription is defined for
> ../patches_misc/v2-0001-Add-publisher-and-subscriber-to-glossary-document.patch:67:
> trailing whitespace.
>  logical 
> replication.
> ../patches_misc/v2-0001-Add-publisher-and-subscriber-to-glossary-document.patch:68:
> trailing whitespace.
>  It subscribes to one or more publications on a publisher node and pulls
> warning: squelched 2 whitespace errors
> warning: 7 lines add whitespace errors.
>
> ~~~
> 2. Publisher node
>
> +  
> +   Publisher node
> +   
> +
> +  A node where publication is defined for
> +  logical
> replication.
> +  It replicates a set of changes from a table or a group of tables in
> +  publication to the subscriber node.
> +
> +
> + For more information, see
> + .
> +
> +   
> +  
> +
>
>
> 2a.
> I felt the term here should be "Publication node".
>
> Indeed, there should also be additional glossary terms like
> "Publisher" (i.e. it is the same as "Publication node") and
> "Subscriber" (i.e. it is the same as "Subscription node"). These
> definitions will then be consistent with node descriptions already in
> sections 30.1 and 30.2
>
>
> ~
>
> 2b.
> "node" should include a link to the glossary term. Same for any other
> terms mentioned
>
> ~
>
> 2c.
> /A node where publication is defined for/A node where a publication is
> defined for/
>
> ~
>
> 2d.
> "It replicates" is misleading because it is the PUBLICATION doing the
> replicating, not the node.
>
> IMO it will be better to include 2 more glossary terms "Publication"
> and "Subscription" where you can say this kind of information. Then
> the link "logical-replication-publication" also belongs under the
> "Publication" term.
>
>
> ~~~
>
> 3.
> +  
> +   Subscriber node
> +   
> +
> + A node where subscription is defined for
> + logical 
> replication.
> + It subscribes to one or more publications on a publisher node and pulls
> + a set of changes from a table or a group of tables in publications it
> + subscribes to.
> +
> +
> + For more information, see
> + .
> +
> +   
> +  
>
> All comments are similar to those above.
>
> ==
>
> In summary, IMO it should be a bit more like below:
>
> SUGGESTION (include all the necessary links etc)
>
> Publisher
> See "Publication node"
>
> Publication
> A publication replicates the changes of one or more tables to a
> subscription. For more information, see Section 30.1
>
> Publication node
> A node where a publication is defined for logical replication.
>
> Subscriber
> See "Subscription node"
>
> Subscription
> A subscription receives the changes of one or more tables from the
> publications it subscribes to. For more information, see Section 30.2
>
> Subscription node
> A node where a subscription is defined for logical replication.

I have addressed the comments and added an updated patch.

Thanks and Regards,
Shlok Kyal


v3-0001-Add-publisher-and-subscriber-to-glossary-document.patch
Description: Binary data


Re: speed up a logical replica setup

2024-02-19 Thread Shlok Kyal
Hi,

On Tue, 20 Feb 2024 at 06:59, Euler Taveira  wrote:
>
> On Mon, Feb 19, 2024, at 7:22 AM, Shlok Kyal wrote:
>
> I have reviewed the v21 patch. And found an issue.
>
> Initially I started the standby server with a new postgresql.conf file
> (not the default postgresql.conf that is present in the instance).
> pg_ctl -D ../standby start -o "-c config_file=/new_path/postgresql.conf"
>
> And I have made 'max_replication_slots = 1' in new postgresql.conf and
> made  'max_replication_slots = 0' in the default postgresql.conf file.
> Now when we run pg_createsubscriber on standby we get error:
> pg_createsubscriber: error: could not set replication progress for the
> subscription "pg_createsubscriber_5_242843": ERROR:  cannot query or
> manipulate replication origin when max_replication_slots = 0
>
>
> That's by design. See [1]. The max_replication_slots parameter is used as the
> maximum number of subscriptions on the server.
>
> NOTICE:  dropped replication slot "pg_createsubscriber_5_242843" on publisher
> pg_createsubscriber: error: could not drop publication
> "pg_createsubscriber_5" on database "postgres": ERROR:  publication
> "pg_createsubscriber_5" does not exist
> pg_createsubscriber: error: could not drop replication slot
> "pg_createsubscriber_5_242843" on database "postgres": ERROR:
> replication slot "pg_createsubscriber_5_242843" does not exist
>
>
> That's a bug and should be fixed.
>
> [1] 
> https://www.postgresql.org/docs/current/runtime-config-replication.html#RUNTIME-CONFIG-REPLICATION-SUBSCRIBER
>
I think you misunderstood the issue, I reported.

My main concern is that the standby server is using different
'postgresql.conf' file (the default file) after :
+   /* Start subscriber and wait until accepting connections */
+   pg_log_info("starting the subscriber");
+   if (!dry_run)
+   start_standby_server(pg_ctl_path, opt.subscriber_dir, server_start_log);

But we initially started the standby server (before running the
pg_createsubscriber) with a new postgresql.conf file (different from
the default file. for this example created inside the 'new_path'
folder).
pg_ctl -D ../standby -l standby.log  start -o "-c
config_file=../new_path/postgresql.conf"

So, when we run the pg_createsubscriber, all the initial checks will
be run on the standby server started using the new postgresql.conf
file. But during pg_createsubscriber, it will restart the standby
server using the default postgresql.conf file. And this might create
some issues.

Thanks and Regards,
Shlok Kyal




Re: speed up a logical replica setup

2024-02-19 Thread Shlok Kyal
Hi,

I have reviewed the v21 patch. And found an issue.

Initially I started the standby server with a new postgresql.conf file
(not the default postgresql.conf that is present in the instance).
pg_ctl -D ../standby start -o "-c config_file=/new_path/postgresql.conf"

And I have made 'max_replication_slots = 1' in new postgresql.conf and
made  'max_replication_slots = 0' in the default postgresql.conf file.
Now when we run pg_createsubscriber on standby we get error:
pg_createsubscriber: error: could not set replication progress for the
subscription "pg_createsubscriber_5_242843": ERROR:  cannot query or
manipulate replication origin when max_replication_slots = 0
NOTICE:  dropped replication slot "pg_createsubscriber_5_242843" on publisher
pg_createsubscriber: error: could not drop publication
"pg_createsubscriber_5" on database "postgres": ERROR:  publication
"pg_createsubscriber_5" does not exist
pg_createsubscriber: error: could not drop replication slot
"pg_createsubscriber_5_242843" on database "postgres": ERROR:
replication slot "pg_createsubscriber_5_242843" does not exist

I observed that when we run the pg_createsubscriber command, it will
stop the standby instance (the non-default postgres configuration) and
restart the standby instance which will now be started with default
postgresql.conf, where the 'max_replication_slot = 0' and
pg_createsubscriber will now fail with the error given above.
I have added the script file with which we can reproduce this issue.
Also similar issues can happen with other configurations such as port, etc.

The possible solution would be
1) allow to run pg_createsubscriber if standby is initially stopped .
I observed that pg_logical_createsubscriber also uses this approach.
2) read GUCs via SHOW command and restore them when server restarts

I would prefer the 1st solution.

Thanks and Regards,
Shlok Kyal
sudo pkill -9 postgres

rm -rf ../primary ../standby ../new_path
rm -rf primary.log standby.log

./initdb -D ../primary

cat << EOF >> ../primary/postgresql.conf
wal_level = 'logical'
EOF

./pg_ctl -D ../primary -l primary.log start
./psql -d postgres -c "CREATE table t1 (c1 int);"
./psql -d postgres -c "Insert into t1 values(1);"
./psql -d postgres -c "Insert into t1 values(2);"
./psql -d postgres -c "INSERT into t1 values(3);"
./psql -d postgres -c "INSERT into t1 values(4);"

./pg_basebackup -h localhost -X stream -v -W -R -D ../standby/

# postgresql.conf file in new location
mkdir  ../new_path
cat << EOF >> ../new_path/postgresql.conf
max_replication_slots = 1
port = 9000
EOF

# postgresql.conf file in default location
cat << EOF >> ../standby/postgresql.conf
max_replication_slots = 0
port = 9000
EOF

./pg_ctl -D ../standby -l standby.log  start -o "-c config_file=../new_path/postgresql.conf"
./pg_createsubscriber -D ../standby -S 'host=localhost port=9000 dbname=postgres' -P 'host=localhost port=5432 dbname=postgres' -d postgres -r


Re: Add publisher and subscriber to glossary documentation.

2024-02-13 Thread Shlok Kyal
Hi,
I addressed the comments and updated the patch.

> Should these be "publisher node" and "subscriber node" instead?  Do we
> want to define the term "node"?  I think in everyday conversations we
> use "node" quite a lot, so maybe we do need an entry for it.  (Maybe
> just  suffices, plus add under instance
> "also called a node".)

Modified

> +   Publisher
> +   
> +
> +  A node where publication is defined.
> +  It replicates the publication changes to the subscriber node.
>
> Apart from deciding what to do with "node", what are "changes"?  This
> doesn't seem very specific.

Modified

> +   Subscriber
> +   
> +
> + A node where subscription is defined.
> + It subscribe to one or more publications on a publisher node and pull 
> the data
> + from the publications they subscribe to.
>
> Same issues as above, plus there are some grammar issues.

Modified

> I think these definitions should use the term "logical replication",
> which we don't currently define.  We do have "replication" where we
> provide an overview of "logical replication".  Maybe that's enough, but
> we should consider whether we want a separate definition of logical
> replication (I'm leaning towards not having one, but it's worth asking.)

Modified. Added the term "logical replication" in the definitions.
Used reference to "replication".

Thanks and Regards,
Shlok Kyal


v2-0001-Add-publisher-and-subscriber-to-glossary-document.patch
Description: Binary data


Add publisher and subscriber to glossary documentation.

2024-02-12 Thread Shlok Kyal
Hi,

There are several places where publisher and subscriber terms are used
across the documentation. But the publisher and subscriber were
missing in the documentation. I felt this should be added in the
glossary.
I have created a patch for the same.

Thanks and Regards
Shlok Kyal


v1-0001-Add-publisher-and-subscriber-to-glossary-document.patch
Description: Binary data


Re: speed up a logical replica setup

2024-01-17 Thread Shlok Kyal
Hi,

I have some comments for the v5 patch:

1.
```
+ base_dir = (char *) pg_malloc0(MAXPGPATH);
+ len = snprintf(base_dir, MAXPGPATH, "%s/%s", subscriber_dir, PGS_OUTPUT_DIR);
```
Before these lines, I think we should use
'canonicalize_path(subscriber_dir)' to remove extra unnecessary
characters. This function is used in many places like initdb.c,
pg_ctl.c, pg_basebakup.c, etc

2.
I also feels that there are many global variables and can be arranged
as structures as suggested by Kuroda-san in [1]

[1] 
https://www.postgresql.org/message-id/TY3PR01MB9889C362FF76102C88FA1C29F56F2%40TY3PR01MB9889.jpnprd01.prod.outlook.com

Thanks and Regards
Shlok Kyal


On Fri, 12 Jan 2024 at 03:46, Euler Taveira  wrote:
>
> On Thu, Jan 11, 2024, at 9:18 AM, Hayato Kuroda (Fujitsu) wrote:
>
> I have been concerned that the patch has not been tested by cfbot due to the
> application error. Also, some comments were raised. Therefore, I created a 
> patch
> to move forward.
>
>
> Let me send an updated patch to hopefully keep the CF bot happy. The following
> items are included in this patch:
>
> * drop physical replication slot if standby is using one [1].
> * cleanup small changes (copyright, .gitignore) [2][3]
> * fix getopt_long() options [2]
> * fix format specifier for some messages
> * move doc to Server Application section [4]
> * fix assert failure
> * ignore duplicate database names [2]
> * store subscriber server log into a separate file
> * remove MSVC support
>
> I'm still addressing other reviews and I'll post another version that includes
> it soon.
>
> [1] 
> https://www.postgresql.org/message-id/e02a2c17-22e5-4ba6-b788-de696ab74f1e%40app.fastmail.com
> [2] 
> https://www.postgresql.org/message-id/CALDaNm1joke42n68LdegN5wCpaeoOMex2EHcdZrVZnGD3UhfNQ%40mail.gmail.com
> [3] 
> https://www.postgresql.org/message-id/TY3PR01MB98895BA6C1D72CB8582CACC4F5682%40TY3PR01MB9889.jpnprd01.prod.outlook.com
> [4] 
> https://www.postgresql.org/message-id/TY3PR01MB988978C7362A101927070D29F56A2%40TY3PR01MB9889.jpnprd01.prod.outlook.com
>
>
> --
> Euler Taveira
> EDB   https://www.enterprisedb.com/
>




Re: Extend pgbench partitioning to pgbench_history

2024-01-10 Thread Shlok Kyal
Sorry, I did not intend to send this message for this email. I by
mistake sent this mail. Please ignore this mail

On Wed, 10 Jan 2024 at 15:27, Shlok Kyal  wrote:
>
> Hi,
>
> There are some test failures reported by Cfbot at [1]:
>
> [09:15:01.794] 192/276 postgresql:pgbench /
> pgbench/001_pgbench_with_server ERROR 7.48s exit status 3
> [09:15:01.794] >>>
> INITDB_TEMPLATE=/tmp/cirrus-ci-build/build/tmp_install/initdb-template
> LD_LIBRARY_PATH=/tmp/cirrus-ci-build/build/tmp_install//usr/local/pgsql/lib
> REGRESS_SHLIB=/tmp/cirrus-ci-build/build/src/test/regress/regress.so
> PATH=/tmp/cirrus-ci-build/build/tmp_install//usr/local/pgsql/bin:/tmp/cirrus-ci-build/build/src/bin/pgbench:/tmp/cirrus-ci-build/build/src/bin/pgbench/test:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin
> PG_REGRESS=/tmp/cirrus-ci-build/build/src/test/regress/pg_regress
> MALLOC_PERTURB_=67 /usr/local/bin/python3
> /tmp/cirrus-ci-build/build/../src/tools/testwrap --basedir
> /tmp/cirrus-ci-build/build --srcdir
> /tmp/cirrus-ci-build/src/bin/pgbench --testgroup pgbench --testname
> 001_pgbench_with_server -- /usr/local/bin/perl -I
> /tmp/cirrus-ci-build/src/test/perl -I
> /tmp/cirrus-ci-build/src/bin/pgbench
> /tmp/cirrus-ci-build/src/bin/pgbench/t/001_pgbench_with_server.pl
> [09:15:01.794] ― ✀
> ―
> [09:15:01.794] stderr:
> [09:15:01.794] # Failed test 'transaction format for 001_pgbench_log_2'
> [09:15:01.794] # at
> /tmp/cirrus-ci-build/src/bin/pgbench/t/001_pgbench_with_server.pl line
> 1247.
> [09:15:01.794] # Failed test 'transaction count for
> /tmp/cirrus-ci-build/build/testrun/pgbench/001_pgbench_with_server/data/t_001_pgbench_with_server_main_data/001_pgbench_log_3.25193
> (11)'
> [09:15:01.794] # at
> /tmp/cirrus-ci-build/src/bin/pgbench/t/001_pgbench_with_server.pl line
> 1257.
> [09:15:01.794] # Failed test 'transaction format for 001_pgbench_log_3'
> [09:15:01.794] # at
> /tmp/cirrus-ci-build/src/bin/pgbench/t/001_pgbench_with_server.pl line
> 1257.
> [09:15:01.794] # Looks like you failed 3 tests of 439.
> [09:15:01.794]
> [09:15:01.794] (test program exited with status code 3)
>
> [1] - https://cirrus-ci.com/task/5139049757802496
>
> Thanks and regards
> Shlok Kyal




Re: Extend pgbench partitioning to pgbench_history

2024-01-10 Thread Shlok Kyal
Hi,

There are some test failures reported by Cfbot at [1]:

[09:15:01.794] 192/276 postgresql:pgbench /
pgbench/001_pgbench_with_server ERROR 7.48s exit status 3
[09:15:01.794] >>>
INITDB_TEMPLATE=/tmp/cirrus-ci-build/build/tmp_install/initdb-template
LD_LIBRARY_PATH=/tmp/cirrus-ci-build/build/tmp_install//usr/local/pgsql/lib
REGRESS_SHLIB=/tmp/cirrus-ci-build/build/src/test/regress/regress.so
PATH=/tmp/cirrus-ci-build/build/tmp_install//usr/local/pgsql/bin:/tmp/cirrus-ci-build/build/src/bin/pgbench:/tmp/cirrus-ci-build/build/src/bin/pgbench/test:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin
PG_REGRESS=/tmp/cirrus-ci-build/build/src/test/regress/pg_regress
MALLOC_PERTURB_=67 /usr/local/bin/python3
/tmp/cirrus-ci-build/build/../src/tools/testwrap --basedir
/tmp/cirrus-ci-build/build --srcdir
/tmp/cirrus-ci-build/src/bin/pgbench --testgroup pgbench --testname
001_pgbench_with_server -- /usr/local/bin/perl -I
/tmp/cirrus-ci-build/src/test/perl -I
/tmp/cirrus-ci-build/src/bin/pgbench
/tmp/cirrus-ci-build/src/bin/pgbench/t/001_pgbench_with_server.pl
[09:15:01.794] ― ✀
―
[09:15:01.794] stderr:
[09:15:01.794] # Failed test 'transaction format for 001_pgbench_log_2'
[09:15:01.794] # at
/tmp/cirrus-ci-build/src/bin/pgbench/t/001_pgbench_with_server.pl line
1247.
[09:15:01.794] # Failed test 'transaction count for
/tmp/cirrus-ci-build/build/testrun/pgbench/001_pgbench_with_server/data/t_001_pgbench_with_server_main_data/001_pgbench_log_3.25193
(11)'
[09:15:01.794] # at
/tmp/cirrus-ci-build/src/bin/pgbench/t/001_pgbench_with_server.pl line
1257.
[09:15:01.794] # Failed test 'transaction format for 001_pgbench_log_3'
[09:15:01.794] # at
/tmp/cirrus-ci-build/src/bin/pgbench/t/001_pgbench_with_server.pl line
1257.
[09:15:01.794] # Looks like you failed 3 tests of 439.
[09:15:01.794]
[09:15:01.794] (test program exited with status code 3)

[1] - https://cirrus-ci.com/task/5139049757802496

Thanks and regards
Shlok Kyal




Re: [PATCH] Reuse Workers and Replication Slots during Logical Replication

2024-01-10 Thread Shlok Kyal
Hi,

This patch is not applying on the HEAD. Please rebase and share the
updated patch.

Thanks and Regards
Shlok Kyal

On Wed, 10 Jan 2024 at 14:55, Peter Smith  wrote:
>
> Oops - now with attachments
>
> On Mon, Aug 21, 2023 at 5:56 PM Peter Smith  wrote:
>>
>> Hi Melih,
>>
>> Last week we revisited your implementation of design#2. Vignesh rebased it, 
>> and then made a few other changes.
>>
>> PSA v28*
>>
>> The patch changes include:
>> * changed the logic slightly by setting recv_immediately(new variable), if 
>> this variable is set the main apply worker loop will not wait in this case.
>> * setting the relation state to ready immediately if there are no more 
>> incremental changes to be synced.
>> * receive the incremental changes if applicable and set the relation state 
>> to ready without waiting.
>> * reuse the worker if the worker is free before trying to start a new table 
>> sync worker
>> * restarting the tablesync worker only after wal_retrieve_retry_interval
>>
>> ~
>>
>> FWIW, we just wanted to share with you the performance measurements seen 
>> using this design#2 patch set:
>>
>> ==
>>
>> RESULTS (not busy tests)
>>
>> --
>> 10 empty tables
>> 2w  4w  8w  16w
>> HEAD:   125 119 140 133
>> HEAD+v28*:  92  93  123 134
>> %improvement:   27% 22% 12% -1%
>> --
>> 100 empty tables
>> 2w  4w  8w  16w
>> HEAD:   1037843 11091155
>> HEAD+v28*:  591 625 26162569
>> %improvement:   43% 26% -136%   -122%
>> --
>> 1000 empty tables
>> 2w  4w  8w  16w
>> HEAD:   15874   10047   991910338
>> HEAD+v28*:  33673   12199   90949896
>> %improvement:   -112%   -21%8%  4%
>> --
>> 2000 empty tables
>> 2w  4w  8w  16w
>> HEAD:   45266   24216   19395   19820
>> HEAD+v28*:  88043   21550   21668   22607
>> %improvement:  -95% 11% -12%-14%
>>
>> ~~~
>>
>> Note - the results were varying quite a lot in comparison to the HEAD
>> e.g. HEAD results are very consistent, but the v28* results observed are not
>> HEAD 1000 (2w): 15861, 15777, 16007, 15950, 15886, 15740, 15846, 15740, 
>> 15908, 15940
>> v28* 1000 (2w):  34214, 13679, 8792, 33289, 31976, 56071, 57042, 56163, 
>> 34058, 11969
>>
>> --
>> Kind Regards,
>> Peter Smith.
>> Fujitsu Australia




Re: speed up a logical replica setup

2024-01-09 Thread Shlok Kyal
On Fri, 5 Jan 2024 at 12:19, Shlok Kyal  wrote:
>
> On Thu, 4 Jan 2024 at 16:46, Amit Kapila  wrote:
> >
> > On Thu, Jan 4, 2024 at 12:22 PM Shlok Kyal  wrote:
> > >
> > > Hi,
> > > I was testing the patch with following test cases:
> > >
> > > Test 1 :
> > > - Create a 'primary' node
> > > - Setup physical replica using pg_basebackup  "./pg_basebackup –h
> > > localhost –X stream –v –R –W –D ../standby "
> > > - Insert data before and after pg_basebackup
> > > - Run pg_subscriber and then insert some data to check logical
> > > replication "./pg_subscriber –D ../standby -S “host=localhost
> > > port=9000 dbname=postgres” -P “host=localhost port=9000
> > > dbname=postgres” -d postgres"
> > > - Also check pg_publication, pg_subscriber and pg_replication_slots 
> > > tables.
> > >
> > > Observation:
> > > Data is not lost. Replication is happening correctly. Pg_subscriber is
> > > working as expected.
> > >
> > > Test 2:
> > > - Create a 'primary' node
> > > - Use normal pg_basebackup but don’t set up Physical replication
> > > "./pg_basebackup –h localhost –v –W –D ../standby"
> > > - Insert data before and after pg_basebackup
> > > - Run pg_subscriber
> > >
> > > Observation:
> > > Pg_subscriber command is not completing and is stuck with following
> > > log repeating:
> > > LOG: waiting for WAL to become available at 0/3000168
> > > LOG: invalid record length at 0/3000150: expected at least 24, got 0
> > >
> >
> > I think probably the required WAL is not copied. Can you use the -X
> > option to stream WAL as well and then test? But I feel in this case
> > also, we should wait for some threshold time and then exit with
> > failure, removing new objects created, if any.
>
> I have tested with -X stream option in pg_basebackup as well. In this
> case also the pg_subscriber command is getting stuck.
> logs:
> 2024-01-05 11:49:34.436 IST [61948] LOG:  invalid resource manager ID
> 102 at 0/3000118
> 2024-01-05 11:49:34.436 IST [61948] LOG:  waiting for WAL to become
> available at 0/3000130
>
> >
> > > Test 3:
> > > - Create a 'primary' node
> > > - Use normal pg_basebackup but don’t set up Physical replication
> > > "./pg_basebackup –h localhost –v –W –D ../standby"
> > > -Insert data before pg_basebackup but not after pg_basebackup
> > > -Run pg_subscriber
> > >
> > > Observation:
> > > Pg_subscriber command is not completing and is stuck with following
> > > log repeating:
> > > LOG: waiting for WAL to become available at 0/3000168
> > > LOG: invalid record length at 0/3000150: expected at least 24, got 0
> > >
> >
> > This is similar to the previous test and you can try the same option
> > here as well.
> For this test as well tried with -X stream option  in pg_basebackup.
> It is getting stuck here as well with similar log.
>
> Will investigate the issue further.

I noticed that the pg_subscriber get stuck when we run it on node
which is not a standby. It is because the of the code:
+   conn = connect_database(dbinfo[0].pubconninfo);
+   if (conn == NULL)
+   exit(1);
+   consistent_lsn = create_logical_replication_slot(conn, [0],
+temp_replslot);
+
.
+else
+   {
+   appendPQExpBuffer(recoveryconfcontents, "recovery_target_lsn = '%s'\n",
+ consistent_lsn);
+   WriteRecoveryConfig(conn, subscriber_dir, recoveryconfcontents);
+   }

Here the standby node would be waiting for the 'consistent_lsn' wal
during recovery but this wal will not be present on standby if no
physical replication is setup. Hence the command will be waiting
infinitely for the wal.
To solve this added a timeout of 60s for the recovery process and also
added a check so that pg_subscriber would give a error when it called
for node which is not in physical replication.
Have attached the patch for the same. It is a top-up patch of the
patch shared by Euler at [1].

Please review the changes and merge the changes if it looks ok.

[1] - 
https://www.postgresql.org/message-id/e02a2c17-22e5-4ba6-b788-de696ab74f1e%40app.fastmail.com

Thanks and regards
Shlok Kyal


v1-0001-Restrict-pg_subscriber-to-standby-node.patch
Description: Binary data


Re: speed up a logical replica setup

2024-01-04 Thread Shlok Kyal
On Thu, 4 Jan 2024 at 16:46, Amit Kapila  wrote:
>
> On Thu, Jan 4, 2024 at 12:22 PM Shlok Kyal  wrote:
> >
> > Hi,
> > I was testing the patch with following test cases:
> >
> > Test 1 :
> > - Create a 'primary' node
> > - Setup physical replica using pg_basebackup  "./pg_basebackup –h
> > localhost –X stream –v –R –W –D ../standby "
> > - Insert data before and after pg_basebackup
> > - Run pg_subscriber and then insert some data to check logical
> > replication "./pg_subscriber –D ../standby -S “host=localhost
> > port=9000 dbname=postgres” -P “host=localhost port=9000
> > dbname=postgres” -d postgres"
> > - Also check pg_publication, pg_subscriber and pg_replication_slots tables.
> >
> > Observation:
> > Data is not lost. Replication is happening correctly. Pg_subscriber is
> > working as expected.
> >
> > Test 2:
> > - Create a 'primary' node
> > - Use normal pg_basebackup but don’t set up Physical replication
> > "./pg_basebackup –h localhost –v –W –D ../standby"
> > - Insert data before and after pg_basebackup
> > - Run pg_subscriber
> >
> > Observation:
> > Pg_subscriber command is not completing and is stuck with following
> > log repeating:
> > LOG: waiting for WAL to become available at 0/3000168
> > LOG: invalid record length at 0/3000150: expected at least 24, got 0
> >
>
> I think probably the required WAL is not copied. Can you use the -X
> option to stream WAL as well and then test? But I feel in this case
> also, we should wait for some threshold time and then exit with
> failure, removing new objects created, if any.

I have tested with -X stream option in pg_basebackup as well. In this
case also the pg_subscriber command is getting stuck.
logs:
2024-01-05 11:49:34.436 IST [61948] LOG:  invalid resource manager ID
102 at 0/3000118
2024-01-05 11:49:34.436 IST [61948] LOG:  waiting for WAL to become
available at 0/3000130

>
> > Test 3:
> > - Create a 'primary' node
> > - Use normal pg_basebackup but don’t set up Physical replication
> > "./pg_basebackup –h localhost –v –W –D ../standby"
> > -Insert data before pg_basebackup but not after pg_basebackup
> > -Run pg_subscriber
> >
> > Observation:
> > Pg_subscriber command is not completing and is stuck with following
> > log repeating:
> > LOG: waiting for WAL to become available at 0/3000168
> > LOG: invalid record length at 0/3000150: expected at least 24, got 0
> >
>
> This is similar to the previous test and you can try the same option
> here as well.
For this test as well tried with -X stream option  in pg_basebackup.
It is getting stuck here as well with similar log.

Will investigate the issue further.


Thanks and regards
Shlok Kyal




Re: speed up a logical replica setup

2024-01-03 Thread Shlok Kyal
Hi,
I was testing the patch with following test cases:

Test 1 :
- Create a 'primary' node
- Setup physical replica using pg_basebackup  "./pg_basebackup –h
localhost –X stream –v –R –W –D ../standby "
- Insert data before and after pg_basebackup
- Run pg_subscriber and then insert some data to check logical
replication "./pg_subscriber –D ../standby -S “host=localhost
port=9000 dbname=postgres” -P “host=localhost port=9000
dbname=postgres” -d postgres"
- Also check pg_publication, pg_subscriber and pg_replication_slots tables.

Observation:
Data is not lost. Replication is happening correctly. Pg_subscriber is
working as expected.

Test 2:
- Create a 'primary' node
- Use normal pg_basebackup but don’t set up Physical replication
"./pg_basebackup –h localhost –v –W –D ../standby"
- Insert data before and after pg_basebackup
- Run pg_subscriber

Observation:
Pg_subscriber command is not completing and is stuck with following
log repeating:
LOG: waiting for WAL to become available at 0/3000168
LOG: invalid record length at 0/3000150: expected at least 24, got 0

Test 3:
- Create a 'primary' node
- Use normal pg_basebackup but don’t set up Physical replication
"./pg_basebackup –h localhost –v –W –D ../standby"
-Insert data before pg_basebackup but not after pg_basebackup
-Run pg_subscriber

Observation:
Pg_subscriber command is not completing and is stuck with following
log repeating:
LOG: waiting for WAL to become available at 0/3000168
LOG: invalid record length at 0/3000150: expected at least 24, got 0

I was not clear about how to use pg_basebackup in this case, can you
let me know if any changes need to be made for test2 and test3.

Thanks and regards
Shlok Kyal




Re: Intermittent failure with t/003_logical_slots.pl test on windows

2023-12-27 Thread Shlok Kyal
Hi,
Apart of these I am getting following some intermittent failure as below:

131/272 postgresql:pg_basebackup / pg_basebackup/010_pg_basebackup
ERROR30.51s   (exit status 255 or
signal 127 SIGinvalid)
114/272 postgresql:libpq / libpq/001_uri
ERROR 9.66s   exit status 8
 34/272 postgresql:pg_upgrade / pg_upgrade/003_logical_slots
ERROR99.14s   exit status 1
186/272 postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade
ERROR   306.22s   exit status 1
 29/272 postgresql:recovery / recovery/002_archiving
ERROR89.62s   (exit status 255 or
signal 127 SIGinvalid)
138/272 postgresql:pg_resetwal / pg_resetwal/001_basic
ERROR 3.93s   (exit status 255 or
signal 127 SIGinvalid)

Have attached the regress logs for the same as well.

Thanks and Regards
Shlok Kyal

On Tue, 26 Dec 2023 at 17:39, Shlok Kyal  wrote:
>
> Hi,
> The same intermittent failure is reproducible on my system.
> For the intermittent issues I found that many issues are due to errors
> where commands like 'psql -V' are not returning any output.
> To reproduce it in an easy way, I wrote a script (.bat file) with
> '--version' option for different binaries. And found out that it was
> not giving any output for some command (varies for each run).
> Then I tried to run the same script after adding 'fflush(stdout)' in
> the function called with  '--version' option and it started to give
> output for each command.
> I noticed the same for '--help' option and did the changes for the same.
>
> I have attached the test script(changes the extension to .txt as gmail
> is blocking it), output of test before the changes.
> I have also attached the patch with changes which resolved the above issue.
>
> This change has resolved most of the intermittent issues for me. I am
> facing some more intermittent issues. Will analyse and share it as
> well.
>
> Thanks and regards
> Shlok Kyal
>
> On Tue, 7 Nov 2023 at 11:05, Kyotaro Horiguchi  
> wrote:
> >
> > At Mon, 6 Nov 2023 19:42:21 +0530, Nisha Moond  
> > wrote in
> > > > Appending '2>&1 test:
> > > > The command still results in NULL and ends up failing as no data is
> > > > returned. Which means even no error message is returned. The error log
> >
> > Thanks for confirmation. So, at least the child process was launced
> > successfully in the cmd.exe's view.
> >
> > Upon a quick check on my end with Windows' _popen, I have obseved the
> > following:
> >
> > - Once a child process is started, it seems to go undetected as an
> >   error by _popen or subsequent fgets calls if the process ends
> >   abnormally, with a non-zero exit status or even with a SEGV.
> >
> > - After the child process has flushed data to stdout, it is possible
> >   to read from the pipe even if the child process crashes or ends
> >   thereafter.
> >
> > - Even if fgets is called before the program starts, it will correctly
> >   block until the program outputs something. Specifically, when I used
> >   popen("sleep 5 & target.exe") and immediately performed fgets on the
> >   pipe, I was able to read the output of target.exe as the first line.
> >
> > Therefore, based on the information available, it is conceivable that
> > the child process was killed by something right after it started, or
> > the program terminated on its own without any error messages.
> >
> > By the way, in the case of aforementioned SEGV, Application Errors
> > corresponding to it were identifiable in the Event
> > Viewer. Additionally, regarding the exit statuses, they can be
> > captured by using a wrapper batch file (.bat) that records
> > %ERRORLEVEL% after running the target program. This may yield
> > insights, aothough its effectiveness is not guaranteed.
> >
> > regards.
> >
> > --
> > Kyotaro Horiguchi
> > NTT Open Source Software Center
> >
> >


regress_log_001_basic
Description: Binary data


regress_log_003_logical_slots
Description: Binary data


regress_log_002_archiving
Description: Binary data


regress_log_002_pg_upgrade
Description: Binary data


regress_log_001_uri
Description: Binary data


regress_log_010_pg_basebackup
Description: Binary data


Re: Intermittent failure with t/003_logical_slots.pl test on windows

2023-12-26 Thread Shlok Kyal
Hi,
The same intermittent failure is reproducible on my system.
For the intermittent issues I found that many issues are due to errors
where commands like 'psql -V' are not returning any output.
To reproduce it in an easy way, I wrote a script (.bat file) with
'--version' option for different binaries. And found out that it was
not giving any output for some command (varies for each run).
Then I tried to run the same script after adding 'fflush(stdout)' in
the function called with  '--version' option and it started to give
output for each command.
I noticed the same for '--help' option and did the changes for the same.

I have attached the test script(changes the extension to .txt as gmail
is blocking it), output of test before the changes.
I have also attached the patch with changes which resolved the above issue.

This change has resolved most of the intermittent issues for me. I am
facing some more intermittent issues. Will analyse and share it as
well.

Thanks and regards
Shlok Kyal

On Tue, 7 Nov 2023 at 11:05, Kyotaro Horiguchi  wrote:
>
> At Mon, 6 Nov 2023 19:42:21 +0530, Nisha Moond  
> wrote in
> > > Appending '2>&1 test:
> > > The command still results in NULL and ends up failing as no data is
> > > returned. Which means even no error message is returned. The error log
>
> Thanks for confirmation. So, at least the child process was launced
> successfully in the cmd.exe's view.
>
> Upon a quick check on my end with Windows' _popen, I have obseved the
> following:
>
> - Once a child process is started, it seems to go undetected as an
>   error by _popen or subsequent fgets calls if the process ends
>   abnormally, with a non-zero exit status or even with a SEGV.
>
> - After the child process has flushed data to stdout, it is possible
>   to read from the pipe even if the child process crashes or ends
>   thereafter.
>
> - Even if fgets is called before the program starts, it will correctly
>   block until the program outputs something. Specifically, when I used
>   popen("sleep 5 & target.exe") and immediately performed fgets on the
>   pipe, I was able to read the output of target.exe as the first line.
>
> Therefore, based on the information available, it is conceivable that
> the child process was killed by something right after it started, or
> the program terminated on its own without any error messages.
>
> By the way, in the case of aforementioned SEGV, Application Errors
> corresponding to it were identifiable in the Event
> Viewer. Additionally, regarding the exit statuses, they can be
> captured by using a wrapper batch file (.bat) that records
> %ERRORLEVEL% after running the target program. This may yield
> insights, aothough its effectiveness is not guaranteed.
>
> regards.
>
> --
> Kyotaro Horiguchi
> NTT Open Source Software Center
>
>

D:\project\pg_meson_64\bin>pg_ctl -V   
pg_ctl (PostgreSQL) 17devel

D:\project\pg_meson_64\bin>pg_controldata -V  
pg_controldata (PostgreSQL) 17devel

D:\project\pg_meson_64\bin>pg_dump -V  
pg_dump (PostgreSQL) 17devel

D:\project\pg_meson_64\bin>pg_dumpall -V  
pg_dumpall (PostgreSQL) 17devel

D:\project\pg_meson_64\bin>pg_ctl -V  
pg_ctl (PostgreSQL) 17devel

D:\project\pg_meson_64\bin>pg_controldata -V  
pg_controldata (PostgreSQL) 17devel

D:\project\pg_meson_64\bin>pg_dump -V  
pg_dump (PostgreSQL) 17devel

D:\project\pg_meson_64\bin>pg_dumpall -V  
pg_dumpall (PostgreSQL) 17devel

D:\project\pg_meson_64\bin>pg_ctl -V  
pg_ctl (PostgreSQL) 17devel

D:\project\pg_meson_64\bin>pg_controldata -V  
pg_controldata (PostgreSQL) 17devel

D:\project\pg_meson_64\bin>pg_dump -V  
pg_dump (PostgreSQL) 17devel

D:\project\pg_meson_64\bin>pg_dumpall -V  
pg_dumpall (PostgreSQL) 17devel

D:\project\pg_meson_64\bin>pg_ctl -V  
pg_ctl (PostgreSQL) 17devel

D:\project\pg_meson_64\bin>pg_controldata -V  
pg_controldata (PostgreSQL) 17devel

D:\project\pg_meson_64\bin>pg_dump -V  
pg_dump (PostgreSQL) 17devel

D:\project\pg_meson_64\bin>pg_dumpall -V  
pg_dumpall (PostgreSQL) 17devel

D:\project\pg_meson_64\bin>pg_ctl -V  
pg_ctl (PostgreSQL) 17devel

D:\project\pg_meson_64\bin>pg_controldata -V  
pg_controldata (PostgreSQL) 17devel

D:\project\pg_meson_64\bin>pg_dump -V  
pg_dump (PostgreSQL) 17devel

D:\project\pg_meson_64\bin>pg_dumpall -V  
pg_dumpall (PostgreSQL) 17devel

D:\project\pg_meson_64\bin>pg_ctl -V  
pg_ctl (PostgreSQL) 17devel

D:\project\pg_meson_64\bin>pg_controldata -V  

D:\project\pg_meson_64\bin>pg_dump -V  
pg_dump (PostgreSQL) 17devel

D:\project\pg_meson_64\bin>pg_dumpall -V  
pg_dumpall (PostgreSQL) 17devel

D:\project\pg_meson_64\bin>pg_ctl -V  
pg_ctl (PostgreSQL) 17devel

D:\project\pg_meson_64\bin>pg_controldata -V  
pg_controldata (PostgreSQL) 17devel

D:\project\pg_meson_64\bin

Re: speed up a logical replica setup

2023-12-20 Thread Shlok Kyal
Hi,

On Wed, 6 Dec 2023 at 12:53, Euler Taveira  wrote:
>
> On Thu, Nov 9, 2023, at 8:12 PM, Michael Paquier wrote:
>
> On Thu, Nov 09, 2023 at 03:41:53PM +0100, Peter Eisentraut wrote:
> > On 08.11.23 00:12, Michael Paquier wrote:
> >> - Should the subdirectory pg_basebackup be renamed into something more
> >> generic at this point?  All these things are frontend tools that deal
> >> in some way with the replication protocol to do their work.  Say
> >> a replication_tools?
> >
> > Seems like unnecessary churn.  Nobody has complained about any of the other
> > tools in there.
>
> Not sure.  We rename things across releases in the tree from time to
> time, and here that's straight-forward.
>
>
> Based on this discussion it seems we have a consensus that this tool should be
> in the pg_basebackup directory. (If/when we agree with the directory renaming,
> it could be done in a separate patch.) Besides this move, the v3 provides a 
> dry
> run mode. It basically executes every routine but skip when should do
> modifications. It is an useful option to check if you will be able to run it
> without having issues with connectivity, permission, and existing objects
> (replication slots, publications, subscriptions). Tests were slightly 
> improved.
> Messages were changed to *not* provide INFO messages by default and --verbose
> provides INFO messages and --verbose --verbose also provides DEBUG messages. I
> also refactored the connect_database() function into which the connection will
> always use the logical replication mode. A bug was fixed in the transient
> replication slot name. Ashutosh review [1] was included. The code was also 
> indented.
>
> There are a few suggestions from Ashutosh [2] that I will reply in another
> email.
>
> I'm still planning to work on the following points:
>
> 1. improve the cleanup routine to point out leftover objects if there is any
>connection issue.
> 2. remove the physical replication slot if the standby is using one
>(primary_slot_name).
> 3. provide instructions to promote the logical replica into primary, I mean,
>stop the replication between the nodes and remove the replication setup
>(publications, subscriptions, replication slots). Or even include another
>action to do it. We could add both too.
>
> Point 1 should be done. Points 2 and 3 aren't essential but will provide a 
> nice
> UI for users that would like to use it.
>
>
> [1] 
> https://postgr.es/m/CAExHW5sCAU3NvPKd7msScQKvrBN-x_AdDQD-ZYAwOxuWG%3Doz1w%40mail.gmail.com
> [2] 
> https://postgr.es/m/caexhw5vhfemfvtuhe+7xwphvzjxrexz5h3dd4uqi7cwmdmj...@mail.gmail.com
>

The changes in the file 'src/tools/msvc/Mkvcbuild.pm' seems
unnecessary as the folder 'msvc' is removed due to the commit [1].


To review the changes, I did 'git reset --hard' to the commit previous
to commit [1].
I tried to build the postgres on my Windows machine using two methods:
i. building using Visual Studio
ii. building using Meson

When I built the code using Visual Studio, on installing postgres,
pg_subscriber binary was not created.
But when I built the code using Meson, on installing postgres,
pg_subscriber binary was created.
Is this behaviour intentional?

[1] 
https://github.com/postgres/postgres/commit/1301c80b2167feb658a738fa4ceb1c23d0991e23

Thanks and Regards,
Shlok Kyal




Re: undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION

2023-12-08 Thread Shlok Kyal
Hi,

> Then let's go with the original patch only. BTW, it took almost the
> same time (105 wallclock secs) in my environment (CentOs VM) to run
> tests in src/test/subscription both with and without the patch. I took
> a median of five runs. I have slightly adjusted the comments and
> commit message in the attached. If you are fine with this, we can
> commit and backpatch this.

I have tested the patch for all the branches from PG 17 to PG 12.
The same patch applies cleanly on all branches. Also, the same patch
resolves the issue on all the branches.
I ran all the tests and all the tests passed on each branch.

I also reviewed the patch and it looks good to me.

Thanks and Regards,
Shlok Kyal




Re: undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION

2023-12-06 Thread Shlok Kyal
Hi,

> I mean to commit the open transaction at the below place in
> wait_for_relation_state_change()
>
> wait_for_relation_state_change()
> {
> ...
> -- commit the xact
> WaitLatch();
> ...
> }
>
> Then start after the wait is over. This is just to test whether it
> improves the difference in regression test timing.

I tried the above approach and observed that the performance of this
approach is nearly same as the previous approach.

For Linux VM:
Summary| Subscription | 100 tables in pub  | 1000 tables in pub
| Test (sec) | and Sub (sec)   | and Sub (sec)
--
old patch |  107.4545 |   6.911   | 77.918
alternate  |  108.3985| 6.9835   | 78.111
approach

For Performance Machine:
Summary| Subscription | 100 tables in pub  | 1000 tables in pub
| Test (sec) | and Sub (sec)   | and Sub (sec)
--
old patch |  115.922  |   6.7305 | 81.1525
alternate  |  115.8215   | 6.7685| 81.2335
approach

I have attached the patch for this approach as 'alternate_approach.patch'.

Since the performance is the same, I think that the previous approach
is better. As in this approach we are using CommitTransactionCommand()
and StartTransactionCommand() inside a 'for loop'.

I also fixed the comment in previous approach and attached here as
'v2-0001-Deadlock-when-apply-worker-tablesync-worker-and-c.patch'

Thanks and Regards


Shlok Kyal
From 11072d138d900227b963b54d1a3626cf256db721 Mon Sep 17 00:00:00 2001
From: Shlok Kyal 
Date: Fri, 24 Nov 2023 16:37:14 +0530
Subject: [PATCH v2] Deadlock when apply worker,tablesync worker and client
 backend run parallelly.

Apply worker holds a lock on the table pg_subscription_rel and waits for
notification from the tablesync workers where the relation is synced, which
happens through updating the pg_subscription_rel row. And that wait happens in
wait_for_relation_state_change, which simply checks the row in a loop, with a
sleep by WaitLatch(). Unfortunately, the tablesync workers can't update the row
because the client backend executing ALTER SUBSCRIPTION ... REFRESH PUBLICATION
sneaked in, and waits for an AccessExclusiveLock. So the tablesync workers are
stuck in the queue and can't proceed.

The client backend is waiting for a lock held by the apply worker. The tablesync
workers can't proceed because their lock request is stuck behind the
AccessExclusiveLock request. And the apply worker is waiting for status update
from the tablesync workers. And the deadlock is undetected, because the apply
worker is not waiting on a lock, but sleeping on a latch.

We have resolved the issue by releasing the lock by commiting the transaction
before the apply worker goes to wait state.
---
 src/backend/replication/logical/tablesync.c | 17 ++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/src/backend/replication/logical/tablesync.c b/src/backend/replication/logical/tablesync.c
index df3c42eb5d..b71234d998 100644
--- a/src/backend/replication/logical/tablesync.c
+++ b/src/backend/replication/logical/tablesync.c
@@ -542,15 +542,26 @@ process_syncing_tables_for_apply(XLogRecPtr current_lsn)
 	LWLockRelease(LogicalRepWorkerLock);
 
 	/*
-	 * Enter busy loop and wait for synchronization worker to
-	 * reach expected state (or die trying).
+	 * If we have a transaction, we must commit it to release
+	 * any locks we have. Then start a new transaction so we 
+	 * can examine catalog state.
 	 */
-	if (!started_tx)
+	if (started_tx)
+	{
+		CommitTransactionCommand();
+		pgstat_report_stat(false);
+		StartTransactionCommand();
+	}
+	else
 	{
 		StartTransactionCommand();
 		started_tx = true;
 	}
 
+	/*
+	 * Enter busy loop and wait for synchronization worker to
+	 * reach expected state (or die trying).
+	 */
 	wait_for_relation_state_change(rstate->relid,
    SUBREL_STATE_SYNCDONE);
 }
-- 
2.34.1

From ce261399d22a7fd9ee7ccdd3b9d1cc7f4f72ce4b Mon Sep 17 00:00:00 2001
From: Shlok Kyal 
Date: Thu, 7 Dec 2023 10:55:26 +0530
Subject: [PATCH] Deadlock when apply worker,tablesync worker and client
 backend run parallelly.

Apply worker holds a lock on the table pg_subscription_rel and waits for
notification from the tablesync workers where the relation is synced, which
happens through updating the pg_subscription_rel row. And that wait happens in
wait_for_relation_state_change, which simply checks the row in a loop, with a
sleep by WaitLatch(). Unfortunately, the tablesync workers can't update the row
because the client backend executing ALTER SUBSCRIPTION ... REFRESH PUBLICATION
sneaked in, and waits for an AccessExclusiveLock. So the tablesync worker

Re: undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION

2023-12-05 Thread Shlok Kyal
On Tue, 5 Dec 2023 at 17:18, Tomas Vondra  wrote:
>
> On 12/5/23 08:14, Shlok Kyal wrote:
> > Hi,
> >
> >> As for the test results, I very much doubt the differences are not
> >> caused simply by random timing variations, or something like that. And I
> >> don't understand what "Performance Machine Linux" is, considering those
> >> timings are slower than the other two machines.
> >
> > The machine has Total Memory of 755.536 GB, 120 CPUs and RHEL 7 Operating 
> > System
> > Also find the detailed info of the performance machine attached.
> >
>
> Thanks for the info. I don't think the tests really benefit from this
> much resources, I would be rather surprised if it was faster beyond 8
> cores or so. The CPU frequency likely matters much more. Which probably
> explains why this machine was the slowest.
>
> Also, I wonder how much the results vary between the runs. I suppose you
> only did s single run for each, right?

I did 10 runs for each of the cases and reported the median result in
the previous thread.
I have documented the result of the runs and have attached the same here.

Thanks and Regards,
Shlok Kyal


Performance_Test.xlsx
Description: MS-Excel 2007 spreadsheet


Re: undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION

2023-12-04 Thread Shlok Kyal
Hi,

> As for the test results, I very much doubt the differences are not
> caused simply by random timing variations, or something like that. And I
> don't understand what "Performance Machine Linux" is, considering those
> timings are slower than the other two machines.

The machine has Total Memory of 755.536 GB, 120 CPUs and RHEL 7 Operating System
Also find the detailed info of the performance machine attached.

Thanks and Regards,
Shlok Kyal
MemTotal: 755.536 GB
MemFree: 748.281 GB
MemAvailable: 748.581 GB
Buffers: 0.002 GB
Cached: 1.366 GB
SwapCached: 0.000 GB
Active: 0.834 GB
Inactive: 0.745 GB
Active(anon): 0.221 GB
Inactive(anon): 0.028 GB
Active(file): 0.614 GB
Inactive(file): 0.717 GB
Unevictable: 0.000 GB
Mlocked: 0.000 GB
SwapTotal: 4.000 GB
SwapFree: 4.000 GB
Dirty: 0.000 GB
Writeback: 0.000 GB
AnonPages: 0.212 GB
Mapped: 0.142 GB
Shmem: 0.038 GB
Slab: 0.468 GB
SReclaimable: 0.139 GB
SUnreclaim: 0.329 GB
KernelStack: 0.020 GB
PageTables: 0.023 GB
NFS_Unstable: 0.000 GB
Bounce: 0.000 GB
WritebackTmp: 0.000 GB
CommitLimit: 381.768 GB
Committed_AS: 0.681 GB
VmallocTotal: 32768.000 GB
VmallocUsed: 1.852 GB
VmallocChunk: 32189.748 GB
Percpu: 0.035 GB
HardwareCorrupted: 0.000 GB
AnonHugePages: 0.025 GB
CmaTotal: 0.000 GB
CmaFree: 0.000 GB
HugePages_Total: 0.000 GB
HugePages_Free: 0.000 GB
HugePages_Rsvd: 0.000 GB
HugePages_Surp: 0.000 GB
Hugepagesize: 0.002 GB
DirectMap4k: 0.314 GB
DirectMap2M: 6.523 GB
DirectMap1G: 763.000 GB
processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model   : 62
model name  : Intel(R) Xeon(R) CPU E7-4890 v2 @ 2.80GHz
stepping: 7
microcode   : 0x715
cpu MHz : 1762.646
cache size  : 38400 KB
physical id : 0
siblings: 30
core id : 0
cpu cores   : 15
apicid  : 0
initial apicid  : 0
fpu : yes
fpu_exception   : yes
cpuid level : 13
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb 
rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology 
nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est 
tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt 
tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm epb intel_ppin ssbd ibrs 
ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms xsaveopt 
dtherm ida arat pln pts md_clear spec_ctrl intel_stibp flush_l1d
bogomips: 5586.07
clflush size: 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:NAME="Red Hat Enterprise Linux Server"
VERSION="7.8 (Maipo)"
ID="rhel"
ID_LIKE="fedora"
VARIANT="Server"
VARIANT_ID="server"
VERSION_ID="7.8"
PRETTY_NAME="Red Hat Enterprise Linux Server 7.8 (Maipo)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:7.8:GA:server"
HOME_URL="https://www.redhat.com/;
BUG_REPORT_URL="https://bugzilla.redhat.com/;

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 7"
REDHAT_BUGZILLA_PRODUCT_VERSION=7.8
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="7.8"


Re: undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION

2023-12-02 Thread Shlok Kyal
Hi,

> thread. I think you can compare the timing of regression tests in
> subscription, with and without the patch to show there is no
> regression. And probably some tests with a large number of tables for
> sync with very little data.

I have tested the regression test timings for subscription with and
without patch. I also did the timing test for sync of subscription
with the publisher for 100 and 1000 tables respectively.
I have attached the test script and results of the timing test are as follows:

Time taken for test to run in Linux VM
Summary|  Subscription Test (sec)
|100 tables in pub and Sub (sec)|  1000 tables in pub and Sub
(sec)
Without patch Release   |  95.564
 | 7.877 |   58.919
With patch Release|  96.513
   | 6.533 |   45.807

Time Taken for test to run in another Linux VM
Summary|  Subscription Test (sec)
|100 tables in pub and Sub (sec)|  1000 tables in pub and Sub
(sec)
Without patch Release   |  109.8145
|6.4675   |83.001
With patch Release|  113.162
   |7.947  |   87.113

Time Taken for test to run in Performance Machine Linux
Summary|  Subscription Test (sec)
|100 tables in pub and Sub (sec)|  1000 tables in pub and Sub
(sec)
Without patch Release   |  115.871
 | 6.656 |   81.157
With patch Release|  115.922
   | 6.7305   |   81.1525

thoughts?

Thanks and Regards,
Shlok Kyal


034_tmp.pl
Description: Binary data


Re: undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION

2023-11-24 Thread Shlok Kyal
Hi,

> I tried to reproduce the issue and was able to reproduce it with
> scripts shared by Tomas.
> I tried testing it from PG17 to PG 11. This issue is reproducible for
> each version.
>
> Next I would try to test with the patch in the thread shared by Amit.

I have created the v1 patch to resolve the issue. Have tested the
patch on HEAD to PG12.
The same patch applies to all the versions. The changes are similar to
the one posted in the thread
https://www.postgresql.org/message-id/1412708.1674417574%40sss.pgh.pa.us

Thanks and Regards,
Shlok Kyal


v1-0001-Deadlock-when-apply-worker-tablesync-worker-and-c.patch
Description: Binary data


Re: undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION

2023-11-23 Thread Shlok Kyal
Hi,

I tried to reproduce the issue and was able to reproduce it with
scripts shared by Tomas.
I tried testing it from PG17 to PG 11. This issue is reproducible for
each version.

Next I would try to test with the patch in the thread shared by Amit.

Thanks,
Shlok Kumar Kyal




Re: pg_upgrade and logical replication

2023-11-22 Thread Shlok Kyal
On Wed, 22 Nov 2023 at 06:48, Peter Smith  wrote:
> ==
> doc/src/sgml/ref/pgupgrade.sgml
>
> 1.
> +  
> +   Create all the new tables that were created in the publication during
> +   upgrade and refresh the publication by executing
> +   ALTER
> SUBSCRIPTION ... REFRESH PUBLICATION.
> +  
>
> "Create all ... that were created" sounds a bit strange.
>
> SUGGESTION (maybe like this or similar?)
> Create equivalent subscriber tables for anything that became newly
> part of the publication during the upgrade and

Modified

> ==
> src/bin/pg_dump/pg_dump.c
>
> 2. getSubscriptionTables
>
> +/*
> + * getSubscriptionTables
> + *   Get information about subscription membership for dumpable tables. This
> + *will be used only in binary-upgrade mode.
> + */
> +void
> +getSubscriptionTables(Archive *fout)
> +{
> + DumpOptions *dopt = fout->dopt;
> + SubscriptionInfo *subinfo = NULL;
> + SubRelInfo *subrinfo;
> + PQExpBuffer query;
> + PGresult   *res;
> + int i_srsubid;
> + int i_srrelid;
> + int i_srsubstate;
> + int i_srsublsn;
> + int ntups;
> + Oid last_srsubid = InvalidOid;
> +
> + if (dopt->no_subscriptions || !dopt->binary_upgrade ||
> + fout->remoteVersion < 17)
> + return;
>
> This function comment says "used only in binary-upgrade mode." and the
> Assert says the same. But, is this compatible with the other function
> dumpSubscriptionTable() where it says "used only in binary-upgrade
> mode and for PG17 or later versions"?
>
Modified

> ==
> src/bin/pg_upgrade/check.c
>
> 3. check_new_cluster_subscription_configuration
>
> +static void
> +check_new_cluster_subscription_configuration(void)
> +{
> + PGresult   *res;
> + PGconn*conn;
> + int nsubs_on_old;
> + int max_replication_slots;
> +
> + /* Logical slots can be migrated since PG17. */
> + if (GET_MAJOR_VERSION(old_cluster.major_version) <= 1600)
> + return;
>
> IMO it is better to say < 1700 in this check, instead of <= 1600.
>
Modified

> ~~~
>
> 4.
> + /* Quick return if there are no subscriptions to be migrated */
> + if (nsubs_on_old == 0)
> + return;
>
> Missing period in comment.
>
Modified

> ~~~
>
> 5.
> +/*
> + * check_old_cluster_subscription_state()
> + *
> + * Verify that each of the subscriptions has all their corresponding tables 
> in
> + * i (initialize), r (ready) or s (synchronized) state.
> + */
> +static void
> +check_old_cluster_subscription_state(ClusterInfo *cluster)
>
> This function is only for the old cluster (hint: the function name) so
> there is no need to pass the 'cluster' parameter here. Just directly
> use old_cluster in the function body.
>
Modified

> ==
> src/bin/pg_upgrade/t/004_subscription.pl
>
> 6.
> +# Add tab_not_upgraded1 to the publication. Now publication has tab_upgraded1
> +# and tab_upgraded2 tables.
> +$publisher->safe_psql('postgres',
> + "ALTER PUBLICATION regress_pub ADD TABLE tab_upgraded2");
>
> Typo in comment. You added tab_not_upgraded2, not tab_not_upgraded1
>
Modified

> ~~
>
> 7.
> +# Subscription relations should be preserved. The upgraded won't know
> +# about 'tab_not_upgraded1' because the subscription is not yet refreshed.
>
> Typo or missing word in comment?
>
> "The upgraded" ??
>
Modified

Attached the v17 patch which have the same changes

Thanks,
Shlok Kumar Kyal


v17-0001-Preserve-the-full-subscription-s-state-during-pg.patch
Description: Binary data


Failure during Building Postgres in Windows with Meson

2023-11-09 Thread Shlok Kyal
Hi,
I am trying to build postgres with meson on Windows. And I am stuck in
the process.

Steps I followed:

1. I clone postgres repo

2.Installed meson and ninja
pip install meson ninja

3. Then running following command:
meson setup build --buildtype debug

4. Then I ran
cd build
ninja

Got following error
D:\project\repo\pg_meson\postgres\build>C:\Users\kyals\AppData\Roaming\Python\Python311\Scripts\ninja
ninja: error: 'src/backend/postgres_lib.a.p/meson_pch-c.obj', needed
by 'src/backend/postgres.exe', missing and no known rule to make it.

Any thoughts on how to resolve this error?

Thanks
Shlok Kumar Kyal




Re: [PoC] Implementation of distinct in Window Aggregates: take two

2023-11-07 Thread Shlok Kyal
Hi,

I went through the Cfbot, and some of the test cases are failing for
this patch. It seems like some tests are crashing:
https://api.cirrus-ci.com/v1/artifact/task/6291153444667392/crashlog/crashlog-postgres.exe_03b0_2023-11-07_10-41-39-624.txt

[10:46:56.546] Summary of Failures:
[10:46:56.546]
[10:46:56.546] 87/270 postgresql:postgres_fdw / postgres_fdw/regress
ERROR 11.10s exit status 1
[10:46:56.546] 82/270 postgresql:regress / regress/regress ERROR
248.55s exit status 1
[10:46:56.546] 99/270 postgresql:recovery /
recovery/027_stream_regress ERROR 161.40s exit status 29
[10:46:56.546] 98/270 postgresql:pg_upgrade /
pg_upgrade/002_pg_upgrade ERROR 253.31s exit status 29

link of tests failing:
https://cirrus-ci.com/task/664299716712
https://cirrus-ci.com/task/4602303584403456
https://cirrus-ci.com/task/5728203491246080
https://cirrus-ci.com/task/5165253537824768?logs=test_world#L511
https://cirrus-ci.com/task/6291153444667392

Thanks
Shlok Kumar Kyal




Re: [ psql - review request ] review request for \d+ tablename, \d+ indexname indenting

2023-11-06 Thread Shlok Kyal
Hi,

On Mon, 6 Nov 2023 at 13:47, 쿼리트릭스  wrote:
>
> The error was corrected and a new diff file was created.
> The diff file was created based on 16 RC1.
> We confirmed that 5 places where errors occurred when performing make check 
> were changed to ok.
>

I went through Cfbot and still see that some tests are failing.
links:
https://cirrus-ci.com/task/6408253983162368
https://cirrus-ci.com/task/5000879099609088
https://cirrus-ci.com/task/6126779006451712
https://cirrus-ci.com/task/5563829053030400
https://cirrus-ci.com/task/6689728959873024

Failure:
[16:42:37.674] Summary of Failures:
[16:42:37.674]
[16:42:37.674] 5/270 postgresql:regress / regress/regress ERROR 28.88s
exit status 1
[16:42:37.674] 7/270 postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade
ERROR 46.73s exit status 1
[16:42:37.674] 56/270 postgresql:recovery /
recovery/027_stream_regress ERROR 38.51s exit status 1

Thanks
Shlok Kumar Kyal




Re: POC: Extension for adding distributed tracing - pg_tracing

2023-11-05 Thread Shlok Kyal
Hi,

On Thu, 12 Oct 2023 at 14:32, Anthonin Bonnefoy
 wrote:
>
> Hi!
>
> I've made a new batch of changes and improvements.
> New features:
> - Triggers are now correctly supported. They were not correctly
> attached to the ExecutorFinish Span before.
> - Additional configuration: exporting parameter values in span
> metadata can be disabled. Deparsing can also be disabled now.
> - Dbid and userid are now exported in span's metadata
> - New Notify channel and threshold parameters. This channel will
> receive a notification when the span buffer usage crosses the
> threshold.
> - Explicit transaction with extended protocol is now correctly
> handled. This is done by keeping 2 different trace contexts, one for
> parser/planner trace context at the root level and the other for
> nested queries and executors. The spans now correctly show the parse
> of the next statement happening before the ExecutorEnd of the previous
> statement (see screenshot).
>
> Changes:
> - Parse span is now outside of the top span. When multiple statements
> are processed, they are parsed together so it didn't make sense to
> attach Parse to a specific statement.
> - Deparse info is now exported in its own column. This will leave the
> possibility to the trace forwarder to either use it as metadata or put
> it in the span name.
> - Renaming: Spans now have a span_type (Select, Executor, Parser...)
> and a span_operation (ExecutorRun, Select $1...)
> - For spans created without propagated trace context, the same traceid
> will be used for statements within the same transaction.
> - pg_tracing_consume_spans and pg_tracing_peek_spans are now
> restricted to users with pg_read_all_stats role
> - In instrument.h, instr->firsttime is only set if timer was requested
>
> The code should be more stable from now on. Most of the features I had
> planned are implemented.
> I've also started writing the module's documentation. It's still a bit
> bare but I will be working on completing this.

In CFbot, I see that build is failing for this patch (link:
https://cirrus-ci.com/task/5378223450619904?logs=build#L1532) with
following error:
[07:58:05.037] In file included from ../src/include/executor/instrument.h:16,
[07:58:05.037] from ../src/include/jit/jit.h:14,
[07:58:05.037] from ../contrib/pg_tracing/span.h:16,
[07:58:05.037] from ../contrib/pg_tracing/pg_tracing_explain.h:4,
[07:58:05.037] from ../contrib/pg_tracing/pg_tracing.c:43:
[07:58:05.037] ../contrib/pg_tracing/pg_tracing.c: In function
‘add_node_counters’:
[07:58:05.037] ../contrib/pg_tracing/pg_tracing.c:2330:70: error:
‘BufferUsage’ has no member named ‘blk_read_time’; did you mean
‘temp_blk_read_time’?
[07:58:05.037] 2330 | blk_read_time =
INSTR_TIME_GET_MILLISEC(node_counters->buffer_usage.blk_read_time);
[07:58:05.037] | ^
[07:58:05.037] ../src/include/portability/instr_time.h:126:12: note:
in definition of macro ‘INSTR_TIME_GET_NANOSEC’
[07:58:05.037] 126 | ((int64) (t).ticks)
[07:58:05.037] | ^
[07:58:05.037] ../contrib/pg_tracing/pg_tracing.c:2330:18: note: in
expansion of macro ‘INSTR_TIME_GET_MILLISEC’
[07:58:05.037] 2330 | blk_read_time =
INSTR_TIME_GET_MILLISEC(node_counters->buffer_usage.blk_read_time);
[07:58:05.037] | ^~~
[07:58:05.037] ../contrib/pg_tracing/pg_tracing.c:2331:71: error:
‘BufferUsage’ has no member named ‘blk_write_time’; did you mean
‘temp_blk_write_time’?
[07:58:05.037] 2331 | blk_write_time =
INSTR_TIME_GET_MILLISEC(node_counters->buffer_usage.blk_write_time);
[07:58:05.037] | ^~
[07:58:05.037] ../src/include/portability/instr_time.h:126:12: note:
in definition of macro ‘INSTR_TIME_GET_NANOSEC’
[07:58:05.037] 126 | ((int64) (t).ticks)
[07:58:05.037] | ^
[07:58:05.037] ../contrib/pg_tracing/pg_tracing.c:2331:19: note: in
expansion of macro ‘INSTR_TIME_GET_MILLISEC’
[07:58:05.037] 2331 | blk_write_time =
INSTR_TIME_GET_MILLISEC(node_counters->buffer_usage.blk_write_time);
[07:58:05.037] | ^~~

I also tried to build this patch locally and the build is failing with
the same error.

Thanks
Shlok Kumar Kyal




Re: [PoC] Federated Authn/z with OAUTHBEARER

2023-11-03 Thread Shlok Kyal
Hi,

On Fri, 3 Nov 2023 at 17:14, Jacob Champion  wrote:
>
> v12 implements a first draft of a client hook, so applications can
> replace either the device prompt or the entire OAuth flow. (Andrey and
> Mahendrakar: hopefully this is close to what you need.) It also cleans
> up some of the JSON tech debt.

I went through CFbot and found that build is failing, links:

https://cirrus-ci.com/task/6061898244816896
https://cirrus-ci.com/task/6624848198238208
https://cirrus-ci.com/task/5217473314684928
https://cirrus-ci.com/task/6343373221527552

Just want to make sure you are aware of these failures.

Thanks,
Shlok Kumar Kyal




Re: Force the old transactions logs cleanup even if checkpoint is skipped

2023-11-02 Thread Shlok Kyal
Hi,

I went through the Cfbot and saw that some test are failing for it
(link: https://cirrus-ci.com/task/4631357628874752):

test: postgresql:recovery / recovery/019_replslot_limit

# test failed
--- stderr ---
# poll_query_until timed out executing this query:
# SELECT '0/15000D8' <= replay_lsn AND state = 'streaming'
#  FROM pg_catalog.pg_stat_replication
#  WHERE application_name IN ('standby_1', 'walreceiver')
# expecting this output:
# t
# last actual query output:
#
# with stderr:
# Tests were run but no plan was declared and done_testing() was not seen.
# Looks like your test exited with 29 just after 7.

I tried to test it locally and this test is timing out in my local
machine as well.

Thanks
Shlok Kumar Kyal




Re: [HACKERS] Allow INSTEAD OF DELETE triggers to modify the tuple for RETURNING

2023-11-02 Thread Shlok Kyal
Hi,

On Thu, 19 Oct 2023 at 16:35, Marko Tiikkaja  wrote:
>
> Hi,
>
> Thank you for the feedback.
>
> Apparently it took me six years, but I've attached the latest version
> of the patch which I believe addresses all issues.  I'll add it to the
> open commitfest.
>
>
> .m

I went through the CFbot and found that docs build run is giving some
error (link: https://cirrus-ci.com/task/5712582359646208):
[07:37:19.769] trigger.sgml:266: parser error : Opening and ending tag
mismatch: command line 266 and COMMAND
[07:37:19.769] DELETE operations, INSTEAD
OF
[07:37:19.769] ^
[07:37:19.769] trigger.sgml:408: parser error : Opening and ending tag
mismatch: para line 266 and sect1
[07:37:19.769] 
[07:37:19.769] ^
[07:37:19.769] trigger.sgml:1034: parser error : Opening and ending
tag mismatch: sect1 line 266 and chapter
[07:37:19.769] 
[07:37:19.769] ^
[07:37:19.769] trigger.sgml:1035: parser error : chunk is not well balanced
[07:37:19.769]
[07:37:19.769] ^
[07:37:19.769] postgres.sgml:222: parser error : Failure to process
entity trigger
[07:37:19.769] 
[07:37:19.769] ^
[07:37:19.769] postgres.sgml:222: parser error : Entity 'trigger' not defined
[07:37:19.769] 
[07:37:19.769] ^
[07:37:19.956] make[2]: *** [Makefile:73: postgres-full.xml] Error 1
[07:37:19.957] make[1]: *** [Makefile:8: all] Error 2
[07:37:19.957] make: *** [Makefile:16: all] Error 2
[07:37:19.957]
[07:37:19.957] real 0m0.529s
[07:37:19.957] user 0m0.493s
[07:37:19.957] sys 0m0.053s
[07:37:19.957]
[07:37:19.957] Exit status: 2

Just wanted to make sure you are aware of the issue.

Thanks
Shlok Kumar Kyal




Re: psql not responding to SIGINT upon db reconnection

2023-11-02 Thread Shlok Kyal
Hi,

> That sounds like a much better solution. Attached you will find a v4
> that implements your suggestion. Please let me know if there is
> something that I missed. I can confirm that the patch works.
>
> $ ./build/src/bin/psql/psql -h pg.neon.tech
> NOTICE:  Welcome to Neon!
> Authenticate by visiting:
> https://console.neon.tech/psql_session/xxx
>
>
> NOTICE:  Connecting to database.
> psql (17devel, server 15.3)
> SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, 
> compression: off)
> Type "help" for help.
>
> tristan957=> \c
> NOTICE:  Welcome to Neon!
> Authenticate by visiting:
> https://console.neon.tech/psql_session/yyy
>
>
> ^Cconnection to server at "pg.neon.tech" (3.18.6.96), port 5432 
> failed:
> Previous connection kept
> tristan957=>

I went through CFbot and found out that some tests are timing out.
Links:
https://cirrus-ci.com/task/6735437444677632
https://cirrus-ci.com/task/4536414189125632
https://cirrus-ci.com/task/5046587584413696
https://cirrus-ci.com/task/6172487491256320
https://cirrus-ci.com/task/5609537537835008

Some tests which are timing out are as follows:
[00:48:49.243] Summary of Failures:
[00:48:49.243]
[00:48:49.243] 4/270 postgresql:regress / regress/regress TIMEOUT 1000.01s
[00:48:49.243] 6/270 postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade
TIMEOUT 1000.02s
[00:48:49.243] 34/270 postgresql:recovery /
recovery/027_stream_regress TIMEOUT 1000.02s
[00:48:49.243] 48/270 postgresql:plpgsql / plpgsql/regress TIMEOUT 1000.02s
[00:48:49.243] 49/270 postgresql:plperl / plperl/regress TIMEOUT 1000.01s
[00:48:49.243] 61/270 postgresql:dblink / dblink/regress TIMEOUT 1000.03s
[00:48:49.243] 89/270 postgresql:postgres_fdw / postgres_fdw/regress
TIMEOUT 1000.01s
[00:48:49.243] 93/270 postgresql:test_decoding / test_decoding/regress
TIMEOUT 1000.02s
[00:48:49.243] 110/270 postgresql:test_extensions /
test_extensions/regress TIMEOUT 1000.03s
[00:48:49.243] 123/270 postgresql:unsafe_tests / unsafe_tests/regress
TIMEOUT 1000.02s
[00:48:49.243] 152/270 postgresql:pg_dump / pg_dump/010_dump_connstr
TIMEOUT 1000.03s

Just want to make sure you are aware of the issue.

Thanks
Shlok Kumar Kyal




Re: [patch] pg_basebackup: mention that spread checkpoints are the default in --help

2023-10-31 Thread Shlok Kyal
Hi,

On Thu, 26 Oct 2023 at 13:58, Michael Banck  wrote:
>
> Hi,
>
> On Wed, Oct 25, 2023 at 04:36:41PM +0200, Peter Eisentraut wrote:
> > On 19.10.23 11:39, Michael Banck wrote:
> > > Hi,
> > >
> > > I believed that spread (not fast) checkpoints are the default in
> > > pg_basebackup, but noticed that --help does not specify which is which -
> > > contrary to the reference documentation.
> > >
> > > So I propose the small attached patch to clarify that.
> >
> > >  printf(_("  -c, --checkpoint=fast|spread\n"
> > >-  " set fast or spread checkpointing\n"));
> > >+  " set fast or spread (default)
> > checkpointing\n"));
> >
> > Could we do like
> >
> >   -c, --checkpoint=fast|spread
> >  set fast or spread checkpointing
> >  (default: spread)
> >
> > This seems to be easier to read.
>
> Yeah, we could do that. But then again the question pops up what to do
> about the other option that mentions defaults (-F) and the others which
> have a default but it is not spelt out yet (-X, -Z at least) (output is
> still from v15, additional options have been added since):
>
>   -F, --format=p|t   output format (plain (default), tar)
>   -X, --wal-method=none|fetch|stream
>  include required WAL files with specified method
>   -Z, --compress=0-9 compress tar output with given compression level
>
> So, my personal opinion is that we should really document -c because it
> is quite user-noticable compared to the others.
>
> So attached is a new version with just your proposed change for now.
>
>
> Michael

I went through the Cfbot for this patch and found out that the build
is failing with the following error (Link:
https://cirrus-ci.com/task/4648506929971200?logs=build#L1217):

[08:34:47.625] FAILED: src/bin/pg_basebackup/pg_basebackup.p/pg_basebackup.c.o
[08:34:47.625] ccache cc -Isrc/bin/pg_basebackup/pg_basebackup.p
-Isrc/include -I../src/include -Isrc/interfaces/libpq
-I../src/interfaces/libpq -Isrc/include/catalog -Isrc/include/nodes
-Isrc/include/utils -fdiagnostics-color=always -pipe
-D_FILE_OFFSET_BITS=64 -Wall -Winvalid-pch -g -fno-strict-aliasing
-fwrapv -fexcess-precision=standard -D_GNU_SOURCE -Wmissing-prototypes
-Wpointer-arith -Werror=vla -Wendif-labels -Wmissing-format-attribute
-Wimplicit-fallthrough=3 -Wcast-function-type
-Wshadow=compatible-local -Wformat-security
-Wdeclaration-after-statement -Wno-format-truncation
-Wno-stringop-truncation -pthread -MD -MQ
src/bin/pg_basebackup/pg_basebackup.p/pg_basebackup.c.o -MF
src/bin/pg_basebackup/pg_basebackup.p/pg_basebackup.c.o.d -o
src/bin/pg_basebackup/pg_basebackup.p/pg_basebackup.c.o -c
../src/bin/pg_basebackup/pg_basebackup.c
[08:34:47.625] ../src/bin/pg_basebackup/pg_basebackup.c: In function ‘usage’:
[08:34:47.625] ../src/bin/pg_basebackup/pg_basebackup.c:411:5:
warning: statement with no effect [-Wunused-value]
[08:34:47.625] 411 | " (default: spread)\n"));
[08:34:47.625] | ^~
[08:34:47.625] ../src/bin/pg_basebackup/pg_basebackup.c:411:51: error:
expected ‘;’ before ‘)’ token
[08:34:47.625] 411 | " (default: spread)\n"));
[08:34:47.625] | ^
[08:34:47.625] | ;
[08:34:47.625] ../src/bin/pg_basebackup/pg_basebackup.c:411:51: error:
expected statement before ‘)’ token
[08:34:47.625] ../src/bin/pg_basebackup/pg_basebackup.c:411:52: error:
expected statement before ‘)’ token
[08:34:47.625] 411 | " (default: spread)\n"));
[08:34:47.625] | ^
[08:34:47.629] [1210/1832] Compiling C object
src/bin/pg_dump/libpgdump_common.a.p/parallel.c.o
[08:34:47.639] [1211/1832] Compiling C object
src/bin/pg_basebackup/pg_recvlogical.p/pg_recvlogical.c.o
[08:34:47.641] [1212/1832] Linking static target
src/bin/pg_basebackup/libpg_basebackup_common.a
[08:34:47.658] [1213/1832] Compiling C object
src/bin/pg_dump/libpgdump_common.a.p/compress_io.c.o
[08:34:47.669] [1214/1832] Compiling C object
src/bin/pg_dump/libpgdump_common.a.p/compress_lz4.c.o
[08:34:47.678] [1215/1832] Compiling C object
src/bin/pg_dump/libpgdump_common.a.p/compress_zstd.c.o
[08:34:47.692] [1216/1832] Compiling C object
src/bin/pg_dump/libpgdump_common.a.p/dumputils.c.o
[08:34:47.692] ninja: build stopped: subcommand failed.

I also see that patch is marked 'Ready for Committer' on commitfest.

Just wanted to make sure, you are aware of this error.

Thanks,
Shlok Kumar Kyal




Re: [PoC] pg_upgrade: allow to upgrade publisher node

2023-10-19 Thread Shlok Kyal
I tested a test scenario:
I started a new publisher with 'max_replication_slots' parameter set
to '1' and created a streaming replication with the new publisher as
primary node.
Then I did a pg_upgrade from old publisher to new publisher. The
upgrade failed with following error:

Restoring logical replication slots in the new cluster
SQL command failed
SELECT * FROM pg_catalog.pg_create_logical_replication_slot('test1',
'pgoutput', false, false);
ERROR:  all replication slots are in use
HINT:  Free one or increase max_replication_slots.

Failure, exiting

Should we document that the existing replication slots are taken in
consideration while setting 'max_replication_slots' value in the new
publisher?

Thanks
Shlok Kumar Kyal

On Wed, 18 Oct 2023 at 15:01, Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear Peter,
>
> Thank you for reviewing! PSA new version.
> Note that 0001 and 0002 are combined into one patch.
>
> > Here are some review comments for v51-0001
> >
> > ==
> > src/bin/pg_upgrade/check.c
> >
> > 0.
> > +check_old_cluster_for_valid_slots(bool live_check)
> > +{
> > + char output_path[MAXPGPATH];
> > + FILE*script = NULL;
> > +
> > + prep_status("Checking for valid logical replication slots");
> > +
> > + snprintf(output_path, sizeof(output_path), "%s/%s",
> > + log_opts.basedir,
> > + "invalid_logical_relication_slots.txt");
> >
> > 0a
> > typo /invalid_logical_relication_slots/invalid_logical_replication_slots/
>
> Fixed.
>
> > 0b.
> > Since the non-upgradable slots are not strictly "invalid", is this an
> > appropriate filename for the bad ones?
> >
> > But I don't have very good alternatives. Maybe:
> > - non_upgradable_logical_replication_slots.txt
> > - problem_logical_replication_slots.txt
>
> Per discussion [1], I kept current style.
>
> > src/bin/pg_upgrade/t/003_upgrade_logical_replication_slots.pl
> >
> > 1.
> > +# --
> > +# TEST: Confirm pg_upgrade fails when wrong GUC is set on new cluster
> > +#
> > +# There are two requirements for GUCs - wal_level and 
> > max_replication_slots,
> > +# but only max_replication_slots will be tested here. This is because to
> > +# reduce the execution time of the test.
> >
> >
> > SUGGESTION
> > # TEST: Confirm pg_upgrade fails when the new cluster has wrong GUC values.
> > #
> > # Two GUCs are required - 'wal_level' and 'max_replication_slots' - but to
> > # reduce the test execution time, only 'max_replication_slots' is tested 
> > here.
>
> First part was fixed. Second part was removed per [1].
>
> > 2.
> > +# Preparations for the subsequent test:
> > +# 1. Create two slots on the old cluster
> > +$old_publisher->start;
> > +$old_publisher->safe_psql('postgres',
> > + "SELECT pg_create_logical_replication_slot('test_slot1',
> > 'test_decoding', false, true);"
> > +);
> > +$old_publisher->safe_psql('postgres',
> > + "SELECT pg_create_logical_replication_slot('test_slot2',
> > 'test_decoding', false, true);"
> > +);
> >
> >
> > Can't you combine those SQL in the same $old_publisher->safe_psql.
>
> Combined.
>
> > 3.
> > +# Clean up
> > +rmtree($new_publisher->data_dir . "/pg_upgrade_output.d");
> > +# Set max_replication_slots to the same value as the number of slots. Both 
> > of
> > +# slots will be used for subsequent tests.
> > +$new_publisher->append_conf('postgresql.conf', "max_replication_slots = 
> > 1");
> >
> > The code doesn't seem to match the comment - is this correct? The
> > old_publisher created 2 slots, so why are you setting new_publisher
> > "max_replication_slots = 1" again?
>
> Fixed to "max_replication_slots = 2" Note that previous test worked well 
> because
> GUC checking on new cluster is done after checking the status of slots.
>
> > 4.
> > +# Preparations for the subsequent test:
> > +# 1. Generate extra WAL records. Because these WAL records do not get
> > consumed
> > +# it will cause the upcoming pg_upgrade test to fail.
> > +$old_publisher->start;
> > +$old_publisher->safe_psql('postgres',
> > + "CREATE TABLE tbl AS SELECT generate_series(1, 10) AS a;");
> > +
> > +# 2. Advance the slot test_slot2 up to the current WAL location
> > +$old_publisher->safe_psql('postgres',
> > + "SELECT pg_replication_slot_advance('test_slot2', NULL);");
> > +
> > +# 3. Emit a non-transactional message. test_slot2 detects the message so 
> > that
> > +# this slot will be also reported by upcoming pg_upgrade.
> > +$old_publisher->safe_psql('postgres',
> > + "SELECT count(*) FROM pg_logical_emit_message('false', 'prefix',
> > 'This is a non-transactional message');"
> > +);
> >
> >
> > I felt this test would be clearer if you emphasised the state of the
> > test_slot1 also. e.g.
> >
> > 4a.
> > BEFORE
> > +# 1. Generate extra WAL records. Because these WAL records do not get
> > consumed
> > +# it will cause the upcoming pg_upgrade test to fail.
> >
> > SUGGESTION
> > # 1. Generate extra WAL records. At this point neither test_slot1 nor 
> > test_slot2
> > #has consumed them.
>
> Fixed.
>
> > 4b.
> > BEFORE

Re: [PoC] pg_upgrade: allow to upgrade publisher node

2023-10-19 Thread Shlok Kyal
> Few comments:
> 1) We will be able to override the value of max_slot_wal_keep_size by
> using --new-options like '--new-options  "-c
> max_slot_wal_keep_size=val"':
> +   /*
> +* Use max_slot_wal_keep_size as -1 to prevent the WAL removal by the
> +* checkpointer process.  If WALs required by logical replication 
> slots
> +* are removed, the slots are unusable.  This setting prevents the
> +* invalidation of slots during the upgrade. We set this option when
> +* cluster is PG17 or later because logical replication slots
> can only be
> +* migrated since then. Besides, max_slot_wal_keep_size is
> added in PG13.
> +*/
> +   if (GET_MAJOR_VERSION(cluster->major_version) >= 1700)
> +   appendPQExpBufferStr(, " -c
> max_slot_wal_keep_size=-1");
>
> Should there be a check to throw an error if this option is specified
> or do we need some documentation that this option should not be
> specified?

I have tested the above scenario. We are able to override the
max_slot_wal_keep_size by using  '--new-options  "-c
max_slot_wal_keep_size=val"'. And also with some insert statements
during pg_upgrade, old WAL file were deleted and logical replication
slots were invalidated. Since the slots were invalidated replication
was not happening after the upgrade.

Thanks,
Shlok Kumar Kyal




Re: pg_ctl start may return 0 even if the postmaster has been already started on Windows

2023-10-06 Thread Shlok Kyal
On Fri, 6 Oct 2023 at 11:38, Hayato Kuroda (Fujitsu)
 wrote:
>
> Dear Horiguchi-san,
>
> Thank you for making a patch! They can pass ci.
> I'm still not sure what should be, but I can respond a part.
>
> > Another issue is.. that I haven't been able to cause the false
> > positive of pg_ctl start..  Do you have a concise reproducer of the
> > issue?
>
> I found a short sleep in pg_ctl/t/001_start_stop.pl. This was introduced in
> 6bcce2580 to ensure waiting more than 2 seconds. I've tested on my CI and
> found that removing the sleep can trigger the failure. Also, I confirmed your 
> patch
> fixes the problem. PSA the small patch for cfbot. 0001 and 0002 were not 
> changed.

I have tested the patches on my windows setup.
I am trying to start two postgres servers with an interval of 5 secs.

with HEAD (when same server is started after an interval of 5 secs):
D:\project\pg\bin>pg_ctl -D ../data -l data2.log start
pg_ctl: another server might be running; trying to start server anyway
waiting for server to start stopped waiting
pg_ctl: could not start server
Examine the log output.

with Patch:(when same server is started after an interval of 5 secs)
D:\project\pg_dev\bin>pg_ctl -D ../data -l data2.log start
pg_ctl: another server might be running; trying to start server anyway
waiting for server to startpg_ctl: launcher shell died

The output message after patch is different from the HEAD. I felt that
with patch as well we should get the message  "pg_ctl: could not start
server".
Is this message change intentional?

Thanks,
Shlok Kumar Kyal




Re: Invalidate the subscription worker in cases where a user loses their superuser status

2023-10-04 Thread Shlok Kyal
On Wed, 4 Oct 2023 at 16:56, Peter Smith  wrote:
>
> On Tue, Oct 3, 2023 at 5:42 PM vignesh C  wrote:
> >
> > Thanks for the comments, the attached v6 version patch has the changes
> > for the same.
> >
>
> v6 LGTM.
>
I have verified the patch and it is working fine for me.