from:"Amit Kapila"

Re: speed up a logical replica setup

2024-05-23 Thread Amit Kapila

On Thu, May 23, 2024 at 8:43 PM Euler Taveira  wrote:
>
> On Thu, May 23, 2024, at 5:54 AM, Amit Kapila wrote:
>
>
> Why in the first place do we need to ensure that primary_slot_name is
> active on the primary? You mentioned something related to WAL
> retention but I don't know how that is related to this tool's
> functionality. If at all, we are bothered about WAL retention on the
> primary that should be the WAL corresponding to consistent_lsn
> computed by setup_publisher() but this check doesn't seem to ensure
> that.
>
> Maybe it is a lot of checks. I'm afraid there isn't a simple way to get and
> make sure the replication slot is used by the physical replication. I mean if
> there is primary_slot_name = 'foo' on standby, there is no guarantee that the
> replication slot 'foo' exists on primary. The idea is to get the exact
> replication slot name used by physical replication to drop it. Once I posted a
> patch it should be clear. (Another idea is to relax this check and rely only 
> on
> primary_slot_name to drop this replication slot on primary. The replication 
> slot
> might not exist and it shouldn't return an error in this case.)
>

I think your other idea is better than what we are doing currently.
Let's ignore the ERROR even if the primary_slot_name doesn't exist on
the primary.

-- 
With Regards,
Amit Kapila.

Re: speed up a logical replica setup

2024-05-23 Thread Amit Kapila

On Wed, May 22, 2024 at 8:46 PM Euler Taveira  wrote:
>
> On Wed, May 22, 2024, at 8:19 AM, Amit Kapila wrote:
>
> > v2-0002: not changed
> >
>
> We have added more tries to see if the primary_slot_name becomes
> active but I think it is still fragile because it is possible on slow
> machines that the required slot didn't become active even after more
> retries. I have raised the same comment previously [2] and asked an
> additional question but didn't get any response.
>
>
> Following the same line that simplifies the code, we can: (a) add a loop in
> check_subscriber() that waits until walreceiver is available on subscriber or
> (b) use a timeout. The main advantage of (a) is that the primary slot is 
> already
> available but I'm afraid we need a escape mechanism for the loop (timeout?).
>

Sorry, it is not clear to me why we need any additional loop in
check_subscriber(), aren't we speaking about the problem in
check_publisher() function?

Why in the first place do we need to ensure that primary_slot_name is
active on the primary? You mentioned something related to WAL
retention but I don't know how that is related to this tool's
functionality. If at all, we are bothered about WAL retention on the
primary that should be the WAL corresponding to consistent_lsn
computed by setup_publisher() but this check doesn't seem to ensure
that.

-- 
With Regards,
Amit Kapila.

Re: State of pg_createsubscriber

2024-05-23 Thread Amit Kapila

On Wed, May 22, 2024 at 8:02 PM Robert Haas  wrote:
>
> On Mon, May 20, 2024 at 2:42 AM Amit Kapila  wrote:
> > Just to summarize, apart from BF failures for which we had some
> > discussion, I could recall the following open points:
> >
> > 1. After promotion, the pre-existing replication objects should be
> > removed (either optionally or always), otherwise, it can lead to a new
> > subscriber not being able to restart or getting some unwarranted data.
> > [1][2].
> >
> > 2. Retaining synced slots on new subscribers can lead to unnecessary
> > WAL retention and dead rows [3].
> >
> > 3. We need to consider whether some of the messages displayed in
> > --dry-run mode are useful or not [4].
>
> Amit, thanks for summarzing your understanding of the situation. Tom,
> is this list complete, to your knowledge? The original thread is quite
> complex and it's hard to pick out what the open items actually are.
> :-(
>
> I would like to see this open item broken up into multiple open items,
> one per issue.
>
> Link [4] goes to a message that doesn't seem to relate to --dry-run.
>

Sorry, for the wrong link. See [1] for the correct link for --dry-run
related suggestion:

[1] 
https://www.postgresql.org/message-id/CAA4eK1J2fAvsJ2HihbWJ_GxETd6sdqSMrZdCVJEutRZRpm1MEQ%40mail.gmail.com

-- 
With Regards,
Amit Kapila.

Re: State of pg_createsubscriber

2024-05-23 Thread Amit Kapila

On Wed, May 22, 2024 at 8:00 PM Robert Haas  wrote:
>
> On Wed, May 22, 2024 at 9:52 AM Robert Haas  wrote:
> > Another option that we should at least consider is "do nothing". In a
> > case like the one Shlok describes, how are we supposed to know what
> > the right thing to do is? Is it unreasonable to say that if the user
> > doesn't want those publications or subscriptions to exist, the user
> > should drop them?
> >

This tool's intended purpose is to speed up the creation of a
subscriber node and for that one won't need the subscriptions already
present on the existing primary/publisher node from which the new
subscriber node is going to get the data. Additionally, the ERRORs
shown by Shlok will occur even during the command (performed by
pg_createsubscriber) which will probably be confusing.

> > Maybe it is unreasonable to say that, but it seems to me we should at
> > least talk about that.
>
> As another option, maybe we could disable subscriptions, so that
> nothing happens when the server is first started, and then the user
> could decide after that what they want to do.
>

Yeah, this would be worth considering. Note that even if the user
wants to retain such pre-existing subscriptions and enable them, they
need more steps than just to enable these to avoid duplicate data
issues or ERRORs as shown in Shlok's test.

So, we have the following options: (a) by default drop the
pre-existing subscriptions, (b) by default disable the pre-existing
subscriptions, and add a Note in the docs that users can take
necessary actions to enable or drop them. Now, we can even think of
providing a switch to retain the pre-existing subscriptions or
publications as the user may have some use case where it can be
helpful for her. For example, retaining publications can help in
creating a bi-directional setup.

-- 
With Regards,
Amit Kapila.

Re: State of pg_createsubscriber

2024-05-22 Thread Amit Kapila

On Wed, May 22, 2024 at 2:45 PM Shlok Kyal  wrote:
>
> > Just to summarize, apart from BF failures for which we had some
> > discussion, I could recall the following open points:
> >
> > 1. After promotion, the pre-existing replication objects should be
> > removed (either optionally or always), otherwise, it can lead to a new
> > subscriber not being able to restart or getting some unwarranted data.
> > [1][2].
> >
> I tried to reproduce the case and found a case where pre-existing
> replication objects can cause unwanted scenario:
>
> Suppose we have a setup of nodes N1, N2 and N3.
> N1 and N2 are in streaming replication where N1 is primary and N2 is standby.
> N3 and N1 are in logical replication where N3 is publisher and N1 is 
> subscriber.
> The subscription created on N1 is replicated to N2 due to streaming 
> replication.
>
> Now, after we run pg_createsubscriber on N2 and start the N2 server,
> we get the following logs repetitively:
> 2024-05-22 11:37:18.619 IST [27344] ERROR:  could not start WAL
> streaming: ERROR:  replication slot "test1" is active for PID 27202
> 2024-05-22 11:37:18.622 IST [27317] LOG:  background worker "logical
> replication apply worker" (PID 27344) exited with exit code 1
> 2024-05-22 11:37:23.610 IST [27349] LOG:  logical replication apply
> worker for subscription "test1" has started
> 2024-05-22 11:37:23.624 IST [27349] ERROR:  could not start WAL
> streaming: ERROR:  replication slot "test1" is active for PID 27202
> 2024-05-22 11:37:23.627 IST [27317] LOG:  background worker "logical
> replication apply worker" (PID 27349) exited with exit code 1
> 2024-05-22 11:37:28.616 IST [27382] LOG:  logical replication apply
> worker for subscription "test1" has started
>
> Note: 'test1' is the name of the subscription created on N1 initially
> and by default, slot name is the same as the subscription name.
>
> Once the N2 server is started after running pg_createsubscriber, the
> subscription that was earlier replicated by streaming replication will
> now try to connect to the publisher. Since the subscription name in N2
> is the same as the subscription created in N1, it will not be able to
> start a replication slot as the slot with the same name is active for
> logical replication between N3 and N1.
>
> Also, there would be a case where N1 becomes down for some time. Then
> in that case subscription on N2 will connect to the publication on N3
> and now data from N3 will be replicated to N2 instead of N1. And once
> N1 is up again, subscription on N1 will not be able to connect to
> publication on N3 as it is already connected to N2. This can lead to
> data inconsistency.
>

So, what shall we do about such cases?  I think by default we can
remove all pre-existing subscriptions and publications on the promoted
standby or instead we can remove them based on some switch. If we want
to go with this idea then we might need to distinguish the between
pre-existing subscriptions and the ones created by this tool.

The other case I remember adding an option in this tool was to avoid
specifying slots, pubs, etc. for each database. See [1]. We can
probably leave if the same is not important but we never reached the
conclusion of same.

[1] - 
https://www.postgresql.org/message-id/CAA4eK1%2Br96SyHYHx7BaTtGX0eviqpbbkSu01MEzwV5b2VFXP6g%40mail.gmail.com

-- 
With Regards,
Amit Kapila.

Re: speed up a logical replica setup

2024-05-22 Thread Amit Kapila

On Mon, May 20, 2024 at 4:30 PM Shlok Kyal  wrote:
> >
> > I was trying to test this utility when 'sync_replication_slots' is on
> > and it gets in an ERROR loop [1] and never finishes. Please find the
> > postgresql.auto used on the standby attached. I think if the standby
> > has enabled sync_slots, you need to pass dbname in
> > GenerateRecoveryConfig(). I couldn't test it further but I wonder if
> > there are already synced slots on the standby (either due to
> > 'sync_replication_slots' or users have used
> > pg_sync_replication_slots() before invoking pg_createsubscriber),
> > those would be retained as it is on new subscriber and lead to
> > unnecessary WAL retention and dead rows.
> >
> > [1]
> > 2024-04-30 11:50:43.239 IST [12536] LOG:  slot sync worker started
> > 2024-04-30 11:50:43.247 IST [12536] ERROR:  slot synchronization
> > requires dbname to be specified in primary_conninfo
>
> Hi,
>
> I tested the scenario posted by Amit in [1], in which retaining synced
> slots lead to unnecessary WAL retention and ERROR. This is raised as
> the second open point in [2].
> The steps to reproduce the issue:
> (1) Setup physical replication with sync slot feature turned on by
> setting sync_replication_slots = 'true' or using
> pg_sync_replication_slots() on the standby node.
> For physical replication setup, run pg_basebackup with -R  and -d option.
> (2) Create a logical replication slot on primary node with failover
> option as true. A corresponding slot is created on standby as part of
> sync slot feature.
> (3) Run pg_createsubscriber on standby node.
> (4) On Checking for the replication slot on standby node, I noticed
> that the logical slots created in step 2 are retained.
>  I have attached the script to reproduce the issue.
>
> I and Kuroda-san worked to resolve open points. Here are patches to
> solve the second and third point in [2].
> Patches proposed by Euler are also attached just in case, but they
> were not modified.
>
> v2-0001: not changed
>

Shouldn't we modify it as per the suggestion given in the email [1]? I
am wondering if we can entirely get rid of checking the primary
business and simply rely on recovery_timeout and keep checking
server_is_in_recovery(). If so, we can modify the test to use
non-default recovery_timeout (say 180s or something similar if we have
used it at any other place). As an additional check we can ensure that
constent_lsn is present on standby.

> v2-0002: not changed
>

We have added more tries to see if the primary_slot_name becomes
active but I think it is still fragile because it is possible on slow
machines that the required slot didn't become active even after more
retries. I have raised the same comment previously [2] and asked an
additional question but didn't get any response.

[1] - 
https://www.postgresql.org/message-id/CAA4eK1JJq_ER6Kq_H%3DjKHR75QPRd8y9_D%3DRtYw%3DaPYKMfqLi9A%40mail.gmail.com
[2] - 
https://www.postgresql.org/message-id/CAA4eK1LT3Z13Dg6p4Z%2B4adO_EY-ow5CmWfikEmBfL%3DeVrm8CPw%40mail.gmail.com

-- 
With Regards,
Amit Kapila.

Re: Fix src/test/subscription/t/029_on_error.pl test when wal_debug is enabled

2024-05-20 Thread Amit Kapila

On Fri, May 17, 2024 at 5:25 AM Michael Paquier  wrote:
>
> On Thu, May 16, 2024 at 09:00:47AM +0530, Amit Kapila wrote:
> > This can only be a problem if the apply worker generates more LOG
> > entries with the required context but it won't do that unless there is
> > an operation on the publisher to replicate. If we see any such danger
> > then I agree this can break on some slow machines but as of now, I
> > don't see any such race condition.
>
> Perhaps, still I'm not completely sure if this assumption is going to
> always stand for all the possible configurations we support.
>

As per my understanding, this will primarily rely on the apply worker
design, not the other configurations, whether we see additional LOG.

>  So,
> rather than coming back to this test again, I would choose to make the
> test as stable as possible from the start.  That's what mapping with
> the error message would offer when grabbing the LSN from the CONTEXT
> part of it, because there's a one-one mapping between the expected
> ERROR and its CONTEXT from which the information used by the test is
> retrieved.
>

I was slightly hesitant to do an ERROR string-based check because the
error string can change and it doesn't seem to bring additional
stability for this particular test. But if you and others are still
not convinced with the simple fix suggested by me then feel free to
proceed with an error-string based check.

-- 
With Regards,
Amit Kapila.

Re: speed up a logical replica setup

2024-05-20 Thread Amit Kapila

On Sun, May 19, 2024 at 4:25 AM Thomas Munro  wrote:
>
> 040_pg_createsubscriber.pl seems to be failing occasionally on
> culicidae near "--dry-run on node S".  I couldn't immediately see why.
> That animal is using EXEC_BACKEND and I also saw a one-off failure a
> bit like that on my own local Linux + EXEC_BACKEND test run
> (sorry I didn't keep the details around).  Coincidence?
>

AFAICS, this is the same as one of the two BF failures being discussed
in this thread.

-- 
With Regards,
Amit Kapila.

Re: State of pg_createsubscriber

2024-05-20 Thread Amit Kapila

On Sun, May 19, 2024 at 11:20 PM Euler Taveira  wrote:
>
> On Sun, May 19, 2024, at 2:30 PM, Tom Lane wrote:
>
> I'm fairly disturbed about the readiness of pg_createsubscriber.
> The 040_pg_createsubscriber.pl TAP test is moderately unstable
> in the buildfarm [1], and there are various unaddressed complaints
> at the end of the patch thread (pretty much everything below [2]).
> I guess this is good enough to start beta with, but it's far from
> being good enough to ship, IMO.  If there were active work going
> on to fix these things, I'd feel better, but neither the C code
> nor the test script have been touched since 1 April.
>
>
> My bad. :( I'll post patches soon to address all of the points.
>

Just to summarize, apart from BF failures for which we had some
discussion, I could recall the following open points:

1. After promotion, the pre-existing replication objects should be
removed (either optionally or always), otherwise, it can lead to a new
subscriber not being able to restart or getting some unwarranted data.
[1][2].

2. Retaining synced slots on new subscribers can lead to unnecessary
WAL retention and dead rows [3].

3. We need to consider whether some of the messages displayed in
--dry-run mode are useful or not [4].

[1] - 
https://www.postgresql.org/message-id/CAA4eK1L6HOhT1qifTyuestXkPpkRwY9bOqFd4wydKsN6C3hePA%40mail.gmail.com
[2] - 
https://www.postgresql.org/message-id/flat/3fa9ef0f-b277-4c13-850a-8ccc04de1406%40eisentraut.org#152dacecfc8f0cf08cbd8ecb79d6d38f
[3] - 
https://www.postgresql.org/message-id/CAA4eK1KdCb%2B5sjYu6qCMXXdCX1y_ihr8kFzMozq0%3DP%3DauYxgog%40mail.gmail.com
[4] - 
https://www.postgresql.org/message-id/flat/3fa9ef0f-b277-4c13-850a-8ccc04de1406%40eisentraut.org#152dacecfc8f0cf08cbd8ecb79d6d38f

-- 
With Regards,
Amit Kapila.

Re: Fix src/test/subscription/t/029_on_error.pl test when wal_debug is enabled

2024-05-15 Thread Amit Kapila

On Thu, May 16, 2024 at 3:43 AM Michael Paquier  wrote:
>
> On Wed, May 15, 2024 at 05:58:18PM +0530, Amit Kapila wrote:
> > I guess it could be more work if we want to enhance the test for
> > ERRORs other than the primary key violation.
>
> And?  You could pass the ERROR string expected as argument of the
> function if more flexibility is wanted at some point, no?
>

Normally, we consider error_codes for comparison as they are less
prone to change but here it may be okay to use error_string as we can
change it, if required. But let's discuss a bit more on the other
solution being discussed below.

>  It happens
> that you don't require that now, which is fine for the three scenarios
> calling test_skip_lsn.
>
> > One simple fix is to
> > update the log_offset to the location of the LOG after successful
> > replication of un-conflicted data. For example, the Log location after
> > we execute the below line in the test:
> >
> > # Check replicated data
> >my $res =
> >  $node_subscriber->safe_psql('postgres', "SELECT
> > count(*) FROM tbl");
> >is($res, $expected, $msg);
>
> That still looks like a shortcut to me, weak to race conditions on
> slow machines where more log entries could be generated in-between.
> So it seems to me that you'd still want to make sure that the CONTEXT
> from which the LSN is retrieved maps to the sole expected error.  This
> is not going to be stable unless there are stronger checks to avoid
> log entries that can parasite the output, and a stronger matching
> ensures that.
>

This can only be a problem if the apply worker generates more LOG
entries with the required context but it won't do that unless there is
an operation on the publisher to replicate. If we see any such danger
then I agree this can break on some slow machines but as of now, I
don't see any such race condition.

-- 
With Regards,
Amit Kapila.

Re: First draft of PG 17 release notes

2024-05-15 Thread Amit Kapila

On Wed, May 15, 2024 at 7:36 AM Bruce Momjian  wrote:
>
> On Wed, May 15, 2024 at 02:03:32PM +1200, David Rowley wrote:
> > On Wed, 15 May 2024 at 13:00, Bruce Momjian  wrote:
> > >
> > > On Tue, May 14, 2024 at 03:39:26PM -0400, Melanie Plageman wrote:
> > > > "Reduce system calls by automatically merging reads up to 
> > > > io_combine_limit"
> > >
> > > Uh, as I understand it, the reduced number of system calls is not the
> > > value of the feature, but rather the ability to request a larger block
> > > from the I/O subsystem.  Without it, you have to make a request and wait
> > > for each request to finish.  I am open to new wording, but I am not sure
> > > your new wording is accurate.
> >
> > I think you have the cause and effect backwards. There's no advantage
> > to reading 128KB if you only need 8KB.  It's the fact that doing
> > *larger* reads allows *fewer* reads that allows it to be more
> > efficient.  There are also the efficiency gains from fadvise
> > POSIX_FADV_WILLNEED. I'm unsure how to jam that into a short sentence.
> > Maybe; "Optimize reading of tables by allowing pages to be prefetched
> > and read in chunks up to io_combine_limit", or a bit more buzzy;
> > "Optimize reading of tables by allowing pages to be prefetched and
> > performing vectored reads in chunks up to io_combine_limit".
>
> Yes, my point is that it is not the number of system calls or system
> call overhead that is the advantage of this patch, but the ability to
> request more of the I/O system in one call, which is not the same as
> system calls.
>
> I can use your wording, but how much prefetching to we enable now?
>

Shouldn't we need to include commit
b5a9b18cd0bc6f0124664999b31a00a264d16913 with this item?

-- 
With Regards,
Amit Kapila.

Re: Fix src/test/subscription/t/029_on_error.pl test when wal_debug is enabled

2024-05-15 Thread Amit Kapila

On Wed, May 15, 2024 at 9:26 AM Michael Paquier  wrote:
>
> On Tue, May 14, 2024 at 10:22:29AM +, Ilyasov Ian wrote:
> > Hello, hackers!
> >
> > Recently I've been building postgres with different cflags and cppflags.
> > And suddenly on REL_15_STABLE, REL_16_STABLE and master
> > I faced a failure of a src/test/subscription/t/029_on_error.pl test when
> > CPPFLAGS="-DWAL_DEBUG"
> > and
> > printf "wal_debug = on\n" >> "${TEMP_CONFIG}"
> > (or when both publisher and subscriber or only subscriber are run with 
> > wal_debug=on)
> >
> > So I propose a little fix to the test.
>
> Rather than assuming that the last line is the one to check, wouldn't
> it be better to grab the data from the CONTEXT line located just after
> the ERROR reporting the primary key violation?
>

I guess it could be more work if we want to enhance the test for
ERRORs other than the primary key violation. One simple fix is to
update the log_offset to the location of the LOG after successful
replication of un-conflicted data. For example, the Log location after
we execute the below line in the test:

# Check replicated data
   my $res =
 $node_subscriber->safe_psql('postgres', "SELECT
count(*) FROM tbl");
   is($res, $expected, $msg);

-- 
With Regards,
Amit Kapila.

Re: Control flow in logical replication walsender

2024-05-07 Thread Amit Kapila

On Tue, May 7, 2024 at 9:51 AM Ashutosh Bapat
 wrote:
>
> On Tue, May 7, 2024 at 12:00 AM Christophe Pettus  wrote:
>>
>> Thank you for the reply!
>>
>> > On May 1, 2024, at 02:18, Ashutosh Bapat  
>> > wrote:
>> > Is there a large transaction which is failing to be replicated repeatedly 
>> > - timeouts, crashes on upstream or downstream?
>>
>> AFAIK, no, although I am doing this somewhat by remote control (I don't have 
>> direct access to the failing system).  This did bring up one other question, 
>> though:
>>
>> Are subtransactions written to their own individual reorder buffers (and 
>> thus potentially spill files), or are they appended to the topmost 
>> transaction's reorder buffer?
>
>
> IIRC, they have their own RB,
>

Right.

>
 but once they commit, they are transferred to topmost transaction's RB.
>

I don't think they are transferred to topmost transaction's RB. We
perform a k-way merge between transactions/subtransactions to retrieve
the changes. See comments: "Support for efficiently iterating over a
transaction's and its subtransactions' changes..." in reorderbuffer.c

-- 
With Regards,
Amit Kapila.

Re: Control flow in logical replication walsender

2024-05-07 Thread Amit Kapila

On Tue, Apr 30, 2024 at 11:28 PM Christophe Pettus  wrote:
>
> I wanted to check my understanding of how control flows in a walsender doing 
> logical replication.  My understanding is that the (single) thread in each 
> walsender process, in the simplest case, loops on:
>
> 1. Pull a record out of the WAL.
> 2. Pass it to the recorder buffer code, which,
> 3. Sorts it out into the appropriate in-memory structure for that transaction 
> (spilling to disk as required), and then continues with #1, or,
> 4. If it's a commit record, it iteratively passes the transaction data one 
> change at a time to,
> 5. The logical decoding plugin, which returns the output format of that 
> plugin, and then,
> 6. The walsender sends the output from the plugin to the client. It cycles on 
> passing the data to the plugin and sending it to the client until it runs out 
> of changes in that transaction, and then resumes reading the WAL in #1.
>
> In particular, I wanted to confirm that while it is pulling the reordered 
> transaction and sending it to the plugin (and thence to the client), that 
> particular walsender is *not* reading new WAL records or putting them in the 
> reorder buffer.
>
> The specific issue I'm trying to track down is an enormous pileup of spill 
> files.  This is in a non-supported version of PostgreSQL (v11), so an upgrade 
> may fix it, but at the moment, I'm trying to find a cause and a mitigation.
>

In PG-14, we have added a feature in logical replication to stream
long in-progress transactions which should reduce spilling to a good
extent. You might want to try that.

-- 
With Regards,
Amit Kapila.

Re: TerminateOtherDBBackends code comments inconsistency.

2024-05-06 Thread Amit Kapila

On Tue, Apr 30, 2024 at 10:36 PM Noah Misch  wrote:
>
> >
> > >  One could argue the function should also check
> > > isBackgroundWorker and ignore even bgworkers that set proc->roleId, but 
> > > I've
> > > not done that.
> >
> > What is the argument for ignoring such workers?
>
> One of the proposed code comments says, "For bgworker authors, it's convenient
> to be able to recommend FORCE if a worker is blocking DROP DATABASE
> unexpectedly."  That argument is debatable, but I do think it applies equally
> to bgworkers whether or not they set proc->roleId.
>

Agreed.

-- 
With Regards,
Amit Kapila.

Re: speed up a logical replica setup

2024-04-30 Thread Amit Kapila

On Tue, Apr 30, 2024 at 12:04 PM Amit Kapila  wrote:
>
> On Mon, Apr 29, 2024 at 5:28 PM Amit Kapila  wrote:
> >
> > On Mon, Apr 29, 2024 at 5:23 PM Euler Taveira  wrote:
> >
>
> I was trying to test this utility when 'sync_replication_slots' is on
> and it gets in an ERROR loop [1] and never finishes. Please find the
> postgresql.auto used on the standby attached. I think if the standby
> has enabled sync_slots, you need to pass dbname in
> GenerateRecoveryConfig().

The other possibility is that when we start standby from
pg_createsubscriber, we specifically set 'sync_replication_slots' as
false.

>
> I couldn't test it further but I wonder if
> there are already synced slots on the standby (either due to
> 'sync_replication_slots' or users have used
> pg_sync_replication_slots() before invoking pg_createsubscriber),
> those would be retained as it is on new subscriber and lead to
> unnecessary WAL retention and dead rows.
>

This still needs some handling.

BTW, I don't see the use of following messages in --dry-run mode or
rather they could be misleading:
pg_createsubscriber: hint: If pg_createsubscriber fails after this
point, you must recreate the physical replica before continuing.
...
...
pg_createsubscriber: setting the replication progress (node name
"pg_0" ; LSN 0/0) on database "postgres"

Similarly, we should think if below messages are useful in --dry-run mode:
pg_createsubscriber: dropping publication
"pg_createsubscriber_5_887f7991" on database "postgres"
pg_createsubscriber: creating subscription
"pg_createsubscriber_5_887f7991" on database "postgres"
...
pg_createsubscriber: enabling subscription
"pg_createsubscriber_5_887f7991" on database "postgres"

-- 
With Regards,
Amit Kapila.

Re: speed up a logical replica setup

2024-04-30 Thread Amit Kapila

On Mon, Apr 29, 2024 at 5:28 PM Amit Kapila  wrote:
>
> On Mon, Apr 29, 2024 at 5:23 PM Euler Taveira  wrote:
>

I was trying to test this utility when 'sync_replication_slots' is on
and it gets in an ERROR loop [1] and never finishes. Please find the
postgresql.auto used on the standby attached. I think if the standby
has enabled sync_slots, you need to pass dbname in
GenerateRecoveryConfig(). I couldn't test it further but I wonder if
there are already synced slots on the standby (either due to
'sync_replication_slots' or users have used
pg_sync_replication_slots() before invoking pg_createsubscriber),
those would be retained as it is on new subscriber and lead to
unnecessary WAL retention and dead rows.

[1]
2024-04-30 11:50:43.239 IST [12536] LOG:  slot sync worker started
2024-04-30 11:50:43.247 IST [12536] ERROR:  slot synchronization
requires dbname to be specified in primary_conninfo

-- 
With Regards,
Amit Kapila.

postgresql.auto.standby.conf
Description: Binary data

Re: TerminateOtherDBBackends code comments inconsistency.

2024-04-29 Thread Amit Kapila

On Tue, Apr 30, 2024 at 2:58 AM Noah Misch  wrote:
>
> On Mon, Apr 29, 2024 at 10:18:35AM +0530, Amit Kapila wrote:
> > On Mon, Apr 22, 2024 at 9:56 PM Noah Misch  wrote:
>
> > 3a9b18b309 didn't change the docs of pg_terminate_backend and whatever
> > is mentioned w.r.t permissions in the doc of that function sounds
> > valid for drop database force to me. Do you have any specific proposal
> > in your mind?
>
> Something like the attached.
>

LGTM.

>  One could argue the function should also check
> isBackgroundWorker and ignore even bgworkers that set proc->roleId, but I've
> not done that.

What is the argument for ignoring such workers?

-- 
With Regards,
Amit Kapila.

Re: speed up a logical replica setup

2024-04-29 Thread Amit Kapila

On Mon, Apr 29, 2024 at 5:23 PM Euler Taveira  wrote:
>
> On Mon, Apr 29, 2024, at 6:56 AM, Amit Kapila wrote:
>
> On Wed, Mar 27, 2024 at 1:47 AM Euler Taveira  wrote:
> >
> > On Tue, Mar 26, 2024, at 4:12 PM, Tomas Vondra wrote:
> >
> > Perhaps I'm missing something, but why is NUM_CONN_ATTEMPTS even needed?
> > Why isn't recovery_timeout enough to decide if wait_for_end_recovery()
> > waited long enough?
> >
> >
> > It was an attempt to decoupled a connection failure (that keeps streaming 
> > the
> > WAL) from recovery timeout. The NUM_CONN_ATTEMPTS guarantees that if the 
> > primary
> > is gone during the standby recovery process, there is a way to bail out.
> >
>
> I think we don't need to check primary if the WAL corresponding to
> consistent_lsn is already present on the standby. Shouldn't we first
> check that? Once we ensure that the required WAL is copied, just
> checking server_is_in_recovery() should be sufficient. I feel that
> will be a direct way of ensuring what is required rather than
> indirectly verifying the same (by checking pg_stat_wal_receiver) as we
> are doing currently.
>
>
> How would you check it? WAL file? During recovery, you are not allowed to use
> pg_current_wal_lsn.
>

How about pg_last_wal_receive_lsn()?

-- 
With Regards,
Amit Kapila.

Re: speed up a logical replica setup

2024-04-29 Thread Amit Kapila

On Tue, Mar 26, 2024 at 8:24 AM Euler Taveira  wrote:
>
> On Mon, Mar 25, 2024, at 1:06 PM, Hayato Kuroda (Fujitsu) wrote:
>
> The first patch implements a combination of (1) and (2).
>
> ## Analysis for failure 2
>
> According to [2], the physical replication slot which is specified as 
> primary_slot_name
> was not used by the walsender process. At that time walsender has not existed.
>
> ```
> ...
> pg_createsubscriber: publisher: current wal senders: 0
> pg_createsubscriber: command is: SELECT 1 FROM 
> pg_catalog.pg_replication_slots WHERE active AND slot_name = 'physical_slot'
> pg_createsubscriber: error: could not obtain replication slot information: 
> got 0 rows, expected 1 row
> ...
> ```
>
> Currently standby must be stopped before the command and current code does not
> block the flow to ensure the replication is started. So there is a possibility
> that the checking is run before walsender is launched.
>
> One possible approach is to wait until the replication starts. Alternative 
> one is
> to ease the condition.
>
>
> That's my suggestion too. I reused NUM_CONN_ATTEMPTS (that was renamed to
> NUM_ATTEMPTS in the first patch). See second patch.
>

How can we guarantee that the slot can become active after these many
attempts? It still could be possible that on some slow machines it
didn't get activated even after NUM_ATTEMPTS. BTW, in the first place,
why do we need to ensure that the 'primary_slot_name' is active on the
primary?

-- 
With Regards,
Amit Kapila.

Re: speed up a logical replica setup

2024-04-29 Thread Amit Kapila

On Wed, Mar 27, 2024 at 1:47 AM Euler Taveira  wrote:
>
> On Tue, Mar 26, 2024, at 4:12 PM, Tomas Vondra wrote:
>
> Perhaps I'm missing something, but why is NUM_CONN_ATTEMPTS even needed?
> Why isn't recovery_timeout enough to decide if wait_for_end_recovery()
> waited long enough?
>
>
> It was an attempt to decoupled a connection failure (that keeps streaming the
> WAL) from recovery timeout. The NUM_CONN_ATTEMPTS guarantees that if the 
> primary
> is gone during the standby recovery process, there is a way to bail out.
>

I think we don't need to check primary if the WAL corresponding to
consistent_lsn is already present on the standby. Shouldn't we first
check that? Once we ensure that the required WAL is copied, just
checking server_is_in_recovery() should be sufficient. I feel that
will be a direct way of ensuring what is required rather than
indirectly verifying the same (by checking pg_stat_wal_receiver) as we
are doing currently.

-- 
With Regards,
Amit Kapila.

Re: TerminateOtherDBBackends code comments inconsistency.

2024-04-28 Thread Amit Kapila

On Mon, Apr 22, 2024 at 9:56 PM Noah Misch  wrote:
>
> On Mon, Apr 15, 2024 at 11:17:54AM +0530, Amit Kapila wrote:
> > On Thu, Apr 11, 2024 at 6:58 PM Kirill Reshke  
> > wrote:
> > >
> > > While working on [0] i have noticed this comment in
> > > TerminateOtherDBBackends function:
> > >
> > > /*
> > > * Check whether we have the necessary rights to terminate other
> > > * sessions. We don't terminate any session until we ensure that we
> > > * have rights on all the sessions to be terminated. These checks are
> > > * the same as we do in pg_terminate_backend.
> > > *
> > > * In this case we don't raise some warnings - like "PID %d is not a
> > > * PostgreSQL server process", because for us already finished session
> > > * is not a problem.
> > > */
> > >
> > > This statement is not true after 3a9b18b.
> > > "These checks are the same as we do in pg_terminate_backend."
>
> The comment mismatch is a problem.  Thanks for reporting it.  The DROP
> DATABASE doc mimics the comment, using, "permissions are the same as with
> pg_terminate_backend".
>

How about updating the comments as in the attached? I see that commit
3a9b18b309 didn't change the docs of pg_terminate_backend and whatever
is mentioned w.r.t permissions in the doc of that function sounds
valid for drop database force to me. Do you have any specific proposal
in your mind?

> > > But the code is still correct, I assume... or not? In fact, we are
> > > killing autovacuum workers which are working with a given database
> > > (proc->roleId == 0), which is OK in that case. Are there any other
> > > cases when proc->roleId == 0 but we should not be able to kill such a
> > > process?
> > >
> >
> > Good question. I am not aware of such cases but I wonder if we should
> > add a check similar to 3a9b18b [1] for the reason given in the commit
> > message. I have added Noah to see if he has any suggestions on this
> > matter.
> >
> > [1] -
> > commit 3a9b18b3095366cd0c4305441d426d04572d88c1
> > Author: Noah Misch 
> > Date:   Mon Nov 6 06:14:13 2023 -0800
> >
> > Ban role pg_signal_backend from more superuser backend types.
>
> That commit distinguished several scenarios.  Here's how they apply in
> TerminateOtherDBBackends():
>
> - logical replication launcher, autovacuum launcher: the proc->databaseId test
>   already skips them, since they don't connect to a database.  Vignesh said
>   this.
>
> - autovacuum worker: should terminate, since CountOtherDBBackends() would
>   terminate them in DROP DATABASE without FORCE.
>
> - background workers that connect to a database: the right thing is less clear
>   on these, but I lean toward allowing termination and changing the DROP
>   DATABASE doc.  As a bgworker author, I would value being able to recommend
>   DROP DATABASE FORCE if a worker is sticking around unexpectedly.  There's
>   relatively little chance of a bgworker actually wanting to block DROP
>   DATABASE FORCE or having an exploitable termination-time bug.
>

Agreed with this analysis and I don't see the need for the code change.

-- 
With Regards,
Amit Kapila.


fix_comments_1.patch
Description: Binary data

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-04-28 Thread Amit Kapila

On Thu, Apr 25, 2024 at 11:11 AM Bharath Rupireddy
 wrote:
>
> On Mon, Apr 22, 2024 at 7:21 PM Masahiko Sawada  wrote:
> >
> > > Please find the attached v35 patch.
> >
> > The documentation says about both 'active' and 'inactive_since'
> > columns of pg_replication_slots say:
> >
> > ---
> > active bool
> > True if this slot is currently actively being used
> >
> > inactive_since timestamptz
> > The time since the slot has become inactive. NULL if the slot is
> > currently being used. Note that for slots on the standby that are
> > being synced from a primary server (whose synced field is true), the
> > inactive_since indicates the last synchronization (see Section 47.2.3)
> > time.
> > ---
> >
> > When reading the description I thought if 'active' is true,
> > 'inactive_since' is NULL, but it doesn't seem to apply for temporary
> > slots.
>
> Right.
>
> > Since we don't reset the active_pid field of temporary slots
> > when the release, the 'active' is still true in the view but
> > 'inactive_since' is not NULL.
>
> Right. inactive_since is reset whenever the temporary slot is acquired
> again within the same backend that created the temporary slot.
>
> > Do you think we need to mention it in
> > the documentation?
>
> I think that's the reason we dropped "active" from the statement. It
> was earlier "NULL if the slot is currently actively being used.". But,
> per Bertrand's comment
> https://www.postgresql.org/message-id/ZehE2IJcsetSJMHC%40ip-10-97-1-34.eu-west-3.compute.internal
> changed it to ""NULL if the slot is currently being used.".
>
> Temporary slots retain the active = true and active_pid =  backend that created it> even when the slot is not being used until
> the lifetime of the backend process. We haven't tied active or
> active_pid flags to inactive_since, doing so now to represent the
> temporary slot behaviour for active and active_pid will confuse users
> more.
>

This is true and it's probably easy for us to understand as we
developed this feature but the same may not be true for others. I
wonder if we can be explicit about the difference of
active/inactive_since by adding something like the following for
inactive_since: Note that this field is not related to the active flag
as temporary slots can remain active till the session ends even when
they are not being used.

Sawada-San, do you have any suggestions on the wording?

>
 As far as the inactive_since of a slot is concerned, it is set
> to 0 when the slot is being used (acquired) and set to current
> timestamp when the slot is not being used (released).
>
> > As for the timeout-based slot invalidation feature, we could end up
> > invalidating the temporary slots even if they are shown as active,
> > which could confuse users. Do we want to somehow deal with it?
>
> Yes. As long as the temporary slot is lying unused holding up
> resources for more than the specified
> replication_slot_inactive_timeout, it is bound to get invalidated.
> This keeps behaviour consistent and less-confusing to the users.
>

Agreed. We may want to add something in the docs for this to avoid
confusion with the active flag.

-- 
With Regards,
Amit Kapila.

Re: promotion related handling in pg_sync_replication_slots()

2024-04-25 Thread Amit Kapila

On Wed, Apr 24, 2024 at 10:28 AM shveta malik  wrote:
>
> Modified the logic to remove only synced temporary slots during
> SQL-function exit.
>
> Please find v11 with above changes.
>

LGTM, so pushed!

-- 
With Regards,
Amit Kapila.

Re: Race condition in FetchTableStates() breaks synchronization of subscription tables

2024-04-25 Thread Amit Kapila

On Thu, Apr 25, 2024 at 7:01 AM Zhijie Hou (Fujitsu)
 wrote:
>
> On Wednesday, April 24, 2024 6:29 PM vignesh C  wrote:
> >
> >
> > The attached patch
> > v7-0001-Table-sync-missed-due-to-race-condition-in-subscr.patch
> > applies for master and PG16 branch,
> > v7-0001-Table-sync-missed-due-to-race-condition-in-subscr_PG15.patch
> > applies for PG15 branch.
>
> Thanks, I have verified that the patches can be applied cleanly and fix the
> issue on each branch. The regression test can also pass after applying the 
> patch
> on my machine.
>

Pushed.

-- 
With Regards,
Amit Kapila.

Re: Race condition in FetchTableStates() breaks synchronization of subscription tables

2024-04-24 Thread Amit Kapila

On Tue, Apr 23, 2024 at 4:53 PM Amit Kapila  wrote:
>
> On Wed, Mar 13, 2024 at 11:59 AM vignesh C  wrote:
> >
> > On Wed, 13 Mar 2024 at 10:12, Zhijie Hou (Fujitsu)
> >  wrote:
> > >
> > >
> > > For 0002, instead of avoid resetting the latch, is it possible to let the
> > > logical rep worker wake up the launcher once after attaching ?
> >
> > Waking up of the launch process uses the same latch that is used for
> > subscription creation/modification and apply worker process exit. As
> > the handling of this latch for subscription creation/modification and
> > worker process exit can be done only by ApplyLauncherMain, we will not
> > be able to reset the latch in WaitForReplicationWorkerAttach. I feel
> > waking up the launcher process might not help in this case as
> > currently we will not be able to differentiate between worker
> > attached, subscription creation/modification and apply worker process
> > exit.
> >
>
> IIUC, even if we set the latch once the worker attaches, the other
> set_latch by subscription creation/modification or apply_worker_exit
> could also be consumed due to reset of latch in
> WaitForReplicationWorkerAttach(). Is that understanding correct? If
> so, can we use some other way to wake up
> WaitForReplicationWorkerAttach() say condition variable?
>

The other possibility is to have a GUC launcher_retry_time or
something like that instead of using a DEFAULT_NAPTIME_PER_CYCLE. This
still may not resolve the issue if launcher_retry_time is longer but
at least users would be able to configure it. I am not sure if this is
a good idea or not but just trying to brainstorm different ideas to
solve this problem.

BTW, as far as I understand, this is an improvement in the existing
code, so should be done only for HEAD (probably PG18) and should be
discussed in a separate thread.

-- 
With Regards,
Amit Kapila.

Re: Race condition in FetchTableStates() breaks synchronization of subscription tables

2024-04-24 Thread Amit Kapila

On Wed, Mar 13, 2024 at 9:19 AM vignesh C  wrote:
>
> On Tue, 12 Mar 2024 at 09:34, Ajin Cherian  wrote:
> >
> >
> >
> > On Tue, Mar 12, 2024 at 2:59 PM vignesh C  wrote:
> >>
> >>
> >> Thanks, I have created the following Commitfest entry for this:
> >> https://commitfest.postgresql.org/47/4816/
> >>
> >> Regards,
> >> Vignesh
> >
> >
> > Thanks for the patch, I have verified that the fix works well by following 
> > the steps mentioned to reproduce the problem.
> > Reviewing the patch, it seems good and is well documented. Just one minor 
> > comment I had was probably to change the name of the variable 
> > table_states_valid to table_states_validity. The current name made sense 
> > when it was a bool, but now that it is a tri-state enum, it doesn't fit 
> > well.
>
> Thanks for reviewing the patch, the attached v6 patch has the changes
> for the same.
>

v6_0001* looks good to me. This should be backpatched unless you or
others think otherwise.

-- 
With Regards,
Amit Kapila.

Re: Disallow changing slot's failover option in transaction block

2024-04-23 Thread Amit Kapila

On Mon, Apr 22, 2024 at 2:31 PM Bertrand Drouvot
 wrote:
>
> On Mon, Apr 22, 2024 at 11:32:14AM +0530, shveta malik wrote:
> > On Mon, Apr 22, 2024 at 5:57 AM Zhijie Hou (Fujitsu)
> >  wrote:
> > > Attach the V3 patch which also addressed Shveta[1] and Bertrand[2]'s 
> > > comments.
>
> Thanks!
>
> >  Tested the patch, works well.
>
> Same here.
>

Pushed!

-- 
With Regards,
Amit Kapila.

Re: subscription/026_stats test is intermittently slow?

2024-04-23 Thread Amit Kapila

On Tue, Apr 23, 2024 at 11:49 AM vignesh C  wrote:
>
> On Sat, 20 Apr 2024 at 10:30, Alexander Lakhin  wrote:
> >
> > Hello Michael and Robert,
> >
> > 20.04.2024 05:57, Michael Paquier wrote:
> > > On Fri, Apr 19, 2024 at 01:57:41PM -0400, Robert Haas wrote:
> > >> It looks to me like in the first run it took 3 minutes for the
> > >> replay_lsn to catch up to the desired value, and in the second run,
> > >> two seconds. I think I have seen previous instances where something
> > >> similar happened, although in those cases I did not stop to record any
> > >> details. Have others seen this? Is there something we can/should do
> > >> about it?
> > > FWIW, I've also seen delays as well with this test on a few occasions.
> > > Thanks for mentioning it.
> >
> > It reminds me of
> > https://www.postgresql.org/message-id/858a7622-2c81-1687-d1df-1322dfcb2e72%40gmail.com
>
> Thanks Alexander for the test, I was able to reproduce the issue with
> the test you shared and also verify that the patch at [1] fixes the
> same.
>

One of the issues reported in the thread you referred to has the same
symptoms [1]. I'll review and analyze your proposal.

[1] - 
https://www.postgresql.org/message-id/858a7622-2c81-1687-d1df-1322dfcb2e72%40gmail.com

-- 
With Regards,
Amit Kapila.

Re: Race condition in FetchTableStates() breaks synchronization of subscription tables

2024-04-23 Thread Amit Kapila

On Wed, Mar 13, 2024 at 11:59 AM vignesh C  wrote:
>
> On Wed, 13 Mar 2024 at 10:12, Zhijie Hou (Fujitsu)
>  wrote:
> >
> >
> > For 0002, instead of avoid resetting the latch, is it possible to let the
> > logical rep worker wake up the launcher once after attaching ?
>
> Waking up of the launch process uses the same latch that is used for
> subscription creation/modification and apply worker process exit. As
> the handling of this latch for subscription creation/modification and
> worker process exit can be done only by ApplyLauncherMain, we will not
> be able to reset the latch in WaitForReplicationWorkerAttach. I feel
> waking up the launcher process might not help in this case as
> currently we will not be able to differentiate between worker
> attached, subscription creation/modification and apply worker process
> exit.
>

IIUC, even if we set the latch once the worker attaches, the other
set_latch by subscription creation/modification or apply_worker_exit
could also be consumed due to reset of latch in
WaitForReplicationWorkerAttach(). Is that understanding correct? If
so, can we use some other way to wake up
WaitForReplicationWorkerAttach() say condition variable? The current
proposal can fix the issue but looks bit adhoc.

-- 
With Regards,
Amit Kapila.

Re: promotion related handling in pg_sync_replication_slots()

2024-04-22 Thread Amit Kapila

On Mon, Apr 22, 2024 at 7:04 PM Masahiko Sawada  wrote:
>
> On Mon, Apr 22, 2024 at 9:02 PM shveta malik  wrote:
> >
> > Thanks for the patch, the changes look good Amit. Please find the merged 
> > patch.
> >
>
> I've reviewed the patch and have some comments:
>
> ---
> /*
> -* Early initialization.
> +* Register slotsync_worker_onexit() before we register
> +* ReplicationSlotShmemExit() in BaseInit(), to ensure that during the
> +* exit of the slot sync worker, ReplicationSlotShmemExit() is called
> +* first, followed by slotsync_worker_onexit(). The startup process during
> +* promotion invokes ShutDownSlotSync() which waits for slot sync to
> +* finish and it does that by checking the 'syncing' flag. Thus worker
> +* must be done with the slots' release and cleanup before it marks itself
> +* as finished syncing.
>  */
>
> I'm slightly worried that we register the slotsync_worker_onexit()
> callback before BaseInit(), because it could be a blocker when we want
> to add more work in the callback, for example sending the stats.
>

The other possibility is that we do slot release/clean up in the
slotsync_worker_onexit() call itself and then we can do it after
BaseInit(). Do you have any other/better idea for this?

> ---
> synchronize_slots(wrconn);
> +
> +   /* Cleanup the temporary slots */
> +   ReplicationSlotCleanup();
> +
> +   /* We are done with sync, so reset sync flag */
> +   reset_syncing_flag();
>
> I think it ends up removing other temp slots that are created by the
> same backend process using
> pg_create_{physical,logical_replication_slots() function, which could
> be a large side effect of this function for users.
>

True, I think here we should either remove only temporary and synced
marked slots. The other possibility is to create slots as RS_EPHEMERAL
initially when called from the SQL function but that doesn't sound
like a neat approach.

>
 Also, if users want
> to have a process periodically calling pg_sync_replication_slots()
> instead of the slotsync worker, it doesn't support a case where we
> create a temp not-ready slot and turn it into a persistent slot if
> it's ready for sync.
>

True, but eventually the API should be able to directly create the
persistent slots and anyway this can happen only for the first time
(till the slots are created and marked persistent) and one who wants
to use this function periodically should be able to see regular syncs.
OTOH, leaving temp slots created via this API could remain as-is after
promotion and we need to document for users to remove such slots. Now,
we can do that if we want but I think it is better to clean up such
slots rather than putting the onus on users to remove them after
promotion.

-- 
With Regards,
Amit Kapila.

Re: promotion related handling in pg_sync_replication_slots()

2024-04-22 Thread Amit Kapila

On Fri, Apr 19, 2024 at 1:52 PM shveta malik  wrote:
>
> Please find v9 with the above comments addressed.
>

I have made minor modifications in the comments and a function name.
Please see the attached top-up patch. Apart from this, the patch looks
good to me.

-- 
With Regards,
Amit Kapila.
diff --git a/src/backend/replication/logical/slotsync.c 
b/src/backend/replication/logical/slotsync.c
index 66fb46ac27..72a775b09c 100644
--- a/src/backend/replication/logical/slotsync.c
+++ b/src/backend/replication/logical/slotsync.c
@@ -79,10 +79,11 @@
  * and also sets stopSignaled=true to handle the race condition when the
  * postmaster has not noticed the promotion yet and thus may end up restarting
  * the slot sync worker. If stopSignaled is set, the worker will exit in such a
- * case. Note that we don't need to reset this variable as after promotion the
- * slot sync worker won't be restarted because the pmState changes to PM_RUN 
from
- * PM_HOT_STANDBY and we don't support demoting primary without restarting the
- * server. See MaybeStartSlotSyncWorker.
+ * case. The SQL function pg_sync_replication_slots() will also error out if
+ * this flag is set. Note that we don't need to reset this variable as after
+ * promotion the slot sync worker won't be restarted because the pmState
+ * changes to PM_RUN from PM_HOT_STANDBY and we don't support demoting
+ * primary without restarting the server. See MaybeStartSlotSyncWorker.
  *
  * The 'syncing' flag is needed to prevent concurrent slot syncs to avoid slot
  * overwrites.
@@ -1246,12 +1247,11 @@ wait_for_slot_activity(bool some_slot_updated)
 }
 
 /*
- * Check stopSignaled and syncing flags. Emit error if promotion has
- * already set stopSignaled or if it is concurrent sync call. Otherwise,
- * set 'syncing' flag and pid info.
+ * Emit an error if a promotion or a concurrent sync call is in progress.
+ * Otherwise, advertise that a sync is in progress.
  */
 static void
-check_flags_and_set_sync_info(pid_t worker_pid)
+check_and_set_sync_info(pid_t worker_pid)
 {
SpinLockAcquire(>mutex);
 
@@ -1259,9 +1259,8 @@ check_flags_and_set_sync_info(pid_t worker_pid)
Assert(worker_pid == InvalidPid || SlotSyncCtx->pid == InvalidPid);
 
/*
-* Startup process signaled the slot sync machinery to stop, so if
-* meanwhile postmaster ended up starting the worker again or user has
-* invoked pg_sync_replication_slots(), error out.
+* Emit an error if startup process signaled the slot sync machinery to
+* stop. See comments atop SlotSyncCtxStruct.
 */
if (SlotSyncCtx->stopSignaled)
{
@@ -1281,7 +1280,10 @@ check_flags_and_set_sync_info(pid_t worker_pid)
 
SlotSyncCtx->syncing = true;
 
-   /* Advertise our PID so that the startup process can kill us on 
promotion */
+   /*
+* Advertise the required PID so that the startup process can kill the 
slot
+* sync worker on promotion.
+*/
SlotSyncCtx->pid = worker_pid;
 
SpinLockRelease(>mutex);
@@ -1333,13 +1335,13 @@ ReplSlotSyncWorkerMain(char *startup_data, size_t 
startup_data_len)
 
/*
 * Register slotsync_worker_onexit() before we register
-* ReplicationSlotShmemExit() in BaseInit(), to ensure that during exit 
of
-* slot sync worker, ReplicationSlotShmemExit() is called first, 
followed
-* by slotsync_worker_onexit(). Startup process during promotion calls
-* ShutDownSlotSync() which waits for slot sync to finish and it does 
that
-* by checking the 'syncing' flag. Thus it is important that worker 
should
-* be done with slots' release and cleanup before it actually marks 
itself
-* as finished syncing.
+* ReplicationSlotShmemExit() in BaseInit(), to ensure that during the 
exit
+* of the slot sync worker, ReplicationSlotShmemExit() is called first,
+* followed by slotsync_worker_onexit(). The startup process during
+* promotion invokes ShutDownSlotSync() which waits for slot sync to 
finish
+* and it does that by checking the 'syncing' flag. Thus worker must be
+* done with the slots' release and cleanup before it marks itself as
+* finished syncing.
 */
before_shmem_exit(slotsync_worker_onexit, (Datum) 0);
 
@@ -1391,7 +1393,7 @@ ReplSlotSyncWorkerMain(char *startup_data, size_t 
startup_data_len)
pqsignal(SIGPIPE, SIG_IGN);
pqsignal(SIGCHLD, SIG_DFL);
 
-   check_flags_and_set_sync_info(MyProcPid);
+   check_and_set_sync_info(MyProcPid);
 
ereport(LOG, errmsg("slot sync worker started"));
 
@@ -1544,9 +1546,9 @@ update_synced_slots_inactive_since(void)
 /*
  * Shut down the slot sync worker.
  *
- * It sends signal to shutdown slot sync worker. It also waits till
- * the slot sync worker has exited and pg_sync_replication_slots()
- * has finished.
+ * This function send

Re: Disallow changing slot's failover option in transaction block

2024-04-17 Thread Amit Kapila

On Tue, Apr 16, 2024 at 5:06 PM shveta malik  wrote:
>
> On Tue, Apr 16, 2024 at 1:45 PM Hayato Kuroda (Fujitsu)
>  wrote:
> >
> > Dear Hou,
> >
> > > Kuroda-San reported an issue off-list that:
> > >
> > > If user execute ALTER SUBSCRIPTION SET (failover) command inside a txn 
> > > block
> > > and rollback, only the subscription option change can be rolled back, 
> > > while the
> > > replication slot's failover change is preserved.
> > >
> > > This is because ALTER SUBSCRIPTION SET (failover) command internally
> > > executes
> > > the replication command ALTER_REPLICATION_SLOT to change the replication
> > > slot's
> > > failover property, but this replication command execution cannot be
> > > rollback.
> > >
> > > To fix it, I think we can prevent user from executing ALTER SUBSCRIPTION 
> > > set
> > > (failover) inside a txn block, which is also consistent to the ALTER
> > > SUBSCRIPTION REFRESH/DROP SUBSCRIPTION command. Attach a small
> > > patch to address this.
> >
> > Thanks for posting the patch, the fix is same as my expectation.
> > Also, should we add the restriction to the doc? I feel [1] can be updated.
>
> +1.
>
> Similar to ALTER SUB, CREATE SUB also needs to be fixed. This is
> because we call alter_replication_slot in CREATE SUB as well, for the
> case when slot_name is provided and create_slot=false. But the tricky
> part is we call alter_replication_slot() when creating subscription
> for both failover=true and false. That means if we want to fix it on
> the similar line of ALTER SUB, we have to disallow user from executing
> the CREATE SUBSCRIPTION (slot_name = xx) in a txn block, which seems
> to break some existing use cases. (previously, user can execute such a
> command inside a txn block).
>
> So, we need to think if there are better ways to fix it.  After
> discussion with Hou-San offlist, here are some ideas:
>
> 1. do not alter replication slot's failover option when CREATE
> SUBSCRIPTION   WITH failover=false. This means we alter replication
> slot only when failover is set to true. And thus we can disallow
> CREATE SUB WITH (slot_name =xx, failover=true, create_slot=false)
> inside a txn block.
>
> This option allows user to run CREATE-SUB(create_slot=false) with
> failover=false in txn block like earlier. But on the downside, it
> makes the behavior inconsistent for otherwise simpler option like
> failover,  i.e. with failover=true, CREATE SUB is not allowed in txn
> block while with failover=false, it is allowed. It makes it slightly
> complex to be understood by user.
>
> 2. let's not disallow CREATE SUB in txn block as earlier, just don't
> perform internal alter-failover during CREATE SUB for existing
> slots(create_slot=false, slot_name=xx)  i.e. when create_slot is
> false, we will ignore failover parameter of CREATE SUB and it is
> user's responsibility to set it appropriately using ALTER SUB cmd. For
> create_slot=true, the behaviour of CREATE-SUB remains same as earlier.
>
> This option does not add new restriction for CREATE SUB wrt txn block.
> In context of failover with create_slot=false, we already have a
> similar restriction (documented one) for ALTER SUB, i.e. with 'ALTER
> SUBSCRIPTION SET(slot_name = new)', user needs to alter the new slot's
> failover by himself. CREAT SUB can also be documented in similar way.
> This seems simpler to be understood considering existing ALTER SUB's
> behavior as well. Plus, this will make CREATE-SUB code slightly
> simpler and thus easily manageable.
>

+1 for option 2 as it sounds logical to me and consistent with ALTER
SUBSCRIPTION. BTW, IIUC, you are referring to: "When altering the
slot_name, the failover and two_phase property values of the named
slot may differ from the counterpart failover and two_phase parameters
specified in the subscription. When creating the slot, ensure the slot
properties failover and two_phase match their counterpart parameters
of the subscription." in docs [1], right?

-- 
With Regards,
Amit Kapila.

Re: promotion related handling in pg_sync_replication_slots()

2024-04-16 Thread Amit Kapila

On Tue, Apr 16, 2024 at 12:03 PM Bertrand Drouvot
 wrote:
>
> On Tue, Apr 16, 2024 at 08:21:04AM +0530, Amit Kapila wrote:
>
> > There is no clear use case for allowing them in parallel and I feel it
> > would add more confusion when it can work sometimes but not other
> > times. However, if we receive some report from the field where there
> > is a real demand for such a thing, it should be easy to achieve. For
> > example, I can imagine that we can have sync_state that has values
> > 'started', 'in_progress' , and 'finished'. This should allow us to
> > achieve what the current proposed patch is doing along with allowing
> > the API to work in parallel when the sync_state is not 'in_progress'.
> >
> > I think for now let's restrict their usage in parallel and make the
> > promotion behavior consistent both for worker and API.
>
> Okay, let's do it that way. Is it worth to add a few words in the doc related 
> to
> pg_sync_replication_slots() though? (to mention it can not be used if the sync
> slot worker is running).
>

Yes, this makes sense to me.

-- 
With Regards,
Amit Kapila.

Re: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-04-16 Thread Amit Kapila

On Tue, Apr 16, 2024 at 7:48 AM Hayato Kuroda (Fujitsu)
 wrote:
>
> > > FYI - We also considered the idea which walsender waits until all prepared
> > transactions
> > > are resolved before decoding and sending changes, but it did not work well
> > > - the restarted walsender sent only COMMIT PREPARED record for
> > transactions which
> > > have been prepared before disabling the subscription. This happened 
> > > because
> > > 1) if the two_phase option of slots is false, the confirmed_flush can be 
> > > ahead of
> > >PREPARE record, and
> > > 2) after the altering and restarting, start_decoding_at becomes same as
> > >confirmed_flush and records behind this won't be decoded.
> > >
> >
> > I don't understand the exact problem you are facing. IIUC, if the
> > commit is after start_decoding_at point and prepare was before it, we
> > expect to send the entire transaction followed by a commit record. The
> > restart_lsn should be before the start of such a transaction and we
> > should have recorded the changes in the reorder buffer.
>
> This behavior is right for two_phase = false case. But if the parameter is
> altered between PREPARE and COMMIT PREPARED, there is a possibility that only
> COMMIT PREPARED is sent.
>

Can you please once consider the idea shared by me at [1] (One naive
idea is that on the publisher .) to solve this problem?

[1] - 
https://www.postgresql.org/message-id/CAA4eK1K1fSkeK%3Dkc26G5cq87vQG4%3D1qs_b%2Bno4%2Bep654SeBy1w%40mail.gmail.com

-- 
With Regards,
Amit Kapila.

Re: promotion related handling in pg_sync_replication_slots()

2024-04-15 Thread Amit Kapila

On Mon, Apr 15, 2024 at 7:47 PM Bertrand Drouvot
 wrote:
>
> On Mon, Apr 15, 2024 at 06:29:49PM +0530, Amit Kapila wrote:
> > On Mon, Apr 15, 2024 at 6:21 PM Bertrand Drouvot
> >  wrote:
> > >
> > > On Mon, Apr 15, 2024 at 03:38:53PM +0530, shveta malik wrote:
> > > > Now both worker and SQL
> > > > function set 'syncing' when they start and reset it when they exit.
> > >
> > > It means that it's not possible anymore to trigger a manual sync if
> > > sync_replication_slots is on. Indeed that would trigger:
> > >
> > > postgres=# select pg_sync_replication_slots();
> > > ERROR:  cannot synchronize replication slots concurrently
> > >
> > > That looks like an issue to me, thoughts?
> > >
> >
> > This is intentional as of now for the sake of keeping
> > implementation/code simple. It is not difficult to allow them but I am
> > not sure whether we want to add another set of conditions allowing
> > them in parallel.
>
> I think that the ability to launch a manual sync before a switchover would be
> missed. Except for this case I don't think that's an issue to prevent them to
> run in parallel.
>

I think if the slotsync worker is available, it can do that as well.
There is no clear use case for allowing them in parallel and I feel it
would add more confusion when it can work sometimes but not other
times. However, if we receive some report from the field where there
is a real demand for such a thing, it should be easy to achieve. For
example, I can imagine that we can have sync_state that has values
'started', 'in_progress' , and 'finished'. This should allow us to
achieve what the current proposed patch is doing along with allowing
the API to work in parallel when the sync_state is not 'in_progress'.

I think for now let's restrict their usage in parallel and make the
promotion behavior consistent both for worker and API.

-- 
With Regards,
Amit Kapila.

Re: promotion related handling in pg_sync_replication_slots()

2024-04-15 Thread Amit Kapila

On Mon, Apr 15, 2024 at 6:21 PM Bertrand Drouvot
 wrote:
>
> On Mon, Apr 15, 2024 at 03:38:53PM +0530, shveta malik wrote:
> > On Mon, Apr 15, 2024 at 2:29 PM Amit Kapila  wrote:
> > >
> > > On Fri, Apr 12, 2024 at 5:25 PM shveta malik  
> > > wrote:
> > > >
> > > > Please find v3 addressing race-condition and one other comment.
> > > >
> > > > Up-thread it was suggested that,  probably, checking
> > > > SlotSyncCtx->syncing should be sufficient in ShutDownSlotSync(). On
> > > > re-thinking, it might not be. Slot sync worker sets and resets
> > > > 'syncing' with each sync-cycle, and thus we need to rely on worker's
> > > > pid in ShutDownSlotSync(), as there could be a window where promotion
> > > > is triggered and 'syncing' is not set for worker, while the worker is
> > > > still running. This implementation of setting and resetting syncing
> > > > with each sync-cycle looks better as compared to setting syncing
> > > > during the entire life-cycle of the worker. So, I did not change it.
> > > >
> > >
> > > To retain this we need to have different handling for 'syncing' for
> > > workers and function which seems like more maintenance burden than the
> > > value it provides. Moreover, in SyncReplicationSlots(), we are calling
> > > a function after acquiring spinlock which is not our usual coding
> > > practice.
> >
> > Okay.  Changed it to consistent handling.
>
> Thanks for the patch!
>
> > Now both worker and SQL
> > function set 'syncing' when they start and reset it when they exit.
>
> It means that it's not possible anymore to trigger a manual sync if
> sync_replication_slots is on. Indeed that would trigger:
>
> postgres=# select pg_sync_replication_slots();
> ERROR:  cannot synchronize replication slots concurrently
>
> That looks like an issue to me, thoughts?
>

This is intentional as of now for the sake of keeping
implementation/code simple. It is not difficult to allow them but I am
not sure whether we want to add another set of conditions allowing
them in parallel. And that too in an unpredictable way as the API will
work only for the time slot sync worker is not performing the sync.

-- 
With Regards,
Amit Kapila.

Re: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-04-15 Thread Amit Kapila

On Mon, Apr 15, 2024 at 1:28 PM Hayato Kuroda (Fujitsu)
 wrote:
>
> > Vitaly, does the minimal solution provided by the proposed patch
> > (Allow to alter two_phase option of a subscriber provided no
> > uncommitted
> > prepared transactions are pending on that subscription.) address your use 
> > case?
>
> I think we do not have to handle cases which there are prepared transactions 
> on
> publisher/subscriber, as the first step. It leads additional complexity and we
> do not have smarter solutions, especially for problem 2.
> IIUC it meets the Vitaly's condition, right?
>
> > > 1. While toggling two_phase from true to false, we could probably get a 
> > > list of
> > prepared transactions for this subscriber id and rollback/abort the prepared
> > transactions. This will allow the transactions to be re-applied like a 
> > normal
> > transaction when the commit comes. Alternatively, if this isn't appropriate 
> > doing it
> > in the ALTER SUBSCRIPTION context, we could store the xids of all prepared
> > transactions of this subscription in a list and when the corresponding xid 
> > is being
> > committed by the apply worker, prior to commit, we make sure the previously
> > prepared transaction is rolled back. But this would add the overhead of 
> > checking
> > this list every time a transaction is committed by the apply worker.
> > >
> >
> > In the second solution, if you check at the time of commit whether
> > there exists a prior prepared transaction then won't we end up
> > applying the changes twice? I think we can first try to achieve it at
> > the time of Alter Subscription because the other solution can add
> > overhead at each commit?
>
> Yeah, at least the second solution might be problematic. I prototyped
> the first one and worked well. However, to make the feature more consistent,
> it is prohibit to exist prepared transactions on subscriber for now.
> We can ease based on the requirement.
>
> > > 2. No solution yet.
> > >
> >
> > One naive idea is that on the publisher we can remember whether the
> > prepare has been sent and if so then only send commit_prepared,
> > otherwise send the entire transaction. On the subscriber-side, we
> > somehow, need to ensure before applying the first change whether the
> > corresponding transaction is already prepared and if so then skip the
> > changes and just perform the commit prepared. One drawback of this
> > approach is that after restart, the prepare flag wouldn't be saved in
> > the memory and we end up sending the entire transaction again. One way
> > to avoid this overhead is that the publisher before sending the entire
> > transaction checks with subscriber whether it has a prepared
> > transaction corresponding to the current commit. I understand that
> > this is not a good idea even if it works but I don't have any better
> > ideas. What do you think?
>
> I considered but not sure it is good to add such mechanism. Your idea requires
> additional wait-loop, which might lead bugs and unexpected behavior. And it 
> may
> degrade the performance based on the network environment.
> As for the another solution (worker sends a list of prepared transactions), it
> is also not so good because list of prepared transactions may be huge.
>
> Based on above, I think we can reject the case for now.
>
> FYI - We also considered the idea which walsender waits until all prepared 
> transactions
> are resolved before decoding and sending changes, but it did not work well
> - the restarted walsender sent only COMMIT PREPARED record for transactions 
> which
> have been prepared before disabling the subscription. This happened because
> 1) if the two_phase option of slots is false, the confirmed_flush can be 
> ahead of
>PREPARE record, and
> 2) after the altering and restarting, start_decoding_at becomes same as
>confirmed_flush and records behind this won't be decoded.
>

I don't understand the exact problem you are facing. IIUC, if the
commit is after start_decoding_at point and prepare was before it, we
expect to send the entire transaction followed by a commit record. The
restart_lsn should be before the start of such a transaction and we
should have recorded the changes in the reorder buffer.

-- 
With Regards,
Amit Kapila.

Re: promotion related handling in pg_sync_replication_slots()

2024-04-15 Thread Amit Kapila

On Fri, Apr 12, 2024 at 5:25 PM shveta malik  wrote:
>
> Please find v3 addressing race-condition and one other comment.
>
> Up-thread it was suggested that,  probably, checking
> SlotSyncCtx->syncing should be sufficient in ShutDownSlotSync(). On
> re-thinking, it might not be. Slot sync worker sets and resets
> 'syncing' with each sync-cycle, and thus we need to rely on worker's
> pid in ShutDownSlotSync(), as there could be a window where promotion
> is triggered and 'syncing' is not set for worker, while the worker is
> still running. This implementation of setting and resetting syncing
> with each sync-cycle looks better as compared to setting syncing
> during the entire life-cycle of the worker. So, I did not change it.
>

To retain this we need to have different handling for 'syncing' for
workers and function which seems like more maintenance burden than the
value it provides. Moreover, in SyncReplicationSlots(), we are calling
a function after acquiring spinlock which is not our usual coding
practice.

One minor comment:
 * All the fields except 'syncing' are used only by slotsync worker.
 * 'syncing' is used both by worker and SQL function pg_sync_replication_slots.
 */
typedef struct SlotSyncCtxStruct
{
pid_t pid;
bool stopSignaled;
bool syncing;
time_t last_start_time;
slock_t mutex;
} SlotSyncCtxStruct;

I feel the above comment is no longer valid after this patch. We can
probably remove this altogether.

-- 
With Regards,
Amit Kapila.

Re: Fix possible dereference null pointer (src/backend/replication/logical/reorderbuffer.c)

2024-04-15 Thread Amit Kapila

On Sat, Apr 13, 2024 at 12:46 PM Heikki Linnakangas  wrote:
>
> I don't much like adding extra runtime checks for "can't happen"
> scenarios either. Assertions would be more clear, but in this case the
> code would just segfault trying to dereference the NULL pointer, which
> is fine for a "can't happen" scenario.
>

Isn't the existing assertion (Assert(!create || txn != NULL);) in
ReorderBufferTXNByXid() sufficient to handle this case?

> Looking closer, when we identify an XID as a subtransaction, we:
> - assign toplevel_xid,
> - set RBTXN_IS_SUBXACT, and
> - assign toptxn pointer.
>
> ISTM the 'toplevel_xid' and RBTXN_IS_SUBXACT are redundant.
> 'toplevel_xid' is only used in those two calls that do
> ReorderBufferTXNByXid(rb, txn->toplevel_xid,...), and you could replace
> those by following the 'toptxn' pointer directly. And RBTXN_IS_SUBXACT
> is redundant with (toptxn != NULL). So here's a patch to remove
> 'toplevel_xid' and RBTXN_IS_SUBXACT altogether.
>

Good idea. I don't see a problem with this idea.

@@ -1135,8 +1133,6 @@ static void
 ReorderBufferTransferSnapToParent(ReorderBufferTXN *txn,
ReorderBufferTXN *subtxn)
 {
- Assert(subtxn->toplevel_xid == txn->xid);

Is there a benefit in converting this assertion using toptxn?

-- 
With Regards,
Amit Kapila.

Re: TerminateOtherDBBackends code comments inconsistency.

2024-04-14 Thread Amit Kapila

On Thu, Apr 11, 2024 at 6:58 PM Kirill Reshke  wrote:
>
> While working on [0] i have noticed this comment in
> TerminateOtherDBBackends function:
>
> /*
> * Check whether we have the necessary rights to terminate other
> * sessions. We don't terminate any session until we ensure that we
> * have rights on all the sessions to be terminated. These checks are
> * the same as we do in pg_terminate_backend.
> *
> * In this case we don't raise some warnings - like "PID %d is not a
> * PostgreSQL server process", because for us already finished session
> * is not a problem.
> */
>
> This statement is not true after 3a9b18b.
> "These checks are the same as we do in pg_terminate_backend."
>
> But the code is still correct, I assume... or not? In fact, we are
> killing autovacuum workers which are working with a given database
> (proc->roleId == 0), which is OK in that case. Are there any other
> cases when proc->roleId == 0 but we should not be able to kill such a
> process?
>

Good question. I am not aware of such cases but I wonder if we should
add a check similar to 3a9b18b [1] for the reason given in the commit
message. I have added Noah to see if he has any suggestions on this
matter.

[1] -
commit 3a9b18b3095366cd0c4305441d426d04572d88c1
Author: Noah Misch 
Date:   Mon Nov 6 06:14:13 2023 -0800

Ban role pg_signal_backend from more superuser backend types.

-- 
With Regards,
Amit Kapila.

Re: promotion related handling in pg_sync_replication_slots()

2024-04-12 Thread Amit Kapila

On Fri, Apr 12, 2024 at 7:57 AM shveta malik  wrote:
>
> On Sat, Apr 6, 2024 at 12:25 PM Bharath Rupireddy
>  wrote:
> >
> > On Fri, Apr 5, 2024 at 10:31 AM shveta malik  wrote:
> > >
> > > Please find v2. Changes are:
> >
> > Thanks for the patch. Here are some comments.
>
> Thanks for reviewing.
> >
> > 1. Can we have a clear saying in the shmem variable who's syncing at
> > the moment? Is it a slot sync worker or a backend via SQL function?
> > Perhaps turn "bool syncing;" to "SlotSyncSource sync_source;"
> >
> > typedef enum SlotSyncSource
> > {
> > SLOT_SYNC_NONE,
> > SLOT_SYNC_WORKER,
> > SLOT_SYNC_BACKEND,
> > } SlotSyncSource;
> >
> > Then, the check in ShutDownSlotSync can be:
> >
> > +   /*
> > +* Return if neither the slot sync worker is running nor the 
> > function
> > +* pg_sync_replication_slots() is executing.
> > +*/
> > +   if ((SlotSyncCtx->pid == InvalidPid) &&
> > SlotSyncCtx->sync_source != SLOT_SYNC_BACKEND)
> > {
> >

I don't know if this will be help, especially after fixing the race
condition I mentioned. But otherwise, also, at this stage it doesn't
seem helpful to add the source of sync explicitly.

> > 2.
> > SyncReplicationSlots(WalReceiverConn *wrconn)
> >  {
> > +/*
> > + * Startup process signaled the slot sync to stop, so if meanwhile user
> > + * has invoked slot sync SQL function, simply return.
> > + */
> > +SpinLockAcquire(>mutex);
> > +if (SlotSyncCtx->stopSignaled)
> > +{
> > +ereport(LOG,
> > +errmsg("skipping slot synchronization as slot sync
> > shutdown is signaled during promotion"));
> > +
> >
> > Unless I'm missing something, I think this can't detect if the backend
> > via SQL function is already half-way through syncing in
> > synchronize_one_slot. So, better move this check to (or also have it
> > there) slot sync loop that calls synchronize_one_slot. To avoid
> > spinlock acquisitions, we can perhaps do this check in when we acquire
> > the spinlock for synced flag.
>
> If the sync via SQL function is already half-way, then promotion
> should wait for it to finish. I don't think it is a good idea to move
> the check to synchronize_one_slot().  The sync-call should either not
> start (if it noticed the promotion) or finish the sync and then let
> promotion proceed. But I would like to know others' opinion on this.
>

Agreed.

-- 
With Regards,
Amit Kapila.

Re: promotion related handling in pg_sync_replication_slots()

2024-04-12 Thread Amit Kapila

On Fri, Apr 12, 2024 at 7:47 AM shveta malik  wrote:
>
> On Sat, Apr 6, 2024 at 11:49 AM Amit Kapila  wrote:
> >
> >
> > Few comments:
> > ==
> > 1.
> >  void
> >  SyncReplicationSlots(WalReceiverConn *wrconn)
> >  {
> > + /*
> > + * Startup process signaled the slot sync to stop, so if meanwhile user
> > + * has invoked slot sync SQL function, simply return.
> > + */
> > + SpinLockAcquire(>mutex);
> > + if (SlotSyncCtx->stopSignaled)
> > + {
> > + ereport(LOG,
> > + errmsg("skipping slot synchronization as slot sync shutdown is
> > signaled during promotion"));
> > +
> > + SpinLockRelease(>mutex);
> > + return;
> > + }
> > + SpinLockRelease(>mutex);
> >
> > There is a race condition with this code. Say during promotion
> > ShutDownSlotSync() is just before setting this flag and the user has
> > invoked pg_sync_replication_slots() and passed this check but still
> > didn't set the SlotSyncCtx->syncing flag. So, now, the promotion would
> > recognize that there is slot sync going on in parallel, and slot sync
> > wouldn't know that the promotion is in progress.
>
> Did you mean that now, the promotion *would not* recognize...
>

Right.

> I see, I will fix this.
>

Thanks.

-- 
With Regards,
Amit Kapila.

Re: pg_upgrde failed : logical replication : alter_subscription_add_log

2024-04-12 Thread Amit Kapila

Note - Please keep pgsql-hackers in CC while responding.

On Fri, Apr 12, 2024 at 10:44 AM Perumal Raj  wrote:

> Thanks Amit for the update,
>
> Documentation says : https://www.postgresql.org/docs/15/upgrading.html
>
> 19.6.3. Upgrading Data via Replication
>
> It is also possible to use logical replication methods to create a standby
> server with the updated version of PostgreSQL. This is possible because
> logical replication supports replication between different major versions
> of PostgreSQL. The standby can be on the same computer or a different
> computer. Once it has synced up with the primary server (running the older
> version of PostgreSQL), you can switch primaries and make the standby the
> primary and shut down the older database instance. Such a switch-over
> results in only several seconds of downtime for an upgrade.
>
> This method of upgrading can be performed using the built-in logical
> replication facilities as well as using external logical replication
> systems such as pglogical, Slony, Londiste, and Bucardo.
>
> What is "built-in logical replication" ?
>

See docs at [1].

[1] - https://www.postgresql.org/docs/devel/logical-replication.html
--
With Regards,
Amit Kapila

Re: Synchronizing slots from primary to standby

2024-04-11 Thread Amit Kapila

On Thu, Apr 11, 2024 at 5:04 PM Zhijie Hou (Fujitsu)
 wrote:
>
> On Thursday, April 11, 2024 12:11 PM Amit Kapila  
> wrote:
>
> >
> > 2.
> > - if (remote_slot->restart_lsn < slot->data.restart_lsn)
> > + if (remote_slot->confirmed_lsn < slot->data.confirmed_flush)
> >   elog(ERROR,
> >   "cannot synchronize local slot \"%s\" LSN(%X/%X)"
> >
> > Can we be more specific in this message? How about splitting it into
> > error_message as "cannot synchronize local slot \"%s\"" and then errdetail 
> > as
> > "Local slot's start streaming location LSN(%X/%X) is ahead of remote slot's
> > LSN(%X/%X)"?
>
> Your version looks better. Since the above two messages all have errdetail, I
> used the style of ereport(ERROR, errmsg_internal(), errdetail_internal()... in
> the patch which is equal to the elog(ERROR but has an additional detail 
> message.
>

makes sense.

> Here is V5 patch set.
>

I think we should move the check to not advance slot when one of
remote_slot's restart_lsn or catalog_xmin is lesser than the local
slot inside update_local_synced_slot() as we want to prevent updating
slot in those cases even during slot synchronization.

-- 
With Regards,
Amit Kapila.

Re: pg_upgrde failed : logical replication : alter_subscription_add_log

2024-04-11 Thread Amit Kapila

On Fri, Apr 12, 2024 at 6:18 AM Perumal Raj  wrote:
>
> I am trying to upgrade PostgreSQL (RHEL 7) from version 13.7 to 15.6 using 
> pglogical.
> My Standby(destination) machine has following rpms,
>
> postgresql13-pglogical-3.7.16-1.el7.x86_64
> pglogical_15-2.4.3-1.rhel7.x86_64
>
> And Primary(Source) has ,
>
> postgresql13-pglogical-3.7.16-1.el7.x86_64
>
> pg_upgrade check mode went fine , but it failed while running real mode.
>
> pg_restore: while PROCESSING TOC:
> pg_restore: from TOC entry 13027; 1255 3375648004 FUNCTION 
> alter_subscription_add_log("text", "text", boolean, "regclass", "text"[], 
> "text"[]) postgres
> pg_restore: error: could not execute query: ERROR:  could not find function 
> "pglogical_alter_subscription_add_log" in file 
> "/usr/pgsql-15/lib/pglogical.so"
> Command was: CREATE FUNCTION 
> "pglogical"."alter_subscription_add_log"("sub_name" "text", "log_name" 
> "text", "log_to_file" boolean DEFAULT true, "log_to_table" "regclass" DEFAULT 
> NULL::"regclass", "conflict_type" "text"[] DEFAULT NULL::"text"[], 
> "conflict_resolution" "text"[] DEFAULT NULL::"text"[]) RETURNS boolean
> LANGUAGE "c"
> AS '$libdir/pglogical', 'pglogical_alter_subscription_add_log';
>
> -- For binary upgrade, handle extension membership the hard way
> ALTER EXTENSION "pglogical" ADD FUNCTION 
> "pglogical"."alter_subscription_add_log"("sub_name" "text", "log_name" 
> "text", "log_to_file" boolean, "log_to_table" "regclass", "conflict_type" 
> "text"[], "conflict_resolution" "text"[]);
>
> Am I missing any packages?
>

We don't maintain pglogical so difficult to answer but looking at the
error (ERROR:  could not find function
"pglogical_alter_subscription_add_log" in file
"/usr/pgsql-15/lib/pglogical.so"), it seems that the required function
is not present in pglogical.so. It is possible that the arguments
would have changed in newer version of pglogical or something like
that. You need to check with the maintainers of pglogical.

-- 
With Regards,
Amit Kapila.

Re: Synchronizing slots from primary to standby

2024-04-10 Thread Amit Kapila

On Wed, Apr 10, 2024 at 5:28 PM Zhijie Hou (Fujitsu)
 wrote:
>
> On Thursday, April 4, 2024 5:37 PM Amit Kapila  
> wrote:
> >
> > BTW, while thinking on this one, I
> > noticed that in the function LogicalConfirmReceivedLocation(), we first 
> > update
> > the disk copy, see comment [1] and then in-memory whereas the same is not
> > true in
> > update_local_synced_slot() for the case when snapshot exists. Now, do we 
> > have
> > the same risk here in case of standby? Because I think we will use these 
> > xmins
> > while sending the feedback message (in XLogWalRcvSendHSFeedback()).
> >
> > * We have to write the changed xmin to disk *before* we change
> > * the in-memory value, otherwise after a crash we wouldn't know
> > * that some catalog tuples might have been removed already.
>
> Yes, I think we have the risk on the standby, I can reproduce the case that if
> the server crashes after updating the in-memory value and before saving them 
> to
> disk, the synced slot could be invalidated after restarting from crash, 
> because
> the necessary rows have been removed on the primary. The steps can be found in
> [1].
>
> I think we'd better fix the order in update_local_synced_slot() as well. I
> tried to make the fix in 0002, 0001 is Shveta's patch to fix another issue in 
> this thread. Since
> they are touching the same function, so attach them together for review.
>

Few comments:
===
1.
+
+ /* Sanity check */
+ if (slot->data.confirmed_flush != remote_slot->confirmed_lsn)
+ ereport(LOG,
+ errmsg("synchronized confirmed_flush for slot \"%s\" differs from
remote slot",
+remote_slot->name),

Is there a reason to use elevel as LOG instead of ERROR? I think it
should be elog(ERROR, .. as this is an unexpected case.

2.
- if (remote_slot->restart_lsn < slot->data.restart_lsn)
+ if (remote_slot->confirmed_lsn < slot->data.confirmed_flush)
  elog(ERROR,
  "cannot synchronize local slot \"%s\" LSN(%X/%X)"

Can we be more specific in this message? How about splitting it into
error_message as "cannot synchronize local slot \"%s\"" and then
errdetail as "Local slot's start streaming location LSN(%X/%X) is
ahead of remote slot's LSN(%X/%X)"?

-- 
With Regards,
Amit Kapila.

Re: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-04-10 Thread Amit Kapila

On Fri, Apr 5, 2024 at 4:59 PM Ajin Cherian  wrote:
>
> On Thu, Apr 4, 2024 at 4:38 PM Amit Kapila  wrote:
>>
>>
>> I think this would probably be better than the current situation but
>> can we think of a solution to allow toggling the value of two_phase
>> even when prepared transactions are present? Can you please summarize
>> the reason for the problems in doing that and the solutions, if any?
>>
>
>
> Updated the patch, as it wasn't addressing updating of two-phase in the 
> remote slot.
>

Vitaly, does the minimal solution provided by the proposed patch
(Allow to alter two_phase option of a subscriber provided no
uncommitted
prepared transactions are pending on that subscription.) address your use case?

>  Currently the main issue that needs to be handled is the handling of pending 
> prepared transactions while the two_phase is altered. I see 3 issues with the 
> current approach.
>
> 1. Uncommitted prepared transactions when toggling two_phase from true to 
> false
>   When two_phase was true, prepared transactions were decoded at PREPARE time 
> and send to the subscriber, which is then prepared on the subscriber with a 
> new gid. Once the two_phase is toggled to false, then the COMMIT PREPARED on 
> the publisher is converted to commit and the entire transaction is decoded 
> and sent to the subscriber. This will   leave the previously prepared 
> transaction pending.
>
> 2. Uncommitted prepared transactions when toggling two_phase form false to 
> true
>   When two_phase was false, prepared transactions were ignored and not 
> decoded at PREPARE time on the publisher. Once the two_phase is toggled to 
> true, the apply worker and the walsender are restarted and a replication is 
> restarted from a new "start_decoding_at" LSN. Now, this new 
> "start_decoding_at" could be past the LSN of the PREPARE record and if so, 
> the PREPARE record is skipped and not send to the subscriber. Look at 
> comments in DecodeTXNNeedSkip() for detail.  Later when the user issues 
> COMMIT PREPARED, this is decoded and sent to the subscriber. but there is no 
> prepared transaction on the subscriber, and this fails because the  
> corresponding gid of the transaction couldn't be found.
>
> 3. While altering the two_phase of the subscription, it is required to also 
> alter the two_phase field of the slot on the primary. The subscription cannot 
> remotely alter the two_phase option of the slot when the subscription is  
> enabled, as the slot is owned by the walsender on the publisher side.
>

Thanks for summarizing the reasons for not allowing altering the
two_pc property for a subscription.

> Possible solutions for the 3 problems:
>
> 1. While toggling two_phase from true to false, we could probably get a list 
> of prepared transactions for this subscriber id and rollback/abort the 
> prepared transactions. This will allow the transactions to be re-applied like 
> a normal transaction when the commit comes. Alternatively, if this isn't 
> appropriate doing it in the ALTER SUBSCRIPTION context, we could store the 
> xids of all prepared transactions of this subscription in a list and when the 
> corresponding xid is being committed by the apply worker, prior to commit, we 
> make sure the previously prepared transaction is rolled back. But this would 
> add the overhead of checking this list every time a transaction is committed 
> by the apply worker.
>

In the second solution, if you check at the time of commit whether
there exists a prior prepared transaction then won't we end up
applying the changes twice? I think we can first try to achieve it at
the time of Alter Subscription because the other solution can add
overhead at each commit?

> 2. No solution yet.
>

One naive idea is that on the publisher we can remember whether the
prepare has been sent and if so then only send commit_prepared,
otherwise send the entire transaction. On the subscriber-side, we
somehow, need to ensure before applying the first change whether the
corresponding transaction is already prepared and if so then skip the
changes and just perform the commit prepared. One drawback of this
approach is that after restart, the prepare flag wouldn't be saved in
the memory and we end up sending the entire transaction again. One way
to avoid this overhead is that the publisher before sending the entire
transaction checks with subscriber whether it has a prepared
transaction corresponding to the current commit. I understand that
this is not a good idea even if it works but I don't have any better
ideas. What do you think?

> 3. We could mandate that the altering of two_phase state only be done after 
> disabling the subscription, just like how it is handled for failover option.
>

makes sense.

-- 
With Regards,
Amit Kapila.

Re: Is this a problem in GenericXLogFinish()?

2024-04-10 Thread Amit Kapila

On Wed, Apr 10, 2024 at 1:27 PM Michael Paquier  wrote:
>
> On Tue, Apr 09, 2024 at 10:25:36AM +, Hayato Kuroda (Fujitsu) wrote:
> >> On Fri, Apr 05, 2024 at 06:22:58AM +, Hayato Kuroda (Fujitsu) wrote:
> >> I'm slightly annoyed by the fact that there is no check on
> >> is_prim_bucket_same_wrt for buffer 1 in the BLK_NEEDS_REDO case to
> >> show the symmetry between the insert and replay paths.  Shouldn't
> >> there be at least an assert for that in the branch when there are no
> >> tuples (imagine an else to cover xldata->ntups == 0)?  I mean between
> >> just before updating the hash page's opaque area when
> >> is_prev_bucket_same_wrt.
> >
> > Indeed, added an Assert() in else part. Was it same as your expectation?
>
> Yep, close enough.  Thanks to the insert path we know that there will
> be no tuples if (is_prim_bucket_same_wrt || is_prev_bucket_same_wrt),
> and the replay path where the assertion is added.
>

It is fine to have an assertion for this path.

+ else
+ {
+ /*
+ * See _hash_freeovflpage() which has a similar assertion when
+ * there are no tuples.
+ */
+ Assert(xldata->is_prim_bucket_same_wrt ||
+xldata->is_prev_bucket_same_wrt);

I can understand this comment as I am aware of this code but not sure
it would be equally easy for the people first time looking at this
code. One may try to find the equivalent assertion in
_hash_freeovflpage(). The alternative could be: "Ensure that the
required flags are set when there are no tuples. See
_hash_freeovflpage().". I am also fine if you prefer to go with your
proposed comment.

Otherwise, the patch looks good to me.

-- 
With Regards,
Amit Kapila.

Re: Improve eviction algorithm in ReorderBuffer

2024-04-09 Thread Amit Kapila

On Wed, Apr 10, 2024 at 11:00 AM Heikki Linnakangas  wrote:
>
> On 10/04/2024 07:45, Michael Paquier wrote:
> > On Tue, Apr 09, 2024 at 09:16:53PM -0700, Jeff Davis wrote:
> >> On Wed, 2024-04-10 at 12:13 +0900, Michael Paquier wrote:
> >>> Wouldn't the best way forward be to revert
> >>> 5bec1d6bc5e3 and revisit the whole in v18?
> >>
> >> Also consider commits b840508644 and bcb14f4abc.
> >
> > Indeed.  These are also linked.
>
> I don't feel the urge to revert this:
>
> - It's not broken as such, we're just discussing better ways to
> implement it. We could also do nothing, and revisit this in v18. The
> only must-fix issue is some compiler warnings IIUC.
>
> - It's a pretty localized change in reorderbuffer.c, so it's not in the
> way of other patches or reverts. Nothing else depends on the binaryheap
> changes yet either.
>
> - It seems straightforward to repeat the performance tests with whatever
> alternative implementations we want to consider.
>
> My #1 choice would be to write a patch to switch the pairing heap,
> performance test that, and revert the binary heap changes.
>

+1.

-- 
With Regards,
Amit Kapila.

Re: Synchronizing slots from primary to standby

2024-04-08 Thread Amit Kapila

On Mon, Apr 8, 2024 at 7:01 PM Zhijie Hou (Fujitsu)
 wrote:
>
> Thanks for pushing.
>
> I checked the BF status, and noticed one BF failure, which I think is related 
> to
> a miss in the test code.
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=adder=2024-04-08%2012%3A04%3A27
>
> From the following log, I can see the sync failed because the standby is
> lagging behind of the failover slot.
>
> -
> # No postmaster PID for node "cascading_standby"
> error running SQL: 'psql::1: ERROR:  skipping slot synchronization as 
> the received slot sync LSN 0/4000148 for slot "snap_test_slot" is ahead of 
> the standby position 0/4000114'
> while running 'psql -XAtq -d port=50074 host=/tmp/t4HQFlrDmI 
> dbname='postgres' -f - -v ON_ERROR_STOP=1' with sql 'SELECT 
> pg_sync_replication_slots();' at 
> /home/bf/bf-build/adder/HEAD/pgsql/src/test/perl/PostgreSQL/Test/Cluster.pm 
> line 2042.
> # Postmaster PID for node "publisher" is 3715298
> -
>
> I think it's because we missed to call wait_for_replay_catchup before syncing
> slots.
>
> -
> $primary->safe_psql('postgres',
> "SELECT pg_create_logical_replication_slot('snap_test_slot', 
> 'test_decoding', false, false, true);"
> );
> # ? missed to wait here
> $standby1->safe_psql('postgres', "SELECT pg_sync_replication_slots();");
> -
>
> While testing, I noticed another place where we were calling
> wait_for_replay_catchup before doing pg_replication_slot_advance, which also 
> has
> a small possibility to cause the failover slot to be ahead of the standby if
> some logs are written in between these two steps. So, I adjusted them 
> together.
>
> Here is a small patch to improve the test.
>

LGTM. I'll push this tomorrow morning unless there are any more
comments or suggestions.

-- 
With Regards,
Amit Kapila.

Re: pgsql: Fix the intermittent buildfarm failures in 040_standby_failover_

2024-04-08 Thread Amit Kapila

On Mon, Apr 8, 2024 at 9:24 PM Robert Haas  wrote:
>
> Hi,
>
> I'm concerned that the failover slots feature may not be in
> sufficiently good shape for us to ship it. Since this test file was
> introduced at the end of January, it's been touched by a total of 16
> commits, most of which seem to be trying to get it to pass reliably:
>

Among the 16 commits, there are 6 feature commits (3 of which are
another feature that has interaction with this feature), 1 code
improvement commit, 2 bug fixes commit, and 7 test stabilization
commits. See [1] for the categorization of commits. Now, among these 7
test stabilization commits (which seems to be the main source of your
concern), 4 are due to the reason that we are expecting the slots to
be synced in one function call with pg_sync_replication_slots() which
sometimes didn't happen when there is an unexpected WAL generation say
by bgwriter (XLOG_RUNNING_XACTS ) or an extra XID generation by
auto(analyze). This shouldn't be a problem in practice where users are
expected to use slotsync worker which will keep syncing slots at
regular intervals. All these required stabilizations are in two of the
tests involving the use of the function pg_sync_replication_slots() to
sync slots. We can think of getting rid of this function and relying
only on slotsync worker functionality but I find this function quite
convenient for debugging and in some cases writing targeted tests
(though it caused instability in tests). We can provide more
information in docs for the use of this API.

The other stabilization fixes are as follows: 1 is a Perl scripting
issue to check LOGs, 1 is to increase the DEBUG level to catch more
information for failures, and 1 is a test setup miss which is already
done in other similar tests.

Having said that, I have kept an eye on the reports (-hackers, -bugs,
etc.) related to this feature and if we find that this feature is
inconvenient to use then we should consider either improving it, if
possible, or reverting it.

[1]:
New features:
6f132ed693 Allow synced slots to have their inactive_since.
6ae701b437 Track invalidation_reason in pg_replication_slots.
bf279ddd1c Introduce a new GUC 'standby_slot_names'.
93db6cbda0 Add a new slot sync worker to synchronize logical slots.
ddd5f4f54a Add a slot synchronization function.
776621a5e4 Add a failover option to subscriptions.

Code improvement
801792e528 Improve ERROR/LOG messages added by commits ddd5f4f54a and
7a424ece48.

Bug fixes:
2ec005b4e2 Ensure that the sync slots reach a consistent state after
promotion without losing data.
b3f6b14cf4 Fixups for commit 93db6cbda0.

Stabilize test cases:
def0ce3370 Fix BF failure introduced by commit b3f6b14cf4.
b7bdade6a4 Disable autovacuum on primary in
040_standby_failover_slots_sync test.
d9e225f275 Change the LOG level in 040_standby_failover_slots_sync.pl to DEBUG2.
9bc1eee988 Another try to fix BF failure introduced in commit ddd5f4f54a.
bd8fc1677b Fix BF introduced in commit ddd5f4f54a.
d13ff82319 Fix BF failure in commit 93db6cbda0.
6f3d8d5e7c Fix the intermittent buildfarm failures in
040_standby_failover_slots_sync.

-- 
With Regards,
Amit Kapila.

Re: Synchronizing slots from primary to standby

2024-04-08 Thread Amit Kapila

On Mon, Apr 8, 2024 at 9:49 PM Andres Freund  wrote:
>
> On 2024-04-08 16:01:41 +0530, Amit Kapila wrote:
> > Pushed.
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=adder=2024-04-08%2012%3A04%3A27
>
> This unfortunately is a commit after
>

Right, and thanks for the report. Hou-San has analyzed and shared the
patch [1] for this yesterday. I'll review it today.

[1] - 
https://www.postgresql.org/message-id/OS0PR01MB571665359F2F5DCD3ADABC9F94002%40OS0PR01MB5716.jpnprd01.prod.outlook.com

-- 
With Regards,
Amit Kapila.

Re: Synchronizing slots from primary to standby

2024-04-08 Thread Amit Kapila

On Mon, Apr 8, 2024 at 12:19 PM Zhijie Hou (Fujitsu)
 wrote:
>
> On Saturday, April 6, 2024 12:43 PM Amit Kapila  
> wrote:
> > On Fri, Apr 5, 2024 at 8:05 PM Bertrand Drouvot
> >  wrote:
> >
> > Yeah, that could be the first step. We can probably add an injection point 
> > to
> > control the bgwrite behavior and then add tests involving walsender
> > performing the decoding. But I think it is important to have sufficient 
> > tests in
> > this area as I see they are quite helpful in uncovering the issues.
>
> Here is the patch to drop the subscription in the beginning so that the
> restart_lsn of the lsub1_slot won't be advanced due to concurrent
> xl_running_xacts from bgwriter. The subscription will be re-created after all
> the slots are sync-ready. I think maybe we can use this to stabilize the test
> as a first step and then think about how to make use of injection point to add
> more tests if it's worth it.
>

Pushed.

-- 
With Regards,
Amit Kapila.

Re: Synchronizing slots from primary to standby

2024-04-07 Thread Amit Kapila

On Sun, Apr 7, 2024 at 3:06 AM Andres Freund  wrote:
>
> On 2024-04-06 10:58:32 +0530, Amit Kapila wrote:
> > On Sat, Apr 6, 2024 at 10:13 AM Amit Kapila  wrote:
> > >
> >
> > There are still a few pending issues to be fixed in this feature but
> > otherwise, we have committed all the main patches, so I marked the CF
> > entry corresponding to this work as committed.
>
> There are a a fair number of failures of 040_standby_failover_slots_sync in
> the buildfarm.  It'd be nice to get those fixed soon-ish.
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=flaviventris=2024-04-06%2020%3A58%3A50
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=skink=2024-04-06%2015%3A18%3A08
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=olingo=2024-04-06%2010%3A13%3A58
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grassquit=2024-04-05%2016%3A04%3A10
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=olingo=2024-04-05%2014%3A59%3A40
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=adder=2024-04-05%2014%3A59%3A07
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=grassquit=2024-04-05%2014%3A18%3A07
>
> The symptoms are similar, but not entirely identical across all of them, I 
> think.
>

I have analyzed these failures and there are two different tests that
are failing but the underlying reason is the same as being discussed
with Bertrand. We are working on the fix.

-- 
With Regards,
Amit Kapila.

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-04-06 Thread Amit Kapila

On Sat, Apr 6, 2024 at 11:55 AM Bharath Rupireddy
 wrote:
>

Why the handling w.r.t active_pid in InvalidatePossiblyInactiveSlot()
is not similar to InvalidatePossiblyObsoleteSlot(). Won't we need to
ensure that there is no other active slot user? Is it sufficient to
check inactive_since for the same? If so, we need some comments to
explain the same.

Can we avoid introducing the new functions like
SaveGivenReplicationSlot() and MarkGivenReplicationSlotDirty(), if we
do the required work in the caller?

-- 
With Regards,
Amit Kapila.

Re: promotion related handling in pg_sync_replication_slots()

2024-04-06 Thread Amit Kapila

On Fri, Apr 5, 2024 at 10:31 AM shveta malik  wrote:
>
> Please find v2. Changes are:
> 1) Rebased the patch as there was conflict due to recent commit 6f132ed.
> 2) Added an Assert in update_synced_slots_inactive_since() to ensure
> that the slot does not have active_pid.
> 3) Improved commit msg and comments.
>

Few comments:
==
1.
 void
 SyncReplicationSlots(WalReceiverConn *wrconn)
 {
+ /*
+ * Startup process signaled the slot sync to stop, so if meanwhile user
+ * has invoked slot sync SQL function, simply return.
+ */
+ SpinLockAcquire(>mutex);
+ if (SlotSyncCtx->stopSignaled)
+ {
+ ereport(LOG,
+ errmsg("skipping slot synchronization as slot sync shutdown is
signaled during promotion"));
+
+ SpinLockRelease(>mutex);
+ return;
+ }
+ SpinLockRelease(>mutex);

There is a race condition with this code. Say during promotion
ShutDownSlotSync() is just before setting this flag and the user has
invoked pg_sync_replication_slots() and passed this check but still
didn't set the SlotSyncCtx->syncing flag. So, now, the promotion would
recognize that there is slot sync going on in parallel, and slot sync
wouldn't know that the promotion is in progress.

The current coding for slot sync worker ensures there is no such race
by checking the stopSignaled and setting the pid together under
spinlock. I think we need to move the setting of the syncing flag
similarly. Once we do that probably checking SlotSyncCtx->syncing
should be sufficient in ShutDownSlotSync(). If we change the location
of setting the 'syncing' flag then please ensure its cleanup as we
currently do in slotsync_failure_callback().

2.
@@ -1395,6 +1395,7 @@ update_synced_slots_inactive_since(void)
  if (s->in_use && s->data.synced)
  {
  Assert(SlotIsLogical(s));
+ Assert(s->active_pid == 0);

We can add a comment like: "The slot must not be acquired by any process"

-- 
With Regards,
Amit Kapila.

Re: Synchronizing slots from primary to standby

2024-04-05 Thread Amit Kapila

On Sat, Apr 6, 2024 at 10:13 AM Amit Kapila  wrote:
>

There are still a few pending issues to be fixed in this feature but
otherwise, we have committed all the main patches, so I marked the CF
entry corresponding to this work as committed.

-- 
With Regards,
Amit Kapila.

Re: Synchronizing slots from primary to standby

2024-04-05 Thread Amit Kapila

On Fri, Apr 5, 2024 at 8:05 PM Bertrand Drouvot
 wrote:
>
> On Fri, Apr 05, 2024 at 06:23:10PM +0530, Amit Kapila wrote:
> > On Fri, Apr 5, 2024 at 5:17 PM Amit Kapila  wrote:
> > Thinking more on this, it doesn't seem related to
> > c9920a9068eac2e6c8fb34988d18c0b42b9bf811 as that commit doesn't change
> > any locking or something like that which impacts write positions.
>
> Agree.
>
> > I think what has happened here is that running_xact record written by
> > the background writer [1] is not written to the kernel or disk (see
> > LogStandbySnapshot()), before pg_current_wal_lsn() checks the
> > current_lsn to be compared with replayed LSN.
>
> Agree, I think it's not visible through pg_current_wal_lsn() yet.
>
> Also I think that the DEBUG message in LogCurrentRunningXacts()
>
> "
> elog(DEBUG2,
>  "snapshot of %d+%d running transaction ids (lsn %X/%X oldest xid 
> %u latest complete %u next xid %u)",
>  CurrRunningXacts->xcnt, CurrRunningXacts->subxcnt,
>  LSN_FORMAT_ARGS(recptr),
>  CurrRunningXacts->oldestRunningXid,
>  CurrRunningXacts->latestCompletedXid,
>  CurrRunningXacts->nextXid);
> "
>
> should be located after the XLogSetAsyncXactLSN() call. Indeed, the new LSN is
> visible after the spinlock (XLogCtl->info_lck) in XLogSetAsyncXactLSN() is
> released,
>

I think the new LSN can be visible only when the corresponding WAL is
written by XLogWrite(). I don't know what in XLogSetAsyncXactLSN() can
make it visible. In your experiment below, isn't it possible that in
the meantime WAL writer has written that WAL due to which you are
seeing an updated location?

>see:
>
> \watch on Session 1 provides:
>
>  pg_current_wal_lsn
> 
>  0/87D110
> (1 row)
>
> Until:
>
> Breakpoint 2, XLogSetAsyncXactLSN (asyncXactLSN=8900936) at xlog.c:2579
> 2579XLogRecPtr  WriteRqstPtr = asyncXactLSN;
> (gdb) n
> 2581boolwakeup = false;
> (gdb)
> 2584SpinLockAcquire(>info_lck);
> (gdb)
> 2585RefreshXLogWriteResult(LogwrtResult);
> (gdb)
> 2586sleeping = XLogCtl->WalWriterSleeping;
> (gdb)
> 2587prevAsyncXactLSN = XLogCtl->asyncXactLSN;
> (gdb)
> 2588if (XLogCtl->asyncXactLSN < asyncXactLSN)
> (gdb)
> 2589XLogCtl->asyncXactLSN = asyncXactLSN;
> (gdb)
> 2590SpinLockRelease(>info_lck);
> (gdb) p p/x (uint32) XLogCtl->asyncXactLSN
> $1 = 0x87d148
>
> Then session 1 provides:
>
>  pg_current_wal_lsn
> 
>  0/87D148
> (1 row)
>
> So, when we see in the log:
>
> 2024-04-05 04:37:05.074 UTC [3854278][background writer][:0] DEBUG:  snapshot 
> of 0+0 running transaction ids (lsn 0/398 oldest xid 740 latest complete 
> 739 next xid 740)
> 2024-04-05 04:37:05.197 UTC [3866475][client backend][2/4:0] LOG:  statement: 
> SELECT '0/360' <= replay_lsn AND state = 'streaming'
>
> It's indeed possible that the new LSN was not visible yet (spinlock not 
> released?)
> before the query began (because we can not rely on the time the DEBUG message 
> has
> been emitted).
>
> > Note that the reason why
> > walsender has picked the running_xact written by background writer is
> > because it has checked after pg_current_wal_lsn() query, see LOGs [2].
> > I think we can probably try to reproduce manually via debugger.
> >
> > If this theory is correct
>
> It think it is.
>
> > then I think we will need to use injection
> > points to control the behavior of bgwriter or use the slots created
> > via SQL API for syncing in tests.
> >
> > Thoughts?
>
> I think that maybe as a first step we should move the "elog(DEBUG2," message 
> as
> proposed above to help debugging (that could help to confirm the above 
> theory).
>

I think I am missing how exactly moving DEBUG2 can confirm the above theory.

> If the theory is proven then I'm not sure we need the extra complexity of
> injection point here, maybe just relying on the slots created via SQL API 
> could
> be enough.
>

Yeah, that could be the first step. We can probably add an injection
point to control the bgwrite behavior and then add tests involving
walsender performing the decoding. But I think it is important to have
sufficient tests in this area as I see they are quite helpful in
uncovering the issues.

-- 
With Regards,
Amit Kapila.

Re: Synchronizing slots from primary to standby

2024-04-05 Thread Amit Kapila

On Fri, Apr 5, 2024 at 5:17 PM Amit Kapila  wrote:
>
> On Thu, Apr 4, 2024 at 2:59 PM shveta malik  wrote:
> >
> > There is an intermittent BF failure observed at [1] after this commit 
> > (2ec005b).
> >
>
> Thanks for analyzing and providing the patch. I'll look into it. There
> is another BF failure [1] which I have analyzed. The main reason for
> failure is the following test:
>
> #   Failed test 'logical slots have synced as true on standby'
> #   at 
> /home/bf/bf-build/serinus/HEAD/pgsql/src/test/recovery/t/040_standby_failover_slots_sync.pl
> line 198.
> #  got: 'f'
> # expected: 't'
>
> Here, we are expecting that the two logical slots (lsub1_slot, and
> lsub2_slot), one created via subscription and another one via API
> pg_create_logical_replication_slot() are synced. The standby LOGs
> which are as follows show that the one created by API 'lsub2_slot' is
> synced but the other one 'lsub1_slot':
>
> LOG for lsub1_slot:
> 
> 2024-04-05 04:37:07.421 UTC [3867682][client backend][0/2:0] DETAIL:
> Streaming transactions committing after 0/0, reading WAL from
> 0/360.
> 2024-04-05 04:37:07.421 UTC [3867682][client backend][0/2:0]
> STATEMENT:  SELECT pg_sync_replication_slots();
> 2024-04-05 04:37:07.422 UTC [3867682][client backend][0/2:0] DEBUG:
> xmin required by slots: data 0, catalog 740
> 2024-04-05 04:37:07.422 UTC [3867682][client backend][0/2:0] LOG:
> could not sync slot "lsub1_slot"
>
> LOG for lsub2_slot:
> 
> 2024-04-05 04:37:08.518 UTC [3867682][client backend][0/2:0] DEBUG:
> xmin required by slots: data 0, catalog 740
> 2024-04-05 04:37:08.769 UTC [3867682][client backend][0/2:0] LOG:
> newly created slot "lsub2_slot" is sync-ready now
> 2024-04-05 04:37:08.769 UTC [3867682][client backend][0/2:0]
> STATEMENT:  SELECT pg_sync_replication_slots();
>
> We can see from the log of lsub1_slot that its restart_lsn is
> 0/360 which means it will start reading from the WAL from that
> location. Now, if we check the publisher log, we have running_xacts
> record at that location. See following LOGs:
>
> 2024-04-05 04:36:57.830 UTC [3860839][client backend][8/2:0] LOG:
> statement: SELECT pg_create_logical_replication_slot('lsub2_slot',
> 'test_decoding', false, false, true);
> 2024-04-05 04:36:58.718 UTC [3860839][client backend][8/2:0] DEBUG:
> snapshot of 0+0 running transaction ids (lsn 0/360 oldest xid 740
> latest complete 739 next xid 740)
> 
> 
> 2024-04-05 04:37:05.074 UTC [3854278][background writer][:0] DEBUG:
> snapshot of 0+0 running transaction ids (lsn 0/398 oldest xid 740
> latest complete 739 next xid 740)
>
> The first running_xact record ends at 360 and the second one at
> 398. So, the start location of the second running_xact is 360,
> the same can be confirmed by the following LOG line of walsender:
>
> 2024-04-05 04:37:05.144 UTC [3857385][walsender][25/0:0] DEBUG:
> serializing snapshot to pg_logical/snapshots/0-360.snap
>
> This shows that while processing running_xact at location 360, we
> have serialized the snapshot. As there is no running transaction in
> WAL at 360 so ideally we should have reached a consistent state
> after processing that record on standby. But the reason standby didn't
> process that LOG is that the confirmed_flush LSN is also at the same
> location so the function LogicalSlotAdvanceAndCheckSnapState() exits
> without reading the WAL at that location. Now, this can be confirmed
> by the below walsender-specific LOG in publisher:
>
> 2024-04-05 04:36:59.155 UTC [3857385][walsender][25/0:0] DEBUG: write
> 0/360 flush 0/360 apply 0/360 reply_time 2024-04-05
> 04:36:59.155181+00
>
> We update the confirmed_flush location with the flush location after
> receiving the above feedback. You can notice that we didn't receive
> the feedback for the 398 location and hence both the
> confirmed_flush and restart_lsn are at the same location 0/360.
> Now, the test is waiting for the subscriber to send feedback of the
> last WAL write location by
> $primary->wait_for_catchup('regress_mysub1'); As noticed from the
> publisher LOGs, the query we used for wait is:
>
> SELECT '0/360' <= replay_lsn AND state = 'streaming'
> FROM pg_catalog.pg_stat_replication
> WHERE application_name IN ('regress_mysub1', 'walreceiver')
>
> Here, instead of '0/360' it should have used ''0/398' which is
> the last write location. This position we get via function
> pg_current_wal_lsn()->GetXLogWriteRecPtr()->LogwrtResult.Write. And
> this variable seems to be touched by commit
> c9920a9068eac2e6c8fb34988d18

Re: Synchronizing slots from primary to standby

2024-04-05 Thread Amit Kapila

On Thu, Apr 4, 2024 at 2:59 PM shveta malik  wrote:
>
> There is an intermittent BF failure observed at [1] after this commit 
> (2ec005b).
>

Thanks for analyzing and providing the patch. I'll look into it. There
is another BF failure [1] which I have analyzed. The main reason for
failure is the following test:

#   Failed test 'logical slots have synced as true on standby'
#   at 
/home/bf/bf-build/serinus/HEAD/pgsql/src/test/recovery/t/040_standby_failover_slots_sync.pl
line 198.
#  got: 'f'
# expected: 't'

Here, we are expecting that the two logical slots (lsub1_slot, and
lsub2_slot), one created via subscription and another one via API
pg_create_logical_replication_slot() are synced. The standby LOGs
which are as follows show that the one created by API 'lsub2_slot' is
synced but the other one 'lsub1_slot':

LOG for lsub1_slot:

2024-04-05 04:37:07.421 UTC [3867682][client backend][0/2:0] DETAIL:
Streaming transactions committing after 0/0, reading WAL from
0/360.
2024-04-05 04:37:07.421 UTC [3867682][client backend][0/2:0]
STATEMENT:  SELECT pg_sync_replication_slots();
2024-04-05 04:37:07.422 UTC [3867682][client backend][0/2:0] DEBUG:
xmin required by slots: data 0, catalog 740
2024-04-05 04:37:07.422 UTC [3867682][client backend][0/2:0] LOG:
could not sync slot "lsub1_slot"

LOG for lsub2_slot:

2024-04-05 04:37:08.518 UTC [3867682][client backend][0/2:0] DEBUG:
xmin required by slots: data 0, catalog 740
2024-04-05 04:37:08.769 UTC [3867682][client backend][0/2:0] LOG:
newly created slot "lsub2_slot" is sync-ready now
2024-04-05 04:37:08.769 UTC [3867682][client backend][0/2:0]
STATEMENT:  SELECT pg_sync_replication_slots();

We can see from the log of lsub1_slot that its restart_lsn is
0/360 which means it will start reading from the WAL from that
location. Now, if we check the publisher log, we have running_xacts
record at that location. See following LOGs:

2024-04-05 04:36:57.830 UTC [3860839][client backend][8/2:0] LOG:
statement: SELECT pg_create_logical_replication_slot('lsub2_slot',
'test_decoding', false, false, true);
2024-04-05 04:36:58.718 UTC [3860839][client backend][8/2:0] DEBUG:
snapshot of 0+0 running transaction ids (lsn 0/360 oldest xid 740
latest complete 739 next xid 740)

2024-04-05 04:37:05.074 UTC [3854278][background writer][:0] DEBUG:
snapshot of 0+0 running transaction ids (lsn 0/398 oldest xid 740
latest complete 739 next xid 740)

The first running_xact record ends at 360 and the second one at
398. So, the start location of the second running_xact is 360,
the same can be confirmed by the following LOG line of walsender:

2024-04-05 04:37:05.144 UTC [3857385][walsender][25/0:0] DEBUG:
serializing snapshot to pg_logical/snapshots/0-360.snap

This shows that while processing running_xact at location 360, we
have serialized the snapshot. As there is no running transaction in
WAL at 360 so ideally we should have reached a consistent state
after processing that record on standby. But the reason standby didn't
process that LOG is that the confirmed_flush LSN is also at the same
location so the function LogicalSlotAdvanceAndCheckSnapState() exits
without reading the WAL at that location. Now, this can be confirmed
by the below walsender-specific LOG in publisher:

2024-04-05 04:36:59.155 UTC [3857385][walsender][25/0:0] DEBUG: write
0/360 flush 0/360 apply 0/360 reply_time 2024-04-05
04:36:59.155181+00

We update the confirmed_flush location with the flush location after
receiving the above feedback. You can notice that we didn't receive
the feedback for the 398 location and hence both the
confirmed_flush and restart_lsn are at the same location 0/360.
Now, the test is waiting for the subscriber to send feedback of the
last WAL write location by
$primary->wait_for_catchup('regress_mysub1'); As noticed from the
publisher LOGs, the query we used for wait is:

SELECT '0/360' <= replay_lsn AND state = 'streaming'
FROM pg_catalog.pg_stat_replication
WHERE application_name IN ('regress_mysub1', 'walreceiver')

Here, instead of '0/360' it should have used ''0/398' which is
the last write location. This position we get via function
pg_current_wal_lsn()->GetXLogWriteRecPtr()->LogwrtResult.Write. And
this variable seems to be touched by commit
c9920a9068eac2e6c8fb34988d18c0b42b9bf811. Though unlikely could
c9920a9068eac2e6c8fb34988d18c0b42b9bf811 be a reason for failure? At
this stage, I am not sure so just sharing with others to see if what I
am saying sounds logical. I'll think more about this.

[1] - 
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus=2024-04-05%2004%3A34%3A27

-- 
With Regards,
Amit Kapila.

Re: Is this a problem in GenericXLogFinish()?

2024-04-04 Thread Amit Kapila

On Fri, Apr 5, 2024 at 9:56 AM Michael Paquier  wrote:
>
> On Wed, Feb 07, 2024 at 06:42:21PM +0900, Michael Paquier wrote:
> > On Wed, Feb 07, 2024 at 02:08:42PM +0530, Amit Kapila wrote:
> > > Thanks for the report and looking into it. Pushed!
> >
> > Thanks Amit for the commit and Kuroda-san for the patch!
>
> I have been pinged about this thread and I should have looked a bit
> closer, but wait a minute here.
>
> There is still some divergence between the code path of
> _hash_freeovflpage() and the replay in hash_xlog_squeeze_page() when
> squeezing a page: we do not set the LSN of the write buffer if
> (xlrec.ntups <= 0 && xlrec.is_prim_bucket_same_wrt &&
> !xlrec.is_prev_bucket_same_wrt) when generating the squeeze record,
> but at replay we call PageSetLSN() on the write buffer and mark it
> dirty in this case.  Isn't that incorrect?
>

Agreed. We will try to reproduce this.

>  It seems to me that we
> should be able to make the conditional depending on the state of the
> xl_hash_squeeze_page record, no?
>

I think we can have a flag like mod_buf and set it in both the
conditions if (xldata->ntups > 0) and if
(xldata->is_prev_bucket_same_wrt). If the flag is set then we can set
the LSN and mark buffer dirty.

-- 
With Regards,
Amit Kapila.

Re: promotion related handling in pg_sync_replication_slots()

2024-04-04 Thread Amit Kapila

On Thu, Apr 4, 2024 at 5:05 PM shveta malik  wrote:
>
> Hello hackers,
>
> Currently, promotion related handling is missing in the slot sync SQL
> function pg_sync_replication_slots().   Here is the background on how
> it is done in slot sync worker:
> During promotion, the startup process in order to shut down the
> slot-sync worker, sets the 'stopSignaled' flag, sends the shut-down
> signal, and waits for slot sync worker to exit. Meanwhile if the
> postmaster has not noticed the promotion yet, it may end up restarting
> slot sync worker. In such a case, the worker exits if 'stopSignaled'
> is set.
>
> Since there is a chance that the user (or any of his scripts/tools)
> may execute SQL function pg_sync_replication_slots() in parallel to
> promotion, such handling is needed in this SQL function as well, The
> attached patch attempts to implement the same.
>

Thanks for the report and patch. I'll look into it.

-- 
With Regards,
Amit Kapila.

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-04-04 Thread Amit Kapila

On Thu, Apr 4, 2024 at 11:12 AM Bharath Rupireddy
 wrote:
>
> On Thu, Apr 4, 2024 at 10:48 AM Amit Kapila  wrote:
> >
> > The v33-0001 looks good to me. I have made minor changes in the
> > comments/commit message and removed one part of the test which was a
> > bit confusing and didn't seem to add much value. Let me know what you
> > think of the attached?
>
> Thanks for the changes. v34-0001 LGTM.
>

I was doing a final review before pushing 0001 and found that
'inactive_since' could be set twice during startup after promotion,
once while restoring slots and then via ShutDownSlotSync(). The reason
is that ShutDownSlotSync() will be invoked in normal startup on
primary though it won't do anything apart from setting inactive_since
if we have synced slots. I think you need to check 'StandbyMode' in
update_synced_slots_inactive_since() and return if the same is not
set. We can't use 'InRecovery' flag as that will be set even during
crash recovery.

Can you please test this once unless you don't agree with the above theory?

-- 
With Regards,
Amit Kapila.

Re: Synchronizing slots from primary to standby

2024-04-04 Thread Amit Kapila

e risk here in case of standby? Because I think we will
use these xmins while sending the feedback message (in
XLogWalRcvSendHSFeedback()).

[1]
/*
* We have to write the changed xmin to disk *before* we change
* the in-memory value, otherwise after a crash we wouldn't know
* that some catalog tuples might have been removed already.

-- 
With Regards,
Amit Kapila.

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-04-04 Thread Amit Kapila

On Thu, Apr 4, 2024 at 1:32 PM Masahiko Sawada  wrote:
>
> On Thu, Apr 4, 2024 at 1:34 PM Bharath Rupireddy
>  wrote:
> >
> > On Thu, Apr 4, 2024 at 9:42 AM Masahiko Sawada  
> > wrote:
> > >
> > > @@ -1368,6 +1416,7 @@ ShutDownSlotSync(void)
> > > if (SlotSyncCtx->pid == InvalidPid)
> > > {
> > > SpinLockRelease(>mutex);
> > > +   update_synced_slots_inactive_since();
> > > return;
> > > }
> > > SpinLockRelease(>mutex);
> > > @@ -1400,6 +1449,8 @@ ShutDownSlotSync(void)
> > > }
> > >
> > > SpinLockRelease(>mutex);
> > > +
> > > +   update_synced_slots_inactive_since();
> > >  }
> > >
> > > Why do we want to update all synced slots' inactive_since values at
> > > shutdown in spite of updating the value every time when releasing the
> > > slot? It seems to contradict the fact that inactive_since is updated
> > > when releasing or restoring the slot.
> >
> > It is to get the inactive_since right for the cases where the standby
> > is promoted without a restart similar to when a standby is promoted
> > with restart in which case the inactive_since is set to current time
> > in RestoreSlotFromDisk.
> >
> > Imagine the slot is synced last time at time t1 and then a few hours
> > passed, the standby is promoted without a restart. If we don't set
> > inactive_since to current time in this case in ShutDownSlotSync, the
> > inactive timeout invalidation mechanism can kick in immediately.
> >
>
> Thank you for the explanation! I understood the needs.
>

Do you want to review the v34_0001* further or shall I proceed with
the commit of the same?

-- 
With Regards,
Amit Kapila.

Re: Slow catchup of 2PC (twophase) transactions on replica in LR

2024-04-03 Thread Amit Kapila

On Thu, Apr 4, 2024 at 10:53 AM Ajin Cherian  wrote:
>
> On Wed, Mar 6, 2024 at 1:29 AM Давыдов Виталий  
> wrote:
>>
>> In usual work, the subscription has two_phase = on. I have to change this 
>> option at catchup stage only, but this parameter can not be altered. There 
>> was a patch proposal in past to implement altering of two_phase option, but 
>> it was rejected. I think, the recreation of the subscription with two_phase 
>> = off will not work.
>>
>>
>
> The altering of two_phase was restricted because if there was a previously 
> prepared transaction on the subscriber when the two_phase was on, and then it 
> was turned off, the apply worker on the subscriber would re-apply the 
> transaction a second time and this might result in an inconsistent replica.
> Here's a patch that allows toggling two_phase option provided that there are 
> no pending uncommitted prepared transactions on the subscriber for that 
> subscription.
>

I think this would probably be better than the current situation but
can we think of a solution to allow toggling the value of two_phase
even when prepared transactions are present? Can you please summarize
the reason for the problems in doing that and the solutions, if any?

-- 
With Regards,
Amit Kapila.

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-04-03 Thread Amit Kapila

On Wed, Apr 3, 2024 at 8:28 PM Bharath Rupireddy
 wrote:
>
> On Wed, Apr 3, 2024 at 6:46 PM Bertrand Drouvot
>  wrote:
> >
> > Just one comment on v32-0001:
> >
> > +# Synced slot on the standby must get its own inactive_since.
> > +is( $standby1->safe_psql(
> > +   'postgres',
> > +   "SELECT '$inactive_since_on_primary'::timestamptz <= 
> > '$inactive_since_on_standby'::timestamptz AND
> > +   '$inactive_since_on_standby'::timestamptz <= 
> > '$slot_sync_time'::timestamptz;"
> > +   ),
> > +   "t",
> > +   'synchronized slot has got its own inactive_since');
> > +
> >
> > By using <= we are not testing that it must get its own inactive_since (as 
> > we
> > allow them to be equal in the test). I think we should just add some 
> > usleep()
> > where appropriate and deny equality during the tests on inactive_since.
>
> Thanks. It looks like we can ignore the equality in all of the
> inactive_since comparisons. IIUC, all the TAP tests do run with
> primary and standbys on the single BF animals. And, it looks like
> assigning the inactive_since timestamps to perl variables is giving
> the microseconds precision level
> (./tmp_check/log/regress_log_040_standby_failover_slots_sync:inactive_since
> 2024-04-03 14:30:09.691648+00). FWIW, we already have some TAP and SQL
> tests relying on stats_reset timestamps without equality. So, I've
> left the equality for the inactive_since tests.
>
> > Except for the above, v32-0001 LGTM.
>
> Thanks. Please see the attached v33-0001 patch after removing equality
> on inactive_since TAP tests.
>

The v33-0001 looks good to me. I have made minor changes in the
comments/commit message and removed one part of the test which was a
bit confusing and didn't seem to add much value. Let me know what you
think of the attached?

-- 
With Regards,
Amit Kapila.


v34-0001-Allow-synced-slots-to-have-their-inactive_since.patch
Description: Binary data

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-04-03 Thread Amit Kapila

On Wed, Apr 3, 2024 at 2:58 PM shveta malik  wrote:
>
> On Wed, Apr 3, 2024 at 11:17 AM Bharath Rupireddy
>  wrote:
> >
> > On Wed, Apr 3, 2024 at 8:38 AM shveta malik  wrote:
> > >
> > > > > Or a simple solution is that the slotsync worker updates
> > > > > inactive_since as it does for non-synced slots, and disables
> > > > > timeout-based slot invalidation for synced slots.
> > >
> > > I like this idea better, it takes care of such a case too when the
> > > user is relying on sync-function rather than worker and does not want
> > > to get the slots invalidated in between 2 sync function calls.
> >
> > Please find the attached v31 patches implementing the above idea:
> >
>
> Thanks for the patches, please find few comments:
>
> v31-001:
>
> 1)
> system-views.sgml:
> value will get updated  after every synchronization from the
> corresponding remote slot on the primary.
>
> --This is confusing. It will be good to rephrase it.
>
> 2)
> update_synced_slots_inactive_since()
>
> --May be, we should mention in the header that this function is called
> only during promotion.
>
> 3) 040_standby_failover_slots_sync.pl:
> We capture inactive_since_on_primary when we do this for the first time at 
> #175
> ALTER SUBSCRIPTION regress_mysub1 DISABLE"
>
> But we again recreate the sub and disable it at line #280.
> Do you think we shall get inactive_since_on_primary again here, to be
> compared with inactive_since_on_new_primary later?
>

I think so.

Few additional comments on tests:
1.
+is( $standby1->safe_psql(
+ 'postgres',
+ "SELECT '$inactive_since_on_primary'::timestamptz <
'$inactive_since_on_standby'::timestamptz AND
+ '$inactive_since_on_standby'::timestamptz < '$slot_sync_time'::timestamptz;"

Shall we do <= check as we are doing in the main function
get_slot_inactive_since_value as the time duration is less so it can
be the same as well? Similarly, please check other tests.

2.
+=item $node->get_slot_inactive_since_value(self, slot_name, reference_time)
+
+Get inactive_since column value for a given replication slot validating it
+against optional reference time.
+
+=cut
+
+sub get_slot_inactive_since_value

I see that all callers validate against reference time. It is better
to name it validate_slot_inactive_since rather than using get_* as the
main purpose is to validate the passed value.

-- 
With Regards,
Amit Kapila.

Re: Synchronizing slots from primary to standby

2024-04-03 Thread Amit Kapila

On Wed, Apr 3, 2024 at 11:13 AM Amit Kapila  wrote:
>
> On Wed, Apr 3, 2024 at 9:36 AM Bharath Rupireddy
>  wrote:
>
> > I quickly looked at v8, and have a nit, rest all looks good.
> >
> > +if (DecodingContextReady(ctx) && found_consistent_snapshot)
> > +*found_consistent_snapshot = true;
> >
> > Can the found_consistent_snapshot be checked first to help avoid the
> > function call DecodingContextReady() for pg_replication_slot_advance
> > callers?
> >
>
> Okay, changed. Additionally, I have updated the comments and commit
> message. I'll push this patch after some more testing.
>

Pushed!

-- 
With Regards,
Amit Kapila.

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-04-03 Thread Amit Kapila

On Wed, Apr 3, 2024 at 12:20 PM Bertrand Drouvot
 wrote:
>
> On Wed, Apr 03, 2024 at 11:17:41AM +0530, Bharath Rupireddy wrote:
> > On Wed, Apr 3, 2024 at 8:38 AM shveta malik  wrote:
> > >
> > > > > Or a simple solution is that the slotsync worker updates
> > > > > inactive_since as it does for non-synced slots, and disables
> > > > > timeout-based slot invalidation for synced slots.
> > >
> > > I like this idea better, it takes care of such a case too when the
> > > user is relying on sync-function rather than worker and does not want
> > > to get the slots invalidated in between 2 sync function calls.
> >
> > Please find the attached v31 patches implementing the above idea:
>
> Thanks!
>
> Some comments related to v31-0001:
>
> === testing the behavior
>
> T1 ===
>
> > - synced slots get their on inactive_since just like any other slot
>
> It behaves as described.
>
> T2 ===
>
> > - synced slots inactive_since is set to current timestamp after the
> > standby gets promoted to help inactive_since interpret correctly just
> > like any other slot.
>
> It behaves as described.
>
> CR1 ===
>
> +inactive_since value will get updated
> +after every synchronization
>
> indicates the last synchronization time? (I think that after every 
> synchronization
> could lead to confusion).
>

+1.

> CR2 ===
>
> +   /*
> +* Set the time since the slot has become inactive 
> after shutting
> +* down slot sync machinery. This helps correctly 
> interpret the
> +* time if the standby gets promoted without a 
> restart.
> +*/
>
> It looks to me that this comment is not at the right place because there is
> nothing after the comment that indicates that we shutdown the "slot sync 
> machinery".
>
> Maybe a better place is before the function definition and mention that this 
> is
> currently called when we shutdown the "slot sync machinery"?
>

Won't it be better to have an assert for SlotSyncCtx->pid? IIRC, we
have some existing issues where we don't ensure that no one is running
sync API before shutdown is complete but I think we can deal with that
separately and here we can still have an Assert.

> CR3 ===
>
> +* We get the current time beforehand and only once 
> to avoid
> +* system calls overhead while holding the lock.
>
> s/avoid system calls overhead while holding the lock/avoid system calls while 
> holding the spinlock/?
>

Is it valid to say that there is overhead of this call while holding
spinlock? Because I don't think at the time of promotion we expect any
other concurrent slot activity. The first reason seems good enough.

One other observation:
--- a/src/backend/replication/slot.c
+++ b/src/backend/replication/slot.c
@@ -42,6 +42,7 @@
 #include "access/transam.h"
 #include "access/xlog_internal.h"
 #include "access/xlogrecovery.h"
+#include "access/xlogutils.h"

Is there a reason for this inclusion? I don't see any change which
should need this one.

-- 
With Regards,
Amit Kapila.

Re: Synchronizing slots from primary to standby

2024-04-02 Thread Amit Kapila

On Wed, Apr 3, 2024 at 9:36 AM Bharath Rupireddy
 wrote:
>
> On Wed, Apr 3, 2024 at 9:04 AM Amit Kapila  wrote:
> >
> > > I'd just rename LogicalSlotAdvanceAndCheckSnapState(XLogRecPtr
> > > moveto, bool *found_consistent_snapshot) to
> > > pg_logical_replication_slot_advance(XLogRecPtr moveto, bool
> > > *found_consistent_snapshot) and use it. If others don't like this, I'd
> > > at least turn pg_logical_replication_slot_advance(XLogRecPtr moveto) a
> > > static inline function.
> > >
> > Yeah, we can do that but it is not a performance-sensitive routine so
> > don't know if it is worth it.
>
> Okay for what the patch has right now. No more bikeshedding from me on this.
>
> > > > The slotsync worker needs to advance the slots from different databases 
> > > > in
> > > > fast_forward. So, we need to skip this check in fast_forward mode. The 
> > > > analysis can
> > > > be found in [3].
> > > -if (slot->data.database != MyDatabaseId)
> > > +/*
> > > + * We need to access the system tables during decoding to build the
> > > + * logical changes unless we are in fast_forward mode where no 
> > > changes are
> > > + * generated.
> > > + */
> > > +if (slot->data.database != MyDatabaseId && !fast_forward)
> > >  ereport(ERROR,
> > >
> > > It's not clear from the comment that we need it for a slotsync worker
> > > to advance the slots from different databases. Can this be put into
> > > the comment? Also, specify in the comment, why this is safe?
> > >
> > It is not specific to slot sync worker but specific to fast_forward
> > mode. There is already a comment "We need to access the system tables
> > during decoding to build the logical changes unless we are in
> > fast_forward mode where no changes are generated." telling why it is
> > safe. The point is we need database access to access system tables
> > while generating the logical changes and in fast-forward mode, we
> > don't generate logical changes so this check is not required. Do let
> > me if you have a different understanding or if my understanding is
> > incorrect.
>
> Understood. Thanks. Just curious, why isn't a problem for the existing
> fast_forward mode callers pg_replication_slot_advance and
> LogicalReplicationSlotHasPendingWal?
>

We call those after connecting to the database and the slot also
belongs to that database whereas during synchronization of slots
standby. the slots could be from different databases.

> I quickly looked at v8, and have a nit, rest all looks good.
>
> +if (DecodingContextReady(ctx) && found_consistent_snapshot)
> +*found_consistent_snapshot = true;
>
> Can the found_consistent_snapshot be checked first to help avoid the
> function call DecodingContextReady() for pg_replication_slot_advance
> callers?
>

Okay, changed. Additionally, I have updated the comments and commit
message. I'll push this patch after some more testing.

-- 
With Regards,
Amit Kapila.


v9-0001-Ensure-that-the-sync-slots-reach-a-consistent-sta.patch
Description: Binary data

Re: Synchronizing slots from primary to standby

2024-04-02 Thread Amit Kapila

On Tue, Apr 2, 2024 at 7:42 PM Bharath Rupireddy
 wrote:
>
> On Tue, Apr 2, 2024 at 7:25 PM Zhijie Hou (Fujitsu)
>  wrote:
> >
> > > 1. Can we just remove pg_logical_replication_slot_advance and use
> > > LogicalSlotAdvanceAndCheckSnapState instead? If worried about the
> > > function naming, LogicalSlotAdvanceAndCheckSnapState can be renamed to
> > > pg_logical_replication_slot_advance?
> > >
> > > + * Advance our logical replication slot forward. See
> > > + * LogicalSlotAdvanceAndCheckSnapState for details.
> > >   */
> > >  static XLogRecPtr
> > >  pg_logical_replication_slot_advance(XLogRecPtr moveto)
> > >  {
> >
> > It was commented[1] that it's not appropriate for the
> > pg_logical_replication_slot_advance to have an out parameter
> > 'ready_for_decoding' which looks bit awkward as the functionality doesn't 
> > match
> > the name, and is also not consistent with the style of
> > pg_physical_replication_slot_advance(). So, we added a new function.
>
> I disagree here. A new function just for a parameter is not that great
> IMHO.
>

It is not for the parameter but primarily for the functionality it
provides. The additional functionality of whether we reached a
consistent point while advancing the slot doesn't sound to suit the
current function. Also, we want to keep the signature similar to the
existing function pg_physical_replication_slot_advance().

>
> I'd just rename LogicalSlotAdvanceAndCheckSnapState(XLogRecPtr
> moveto, bool *found_consistent_snapshot) to
> pg_logical_replication_slot_advance(XLogRecPtr moveto, bool
> *found_consistent_snapshot) and use it. If others don't like this, I'd
> at least turn pg_logical_replication_slot_advance(XLogRecPtr moveto) a
> static inline function.
>

Yeah, we can do that but it is not a performance-sensitive routine so
don't know if it is worth it.

> > > 5. As far as the test case for this issue is concerned, I'm fine with
> > > adding one using an INJECTION point because we seem to be having no
> > > consistent way to control postgres writing current snapshot to WAL.
> >
> > Since me and my colleagues can reproduce the issue consistently after 
> > applying
> > 0002 and it could be rare for concurrent xl_running_xacts to happen, we 
> > discussed[2] to
> > consider adding the INJECTION point after pushing the main fix.
>
> Right.
>
> > > 7.
> > > +/*
> > > + * We need to access the system tables during decoding to build the
> > > + * logical changes unless we are in fast-forward mode where no 
> > > changes
> > > are
> > > + * generated.
> > > + */
> > > +if (slot->data.database != MyDatabaseId && !fast_forward)
> > >
> > > May I know if we need this change for this fix?
> >
> > The slotsync worker needs to advance the slots from different databases in
> > fast_forward. So, we need to skip this check in fast_forward mode. The 
> > analysis can
> > be found in [3].
> -if (slot->data.database != MyDatabaseId)
> +/*
> + * We need to access the system tables during decoding to build the
> + * logical changes unless we are in fast_forward mode where no changes 
> are
> + * generated.
> + */
> +if (slot->data.database != MyDatabaseId && !fast_forward)
>  ereport(ERROR,
>
> It's not clear from the comment that we need it for a slotsync worker
> to advance the slots from different databases. Can this be put into
> the comment? Also, specify in the comment, why this is safe?
>

It is not specific to slot sync worker but specific to fast_forward
mode. There is already a comment "We need to access the system tables
during decoding to build the logical changes unless we are in
fast_forward mode where no changes are generated." telling why it is
safe. The point is we need database access to access system tables
while generating the logical changes and in fast-forward mode, we
don't generate logical changes so this check is not required. Do let
me if you have a different understanding or if my understanding is
incorrect.

-- 
With Regards,
Amit Kapila.

Re: Synchronizing slots from primary to standby

2024-04-02 Thread Amit Kapila

On Tue, Apr 2, 2024 at 1:54 PM Bertrand Drouvot
 wrote:
>
> On Tue, Apr 02, 2024 at 07:20:46AM +, Zhijie Hou (Fujitsu) wrote:
> > I added one test in 040_standby_failover_slots_sync.pl in 0002 patch, which 
> > can
> > reproduce the data loss issue consistently on my machine.
>
> Thanks!
>
> > It may not reproduce
> > in some rare cases if concurrent xl_running_xacts are written by bgwriter, 
> > but
> > I think it's still valuable if it can verify the fix in most cases.
>
> What about adding a "wait" injection point in LogStandbySnapshot() to prevent
> checkpointer/bgwriter to log a standby snapshot? Something among those lines:
>
>if (AmCheckpointerProcess() || AmBackgroundWriterProcess())
>INJECTION_POINT("bgw-log-standby-snapshot");
>
> And make use of it in the test, something like:
>
>$node_primary->safe_psql('postgres',
>"SELECT injection_points_attach('bgw-log-standby-snapshot', 
> 'wait');");
>

Sometimes we want the checkpoint to log the standby snapshot as we
need it at a predictable time, maybe one can use
pg_log_standby_snapshot() instead of that. Can we add an injection
point as a separate patch/commit after a bit more discussion? I want
to discuss this in a separate thread so that later we should not get
an objection to adding an injection_point at this location. One other
idea to make such tests predictable is to add a developer-specific GUC
say debug_bg_log_standby_snapshot or something like that but injection
point sounds like a better idea.

-- 
With Regards,
Amit Kapila.

Re: Synchronizing slots from primary to standby

2024-04-01 Thread Amit Kapila

On Mon, Apr 1, 2024 at 6:58 PM Bertrand Drouvot
 wrote:
>
> On Mon, Apr 01, 2024 at 05:04:53PM +0530, Amit Kapila wrote:
> > On Mon, Apr 1, 2024 at 2:51 PM Bertrand Drouvot
> >  wrote:
> > > Then there is no need to call WaitForStandbyConfirmation() as it could go 
> > > until
> > > the RecoveryInProgress() in StandbySlotsHaveCaughtup() for nothing (as we 
> > > already
> > > know it).
> > >
> >
> > Won't it will normally return from the first check in
> > WaitForStandbyConfirmation() because standby_slot_names_config is not
> > set on standby?
>
> I think standby_slot_names can be set on a standby. One could want to set it 
> in
> a cascading standby env (though it won't have any real effects until the 
> standby
> is promoted).
>

Yeah, it is possible but doesn't seem worth additional checks for this
micro-optimization.

-- 
With Regards,
Amit Kapila.

Re: Synchronizing slots from primary to standby

2024-04-01 Thread Amit Kapila

On Mon, Apr 1, 2024 at 2:51 PM Bertrand Drouvot
 wrote:
>
> Hi,
>
> On Mon, Apr 01, 2024 at 06:05:34AM +, Zhijie Hou (Fujitsu) wrote:
> > On Monday, April 1, 2024 8:56 AM Zhijie Hou (Fujitsu) 
> >  wrote:
> > Attach the V4 patch which includes the optimization to skip the decoding if
> > the snapshot at the syncing restart_lsn is already serialized. It can avoid 
> > most
> > of the duplicate decoding in my test, and I am doing some more tests 
> > locally.
> >
>
> Thanks!
>
> 1 ===
>
> Same comment as in [1].
>
> In LogicalSlotAdvanceAndCheckReadynessForDecoding(), if we are synchronizing 
> slots
> then I think that we can skip:
>
> +   /*
> +* Wait for specified streaming replication standby servers 
> (if any)
> +* to confirm receipt of WAL up to moveto lsn.
> +*/
> +   WaitForStandbyConfirmation(moveto);
>
> Indeed if we are dealing with synced slot then we know we're in 
> RecoveryInProgress().
>
> Then there is no need to call WaitForStandbyConfirmation() as it could go 
> until
> the RecoveryInProgress() in StandbySlotsHaveCaughtup() for nothing (as we 
> already
> know it).
>

Won't it will normally return from the first check in
WaitForStandbyConfirmation() because standby_slot_names_config is not
set on standby?

> 2 ===
>
> +   {
> +   if (SnapBuildSnapshotExists(remote_slot->restart_lsn))
> +   {
>
> That could call SnapBuildSnapshotExists() multiple times for the same
> "restart_lsn" (for example in case of multiple remote slots to sync).
>
> What if the sync worker records the last lsn it asks for serialization (and
> serialized ? Then we could check that value first before deciding to call (or 
> not)
> SnapBuildSnapshotExists() on it?
>
> It's not ideal because it would record "only the last one" but that would be
> simple enough for now (currently there is only one sync worker so that 
> scenario
> is likely to happen).
>

Yeah, we could do that but I am not sure how much it can help. I guess
we could do some tests to see if it helps.

-- 
With Regards,
Amit Kapila.

Re: Synchronizing slots from primary to standby

2024-04-01 Thread Amit Kapila

On Mon, Apr 1, 2024 at 6:26 AM Zhijie Hou (Fujitsu)
 wrote:
>
> On Friday, March 29, 2024 2:50 PM Amit Kapila  wrote:
> >
>
> >
> >
> > 2.
> > +extern XLogRecPtr pg_logical_replication_slot_advance(XLogRecPtr moveto,
> > +   bool *found_consistent_point);
> > +
> >
> > This API looks a bit awkward as the functionality doesn't match the name. 
> > How
> > about having a function with name
> > LogicalSlotAdvanceAndCheckReadynessForDecoding(moveto,
> > ready_for_decoding) with the same functionality as your patch has for
> > pg_logical_replication_slot_advance() and then invoke it both from
> > pg_logical_replication_slot_advance and slotsync.c. The function name is too
> > big, we can think of a shorter name. Any ideas?
>
> How about LogicalSlotAdvanceAndCheckDecodingState() Or just
> LogicalSlotAdvanceAndCheckDecoding()?
>

It is about snapbuild state, so how about naming the function as
LogicalSlotAdvanceAndCheckSnapState()?

I have made quite a few cosmetic changes in comments and code. See
attached. This is atop your latest patch. Can you please review and
include these changes in the next version?

-- 
With Regards,
Amit Kapila.
diff --git a/src/backend/replication/logical/logical.c 
b/src/backend/replication/logical/logical.c
index bbc7cdaf50..7d5884789c 100644
--- a/src/backend/replication/logical/logical.c
+++ b/src/backend/replication/logical/logical.c
@@ -518,10 +518,9 @@ CreateDecodingContext(XLogRecPtr start_lsn,
 errmsg("cannot use physical replication slot 
for logical decoding")));
 
/*
-* Do not allow decoding if the replication slot belongs to a different
-* database unless we are in fast-forward mode. In fast-forward mode, we
-* ignore storage-level changes and do not need to access the database
-* object.
+* We need to access the system tables during decoding to build the 
logical
+* changes unless we are in fast-forward mode where no changes are
+* generated.
 */
if (slot->data.database != MyDatabaseId && !fast_forward)
ereport(ERROR,
@@ -530,9 +529,9 @@ CreateDecodingContext(XLogRecPtr start_lsn,
NameStr(slot->data.name;
 
/*
-* Do not allow consumption of a "synchronized" slot until the standby 
gets
-* promoted unless we are syncing replication slots, in which case we 
need
-* to advance the LSN and xmin of the slot during decoding.
+* The slots being synced from the primary can't be used for decoding as
+* they are used after failover. However, we do allow advancing the LSNs
+* during the synchronization of slots. See update_local_synced_slot.
 */
if (RecoveryInProgress() && slot->data.synced && 
!IsSyncingReplicationSlots())
ereport(ERROR,
@@ -2054,8 +2053,8 @@ LogicalReplicationSlotHasPendingWal(XLogRecPtr end_of_wal)
  * WAL and removal of old catalog tuples.  As decoding is done in fast_forward
  * mode, no changes are generated anyway.
  *
- * *ready_for_decoding will be set to true if the logical decoding reaches
- * the consistent point; Otherwise, it will be set to false.
+ * *ready_for_decoding will be true if the initial decoding snapshot has
+ * been built; Otherwise, it will be false.
  */
 XLogRecPtr
 LogicalSlotAdvanceAndCheckReadynessForDecoding(XLogRecPtr moveto,
diff --git a/src/backend/replication/logical/slotsync.c 
b/src/backend/replication/logical/slotsync.c
index 886a9fcc7e..9f6b83d486 100644
--- a/src/backend/replication/logical/slotsync.c
+++ b/src/backend/replication/logical/slotsync.c
@@ -26,10 +26,13 @@
  * pg_sync_replication_slots() periodically to perform the syncs.
  *
  * If synchronized slots fail to build a consistent snapshot from the
- * restart_lsn, they would become unreliable after promotion due to potential
- * data loss from changes before reaching a consistent point. So, we mark such
- * slots as RS_TEMPORARY. Once they successfully reach the consistent point,
- * they will be marked to RS_PERSISTENT.
+ * restart_lsn before reaching confirmed_flush_lsn, they would become
+ * unreliable after promotion due to potential data loss from changes
+ * before reaching a consistent point. This can happen because the slots can
+ * be synced at some random time and we may not reach the consistent point
+ * at the same WAL location as the primary. So, we mark such slots as
+ * RS_TEMPORARY. Once the decoding from corresponding LSNs can reach a
+ * consistent point, they will be marked as RS_PERSISTENT.
  *
  * The slot sync worker waits for some time before the next synchronization,
  * with the duration varying based on whether any slots were updated during
@@ -155,9 +158,9 @@ static void slotsync_failure_callback(int

Re: Synchronizing slots from primary to standby

2024-03-31 Thread Amit Kapila

On Mon, Apr 1, 2024 at 10:01 AM Bharath Rupireddy
 wrote:
>
> On Thu, Mar 28, 2024 at 10:08 AM Zhijie Hou (Fujitsu)
>  wrote:
> >
> > [2] The steps to reproduce the data miss issue on a primary->standby setup:
>
> I'm trying to reproduce the problem with [1], but I can see the
> changes after the standby is promoted. Am I missing anything here?
>
> ubuntu:~/postgres/pg17/bin$ ./psql -d postgres -p 5433 -c "select *
> from pg_logical_slot_get_changes('lrep_sync_slot', NULL, NULL);"
> lsn| xid |data
> ---+-+-
>  0/3B0 | 738 | BEGIN 738
>  0/3017FC8 | 738 | COMMIT 738
>  0/3017FF8 | 739 | BEGIN 739
>  0/3019A38 | 739 | COMMIT 739
>  0/3019A38 | 740 | BEGIN 740
>  0/3019A38 | 740 | table public.dummy1: INSERT: a[integer]:999
>  0/3019AA8 | 740 | COMMIT 740
> (7 rows)
>
> [1]
> -#define LOG_SNAPSHOT_INTERVAL_MS 15000
> +#define LOG_SNAPSHOT_INTERVAL_MS 150
>
> ./initdb -D db17
> echo "archive_mode = on
> archive_command='cp %p /home/ubuntu/postgres/pg17/bin/archived_wal/%f'
> wal_level='logical'
> autovacuum = off
> checkpoint_timeout='1h'" | tee -a db17/postgresql.conf
>
> ./pg_ctl -D db17 -l logfile17 start
>
> rm -rf sbdata logfilesbdata
> ./pg_basebackup -D sbdata
>
> ./psql -d postgres -p 5432 -c "SELECT
> pg_create_logical_replication_slot('lrep_sync_slot', 'test_decoding',
> false, false, true);"
> ./psql -d postgres -p 5432 -c "SELECT
> pg_create_physical_replication_slot('phy_repl_slot', true, false);"
>
> echo "port=5433
> primary_conninfo='host=localhost port=5432 dbname=postgres user=ubuntu'
> primary_slot_name='phy_repl_slot'
> restore_command='cp /home/ubuntu/postgres/pg17/bin/archived_wal/%f %p'
> hot_standby_feedback=on
> sync_replication_slots=on" | tee -a sbdata/postgresql.conf
>
> touch sbdata/standby.signal
>
> ./pg_ctl -D sbdata -l logfilesbdata start
> ./psql -d postgres -p 5433 -c "SELECT pg_is_in_recovery();"
>
> ./psql -d postgres
>
> SESSION1, TXN1
> BEGIN;
> create table dummy1(a int);
>
> SESSION2, TXN2
> SELECT pg_log_standby_snapshot();
>
> SESSION1, TXN1
> COMMIT;
>
> SESSION1, TXN1
> BEGIN;
> create table dummy2(a int);
>
> SESSION2, TXN2
> SELECT pg_log_standby_snapshot();
>
> SESSION1, TXN1
> COMMIT;
>
> ./psql -d postgres -p 5432 -c "SELECT
> pg_replication_slot_advance('lrep_sync_slot', pg_current_wal_lsn());"
>

After this step and before the next, did you ensure that the slot sync
has synced the latest confirmed_flush/restart LSNs? You can query:
"select slot_name,restart_lsn, confirmed_flush_lsn from
pg_replication_slots;" to ensure the same on both the primary and
standby.

-- 
With Regards,
Amit Kapila.

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-31 Thread Amit Kapila

On Fri, Mar 29, 2024 at 6:17 PM Bertrand Drouvot
 wrote:
>
> On Fri, Mar 29, 2024 at 03:03:01PM +0530, Amit Kapila wrote:
> > On Fri, Mar 29, 2024 at 11:49 AM Bertrand Drouvot
> >  wrote:
> > >
> > > On Fri, Mar 29, 2024 at 09:39:31AM +0530, Amit Kapila wrote:
> > > >
> > > > Commit message states: "why we can't just update inactive_since for
> > > > synced slots on the standby with the value received from remote slot
> > > > on the primary. This is consistent with any other slot parameter i.e.
> > > > all of them are synced from the primary."
> > > >
> > > > The inactive_since is not consistent with other slot parameters which
> > > > we copy. We don't perform anything related to those other parameters
> > > > like say two_phase phase which can change that property. However, we
> > > > do acquire the slot, advance the slot (as per recent discussion [1]),
> > > > and release it. Since these operations can impact inactive_since, it
> > > > seems to me that inactive_since is not the same as other parameters.
> > > > It can have a different value than the primary. Why would anyone want
> > > > to know the value of inactive_since from primary after the standby is
> > > > promoted?
> > >
> > > I think it can be useful "before" it is promoted and in case the primary 
> > > is down.
> > >
> >
> > It is not clear to me what is user going to do by checking the
> > inactivity time for slots when the corresponding server is down.
>
> Say a failover needs to be done, then it could be useful to know for which
> slots the activity needs to be resumed (thinking about external logical 
> decoding
> plugin, not about pub/sub here). If one see an inactive slot (since long 
> "enough")
> then he can start to reasonate about what to do with it.
>
> > I thought the idea was to check such slots and see if they need to be
> > dropped or enabled again to avoid excessive disk usage, etc.
>
> Yeah that's the case but it does not mean inactive_since can't be useful in 
> other
> ways.
>
> Also, say the slot has been invalidated on the primary (due to inactivity 
> timeout),
> primary is down and there is a failover. By keeping the inactive_since from
> the primary, one could know when the inactivity that lead to the timeout 
> started.
>

So, this means at promotion, we won't set the current_time for
inactive_since which is not what the currently proposed patch is
doing. Moreover, doing the invalidation on promoted standby based on
inactive_since of the primary node sounds debatable because the
inactive_timeout could be different on the new node (promoted
standby).

> Again, more concerned about external logical decoding plugin than pub/sub 
> here.
>
> > > I agree that tracking the activity time of a synced slot can be useful, 
> > > why
> > > not creating a dedicated field for that purpose (and keep inactive_since a
> > > perfect "copy" of the primary)?
> > >
> >
> > We can have a separate field for this but not sure if it is worth it.
>
> OTOH I'm not sure that erasing this information from the primary is useful. I
> think that 2 fields would be the best option and would be less subject of
> misinterpretation.
>
> > > > Now, the other concern is that calling GetCurrentTimestamp()
> > > > could be costly when the values for the slot are not going to be
> > > > updated but if that happens we can optimize such that before acquiring
> > > > the slot we can have some minimal pre-checks to ensure whether we need
> > > > to update the slot or not.
> > >
> > > Right, but for a very active slot it is likely that we call 
> > > GetCurrentTimestamp()
> > > during almost each sync cycle.
> > >
> >
> > True, but if we have to save a slot to disk each time to persist the
> > changes (for an active slot) then probably GetCurrentTimestamp()
> > shouldn't be costly enough to matter.
>
> Right, persisting the changes to disk would be even more costly.
>

The point I was making is that currently after copying the
remote_node's values, we always persist the slots to disk, so the cost
of current_time shouldn't be much. Now, if the values won't change
then probably there is some cost but in most cases (active slots), the
values will always change. Also, if all the slots are inactive then we
will slow down the speed of sync. We also need to consider if we want
to copy the value of inactive_since from the primary and if that is
the only value changed then shall we persist the slot or not?

-- 
With Regards,
Amit Kapila.

Re: Improve eviction algorithm in ReorderBuffer

2024-03-29 Thread Amit Kapila

On Fri, Mar 29, 2024 at 12:13 PM Masahiko Sawada  wrote:
>
> On Fri, Mar 29, 2024 at 2:09 PM vignesh C  wrote:
> >
> > On Tue, 26 Mar 2024 at 10:05, Masahiko Sawada  wrote:
> > >
> > > On Thu, Mar 14, 2024 at 12:02 PM Masahiko Sawada  
> > > wrote:
> > > >
> > > >
> > > > I've attached new version patches.
> > >
> > > Since the previous patch conflicts with the current HEAD, I've
> > > attached the rebased patches.
> >
> > Thanks for the updated patch.
> > One comment:
> > I felt we can mention the improvement where we update memory
> > accounting info at transaction level instead of per change level which
> > is done in ReorderBufferCleanupTXN, ReorderBufferTruncateTXN, and
> > ReorderBufferSerializeTXN also in the commit message:
>
> Agreed.
>
> I think the patch is in good shape. I'll push the patch with the
> suggestion next week, barring any objections.
>

Few minor comments:
1.
@@ -3636,6 +3801,8 @@ ReorderBufferCheckMemoryLimit(ReorderBuffer *rb)
  Assert(txn->nentries_mem == 0);
  }

+ ReorderBufferMaybeResetMaxHeap(rb);
+

Can we write a comment about why this reset is required here?
Otherwise, the reason is not apparent.

2.
Although using max-heap to select the largest
+ * transaction is effective when there are many transactions being decoded,
+ * there is generally no need to use it as long as all transactions being
+ * decoded are top-level transactions. Therefore, we use MaxConnections as the
+ * threshold so we can prevent switching to the state unless we use
+ * subtransactions.
+ */
+#define MAX_HEAP_TXN_COUNT_THRESHOLD MaxConnections

Isn't using max-heap equally effective in finding the largest
transaction whether there are top-level or top-level plus
subtransactions? This comment indicates it is only effective when
there are subtransactions.

-- 
With Regards,
Amit Kapila.

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-29 Thread Amit Kapila

On Fri, Mar 29, 2024 at 11:49 AM Bertrand Drouvot
 wrote:
>
> On Fri, Mar 29, 2024 at 09:39:31AM +0530, Amit Kapila wrote:
> >
> > Commit message states: "why we can't just update inactive_since for
> > synced slots on the standby with the value received from remote slot
> > on the primary. This is consistent with any other slot parameter i.e.
> > all of them are synced from the primary."
> >
> > The inactive_since is not consistent with other slot parameters which
> > we copy. We don't perform anything related to those other parameters
> > like say two_phase phase which can change that property. However, we
> > do acquire the slot, advance the slot (as per recent discussion [1]),
> > and release it. Since these operations can impact inactive_since, it
> > seems to me that inactive_since is not the same as other parameters.
> > It can have a different value than the primary. Why would anyone want
> > to know the value of inactive_since from primary after the standby is
> > promoted?
>
> I think it can be useful "before" it is promoted and in case the primary is 
> down.
>

It is not clear to me what is user going to do by checking the
inactivity time for slots when the corresponding server is down. I
thought the idea was to check such slots and see if they need to be
dropped or enabled again to avoid excessive disk usage, etc.

> I agree that tracking the activity time of a synced slot can be useful, why
> not creating a dedicated field for that purpose (and keep inactive_since a
> perfect "copy" of the primary)?
>

We can have a separate field for this but not sure if it is worth it.

> > Now, the other concern is that calling GetCurrentTimestamp()
> > could be costly when the values for the slot are not going to be
> > updated but if that happens we can optimize such that before acquiring
> > the slot we can have some minimal pre-checks to ensure whether we need
> > to update the slot or not.
>
> Right, but for a very active slot it is likely that we call 
> GetCurrentTimestamp()
> during almost each sync cycle.
>

True, but if we have to save a slot to disk each time to persist the
changes (for an active slot) then probably GetCurrentTimestamp()
shouldn't be costly enough to matter.

-- 
With Regards,
Amit Kapila.

Re: Synchronizing slots from primary to standby

2024-03-29 Thread Amit Kapila

On Fri, Mar 29, 2024 at 1:08 PM Bertrand Drouvot
 wrote:
>
> On Fri, Mar 29, 2024 at 07:23:11AM +, Zhijie Hou (Fujitsu) wrote:
> > On Friday, March 29, 2024 2:48 PM Bertrand Drouvot 
> >  wrote:
> > >
> > > Hi,
> > >
> > > On Fri, Mar 29, 2024 at 01:06:15AM +, Zhijie Hou (Fujitsu) wrote:
> > > > Attach a new version patch which fixed an un-initialized variable
> > > > issue and added some comments. Also, temporarily enable DEBUG2 for the
> > > > 040 tap-test so that we can analyze the possible CFbot failures easily.
> > > >
> > >
> > > Thanks!
> > >
> > > +   if (remote_slot->confirmed_lsn != slot->data.confirmed_flush)
> > > +   {
> > > +   /*
> > > +* By advancing the restart_lsn, confirmed_lsn, and xmin 
> > > using
> > > +* fast-forward logical decoding, we ensure that the 
> > > required
> > > snapshots
> > > +* are saved to disk. This enables logical decoding to 
> > > quickly
> > > reach a
> > > +* consistent point at the restart_lsn, eliminating the 
> > > risk of
> > > missing
> > > +* data during snapshot creation.
> > > +*/
> > > +
> > > pg_logical_replication_slot_advance(remote_slot->confirmed_lsn,
> > > +
> > > found_consistent_point);
> > > +   ReplicationSlotsComputeRequiredLSN();
> > > +   updated_lsn = true;
> > > +   }
> > >
> > > Instead of using pg_logical_replication_slot_advance() for each synced 
> > > slot and
> > > during sync cycles what about?:
> > >
> > > - keep sync slot synchronization as it is currently (not using
> > > pg_logical_replication_slot_advance())
> > > - create "an hidden" logical slot if sync slot feature is on
> > > - at the time of promotion use pg_logical_replication_slot_advance() on 
> > > this
> > > hidden slot only to advance to the max lsn of the synced slots
> > >
> > > I'm not sure that would be enough, just asking your thoughts on this 
> > > (benefits
> > > would be to avoid calling pg_logical_replication_slot_advance() on each 
> > > sync
> > > slots and during the sync cycles).
> >
> > Thanks for the idea !
> >
> > I considered about this. I think advancing the "hidden" slot on promotion 
> > may be a
> > bit late, because if we cannot reach the consistent point after advancing 
> > the
> > "hidden" slot, then it means we may need to remove all the synced slots as 
> > we
> > are not sure if they are usable(will not loss data) after promotion.
>
> What about advancing the hidden slot during the sync cycles then?
>
> > The current approach is to mark such un-consistent slot as temp and persist
> > them once it reaches consistent point, so that user can ensure the slot can 
> > be
> > used after promotion once persisted.
>
> Right, but do we need to do so for all the sync slots? Would a single hidden
> slot be enough?
>

Even if we mark one of the synced slots as persistent without reaching
a consistent state, it could create a problem after promotion. And,
how a single hidden slot would serve the purpose, different synced
slots will have different restart/confirmed_flush LSN and we won't be
able to perform advancing for those using a single slot. For example,
say for first synced slot, it has not reached a consistent state and
then how can it try for the second slot? This sounds quite tricky to
make work. We should go with something simple where the chances of
introducing bugs are lesser.

-- 
With Regards,
Amit Kapila.

Re: Synchronizing slots from primary to standby

2024-03-29 Thread Amit Kapila

On Fri, Mar 29, 2024 at 9:34 AM Hayato Kuroda (Fujitsu)
 wrote:
>
> Thanks for updating the patch! Here is a comment for it.
>
> ```
> +/*
> + * By advancing the restart_lsn, confirmed_lsn, and xmin using
> + * fast-forward logical decoding, we can verify whether a consistent
> + * snapshot can be built. This process also involves saving necessary
> + * snapshots to disk during decoding, ensuring that logical decoding
> + * efficiently reaches a consistent point at the restart_lsn without
> + * the potential loss of data during snapshot creation.
> + */
> +pg_logical_replication_slot_advance(remote_slot->confirmed_lsn,
> +found_consistent_point);
> +ReplicationSlotsComputeRequiredLSN();
> +updated_lsn = true;
> ```
>
> You added them like pg_replication_slot_advance(), but the function also calls
> ReplicationSlotsComputeRequiredXmin(false) at that time. According to the 
> related
> commit b48df81 and discussions [1], I know it is needed only for physical 
> slots,
> but it makes more consistent to call requiredXmin() as well, per [2]:
>

Yeah, I also think it is okay to call for the sake of consistency with
pg_replication_slot_advance().

-- 
With Regards,
Amit Kapila.

Re: Synchronizing slots from primary to standby

2024-03-29 Thread Amit Kapila

On Fri, Mar 29, 2024 at 6:36 AM Zhijie Hou (Fujitsu)
 wrote:
>
>
> Attach a new version patch which fixed an un-initialized variable issue and
> added some comments.
>

The other approach to fix this issue could be that the slotsync worker
get the serialized snapshot using pg_read_binary_file() corresponding
to restart_lsn and writes those at standby. But there are cases when
we won't have such a file like (a) when we initially create the slot
and reach the consistent_point, or (b) also by the time the slotsync
worker starts to read the remote snapshot file, the snapshot file
could have been removed by the checkpointer on the primary (if the
restart_lsn of the remote has been advanced in this window). So, in
such cases, we anyway need to advance the slot. I think these could be
optimizations that we could do in the future.

Few comments:
=
1.
- if (slot->data.database != MyDatabaseId)
+ if (slot->data.database != MyDatabaseId && !fast_forward)
  ereport(ERROR,
  (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
  errmsg("replication slot \"%s\" was not created in this database",
@@ -526,7 +527,7 @@ CreateDecodingContext(XLogRecPtr start_lsn,
  * Do not allow consumption of a "synchronized" slot until the standby
  * gets promoted.
  */
- if (RecoveryInProgress() && slot->data.synced)
+ if (RecoveryInProgress() && slot->data.synced && !IsSyncingReplicationSlots())

Add comments at both of the above places.

2.
+extern XLogRecPtr pg_logical_replication_slot_advance(XLogRecPtr moveto,
+   bool *found_consistent_point);
+

This API looks a bit awkward as the functionality doesn't match the
name. How about having a function with name
LogicalSlotAdvanceAndCheckReadynessForDecoding(moveto,
ready_for_decoding) with the same functionality as your patch has for
pg_logical_replication_slot_advance() and then invoke it both from
pg_logical_replication_slot_advance and slotsync.c. The function name
is too big, we can think of a shorter name. Any ideas?

-- 
With Regards,
Amit Kapila.

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-28 Thread Amit Kapila

On Wed, Mar 27, 2024 at 9:00 PM Bharath Rupireddy
 wrote:
>
>
> Thanks. I'm attaching v29 patches. 0001 managing inactive_since on the
> standby for sync slots.
>

Commit message states: "why we can't just update inactive_since for
synced slots on the standby with the value received from remote slot
on the primary. This is consistent with any other slot parameter i.e.
all of them are synced from the primary."

The inactive_since is not consistent with other slot parameters which
we copy. We don't perform anything related to those other parameters
like say two_phase phase which can change that property. However, we
do acquire the slot, advance the slot (as per recent discussion [1]),
and release it. Since these operations can impact inactive_since, it
seems to me that inactive_since is not the same as other parameters.
It can have a different value than the primary. Why would anyone want
to know the value of inactive_since from primary after the standby is
promoted? Now, the other concern is that calling GetCurrentTimestamp()
could be costly when the values for the slot are not going to be
updated but if that happens we can optimize such that before acquiring
the slot we can have some minimal pre-checks to ensure whether we need
to update the slot or not.

[1] - 
https://www.postgresql.org/message-id/OS0PR01MB571615D35F486080616CA841943A2%40OS0PR01MB5716.jpnprd01.prod.outlook.com

-- 
With Regards,
Amit Kapila.

Re: Support logical replication of DDLs

2024-03-28 Thread Amit Kapila

On Thu, Mar 28, 2024 at 5:31 PM Andrey M. Borodin  wrote:
>
> This thread is registered on CF [0] but is not active since 2023. Is anyone 
> working on this? I understand that this is a nice feature. Should we move it 
> to next CF or withdraw CF entry?
>

At this stage, we should close either Returned With Feedback or
Withdrawn as this requires a lot of work.

-- 
With Regards,
Amit Kapila.

Re: Synchronizing slots from primary to standby

2024-03-28 Thread Amit Kapila

On Thu, Mar 28, 2024 at 3:34 PM Bertrand Drouvot
 wrote:
>
> On Thu, Mar 28, 2024 at 04:38:19AM +, Zhijie Hou (Fujitsu) wrote:
>
> > To fix this, we could use the fast forward logical decoding to advance the 
> > synced
> > slot's lsn/xmin when syncing these values instead of directly updating the
> > slot's info. This way, the snapshot will be serialized to disk when 
> > decoding.
> > If we could not reach to the consistent point at the remote restart_lsn, the
> > slot is marked as temp and will be persisted once it reaches the consistent
> > point. I am still analyzing the fix and will share once ready.
>
> Thanks! I'm wondering about the performance impact (even in fast_forward 
> mode),
> might be worth to keep an eye on it.
>

True, we can consider performance but correctness should be a
priority, and can we think of a better way to fix this issue?

> Should we create a 17 open item [1]?
>
> [1]: https://wiki.postgresql.org/wiki/PostgreSQL_17_Open_Items
>

Yes, we can do that.

-- 
With Regards,
Amit Kapila.

Re: Synchronizing slots from primary to standby

2024-03-28 Thread Amit Kapila

On Thu, Mar 28, 2024 at 10:08 AM Zhijie Hou (Fujitsu)
 wrote:
>
> When analyzing one BF error[1], we find an issue of slotsync: Since we don't
> perform logical decoding for the synced slots when syncing the lsn/xmin of
> slot, no logical snapshots will be serialized to disk. So, when user starts to
> use these synced slots after promotion, it needs to re-build the consistent
> snapshot from the restart_lsn if the WAL(xl_running_xacts) at restart_lsn
> position indicates that there are running transactions. This however could
> cause the data that before the consistent point to be missed[2].
>
> This issue doesn't exist on the primary because the snapshot at restart_lsn
> should have been serialized to disk (SnapBuildProcessRunningXacts ->
> SnapBuildSerialize), so even if the logical decoding restarts, it can find
> consistent snapshot immediately at restart_lsn.
>
> To fix this, we could use the fast forward logical decoding to advance the 
> synced
> slot's lsn/xmin when syncing these values instead of directly updating the
> slot's info. This way, the snapshot will be serialized to disk when decoding.
> If we could not reach to the consistent point at the remote restart_lsn, the
> slot is marked as temp and will be persisted once it reaches the consistent
> point. I am still analyzing the fix and will share once ready.
>

Yes, we can use this but one thing to note is that
CreateDecodingContext() will expect that the slot's and current
database are the same. I think the reason for that is we need to check
system tables of the current database while decoding and sending data
to the output_plugin which won't be a requirement for the fast_forward
case. So, we need to skip that check in fast_forward mode.

Next, I was thinking about the case of the first time updating the
restart and confirmed_flush LSN while syncing the slots. I think we
can keep the current logic as it is based on the following analysis.

For each logical slot, cases possible on the primary:
1. The restart_lsn doesn't have a serialized snapshot and hasn't yet
reached the consistent point.
2. The restart_lsn doesn't have a serialized snapshot but has reached
a consistent point.
3. The restart_lsn has a serialized snapshot which means it has
reached a consistent point as well.

Considering we keep the logic to reserve initial WAL positions the
same as the current (Reserve WAL for the currently active local slot
using the specified WAL location (restart_lsn). If the given WAL
location has been removed, reserve WAL using the oldest existing WAL
segment.), I could think of the below scenarios:
A. For 1, we shouldn't sync the slot as it still wouldn't have been
marked persistent on the primary.
B. For 2, we would sync the slot
   B1. If remote_restart_lsn >= local_resart_lsn, then advance the
slot by calling pg_logical_replication_slot_advance().
   B11. If we reach consistent point, then it should be okay
because after promotion as well we should reach consistent point.
B111. But again is it possible that there is some xact
that comes before consistent_point on primary and the same is after
consistent_point on standby? This shouldn't matter as we will start
decoding transactions after confirmed_flush_lsn which would be the
same on primary and standby.
   B22. If we haven't reached consistent_point, then we won't mark
the slot as persistent, and at the next sync we will do the same till
it reaches consistent_point. At that time, the situation will be
similar to B11.
   B2. If remote_restart_lsn < local_restart_lsn, then we will wait
for the next sync cycle and keep the slot as temporary. Once in the
next or some consecutive sync cycle, we reach the condition
remote_restart_lsn >= local_restart_lsn, we will proceed to advance
the slot and we should have the same behavior as B1.
C. For 3, we would sync the slot, but the behavior should be the same as B.

Thoughts?

-- 
With Regards,
Amit Kapila.

Re: pg_upgrade and logical replication

2024-03-27 Thread Amit Kapila

On Wed, Mar 27, 2024 at 11:57 AM vignesh C  wrote:
>
> The attached patch has changes to wait for regress_sub4 subscription
> to apply the changes from the publisher before verifying the data.
>

Pushed after changing the order of wait as it looks logical to wait
for regress_sub5 after enabling the subscription. Thanks

-- 
With Regards,
Amit Kapila.

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-26 Thread Amit Kapila

On Wed, Mar 27, 2024 at 10:08 AM Bharath Rupireddy
 wrote:
>
> On Tue, Mar 26, 2024 at 11:22 PM Bertrand Drouvot
>  wrote:
>
> > 3)
> > update_synced_slots_inactive_time():
> >
> > This assert is removed, is it intentional?
> > Assert(s->active_pid == 0);
>
> Yes, the slot can get acquired in the corner case when someone runs
> pg_sync_replication_slots concurrently at this time. I'm referring to
> the issue reported upthread. We don't prevent one running
> pg_sync_replication_slots in promotion/ShutDownSlotSync phase right?
> Maybe we should prevent that otherwise some of the slots are synced
> and the standby gets promoted while others are yet-to-be-synced.
>

We should do something about it but that shouldn't be done in this
patch. We can handle it separately and then add such an assert.

-- 
With Regards,
Amit Kapila.

Re: pgsql: Track last_inactive_time in pg_replication_slots.

2024-03-26 Thread Amit Kapila

On Tue, Mar 26, 2024 at 9:10 PM Alvaro Herrera  wrote:
>
> On 2024-Mar-26, Nathan Bossart wrote:
>
> > FWIW I'd really prefer to have something like max_slot_xid_age for this.  A
> > time-based parameter would likely help with most cases, but transaction ID
> > usage will vary widely from server to server, so it'd be nice to have
> > something to protect against wraparound more directly.
>
> Yeah, I tend to agree that an XID-based limit makes more sense than a
> time-based one.
>

So, do we want the time-based parameter or just max_slot_xid_age
considering both will be GUC's? Yes, the xid_age based parameter
sounds to be directly helpful for transaction ID wraparound or dead
row cleanup, OTOH having a lot of inactive slots (which is possible in
use cases where a tool creates a lot of slots and forgets to remove
them, or the tool exits without cleaning up slots (say due to server
shutdown)) also prohibit removing dead space which is not nice either?

The one example that comes to mind is the pg_createsubscriber
(committed for PG17) which creates one slot per database to convert
standby to subscriber, now say it exits due to power shutdown then
there could be a lot of dangling slots on the primary server. Also,
say there is some bug in such a tool that doesn't allow proper cleanup
of slots, the same problem can happen; yeah, this would be a problem
of the tool but I think there is no harm in giving a way to avoid
problems at the server end due to such slots.

-- 
With Regards,
Amit Kapila.

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-26 Thread Amit Kapila

On Tue, Mar 26, 2024 at 3:12 PM Bertrand Drouvot
 wrote:
>
> On Tue, Mar 26, 2024 at 02:27:17PM +0530, Bharath Rupireddy wrote:
> > Please use the v22 patch set.
>
> Thanks!
>
> 1 ===
>
> +reset_synced_slots_info(void)
>
> I'm not sure "reset" is the right word, what about 
> slot_sync_shutdown_update()?
>

*shutdown_update() sounds generic. How about
update_synced_slots_inactive_time()? I think it is a bit longer but
conveys the meaning.

-- 
With Regards,
Amit Kapila.

Re: pgsql: Track last_inactive_time in pg_replication_slots.

2024-03-26 Thread Amit Kapila

On Tue, Mar 26, 2024 at 2:11 PM Alvaro Herrera  wrote:
>
> On 2024-Mar-26, Amit Kapila wrote:
>
> > On Tue, Mar 26, 2024 at 1:09 PM Alvaro Herrera  
> > wrote:
> > > On 2024-Mar-26, Amit Kapila wrote:
> > > > I would also like to solicit your opinion on the other slot-level
> > > > parameter we are planning to introduce.  This new slot-level parameter
> > > > will be named as inactive_timeout.
> > >
> > > Maybe inactivity_timeout?
> > >
> > > > This will indicate that once the slot is inactive for the
> > > > inactive_timeout period, we will invalidate the slot. We are also
> > > > discussing to have this parameter (inactive_timeout) as GUC [1]. We
> > > > can have this new parameter both at the slot level and as well as a
> > > > GUC, or just one of those.
> > >
> > > replication_slot_inactivity_timeout?
> >
> > So, it seems you are okay to have this parameter both at slot level
> > and as a GUC.
>
> Well, I think a GUC is good to have regardless of the slot parameter,
> because the GUC can be used as an instance-wide protection against going
> out of disk space because of broken replication.  However, now that I
> think about it, I'm not really sure about invalidating a slot based on
> time rather on disk space, for which we already have a parameter; what's
> your rationale for that?  The passage of time is not a very good
> measure, really, because the amount of WAL being protected has wildly
> varying production rate across time.
>

The inactive slot not only blocks WAL from being removed but prevents
the vacuum from proceeding. Also, there is a risk of transaction Id
wraparound. See email [1] for more context.

> I can only see a timeout being useful as a parameter if its default
> value is not the special disable value; say, the default timeout is 3
> days (to be more precise -- the period from Friday to Monday, that is,
> between DBA leaving the office one week until discovering a problem when
> he returns early next week).  This way we have a built-in mechanism that
> invalidates slots regardless of how big the WAL partition is.
>

We can have a default value for this parameter but it has the
potential to break the replication, so not sure what could be a good
default value.

>
> I'm less sure about the slot parameter; in what situation do you need to
> extend the life of one individual slot further than the life of all the
> other slots?

I was thinking of an idle slot scenario where a slot from one
particular subscriber (or output plugin) is inactive due to some
maintenance activity. But it should be okay to have a GUC for this for
now.

[1] - 
https://www.postgresql.org/message-id/20240325195443.GA2923888%40nathanxps13

-- 
With Regards,
Amit Kapila.

Re: pgsql: Track last_inactive_time in pg_replication_slots.

2024-03-26 Thread Amit Kapila

On Tue, Mar 26, 2024 at 1:09 PM Alvaro Herrera  wrote:
>
> On 2024-Mar-26, Amit Kapila wrote:
>
> > We have a consensus on inactive_since, so I'll make that change.
>
> Sounds reasonable.  So this is a timestamptz if the slot is inactive,
> NULL if active, right?
>

Yes.

>  What value is it going to have for sync slots?
>

The behavior will be the same for non-sync slots. In each sync cycle,
we acquire/release the sync slots. So at the time of release,
inactive_since will be updated. See email [1].

> > I would also like to solicit your opinion on the other slot-level
> > parameter we are planning to introduce.  This new slot-level parameter
> > will be named as inactive_timeout.
>
> Maybe inactivity_timeout?
>
> > This will indicate that once the slot is inactive for the
> > inactive_timeout period, we will invalidate the slot. We are also
> > discussing to have this parameter (inactive_timeout) as GUC [1]. We
> > can have this new parameter both at the slot level and as well as a
> > GUC, or just one of those.
>
> replication_slot_inactivity_timeout?
>

So, it seems you are okay to have this parameter both at slot level
and as a GUC. About names, let us see what others think.

Thanks for the input on the names.


[1] - 
https://www.postgresql.org/message-id/CAA4eK1KrPGwfZV9LYGidjxHeW%2BrxJ%3DE2ThjXvwRGLO%3DiLNuo%3DQ%40mail.gmail.com

-- 
With Regards,
Amit Kapila.

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-26 Thread Amit Kapila

On Tue, Mar 26, 2024 at 1:15 PM Bertrand Drouvot
 wrote:
>
> 2 ===
>
> It looks like inactive_since is set to the current timestamp on the standby
> each time the sync worker does a cycle:
>
> primary:
>
> postgres=# select slot_name,inactive_since from pg_replication_slots where 
> failover = 't';
>   slot_name  |inactive_since
> -+---
>  lsub27_slot | 2024-03-26 07:39:19.745517+00
>  lsub28_slot | 2024-03-26 07:40:24.953826+00
>
> standby:
>
> postgres=# select slot_name,inactive_since from pg_replication_slots where 
> failover = 't';
>   slot_name  |inactive_since
> -+---
>  lsub27_slot | 2024-03-26 07:43:56.387324+00
>  lsub28_slot | 2024-03-26 07:43:56.387338+00
>
> I don't think that should be the case.
>

But why? This is exactly what we discussed in another thread where we
agreed to update inactive_since even for sync slots. In each sync
cycle, we acquire/release the slot, so the inactive_since gets
updated. See synchronize_one_slot().

-- 
With Regards,
Amit Kapila.

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-25 Thread Amit Kapila

On Sun, Mar 24, 2024 at 3:05 PM Bharath Rupireddy
 wrote:
>
> I've attached the v18 patch set here. I've also addressed earlier
> review comments from Amit, Ajin Cherian. Note that I've added new
> invalidation mechanism tests in a separate TAP test file just because
> I don't want to clutter or bloat any of the existing files and spread
> tests for physical slots and logical slots into separate existing TAP
> files.
>

Review comments on v18_0002 and v18_0005
===
1.
 ReplicationSlotCreate(const char *name, bool db_specific,
ReplicationSlotPersistency persistency,
-   bool two_phase, bool failover, bool synced)
+   bool two_phase, bool failover, bool synced,
+   int inactive_timeout)
 {
  ReplicationSlot *slot = NULL;
  int i;
@@ -345,6 +348,18 @@ ReplicationSlotCreate(const char *name, bool db_specific,
  errmsg("cannot enable failover for a temporary replication slot"));
  }

+ if (inactive_timeout > 0)
+ {
+ /*
+ * Do not allow users to set inactive_timeout for temporary slots,
+ * because temporary slots will not be saved to the disk.
+ */
+ if (persistency == RS_TEMPORARY)
+ ereport(ERROR,
+ errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+ errmsg("cannot set inactive_timeout for a temporary replication slot"));
+ }

We have decided to update inactive_since for temporary slots. So,
unless there is some reason, we should allow inactive_timeout to also
be set for temporary slots.

2.
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1024,6 +1024,7 @@ CREATE VIEW pg_replication_slots AS
 L.safe_wal_size,
 L.two_phase,
 L.last_inactive_time,
+L.inactive_timeout,

Shall we keep inactive_timeout before
last_inactive_time/inactive_since? I don't have any strong reason to
propose that way apart from that the former is provided by the user.

3.
@@ -287,6 +288,13 @@ pg_get_replication_slots(PG_FUNCTION_ARGS)
  slot_contents = *slot;
  SpinLockRelease(>mutex);

+ /*
+ * Here's an opportunity to invalidate inactive replication slots
+ * based on timeout, so let's do it.
+ */
+ if (InvalidateReplicationSlotForInactiveTimeout(slot, false, true, true))
+ invalidated = true;

I don't think we should try to invalidate the slots in
pg_get_replication_slots. This function's purpose is to get the
current information on slots and has no intention to perform any work
for slots. Any error due to invalidation won't be what the user would
be expecting here.

4.
+static bool
+InvalidateSlotForInactiveTimeout(ReplicationSlot *slot,
+ bool need_control_lock,
+ bool need_mutex)
{
...
...
+ if (need_control_lock)
+ LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+
+ Assert(LWLockHeldByMeInMode(ReplicationSlotControlLock, LW_SHARED));
+
+ /*
+ * Check if the slot needs to be invalidated due to inactive_timeout. We
+ * do this with the spinlock held to avoid race conditions -- for example
+ * the restart_lsn could move forward, or the slot could be dropped.
+ */
+ if (need_mutex)
+ SpinLockAcquire(>mutex);
...

I find this combination of parameters a bit strange. Because, say if
need_mutex is false and need_control_lock is true then that means this
function will acquire LWlock after acquiring spinlock which is
unacceptable. Now, this may not happen in practice as the callers
won't pass such a combination but still, this functionality should be
improved.

-- 
With Regards,
Amit Kapila.

Re: Introduce XID age and inactive timeout based replication slot invalidation

2024-03-25 Thread Amit Kapila

On Tue, Mar 26, 2024 at 1:24 AM Nathan Bossart  wrote:
>
>
> On Sun, Mar 24, 2024 at 03:05:44PM +0530, Bharath Rupireddy wrote:
> > This commit particularly lets one specify the inactive_timeout for
> > a slot via SQL functions pg_create_physical_replication_slot and
> > pg_create_logical_replication_slot.
>
> Off-list, Bharath brought to my attention that the current proposal was to
> set the timeout at the slot level.  While I think that is an entirely
> reasonable thing to support, the main use-case I have in mind for this
> feature is for an administrator that wants to prevent inactive slots from
> causing problems (e.g., transaction ID wraparound) on a server or a number
> of servers.  For that use-case, I think a GUC would be much more
> convenient.  Perhaps there could be a default inactive slot timeout GUC
> that would be used in the absence of a slot-level setting.  Thoughts?
>

Yeah, that is a valid point. One of the reasons for keeping it at slot
level was to allow different subscribers/output plugins to have a
different setting for invalid_timeout for their respective slots based
on their usage. Now, having it as a GUC also has some valid use cases
as pointed out by you but I am not sure having both at slot level and
at GUC level is required. I was a bit inclined to have it at slot
level for now and then based on some field usage report we can later
add GUC as well.

-- 
With Regards,
Amit Kapila.

Re: speed up a logical replica setup

2024-03-25 Thread Amit Kapila

On Tue, Mar 26, 2024 at 8:27 AM Euler Taveira  wrote:
>
> On Mon, Mar 25, 2024, at 11:33 PM, Amit Kapila wrote:
>
> On Mon, Mar 25, 2024 at 5:25 PM Peter Eisentraut  wrote:
> >
> > I have committed your version v33.  I did another pass over the
> > identifier and literal quoting.  I added quoting for replication slot
> > names, for example, even though they can only contain a restricted set
> > of characters, but it felt better to be defensive there.
> >
> > I'm happy to entertain follow-up patches on some of the details like
> > option naming that were still being discussed.  I just wanted to get the
> > main functionality in in good time.  We can fine-tune the rest over the
> > next few weeks.
> >
>
> I was looking at prior discussions on this topic to see if there are
> any other open design points apart from this and noticed that the
> points raised/discussed in the email [1] are also not addressed. IIRC,
> the key point we discussed was that after promotion, the existing
> replication objects should be removed (either optionally or always),
> otherwise, it can lead to a new subscriber not being able to restart
> or getting some unwarranted data.
>
>
> See setup_subscriber.
>
> /*
>  * Since the publication was created before the consistent LSN, it is
>  * available on the subscriber when the physical replica is promoted.
>  * Remove publications from the subscriber because it has no use.
>  */
> drop_publication(conn, [I]);
>

This only drops the publications created by this tool, not the
pre-existing ones that we discussed in the link provided.

-- 
With Regards,
Amit Kapila.

Re: speed up a logical replica setup

2024-03-25 Thread Amit Kapila

On Mon, Mar 25, 2024 at 5:25 PM Peter Eisentraut  wrote:
>
> I have committed your version v33.  I did another pass over the
> identifier and literal quoting.  I added quoting for replication slot
> names, for example, even though they can only contain a restricted set
> of characters, but it felt better to be defensive there.
>
> I'm happy to entertain follow-up patches on some of the details like
> option naming that were still being discussed.  I just wanted to get the
> main functionality in in good time.  We can fine-tune the rest over the
> next few weeks.
>

I was looking at prior discussions on this topic to see if there are
any other open design points apart from this and noticed that the
points raised/discussed in the email [1] are also not addressed. IIRC,
the key point we discussed was that after promotion, the existing
replication objects should be removed (either optionally or always),
otherwise, it can lead to a new subscriber not being able to restart
or getting some unwarranted data.

[1] - 
https://www.postgresql.org/message-id/CAExHW5t4ew7ZrgcDdTv7YmuG7LVQT1ZaEny_EvtngHtEBNyjcQ%40mail.gmail.com

-- 
With Regards,
Amit Kapila.

Re: pgsql: Track last_inactive_time in pg_replication_slots.

2024-03-25 Thread Amit Kapila

On Mon, Mar 25, 2024 at 9:55 PM Robert Haas  wrote:
>
> On Mon, Mar 25, 2024 at 12:12 PM Bertrand Drouvot
>  wrote:
>
> > Would "released_time" sounds better? (at the end this is exactly what it 
> > does
> > represent unless for the case where it is restored from disk for which the 
> > meaning
> > would still makes sense to me though). It seems to me that released_time 
> > does not
> > lead to any expectation then removing any confusion.
>
> Yeah, that's not bad. I mean, I don't agree that released_time doesn't
> lead to any expectation, but what it leads me to expect is that you're
> going to tell me the time at which the slot was released. So if it's
> currently active, then I see NULL, because it's not released; but if
> it's inactive, then I see the time at which it became so.
>
> In the same vein, I think deactivated_at or inactive_since might be
> good names to consider. I think they get at the same thing as
> released_time, but they avoid introducing a completely new word
> (release, as opposed to active/inactive).
>

We have a consensus on inactive_since, so I'll make that change. I
would also like to solicit your opinion on the other slot-level
parameter we are planning to introduce.  This new slot-level parameter
will be named as inactive_timeout. This will indicate that once the
slot is inactive for the inactive_timeout period, we will invalidate
the slot. We are also discussing to have this parameter
(inactive_timeout) as GUC [1]. We can have this new parameter both at
the slot level and as well as a GUC, or just one of those.

[1] - 
https://www.postgresql.org/message-id/20240325195443.GA2923888%40nathanxps13

-- 
With Regards,
Amit Kapila.

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 5916 matches

Mail list logo