Re: Race conditions in 019_replslot_limit.pl

2022-05-30 Thread Kyotaro Horiguchi
At Mon, 30 May 2022 12:01:55 -0700, Andres Freund wrote in > Hi, > > On 2022-03-27 22:37:34 -0700, Andres Freund wrote: > > On 2022-03-27 17:36:14 -0400, Tom Lane wrote: > > > Andres Freund writes: > > > > I still feel like there's something off here. But that's probably not > > > > enough >

Re: Race conditions in 019_replslot_limit.pl

2022-05-30 Thread Andres Freund
Hi, On 2022-03-27 22:37:34 -0700, Andres Freund wrote: > On 2022-03-27 17:36:14 -0400, Tom Lane wrote: > > Andres Freund writes: > > > I still feel like there's something off here. But that's probably not > > > enough > > > to keep causing failures. I'm inclined to leave the debugging in for a

Re: Race conditions in 019_replslot_limit.pl

2022-03-27 Thread Andres Freund
Hi, On 2022-03-27 17:36:14 -0400, Tom Lane wrote: > Andres Freund writes: > > I still feel like there's something off here. But that's probably not enough > > to keep causing failures. I'm inclined to leave the debugging in for a bit > > longer, but not fail the test anymore? > > WFM. I've

Re: Race conditions in 019_replslot_limit.pl

2022-03-27 Thread Tom Lane
Andres Freund writes: > I still feel like there's something off here. But that's probably not enough > to keep causing failures. I'm inclined to leave the debugging in for a bit > longer, but not fail the test anymore? WFM. > The way the temporary slot removal hangs for a while seems just off:

Re: Race conditions in 019_replslot_limit.pl

2022-03-27 Thread Andres Freund
Hi, On 2022-03-27 15:28:05 -0400, Tom Lane wrote: > Andres Freund writes: > > How about the attached variation, which retries (for 15s, with 100ms sleeps) > > if there are multiple walsenders, printing the debugging info each time? > > It'll > > still fail the test if later iterations find just

Re: Race conditions in 019_replslot_limit.pl

2022-03-27 Thread Tom Lane
Andres Freund writes: > How about the attached variation, which retries (for 15s, with 100ms sleeps) > if there are multiple walsenders, printing the debugging info each time? It'll > still fail the test if later iterations find just one walsender, which seems > the right behaviour for now. We

Re: Race conditions in 019_replslot_limit.pl

2022-03-24 Thread Tom Lane
Andres Freund writes: > How about the attached variation, which retries (for 15s, with 100ms sleeps) > if there are multiple walsenders, printing the debugging info each time? It'll > still fail the test if later iterations find just one walsender, which seems > the right behaviour for now. OK

Re: Race conditions in 019_replslot_limit.pl

2022-03-24 Thread Andres Freund
Hi, On 2022-03-24 15:05:40 -0400, Tom Lane wrote: > I just noticed something very interesting: in a couple of recent > buildfarm runs with this failure, the pg_stat_activity printout > no longer shows the extra walsender: Oh. That is interesting. Thanks for catching that. > What this suggests

Re: Race conditions in 019_replslot_limit.pl

2022-03-24 Thread Tom Lane
I just noticed something very interesting: in a couple of recent buildfarm runs with this failure, the pg_stat_activity printout no longer shows the extra walsender: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=serinus=2022-03-24%2017%3A50%3A10

Re: Race conditions in 019_replslot_limit.pl

2022-02-25 Thread Andres Freund
Hi, On 2022-02-25 12:15:58 -0800, Andres Freund wrote: > On 2022-02-25 15:07:01 -0500, Tom Lane wrote: > > Andres Freund writes: > > Hmm. Maybe put a couple more debug messages into ReplicationSlotCleanup > > and/or ReplicationSlotDropPtr? It doesn't seem very clear where in that > > sequence

Re: Race conditions in 019_replslot_limit.pl

2022-02-25 Thread Andres Freund
Hi, On 2022-02-25 15:07:01 -0500, Tom Lane wrote: > Andres Freund writes: > > Seems to suggest something is holding a problematic lock in a way that I do > > not understand yet: > > > https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=crake=2022-02-23%2013%3A47%3A20=recovery-check >

Re: Race conditions in 019_replslot_limit.pl

2022-02-25 Thread Tom Lane
Andres Freund writes: > Seems to suggest something is holding a problematic lock in a way that I do > not understand yet: > https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=crake=2022-02-23%2013%3A47%3A20=recovery-check > 2022-02-23 09:09:52.299 EST [2022-02-23 09:09:52 EST

Re: Race conditions in 019_replslot_limit.pl

2022-02-25 Thread Andres Freund
Hi, On 2022-02-22 18:06:24 -0800, Andres Freund wrote: > On 2022-02-18 15:14:15 -0800, Andres Freund wrote: > > I'm running out of ideas for how to try to reproduce this. I think we might > > need some additional debugging information to get more information from the > > buildfarm. > > > I'm

Re: Race conditions in 019_replslot_limit.pl

2022-02-22 Thread Andres Freund
Hi, I think I did find a bug related to the test, but afaics not the cause of the test failures we're seeing. See https://www.postgresql.org/message-id/20220223014855.4lsddr464i7mymk2%40alap3.anarazel.de I don't think it's related to the problem of this thread, because the logs of primary3

Re: Race conditions in 019_replslot_limit.pl

2022-02-18 Thread Andres Freund
Hi, On 2022-02-18 18:49:14 -0500, Tom Lane wrote: > Andres Freund writes: > > We'd also need to add pq_endmessage_noblock(), because the pq_endmessage() > > obviously tries to send (as in the backtrace upthread) if the output buffer > > is > > large enough, which it often will be in walsender.

Re: Race conditions in 019_replslot_limit.pl

2022-02-18 Thread Tom Lane
Andres Freund writes: > On 2022-02-18 18:15:21 -0500, Tom Lane wrote: >> Perhaps it'd be sensible to do this only in debugging (ie Assert) >> builds? > That seems not great, because it pretty clearly can lead to hangs, which is > problematic in tests too. What about using pq_flush_if_writable()?

Re: Race conditions in 019_replslot_limit.pl

2022-02-18 Thread Andres Freund
Hi, On 2022-02-18 18:15:21 -0500, Tom Lane wrote: > Andres Freund writes: > > On 2022-02-17 21:55:21 -0800, Andres Freund wrote: > >> Isn't it pretty bonkers that we allow error processing to get stuck behind > >> network traffic, *before* we have have released resources (locks etc)? > > It's

Re: Race conditions in 019_replslot_limit.pl

2022-02-18 Thread Tom Lane
Andres Freund writes: > On 2022-02-17 21:55:21 -0800, Andres Freund wrote: >> Isn't it pretty bonkers that we allow error processing to get stuck behind >> network traffic, *before* we have have released resources (locks etc)? It's more or less intentional, per elog.c: /* * This flush

Re: Race conditions in 019_replslot_limit.pl

2022-02-18 Thread Andres Freund
Hi, On 2022-02-18 14:42:48 -0800, Andres Freund wrote: > On 2022-02-17 21:55:21 -0800, Andres Freund wrote: > > Isn't it pretty bonkers that we allow error processing to get stuck behind > > network traffic, *before* we have have released resources (locks etc)? > > This is particularly likely to

Re: Race conditions in 019_replslot_limit.pl

2022-02-18 Thread Andres Freund
Hi, On 2022-02-17 21:55:21 -0800, Andres Freund wrote: > Isn't it pretty bonkers that we allow error processing to get stuck behind > network traffic, *before* we have have released resources (locks etc)? This is particularly likely to be a problem for walsenders, because they often have a large

Re: Race conditions in 019_replslot_limit.pl

2022-02-17 Thread Andres Freund
Hi, On 2022-02-17 19:41:40 -0800, Andres Freund wrote: > At this point the psql "walsender" is *not* terminated, because the fatal > message is waiting behind all the WAL data sent. Which also means that the > slot isn't yet released. Isn't it pretty bonkers that we allow error processing to get

Re: Race conditions in 019_replslot_limit.pl

2022-02-17 Thread Tom Lane
Andres Freund writes: > There've not been any new reports in the last 18 hours, but that's probably > just due to lower commit activity triggering fewer runs. myna just showed the same symptom: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=myna=2022-02-18%2003%3A00%3A17 # Failed

Re: Race conditions in 019_replslot_limit.pl

2022-02-17 Thread Andres Freund
Hi, On 2022-02-17 16:34:34 -0800, Andres Freund wrote: > I've done this for a couple hundred iterations, under load, subsequently with > additional assertions, without being able to reproduce. Playing around with this further I did get into one interesting state: I started psql with

Re: Race conditions in 019_replslot_limit.pl

2022-02-17 Thread Andres Freund
Hi, On 2022-02-16 20:03:00 -0800, Andres Freund wrote: > I'll run the test in a loop, perhaps I can reproduce... I've done this for a couple hundred iterations, under load, subsequently with additional assertions, without being able to reproduce. There've not been any new reports in the last 18

Re: Race conditions in 019_replslot_limit.pl

2022-02-16 Thread Andres Freund
Hi, On 2022-02-16 22:58:19 -0500, Tom Lane wrote: > I wonder though if an exiting walsender would always take that path. You're right - the CFI() in PostgresMain(), several replication commands, are not at all guaranteed to be in a transaction when throwing a FATAL. I don't quite see where we

Re: Race conditions in 019_replslot_limit.pl

2022-02-16 Thread Andres Freund
Hi, On 2022-02-16 19:46:26 -0800, Andres Freund wrote: > But right now I'm not seeing what prevents us from throwing a FATAL error > while holding an lwlock? Nothing. But we register ShutdownPostgres() / ShutdownAuxiliaryProcess() to take care of that. I checked and walsenders do have

Re: Race conditions in 019_replslot_limit.pl

2022-02-16 Thread Tom Lane
Andres Freund writes: > But right now I'm not seeing what prevents us from throwing a FATAL error > while holding an lwlock? If we're inside a transaction, then ShutdownPostgres -> AbortOutOfAnyTransaction would release LWLocks. I wonder though if an exiting walsender would always take that

Re: Race conditions in 019_replslot_limit.pl

2022-02-16 Thread Andres Freund
Hi, On 2022-02-16 22:11:30 -0500, Tom Lane wrote: > So (a) it broke around 48 hours ago, which is already a useful > bit of info Indeed. 2f6501fa3c54bbe4568e3bcccd9a60d26a46b5ee seems like the obvious commit to blame. We document before_shmem_exit hooks as /* * Call

Re: Race conditions in 019_replslot_limit.pl

2022-02-16 Thread Tom Lane
I wrote: > So (a) it broke around 48 hours ago, which is already a useful > bit of info, and (b) your animals seem far more susceptible than > anyone else's. Why do you suppose that is? Eyeballing the commit log for potential causes in that time frame, I can't help noticing 2 days ago

Re: Race conditions in 019_replslot_limit.pl

2022-02-16 Thread Tom Lane
Andres Freund writes: > On 2022-02-16 20:22:23 -0500, Tom Lane wrote: >> There's no disconnection log entry for either, which I suppose means >> that somebody didn't bother logging disconnection for walsenders ... > The thing is, we actually *do* log disconnection for walsenders: Ah, my

Re: Race conditions in 019_replslot_limit.pl

2022-02-16 Thread Andres Freund
Hi, On 2022-02-16 20:22:23 -0500, Tom Lane wrote: > There's no disconnection log entry for either, which I suppose means > that somebody didn't bother logging disconnection for walsenders ... The thing is, we actually *do* log disconnection for walsenders: 2022-02-16 17:53:55.780 PST

Re: Race conditions in 019_replslot_limit.pl

2022-02-16 Thread Tom Lane
Andres Freund writes: > I think the test is telling us that something may be broken. We shouldn't > silence that without at least some understanding what it is. I looked at the recent failure on komodoensis [1], and I think what is happening is just that the walsender for the basebackup run

Re: Race conditions in 019_replslot_limit.pl

2022-02-16 Thread Andres Freund
Hi, On 2022-02-16 18:04:14 +0900, Masahiko Sawada wrote: > On Wed, Feb 16, 2022 at 3:22 PM Kyotaro Horiguchi > wrote: > > > > At Wed, 16 Feb 2022 14:58:23 +0900, Masahiko Sawada > > wrote in > > > Or it's possible that the process took a time to clean up the > > > temporary replication slot? >

Re: Race conditions in 019_replslot_limit.pl

2022-02-16 Thread Masahiko Sawada
On Wed, Feb 16, 2022 at 3:22 PM Kyotaro Horiguchi wrote: > > At Wed, 16 Feb 2022 14:58:23 +0900, Masahiko Sawada > wrote in > > Or it's possible that the process took a time to clean up the > > temporary replication slot? > > Checkpointer may take ReplicationSlotControlLock. Dead lock between >

Re: Race conditions in 019_replslot_limit.pl

2022-02-15 Thread Kyotaro Horiguchi
At Wed, 16 Feb 2022 15:22:27 +0900 (JST), Kyotaro Horiguchi wrote in > At Wed, 16 Feb 2022 14:58:23 +0900, Masahiko Sawada > wrote in > > Or it's possible that the process took a time to clean up the > > temporary replication slot? > > Checkpointer may take ReplicationSlotControlLock. Dead

Re: Race conditions in 019_replslot_limit.pl

2022-02-15 Thread Kyotaro Horiguchi
At Wed, 16 Feb 2022 14:58:23 +0900, Masahiko Sawada wrote in > Or it's possible that the process took a time to clean up the > temporary replication slot? Checkpointer may take ReplicationSlotControlLock. Dead lock between ReplicationSlotCleanup and InvalidateObsoleteReplicationSlots happened?

Re: Race conditions in 019_replslot_limit.pl

2022-02-15 Thread Kyotaro Horiguchi
At Wed, 16 Feb 2022 14:58:23 +0900, Masahiko Sawada wrote in > On Wed, Feb 16, 2022 at 2:26 PM Kyotaro Horiguchi > wrote: > > > > At Tue, 15 Feb 2022 15:51:57 -0800, Andres Freund > > wrote in > > > I guess it's conceivable that the backend was still working through > > > process > > >

Re: Race conditions in 019_replslot_limit.pl

2022-02-15 Thread Kyotaro Horiguchi
At Wed, 16 Feb 2022 14:26:37 +0900 (JST), Kyotaro Horiguchi wrote in > Agreed. Doing this att all slot creation seems fine. Done in the attached. The first slot is deliberately created unreserved so I changed the code to re-create the slot with "reserved" before taking backup. > > Even

Re: Race conditions in 019_replslot_limit.pl

2022-02-15 Thread Masahiko Sawada
On Wed, Feb 16, 2022 at 2:26 PM Kyotaro Horiguchi wrote: > > At Tue, 15 Feb 2022 15:51:57 -0800, Andres Freund wrote > in > > I think what happened is that there was no WAL to receive between the start > > of > > the primary and the $node_primary3->wait_for_catchup($node_standby3); > > > >

Re: Race conditions in 019_replslot_limit.pl

2022-02-15 Thread Kyotaro Horiguchi
At Tue, 15 Feb 2022 15:51:57 -0800, Andres Freund wrote in > I think what happened is that there was no WAL to receive between the start of > the primary and the $node_primary3->wait_for_catchup($node_standby3); > > Because the slot is created without reserving WAL that allows the primary to >

Re: Race conditions in 019_replslot_limit.pl

2022-02-15 Thread Andres Freund
Hi, On 2022-02-15 23:29:20 +0200, Heikki Linnakangas wrote: > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=desmoxytes=2022-02-14%2006%3A30%3A04 > > [07:42:23] t/018_wal_optimize.pl ok12403 ms ( 0.00 usr > 0.00 sys + 1.40 cusr 0.63 csys = 2.03 CPU) > #

Race conditions in 019_replslot_limit.pl

2022-02-15 Thread Heikki Linnakangas
While looking at recent failures in the new 028_pitr_timelines.pl recovery test, I noticed that there have been a few failures in the buildfarm in the recoveryCheck phase even before that, in the 019_replslot_limit.pl test. For example: