Re: Autovacuum worker doesn't immediately exit on postmaster death

2021-03-30 Thread Stephen Frost
Greetings,

* Stephen Frost (sfr...@snowman.net) wrote:
> * Stephen Frost (sfr...@snowman.net) wrote:
> > * Michael Paquier (mich...@paquier.xyz) wrote:
> > > On Mon, Mar 22, 2021 at 04:07:12PM -0400, Robert Haas wrote:
> > > > On Mon, Mar 22, 2021 at 1:48 PM Stephen Frost  
> > > > wrote:
> > > >> Thanks for that.  Attached is just a rebased version with a commit
> > > >> message added.  If there aren't any other concerns, I'll commit this in
> > > >> the next few days and back-patch it.  When it comes to 12 and older,
> > > >> does anyone want to opine about the wait event to use?  I was thinking
> > > >> PG_WAIT_TIMEOUT or WAIT_EVENT_PG_SLEEP ...
> > > > 
> > > > I'm not sure if we should back-patch this, but I think if you do you
> > > > should just add a wait event, rather than using a generic one.
> > > 
> > > I would not back-patch that either, as this is an improvement of the
> > > current state.  I agree that this had better introduce a new wait
> > > event.  Even if this stuff gets backpatched, you won't introduce an
> > > ABI incompatibility with a new event as long as you add the new event
> > > at the end of the existing enum lists, but let's keep the wait events
> > > ordered on HEAD.
> > 
> > Adding CFI's in places that really should have them is something we
> > certainly have back-patched in the past, and that's just 'an improvement
> > of the current state' too, so I don't quite follow the argument being
> > made here that this shouldn't be back-patched.
> > 
> > I don't have any problem with adding into the older releases, at the end
> > of the existing lists, the same wait event that exists in 13+ for this
> > already.
> > 
> > Any other thoughts on this, particularly about back-patching or not..?
> 
> We seem to be at a bit of an impasse on this regarding back-patching,
> which seems unfortunate to me, but without someone else commenting it
> seems like it's stalled.
> 
> I'll go ahead and push the change to HEAD soon, as there doesn't seem to
> be any contention regarding that.

Done.

Thanks!

Stephen


signature.asc
Description: PGP signature


Re: Autovacuum worker doesn't immediately exit on postmaster death

2021-03-28 Thread Stephen Frost
Greetings,

* Stephen Frost (sfr...@snowman.net) wrote:
> * Michael Paquier (mich...@paquier.xyz) wrote:
> > On Mon, Mar 22, 2021 at 04:07:12PM -0400, Robert Haas wrote:
> > > On Mon, Mar 22, 2021 at 1:48 PM Stephen Frost  wrote:
> > >> Thanks for that.  Attached is just a rebased version with a commit
> > >> message added.  If there aren't any other concerns, I'll commit this in
> > >> the next few days and back-patch it.  When it comes to 12 and older,
> > >> does anyone want to opine about the wait event to use?  I was thinking
> > >> PG_WAIT_TIMEOUT or WAIT_EVENT_PG_SLEEP ...
> > > 
> > > I'm not sure if we should back-patch this, but I think if you do you
> > > should just add a wait event, rather than using a generic one.
> > 
> > I would not back-patch that either, as this is an improvement of the
> > current state.  I agree that this had better introduce a new wait
> > event.  Even if this stuff gets backpatched, you won't introduce an
> > ABI incompatibility with a new event as long as you add the new event
> > at the end of the existing enum lists, but let's keep the wait events
> > ordered on HEAD.
> 
> Adding CFI's in places that really should have them is something we
> certainly have back-patched in the past, and that's just 'an improvement
> of the current state' too, so I don't quite follow the argument being
> made here that this shouldn't be back-patched.
> 
> I don't have any problem with adding into the older releases, at the end
> of the existing lists, the same wait event that exists in 13+ for this
> already.
> 
> Any other thoughts on this, particularly about back-patching or not..?

We seem to be at a bit of an impasse on this regarding back-patching,
which seems unfortunate to me, but without someone else commenting it
seems like it's stalled.

I'll go ahead and push the change to HEAD soon, as there doesn't seem to
be any contention regarding that.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: Autovacuum worker doesn't immediately exit on postmaster death

2021-03-24 Thread Stephen Frost
Greetings,

* Michael Paquier (mich...@paquier.xyz) wrote:
> On Mon, Mar 22, 2021 at 04:07:12PM -0400, Robert Haas wrote:
> > On Mon, Mar 22, 2021 at 1:48 PM Stephen Frost  wrote:
> >> Thanks for that.  Attached is just a rebased version with a commit
> >> message added.  If there aren't any other concerns, I'll commit this in
> >> the next few days and back-patch it.  When it comes to 12 and older,
> >> does anyone want to opine about the wait event to use?  I was thinking
> >> PG_WAIT_TIMEOUT or WAIT_EVENT_PG_SLEEP ...
> > 
> > I'm not sure if we should back-patch this, but I think if you do you
> > should just add a wait event, rather than using a generic one.
> 
> I would not back-patch that either, as this is an improvement of the
> current state.  I agree that this had better introduce a new wait
> event.  Even if this stuff gets backpatched, you won't introduce an
> ABI incompatibility with a new event as long as you add the new event
> at the end of the existing enum lists, but let's keep the wait events
> ordered on HEAD.

Adding CFI's in places that really should have them is something we
certainly have back-patched in the past, and that's just 'an improvement
of the current state' too, so I don't quite follow the argument being
made here that this shouldn't be back-patched.

I don't have any problem with adding into the older releases, at the end
of the existing lists, the same wait event that exists in 13+ for this
already.

Any other thoughts on this, particularly about back-patching or not..?

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: Autovacuum worker doesn't immediately exit on postmaster death

2021-03-23 Thread Michael Paquier
On Mon, Mar 22, 2021 at 04:07:12PM -0400, Robert Haas wrote:
> On Mon, Mar 22, 2021 at 1:48 PM Stephen Frost  wrote:
>> Thanks for that.  Attached is just a rebased version with a commit
>> message added.  If there aren't any other concerns, I'll commit this in
>> the next few days and back-patch it.  When it comes to 12 and older,
>> does anyone want to opine about the wait event to use?  I was thinking
>> PG_WAIT_TIMEOUT or WAIT_EVENT_PG_SLEEP ...
> 
> I'm not sure if we should back-patch this, but I think if you do you
> should just add a wait event, rather than using a generic one.

I would not back-patch that either, as this is an improvement of the
current state.  I agree that this had better introduce a new wait
event.  Even if this stuff gets backpatched, you won't introduce an
ABI incompatibility with a new event as long as you add the new event
at the end of the existing enum lists, but let's keep the wait events
ordered on HEAD.
--
Michael


signature.asc
Description: PGP signature


Re: Autovacuum worker doesn't immediately exit on postmaster death

2021-03-22 Thread Robert Haas
On Mon, Mar 22, 2021 at 1:48 PM Stephen Frost  wrote:
> Thanks for that.  Attached is just a rebased version with a commit
> message added.  If there aren't any other concerns, I'll commit this in
> the next few days and back-patch it.  When it comes to 12 and older,
> does anyone want to opine about the wait event to use?  I was thinking
> PG_WAIT_TIMEOUT or WAIT_EVENT_PG_SLEEP ...

I'm not sure if we should back-patch this, but I think if you do you
should just add a wait event, rather than using a generic one.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Autovacuum worker doesn't immediately exit on postmaster death

2021-03-22 Thread Stephen Frost
Greetings,

* Thomas Munro (thomas.mu...@gmail.com) wrote:
> On Fri, Dec 11, 2020 at 7:57 AM Stephen Frost  wrote:
> > * Tom Lane (t...@sss.pgh.pa.us) wrote:
> > > The if-we're-going-to-delay-anyway path in vacuum_delay_point seems
> > > OK to add a touch more overhead to, though.
> >
> > Alright, for this part at least, seems like it'd be something like the
> > attached.
> >
> > Only lightly tested, but does seem to address the specific example which
> > was brought up on this thread.
> >
> > Thoughts..?
> 
> +1

Thanks for that.  Attached is just a rebased version with a commit
message added.  If there aren't any other concerns, I'll commit this in
the next few days and back-patch it.  When it comes to 12 and older,
does anyone want to opine about the wait event to use?  I was thinking
PG_WAIT_TIMEOUT or WAIT_EVENT_PG_SLEEP ...

Or do folks think this shouldn't be backpatched?  That would mean it
wouldn't help anyone for years, which would be pretty unfortuante, hence
my feeling that it's worthwhile to backpatch.

Thanks!

Stephen
From 9daf52b78d106c86e038dcefdb1d8345d22b9756 Mon Sep 17 00:00:00 2001
From: Stephen Frost 
Date: Mon, 22 Mar 2021 13:25:57 -0400
Subject: [PATCH] Use a WaitLatch for vacuum/autovacuum sleeping

Instead of using pg_usleep() in vacuum_delay_point(), use a WaitLatch.
This has the advantage that we will realize if the postmaster has been
killed since the last time we decided to sleep while vacuuming.

Discussion: https://postgr.es/m/CAFh8B=kcdk8k-Y21RfXPu5dX=bgPqJ8TC3p_qxR_ygdBS=j...@mail.gmail.com
Backpatch: 9.6-
---
 src/backend/commands/vacuum.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index c064352e23..662aff04b4 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -2080,9 +2080,11 @@ vacuum_delay_point(void)
 		if (msec > VacuumCostDelay * 4)
 			msec = VacuumCostDelay * 4;
 
-		pgstat_report_wait_start(WAIT_EVENT_VACUUM_DELAY);
-		pg_usleep((long) (msec * 1000));
-		pgstat_report_wait_end();
+		(void) WaitLatch(MyLatch,
+		 WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+		 msec,
+		 WAIT_EVENT_VACUUM_DELAY);
+		ResetLatch(MyLatch);
 
 		VacuumCostBalance = 0;
 
-- 
2.27.0



signature.asc
Description: PGP signature


Re: Autovacuum worker doesn't immediately exit on postmaster death

2021-02-22 Thread Thomas Munro
On Fri, Dec 11, 2020 at 7:57 AM Stephen Frost  wrote:
> * Tom Lane (t...@sss.pgh.pa.us) wrote:
> > The if-we're-going-to-delay-anyway path in vacuum_delay_point seems
> > OK to add a touch more overhead to, though.
>
> Alright, for this part at least, seems like it'd be something like the
> attached.
>
> Only lightly tested, but does seem to address the specific example which
> was brought up on this thread.
>
> Thoughts..?

+1




Re: Autovacuum worker doesn't immediately exit on postmaster death

2020-12-10 Thread Thomas Munro
On Fri, Dec 11, 2020 at 8:34 AM Robert Haas  wrote:
> On Thu, Oct 29, 2020 at 5:36 PM Alvaro Herrera  
> wrote:
> > Maybe instead of thinking specifically in terms of vacuum, we could
> > count buffer accesses (read from kernel) and check the latch once every
> > 1000th such, or something like that.  Then a very long query doesn't
> > have to wait until it's run to completion.  The cost is one integer
> > addition per syscall, which should be bearable.
>
> Interesting idea. One related case is where everything is fine on the
> server side but the client has disconnected and we don't notice that
> the socket has changed state until something makes us try to send a
> message to the client, which might be a really long time if the
> server's doing like a lengthy computation before generating any rows.
> It would be really nice if we could find a cheap way to check for both
> postmaster death and client disconnect every now and then, like if a
> single system call could somehow answer both questions.

For the record, an alternative approach was proposed[1] that
periodically checks for disconnected sockets using a timer, that will
then cause the next CFI() to abort.

Doing the check (a syscall) based on elapsed time rather than every
nth CFI() or buffer access or whatever seems better in some ways,
considering the difficulty of knowing what the frequency will be.  One
of the objections was that it added unacceptable setitimer() calls.
We discussed an idea to solve that problem generally, and then later I
prototyped that idea in another thread[2] about idle session timeouts
(not sure about that yet, comments welcome).

I've also wondered about checking postmaster_possibly_dead in CFI() on
platforms where we have it (and working to increase that set of
platforms), instead of just reacting to PM death when sleeping.   But
it seems like the real problem in this specific case is the use of
pg_usleep() where WaitLatch() should be used, no?

The recovery loop is at the opposite end of the spectrum: while vacuum
doesn't check for postmaster death often enough, the recovery loop
checks potentially hundreds of thousands or millions of times per
seconds, which sucks on systems that don't have parent-death signals
and slows down recovery quite measurably.  In the course of the
discussion about fixing that[3] we spotted other places that are using
a pg_usleep() where they ought to be using WaitLatch() (which comes
with exit-on-PM-death behaviour built-in).  By the way, the patch in
that thread does almost what Robert described, namely check for PM
death every nth time (which in this case means every nth WAL record),
except it's not in the main CFI(), it's in a special variant used just
for recovery.

[1] 
https://www.postgresql.org/message-id/flat/77def86b27e41f0efcba411460e929ae%40postgrespro.ru
[2] 
https://www.postgresql.org/message-id/flat/763a0689-f189-459e-946f-f0ec44589...@hotmail.com
[3] 
https://www.postgresql.org/message-id/flat/CA+hUKGK1607VmtrDUHQXrsooU=ap4g4r2yaobywooa3m8xe...@mail.gmail.com




Re: Autovacuum worker doesn't immediately exit on postmaster death

2020-12-10 Thread Robert Haas
On Thu, Oct 29, 2020 at 5:36 PM Alvaro Herrera  wrote:
> Maybe instead of thinking specifically in terms of vacuum, we could
> count buffer accesses (read from kernel) and check the latch once every
> 1000th such, or something like that.  Then a very long query doesn't
> have to wait until it's run to completion.  The cost is one integer
> addition per syscall, which should be bearable.

Interesting idea. One related case is where everything is fine on the
server side but the client has disconnected and we don't notice that
the socket has changed state until something makes us try to send a
message to the client, which might be a really long time if the
server's doing like a lengthy computation before generating any rows.
It would be really nice if we could find a cheap way to check for both
postmaster death and client disconnect every now and then, like if a
single system call could somehow answer both questions.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Autovacuum worker doesn't immediately exit on postmaster death

2020-12-10 Thread Stephen Frost
Greetings,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Alvaro Herrera  writes:
> > On 2020-Oct-29, Stephen Frost wrote:
> >> I do think it'd be good to find a way to check every once in a while
> >> even when we aren't going to delay though.  Not sure what the best
> >> answer there is.
> 
> > Maybe instead of thinking specifically in terms of vacuum, we could
> > count buffer accesses (read from kernel) and check the latch once every
> > 1000th such, or something like that.  Then a very long query doesn't
> > have to wait until it's run to completion.  The cost is one integer
> > addition per syscall, which should be bearable.
> 
> I'm kind of unwilling to add any syscalls at all to normal execution
> code paths for this purpose.  People shouldn't be sig-kill'ing the
> postmaster, or if they do, cleaning up the mess is their responsibility.
> I'd also suggest that adding nearly-untestable code paths for this
> purpose is a fine way to add bugs we'll never catch.
> 
> The if-we're-going-to-delay-anyway path in vacuum_delay_point seems
> OK to add a touch more overhead to, though.

Alright, for this part at least, seems like it'd be something like the
attached.

Only lightly tested, but does seem to address the specific example which
was brought up on this thread.

Thoughts..?

Thanks,

Stephen
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index 98270a1049..c90a4edb98 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -2069,9 +2069,11 @@ vacuum_delay_point(void)
 		if (msec > VacuumCostDelay * 4)
 			msec = VacuumCostDelay * 4;
 
-		pgstat_report_wait_start(WAIT_EVENT_VACUUM_DELAY);
-		pg_usleep((long) (msec * 1000));
-		pgstat_report_wait_end();
+		(void) WaitLatch(MyLatch,
+		 WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
+		 msec,
+		 WAIT_EVENT_VACUUM_DELAY);
+		ResetLatch(MyLatch);
 
 		VacuumCostBalance = 0;
 


signature.asc
Description: PGP signature


Re: Autovacuum worker doesn't immediately exit on postmaster death

2020-10-30 Thread Stephen Frost
Greetings,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Alvaro Herrera  writes:
> > On 2020-Oct-29, Stephen Frost wrote:
> >> I do think it'd be good to find a way to check every once in a while
> >> even when we aren't going to delay though.  Not sure what the best
> >> answer there is.
> 
> > Maybe instead of thinking specifically in terms of vacuum, we could
> > count buffer accesses (read from kernel) and check the latch once every
> > 1000th such, or something like that.  Then a very long query doesn't
> > have to wait until it's run to completion.  The cost is one integer
> > addition per syscall, which should be bearable.
> 
> I'm kind of unwilling to add any syscalls at all to normal execution
> code paths for this purpose.  People shouldn't be sig-kill'ing the
> postmaster, or if they do, cleaning up the mess is their responsibility.
> I'd also suggest that adding nearly-untestable code paths for this
> purpose is a fine way to add bugs we'll never catch.

Not sure if either is at all viable, but I had a couple of thoughts
about other ways to possibly address this.

The first simplistic idea is this- we have lots of processes that pick
up pretty quickly on the postmaster going away due to checking if it's
still around while waiting for something else to happen anyway (like the
autovacuum launcher...), and we have CFI's in a lot of places where it's
reasonable to do a CFI but isn't alright to check for postmaster death.
While it'd be better if there were more platforms where parent death
would send a signal to the children, that doesn't seem to be coming any
time soon- so why don't we do it ourselves?  That is, when we discover
that the postmaster has died, scan through the proc array (carefully,
since it could be garbage, but all we're looking for are the PIDs of
anything that might still be around) and try sending a signal to any
processes that are left?  Those signals would hopefully get delivered
and the other backends would discover the signal through CFI and exit
reasonably quickly.

The other thought I had was around trying to check for postmaster death
when we're about to do some I/O, which would probably catch a large
number of these cases too though technically some process might stick
around for a while if it's only dealing with things that are already in
shared buffers, I suppose.  Also seems complicated and expensive to do.

> The if-we're-going-to-delay-anyway path in vacuum_delay_point seems
> OK to add a touch more overhead to, though.

Yeah, this certainly seems reasonable to do too and on a well run system
would likely be enough 90+% of the time.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: Autovacuum worker doesn't immediately exit on postmaster death

2020-10-29 Thread Tom Lane
Alvaro Herrera  writes:
> On 2020-Oct-29, Stephen Frost wrote:
>> I do think it'd be good to find a way to check every once in a while
>> even when we aren't going to delay though.  Not sure what the best
>> answer there is.

> Maybe instead of thinking specifically in terms of vacuum, we could
> count buffer accesses (read from kernel) and check the latch once every
> 1000th such, or something like that.  Then a very long query doesn't
> have to wait until it's run to completion.  The cost is one integer
> addition per syscall, which should be bearable.

I'm kind of unwilling to add any syscalls at all to normal execution
code paths for this purpose.  People shouldn't be sig-kill'ing the
postmaster, or if they do, cleaning up the mess is their responsibility.
I'd also suggest that adding nearly-untestable code paths for this
purpose is a fine way to add bugs we'll never catch.

The if-we're-going-to-delay-anyway path in vacuum_delay_point seems
OK to add a touch more overhead to, though.

regards, tom lane




Re: Autovacuum worker doesn't immediately exit on postmaster death

2020-10-29 Thread Alvaro Herrera
On 2020-Oct-29, Stephen Frost wrote:

> I do think it'd be good to find a way to check every once in a while
> even when we aren't going to delay though.  Not sure what the best
> answer there is.

Maybe instead of thinking specifically in terms of vacuum, we could
count buffer accesses (read from kernel) and check the latch once every
1000th such, or something like that.  Then a very long query doesn't
have to wait until it's run to completion.  The cost is one integer
addition per syscall, which should be bearable.

(This doesn't help with a query that's running arbitrarily outside of
Postgres, or doing something that doesn't access disk -- but it'd help
with a majority of problem cases.)




Re: Autovacuum worker doesn't immediately exit on postmaster death

2020-10-29 Thread Stephen Frost
Greetings,

* Andres Freund (and...@anarazel.de) wrote:
> On 2020-10-29 12:27:53 -0400, Tom Lane wrote:
> > Maybe put a check into vacuum_delay_point, and poll the pipe when we're
> > about to sleep anyway?
> 
> Perhaps we should just replace the pg_usleep() with a latch wait?

I'm not sure why, but I had the thought that we already had done that,
and was a bit surprised that it wasn't that way, so +1 from my part.

I do think it'd be good to find a way to check every once in a while
even when we aren't going to delay though.  Not sure what the best
answer there is.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: Autovacuum worker doesn't immediately exit on postmaster death

2020-10-29 Thread Andres Freund
Hi,

On 2020-10-29 12:27:53 -0400, Tom Lane wrote:
> Maybe put a check into vacuum_delay_point, and poll the pipe when we're
> about to sleep anyway?

Perhaps we should just replace the pg_usleep() with a latch wait?

Greetings,

Andres Freund




Re: Autovacuum worker doesn't immediately exit on postmaster death

2020-10-29 Thread Alvaro Herrera
On 2020-Oct-28, Alexander Kukushkin wrote:

> Hello,
> 
> I know, nobody in their mind should do that, but, if the postmaster
> process is killed with SIGKILL signal, most backend processes
> correctly notice the fact of the postmaster process absence and exit.
> There is one exception though, when there are autovacuum worker
> processes they are continuing to run until eventually finish and exit.

So, if you have a manual vacuum running on the table (with
vacuum_cost_delay=0) and kill -KILL the postmaster, that one also
lingers arbitrarily long afterwards?

(I suppose the problem is not as obvious just because the vacuum
wouldn't run as long, because of no vacuum cost delay; but it'd still be
a problem if you made the table bigger.)




Re: Autovacuum worker doesn't immediately exit on postmaster death

2020-10-29 Thread Alvaro Herrera
On 2020-Oct-29, Stephen Frost wrote:

> > It's hard to do better than that, because on most platforms there's
> > no way to get a signal on parent-process death, so the only way to
> > notice would be to poll the postmaster-death pipe constantly; which
> > would be hugely expensive in comparison to the value.
> 
> I agree that 'constantly' wouldn't be great, but with some periodicity
> that's more frequent than 'not until a few hours later when we finally
> finish vacuuming this relation' would be nice.  At least with autovauum
> we may be periodically sleeping anyway so it doesn't seem like polling
> at that point would really be terrible, though it'd be nice to check
> every once in a while even if we aren't sleeping.

vacuum_delay_point seems an obvious candidate, as soon as we've
determined that the sleep interval is > 0; since we're going to sleep,
the cost of a syscall seems negligible.  I'm not sure what to suggest
for vacuums that don't have vacuum costing active, though.




Re: Autovacuum worker doesn't immediately exit on postmaster death

2020-10-29 Thread Tom Lane
Stephen Frost  writes:
> I agree that 'constantly' wouldn't be great, but with some periodicity
> that's more frequent than 'not until a few hours later when we finally
> finish vacuuming this relation' would be nice.  At least with autovauum
> we may be periodically sleeping anyway so it doesn't seem like polling
> at that point would really be terrible, though it'd be nice to check
> every once in a while even if we aren't sleeping.

Maybe put a check into vacuum_delay_point, and poll the pipe when we're
about to sleep anyway?  That wouldn't fix anything except autovacuum,
but if you're right that that's a primary pain point then it'd help.

regards, tom lane




Re: Autovacuum worker doesn't immediately exit on postmaster death

2020-10-29 Thread Stephen Frost
Greetings,

* Tom Lane (t...@sss.pgh.pa.us) wrote:
> Victor Yegorov  writes:
> > ср, 28 окт. 2020 г. в 19:44, Alexander Kukushkin :
> >> I know, nobody in their mind should do that, but, if the postmaster
> >> process is killed with SIGKILL signal, most backend processes
> >> correctly notice the fact of the postmaster process absence and exit.
> >> There is one exception though, when there are autovacuum worker
> >> processes they are continuing to run until eventually finish and exit.
> 
> > Do you get the same behaviour also on master?
> > As there was some work in this area for 14, see
> > https://git.postgresql.org/pg/commitdiff/44fc6e259b
> 
> That was about SIGQUIT response, which isn't really related to this
> scenario.  But I do not think Alexander has accurately characterized
> the situation.  *No* server processes will react instantly to postmaster
> death.  Typically they'll only detect it while waiting for some other
> condition, such as client input, or in some cases while iterating their
> outermost loop.  So if they're busy with calculations they might not
> notice for a long time.  I don't think autovacuum is any worse than
> a busy client backend on this score.

Considering how long an autovacuum can run, it seems like it'd be
worthwhile to find a useful place to check for postmaster-death.
Typical well-running systems are going to be waiting for the client
pretty frequently and therefore this does make autovacuum stick out in
this case.

> It's hard to do better than that, because on most platforms there's
> no way to get a signal on parent-process death, so the only way to
> notice would be to poll the postmaster-death pipe constantly; which
> would be hugely expensive in comparison to the value.

I agree that 'constantly' wouldn't be great, but with some periodicity
that's more frequent than 'not until a few hours later when we finally
finish vacuuming this relation' would be nice.  At least with autovauum
we may be periodically sleeping anyway so it doesn't seem like polling
at that point would really be terrible, though it'd be nice to check
every once in a while even if we aren't sleeping.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: Autovacuum worker doesn't immediately exit on postmaster death

2020-10-28 Thread Tom Lane
Victor Yegorov  writes:
> ср, 28 окт. 2020 г. в 19:44, Alexander Kukushkin :
>> I know, nobody in their mind should do that, but, if the postmaster
>> process is killed with SIGKILL signal, most backend processes
>> correctly notice the fact of the postmaster process absence and exit.
>> There is one exception though, when there are autovacuum worker
>> processes they are continuing to run until eventually finish and exit.

> Do you get the same behaviour also on master?
> As there was some work in this area for 14, see
> https://git.postgresql.org/pg/commitdiff/44fc6e259b

That was about SIGQUIT response, which isn't really related to this
scenario.  But I do not think Alexander has accurately characterized
the situation.  *No* server processes will react instantly to postmaster
death.  Typically they'll only detect it while waiting for some other
condition, such as client input, or in some cases while iterating their
outermost loop.  So if they're busy with calculations they might not
notice for a long time.  I don't think autovacuum is any worse than
a busy client backend on this score.

It's hard to do better than that, because on most platforms there's
no way to get a signal on parent-process death, so the only way to
notice would be to poll the postmaster-death pipe constantly; which
would be hugely expensive in comparison to the value.

On the whole I'm skeptical that this is a useful consideration to
expend effort on.  You shouldn't be killing the postmaster that way.
If you do, you'll soon learn not to, for plenty of reasons besides
this one.

regards, tom lane




Re: Autovacuum worker doesn't immediately exit on postmaster death

2020-10-28 Thread Victor Yegorov
ср, 28 окт. 2020 г. в 19:44, Alexander Kukushkin :

> I know, nobody in their mind should do that, but, if the postmaster
> process is killed with SIGKILL signal, most backend processes
> correctly notice the fact of the postmaster process absence and exit.
> There is one exception though, when there are autovacuum worker
> processes they are continuing to run until eventually finish and exit.
>
> …
>
> I was able to reproduce it with 13.0 and 12.4, and I believe older
> versions are also affected.
>

Do you get the same behaviour also on master?
As there was some work in this area for 14, see
https://git.postgresql.org/pg/commitdiff/44fc6e259b

-- 
Victor Yegorov