Re: background worker shudown (was Re: [HACKERS] Why does logical replication launcher exit with exit code 1?)

2018-10-10 Thread Robert Haas
On Wed, Oct 10, 2018 at 12:00 AM Thomas Munro
 wrote:
> So I suppose we should just remove it, with something like 0002.  I'm
> a bit uneasy about existing code out there that might be not calling
> CFI.  OTOH I suspect that a lot of code copied worker_spi.c and
> installed its own handler.

I agree -- I think worker_spi.c has probably been copied a lot, and
that's not a good thing.

> Maybe src/test/modules/worker_spi.c shouldn't use approach #1 (even if
> it might technically be OK for that code)?  I think I might have been
> guilty of copying that.

+1.  It's not *really* OK unless all of the SQL queries it executes
are super-short and can't possibly block ... which I guess is never
really true of any SQL queries, right?

> > Maybe we could replace this by a general-purpose hook.  So instead of
> > the various tests for process types that are there now, we would just
> > have
> >
> > if (procdie_hook != NULL)
> > (*procdie_hook)();
> >
> > And that hook can do whatever you like (but probably including dying).
>
> Ok, patch 0001 is a sketch like that, for discussion.

I was assuming that we'd replace the existing message-selection logic
with customer proc-die handlers.  Also, I think we should not indicate
in the comments that the handler is expected to proc_exit() or
ereport(FATAL), because you might want to do something else there,
then return and let the usual thing happen.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: background worker shudown (was Re: [HACKERS] Why does logical replication launcher exit with exit code 1?)

2018-10-09 Thread Thomas Munro
On Wed, Oct 10, 2018 at 7:29 AM Robert Haas  wrote:
> On Thu, Oct 4, 2018 at 7:37 PM Thomas Munro
>  wrote:
> > 0.  The default SIGTERM handler for bgworkers is bgworker_die(), which
> > generates a FATAL ereport "terminating background worker \"%s\" due to
> > administrator command", directly in the signal handler (hmm, see also
> > recent discussions about the legality of similar code in quickdie()).
>
> Yeah, I think that code is bad news, for the same reasons discussed
> with regard to quickdie().

So I suppose we should just remove it, with something like 0002.  I'm
a bit uneasy about existing code out there that might be not calling
CFI.  OTOH I suspect that a lot of code copied worker_spi.c and
installed its own handler.

> > 1.  Some processes install their own custom SIGTERM handler that sets
> > a flag, and use that to break out of their main loop and exit quietly.
> > Example: autovacuum.c, or the open-source pg_cron extension, and
> > probably many other things that just want a nice quiet shutdown.
> >
> > 2.  Some processes install the standard SIGTERM handler die(), and
> > then use CHECK_FOR_INTERRUPTS() to break out of their main loop.  By
> > default this looks like "FATAL:  terminating connection due to
> > administrator command".  This approach is effectively required if the
> > main loop will reach other code that has a CHECK_FOR_INTERRUPTS() wait
> > loop.  For example, a "launcher"-type process that calls
> > WaitForBackgroundWorkerStartup() could hang forever if it used
> > approach 1 (ask me how I know).
>
> My experience with background workers has been that approach #1 does
> not really work.  I mean, if you aren't doing anything complicated it
> might be OK, if for example you are executing SQL queries, and you try
> to do #1, then your SQL queries don't respond to interrupts.  And that
> sucks.  So I have generally adopted approach #2.

Maybe src/test/modules/worker_spi.c shouldn't use approach #1 (even if
it might technically be OK for that code)?  I think I might have been
guilty of copying that.

> To put that another way, nearly everything can reach
> CHECK_FOR_INTERRUPTS(), so saying that this is required whenever that
> in the case isn't much different from saying that it is required, full
> stop.

> > 3.  Some system processes generally use approach 2, but have a special
> > case in ProcessInterrupts() to suppress or alter the usual FATAL
> > message or behaviour.  This includes the logical replication launcher.
>
> Maybe we could replace this by a general-purpose hook.  So instead of
> the various tests for process types that are there now, we would just
> have
>
> if (procdie_hook != NULL)
> (*procdie_hook)();
>
> And that hook can do whatever you like (but probably including dying).

Ok, patch 0001 is a sketch like that, for discussion.

-- 
Thomas Munro
http://www.enterprisedb.com


0001-Add-proc_die_hook-to-customize-die-interrupt-handlin.patch
Description: Binary data


0002-Remove-async-signal-unsafe-bgworker_die-function.patch
Description: Binary data


Re: [HACKERS] Why does logical replication launcher exit with exit code 1?

2018-10-09 Thread Thomas Munro
On Fri, Oct 5, 2018 at 12:36 PM Thomas Munro
 wrote:
> *  We really should get rid of that "exited with exit code 1".

Robert and I just discussed this subproblem (the original complaint of
this thread) off-list.  Our questions are: does anyone actually want
that message from the postmaster in the log, and if not, shouldn't we
just do this?

diff --git a/src/backend/postmaster/postmaster.c
b/src/backend/postmaster/postmaster.c
index 41de140ae0..b34655b4bd 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -3193,8 +3193,7 @@ CleanupBackgroundWorker(int pid,
rw->rw_child_slot = 0;
ReportBackgroundWorkerExit();  /* report child death */

-   LogChildExit(EXIT_STATUS_0(exitstatus) ? DEBUG1 : LOG,
-namebuf, pid, exitstatus);
+   LogChildExit(DEBUG1, namebuf, pid, exitstatus);

return true;
}

As for the problem of the behaviour of bgworker processes themselves
on SIGTERM, let's discuss that separately in the other subthread[1]
(well, my mail client thinks it's a different thread, but the archives
think it's the same thread with a different subject).

[1] 
https://www.postgresql.org/message-id/CA%2BTgmobwExL4kNj_eXJxPah_tVQ31N0cYDbUN0FFm6uaY_%2BX%3Dw%40mail.gmail.com

-- 
Thomas Munro
http://www.enterprisedb.com



Fwd: background worker shudown (was Re: [HACKERS] Why does logical replication launcher exit with exit code 1?)

2018-10-09 Thread Robert Haas
On Thu, Oct 4, 2018 at 7:37 PM Thomas Munro
 wrote:
> I still think the current situation is non-ideal.  I don't have a
> strong view on whether some or all system-wide processes should say
> hello and goodbye explicitly in the log, but I do think we need a way
> to make that not look like an error condition, and ideally without
> special cases for known built-in processes.
>
> I looked into this a bit today, while debugging an extension that runs
> a background worker.  Here are some assorted approaches to shutdown:
>
> 0.  The default SIGTERM handler for bgworkers is bgworker_die(), which
> generates a FATAL ereport "terminating background worker \"%s\" due to
> administrator command", directly in the signal handler (hmm, see also
> recent discussions about the legality of similar code in quickdie()).

Yeah, I think that code is bad news, for the same reasons discussed
with regard to quickdie().

> 1.  Some processes install their own custom SIGTERM handler that sets
> a flag, and use that to break out of their main loop and exit quietly.
> Example: autovacuum.c, or the open-source pg_cron extension, and
> probably many other things that just want a nice quiet shutdown.
>
> 2.  Some processes install the standard SIGTERM handler die(), and
> then use CHECK_FOR_INTERRUPTS() to break out of their main loop.  By
> default this looks like "FATAL:  terminating connection due to
> administrator command".  This approach is effectively required if the
> main loop will reach other code that has a CHECK_FOR_INTERRUPTS() wait
> loop.  For example, a "launcher"-type process that calls
> WaitForBackgroundWorkerStartup() could hang forever if it used
> approach 1 (ask me how I know).

My experience with background workers has been that approach #1 does
not really work.  I mean, if you aren't doing anything complicated it
might be OK, if for example you are executing SQL queries, and you try
to do #1, then your SQL queries don't respond to interrupts.  And that
sucks.  So I have generally adopted approach #2.

To put that another way, nearly everything can reach
CHECK_FOR_INTERRUPTS(), so saying that this is required whenever that
in the case isn't much different from saying that it is required, full
stop.

> 3.  Some system processes generally use approach 2, but have a special
> case in ProcessInterrupts() to suppress or alter the usual FATAL
> message or behaviour.  This includes the logical replication launcher.

Maybe we could replace this by a general-purpose hook.  So instead of
the various tests for process types that are there now, we would just
have

if (procdie_hook != NULL)
(*procdie_hook)();

And that hook can do whatever you like (but probably including dying).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] Why does logical replication launcher exit with exit code 1?

2018-10-04 Thread Thomas Munro
On Thu, Aug 3, 2017 at 1:41 PM Robert Haas  wrote:
> On Wed, Aug 2, 2017 at 9:33 PM, Peter Eisentraut
>  wrote:
> > On 8/2/17 16:52, Robert Haas wrote:
> >> I actually don't think it's that unreasonable to get notified when
> >> system-wide processes like the autovacuum launcher or the logical
> >> replication launcher start or stop.
> >
> > But we got rid of those start messages recently due to complaints.
>
> Yeah, true.  I'm just talking about what *I* think.  :-)

I still think the current situation is non-ideal.  I don't have a
strong view on whether some or all system-wide processes should say
hello and goodbye explicitly in the log, but I do think we need a way
to make that not look like an error condition, and ideally without
special cases for known built-in processes.

I looked into this a bit today, while debugging an extension that runs
a background worker.  Here are some assorted approaches to shutdown:

0.  The default SIGTERM handler for bgworkers is bgworker_die(), which
generates a FATAL ereport "terminating background worker \"%s\" due to
administrator command", directly in the signal handler (hmm, see also
recent discussions about the legality of similar code in quickdie()).

1.  Some processes install their own custom SIGTERM handler that sets
a flag, and use that to break out of their main loop and exit quietly.
Example: autovacuum.c, or the open-source pg_cron extension, and
probably many other things that just want a nice quiet shutdown.

2.  Some processes install the standard SIGTERM handler die(), and
then use CHECK_FOR_INTERRUPTS() to break out of their main loop.  By
default this looks like "FATAL:  terminating connection due to
administrator command".  This approach is effectively required if the
main loop will reach other code that has a CHECK_FOR_INTERRUPTS() wait
loop.  For example, a "launcher"-type process that calls
WaitForBackgroundWorkerStartup() could hang forever if it used
approach 1 (ask me how I know).

3.  Some system processes generally use approach 2, but have a special
case in ProcessInterrupts() to suppress or alter the usual FATAL
message or behaviour.  This includes the logical replication launcher.

A couple of thoughts:

*  Extensions that need to use die() (or something else that would be
compatible with CFI() wait loops) should ideally have a way to control
how ProcessInterrupts() reports this to the user, since shutdown is an
expected condition for long-lived processes.

*  We really should get rid of that "exited with exit code 1".

-- 
Thomas Munro
http://www.enterprisedb.com