Re: Async-unsafe functions in signal handlers

2021-08-30 Thread Denis Smirnov
Honestly, I don’t know what to do with bgworker_die(). At the moment it 
produces ereport(FATAL) with async-unsafe proc_exit_prepare() and exit() 
underhood. I can see three solutions:

1. Leave the code as is. Then SIGTERM can produce deadlocks in bgworker's 
signal handler. The locked process can terminated with an immediate shutdown 

 of the cluster. May be it is ok as we don’t expect to send SIGTERM to bgworker 
too often.

2. Use async-safe _exit() in a signal handler instead of proc_exit_prepare() 
and exit(). In this case we’ll have to go through cluster recovery as the 
bgworker doesn't properly clean its shared memory. This solution is even worth 
than immediate shutdown as we recover for every SIGTERM have been sent to 
bgworker.

3. Set a signal flag inside the handler (something like miscadmin.h 
XXX_INTERRUPTS() macros). So it becomes an extension developer's responsibility 
to properly handle this flag in the bgworker’s code. This approach breaks 
backward compatibility.

May be I've missed a good solution, do you see any?

Best regards,
Denis Smirnov | Developer
s...@arenadata.io 
Arenadata | Godovikova 9-17, Moscow 129085 Russia



Re: Async-unsafe functions in signal handlers

2021-08-27 Thread Denis Smirnov


> 28 авг. 2021 г., в 07:05, Andres Freund  написал(а):
> 
> However, we have a
> bandaid that deals with possible hangs, by SIGKILLing when processes don't
> shut down (at that point things have already gone quite south, so that's not
> an issue).

Thanks for the explanation. I can see that child process SIGKILL machinery was 
introduced by 82233ce7ea42d6ba519aaec63008aff49da6c7af commit to fix a malloc() 
deadlock in quickdie() signal handler. As a result, all child processes that 
die too long are killed in the ServerLoop() with SIGKILL. But bgworker_die() is 
a problem as we initialize bgworkers right before ServerLoop(). So we can face 
malloc() deadlock on postmaster startup (before ServerLoop() started). Maybe we 
should simply use write() and exit() instead of ereport() for bgworker_die()?

Best regards,
Denis Smirnov | Developer
s...@arenadata.io 
Arenadata | Godovikova 9-17, Moscow 129085 Russia





Re: Async-unsafe functions in signal handlers

2021-08-27 Thread Andres Freund
Hi,

On 2021-08-27 23:51:27 +0300, Denis Smirnov wrote:
> > 26 авг. 2021 г., в 23:39, Tom Lane  написал(а):
> > 
> > (BTW, I think it's pretty silly to imagine that adding backtrace()
> > calls inside ereport is making things any more dangerous.  ereport
> > has pretty much always carried a likelihood of calling malloc(),
> > for example.)
> 
> I have taken a look through the signal handlers and found out that many of 
> them use malloc() via ereport() and elog(). Here is the list:
> 
> SIGUSR1
> - procsignal_sigusr1_handler(): autoprewarm, autovacuum, bgworker, bgwriter, 
> checkpointer, pgarch, startup, walwriter, walreciever, walsender

There shouldn't be meaningful uses of elog/ereport() inside
procsignal_sigusr1_handler(). The exception I found was an elog(FATAL) for
unreachable code.


> - sigusr1_handler(): postmaster
> SIGHUP:
> - SIGHUP_handler(): postmaster
> SIGCHLD:
> - reaper(): postmaster

I think these runs in a very controlled set of circumstances because most of
postmaster runs with signals masked.


> SIGFPE:
> - FloatExceptionHandler(): autovacuum, bgworker, postgres, plperl

Yep, although as discussed this might not be a "real" problem because it
should only run during an instruction triggering an FPE.


> SIGQUIT:
> - quickdie(): postgres

Yes, this is an issue. I've previously argued for handling this via write()
and _exit(), instead of the full ereport() machinery. However, we have a
bandaid that deals with possible hangs, by SIGKILLing when processes don't
shut down (at that point things have already gone quite south, so that's not
an issue).


> SIGTERM:
> - bgworker_die(): bgworker

Bad.


> SIGALRM:
> - handle_sig_alarm(): autovacuum, bgworker, postmaster, startup, walsender, 
> postgres

I don't think there reachable elogs in there. I'm not concerned about e.g.
elog(FATAL, "timeout index %d out of range 0..%d", index,
 num_active_timeouts - 1);
because that's not something that should ever be reachable in a production
scenario. If it is, there's bigger problems.


Perhaps we ought to have a version of Assert() that's enabled in production
builds as well, and that outputs the error messages via write() and then
_exit()s?

Greetings,

Andres Freund




Re: Async-unsafe functions in signal handlers

2021-08-27 Thread Denis Smirnov


> 26 авг. 2021 г., в 23:39, Tom Lane  написал(а):
> 
> (BTW, I think it's pretty silly to imagine that adding backtrace()
> calls inside ereport is making things any more dangerous.  ereport
> has pretty much always carried a likelihood of calling malloc(),
> for example.)

I have taken a look through the signal handlers and found out that many of them 
use malloc() via ereport() and elog(). Here is the list:

SIGUSR1
- procsignal_sigusr1_handler(): autoprewarm, autovacuum, bgworker, bgwriter, 
checkpointer, pgarch, startup, walwriter, walreciever, walsender
- sigusr1_handler(): postmaster

SIGFPE:
- FloatExceptionHandler(): autovacuum, bgworker, postgres, plperl

SIGHUP:
- SIGHUP_handler(): postmaster

SIGCHLD:
- reaper(): postmaster

SIGQUIT:
- quickdie(): postgres

SIGTERM:
- bgworker_die(): bgworker

SIGALRM:
- handle_sig_alarm(): autovacuum, bgworker, postmaster, startup, walsender, 
postgres

I suspect there are lots of potential ways to lock on malloc() inside any of 
this handlers. An interesting question is why there are still no evidence of 
such locks?

Best regards,
Denis Smirnov | Developer
s...@arenadata.io 
Arenadata | Godovikova 9-17, Moscow 129085 Russia





Re: Async-unsafe functions in signal handlers

2021-08-26 Thread Tom Lane
Robert Haas  writes:
> That is a great question. I think bgworker_die() is extremely
> dangerous and ought to be removed. I can't see how that can ever be
> safe.

Agreed, it looks pretty dangerous from here.  The equivalent (and
far better battle-tested) signal handlers in postgres.c are a lot
more circumspect --- they will call stuff that's unsafe per POSIX,
but not from just any interrupt point.

(BTW, I think it's pretty silly to imagine that adding backtrace()
calls inside ereport is making things any more dangerous.  ereport
has pretty much always carried a likelihood of calling malloc(),
for example.)

> FloatExceptionHandler() is a bit different because in theory the
> things that could trigger SIGFPE are relatively limited, and in theory
> we know that those are safe places to ereport(). But I'm not very
> convinced that this is really true. Among other things, kill -FPE
> could be executed any time, but even aside from that, I doubt we have
> exhaustive knowledge of everything in the code that could trigger a
> floating point exception.

On the one hand, that's theoretically true, and on the other hand,
that code's been like that since the last century and I'm unaware
of any actual problems.  There are not many places in the backend
that do arithmetic that's likely to trigger SIGFPE.  Besides which,
what's the alternative?  I suppose we could SIG_IGN SIGFPE, but
that is almost certainly going to create real problems (i.e. missed
error cases) while removing only hypothetical ones.

(The "DBA randomly issues 'kill -FPE'" scenario can be dismissed,
I think --- surely that is in the category of "when you break it,
you get to keep both pieces".)

The larger subtext here is that just because it's undefined per
POSIX doesn't necessarily mean it's unsafe in our use-pattern.

regards, tom lane




Re: Async-unsafe functions in signal handlers

2021-08-26 Thread Robert Haas
On Wed, Aug 25, 2021 at 10:22 AM Denis Smirnov  wrote:
> I am going to refactor Greenplum backtraces for error messages and want to 
> make it more compatible with PostgreSQL code. Backtraces in PostgreSQL were 
> introduced by 71a8a4f6e36547bb060dbcc961ea9b57420f7190 commit (original 
> discussion 
> https://www.postgresql.org/message-id/CAMsr+YGL+yfWE=jvbubnpwtrrzney7hj07+zt4byjdvp4sz...@mail.gmail.com
>  ) and rely on backtrace() and backtrace_symbols() functions. They are used 
> inside errfinish() that is wrapped by ereport() macros. ereport() is invoked 
> inside bgworker_die() and FloatExceptionHandler() signal handlers. I am 
> confused with this fact - both backtrace functions are async-unsafe: 
> backtrace_symbols() - always, backtrace() - only for the first call due to 
> dlopen. I wonder why does PostgreSQL use async-unsafe functions in signal 
> handlers?

That is a great question. I think bgworker_die() is extremely
dangerous and ought to be removed. I can't see how that can ever be
safe.

FloatExceptionHandler() is a bit different because in theory the
things that could trigger SIGFPE are relatively limited, and in theory
we know that those are safe places to ereport(). But I'm not very
convinced that this is really true. Among other things, kill -FPE
could be executed any time, but even aside from that, I doubt we have
exhaustive knowledge of everything in the code that could trigger a
floating point exception.

-- 
Robert Haas
EDB: http://www.enterprisedb.com




Re: Async-unsafe functions in signal handlers

2021-08-26 Thread Denis Smirnov
As far as I understand, the main problem with backtrace_symbols() is the 
internal malloc() call. Backend can lock forever if malloc() was interrupted by 
a signal and then was evaluated again in a signal handler.

At the moment Greenplum uses "addr2line -s -e» (on Linux) and "atos -o" (on 
macOS) for each stack address instead of backtrace_symbols(). Both of these 
utils don’t use malloc() underhood, although there is no guarantee that this 
implementation never changes in the future. It seems to be a safer approach, 
but looks like a dirty hack.

> 26 авг. 2021 г., в 08:52, Andrey Borodin  написал(а):
> 
> 
> 
>> 25 авг. 2021 г., в 19:22, Denis Smirnov  написал(а):
>> 
>> I am going to refactor Greenplum backtraces for error messages and want to 
>> make it more compatible with PostgreSQL code. Backtraces in PostgreSQL were 
>> introduced by 71a8a4f6e36547bb060dbcc961ea9b57420f7190 commit (original 
>> discussion 
>> https://www.postgresql.org/message-id/CAMsr+YGL+yfWE=jvbubnpwtrrzney7hj07+zt4byjdvp4sz...@mail.gmail.com
>>  ) and rely on backtrace() and backtrace_symbols() functions. They are used 
>> inside errfinish() that is wrapped by ereport() macros. ereport() is invoked 
>> inside bgworker_die() and FloatExceptionHandler() signal handlers. I am 
>> confused with this fact - both backtrace functions are async-unsafe: 
>> backtrace_symbols() - always, backtrace() - only for the first call due to 
>> dlopen. I wonder why does PostgreSQL use async-unsafe functions in signal 
>> handlers?
> 
> In my view GUC backtrace_functions is expected to be used for debug purposes. 
> Not for enabling on production server for bgworker_die() or 
> FloatExceptionHandler().
> Are there any way to call backtrace_symbols() without touching 
> backtrace_functions?
> 
> Best regards, Andrey Borodin.
> 

Best regards,
Denis Smirnov | Developer
s...@arenadata.io 
Arenadata | Godovikova 9-17, Moscow 129085 Russia





Re: Async-unsafe functions in signal handlers

2021-08-25 Thread Andrey Borodin



> 25 авг. 2021 г., в 19:22, Denis Smirnov  написал(а):
> 
> I am going to refactor Greenplum backtraces for error messages and want to 
> make it more compatible with PostgreSQL code. Backtraces in PostgreSQL were 
> introduced by 71a8a4f6e36547bb060dbcc961ea9b57420f7190 commit (original 
> discussion 
> https://www.postgresql.org/message-id/CAMsr+YGL+yfWE=jvbubnpwtrrzney7hj07+zt4byjdvp4sz...@mail.gmail.com
>  ) and rely on backtrace() and backtrace_symbols() functions. They are used 
> inside errfinish() that is wrapped by ereport() macros. ereport() is invoked 
> inside bgworker_die() and FloatExceptionHandler() signal handlers. I am 
> confused with this fact - both backtrace functions are async-unsafe: 
> backtrace_symbols() - always, backtrace() - only for the first call due to 
> dlopen. I wonder why does PostgreSQL use async-unsafe functions in signal 
> handlers?

In my view GUC backtrace_functions is expected to be used for debug purposes. 
Not for enabling on production server for bgworker_die() or 
FloatExceptionHandler().
Are there any way to call backtrace_symbols() without touching 
backtrace_functions?

Best regards, Andrey Borodin.



Async-unsafe functions in signal handlers

2021-08-25 Thread Denis Smirnov
Hello all,

I am going to refactor Greenplum backtraces for error messages and want to make 
it more compatible with PostgreSQL code. Backtraces in PostgreSQL were 
introduced by 71a8a4f6e36547bb060dbcc961ea9b57420f7190 commit (original 
discussion 
https://www.postgresql.org/message-id/CAMsr+YGL+yfWE=jvbubnpwtrrzney7hj07+zt4byjdvp4sz...@mail.gmail.com
 ) and rely on backtrace() and backtrace_symbols() functions. They are used 
inside errfinish() that is wrapped by ereport() macros. ereport() is invoked 
inside bgworker_die() and FloatExceptionHandler() signal handlers. I am 
confused with this fact - both backtrace functions are async-unsafe: 
backtrace_symbols() - always, backtrace() - only for the first call due to 
dlopen. I wonder why does PostgreSQL use async-unsafe functions in signal 
handlers?

Best regards,
Denis Smirnov | Developer
s...@arenadata.io 
Arenadata | Godovikova 9-17, Moscow 129085 Russia