Re: [HACKERS] possible self-deadlock window after bad ProcessStartupPacket

2017-06-22 Thread Tom Lane
Andres Freund  writes:
> Or, probably more robust: Simply _exit(2) without further ado, and rely
> on postmaster to output an appropriate error message. Arguably it's not
> actually useful to see hundreds of "WARNING: terminating connection because of
> crash of another server process" messages in the log anyway.

At that point you might as well skip the entire mechanism and go straight
to SIGKILL.  IMO the only reason quickdie() exists at all is to try to
send a helpful message to the client.  And it is helpful --- your attempt
to claim that it isn't sounds very much like wishful thinking.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] possible self-deadlock window after bad ProcessStartupPacket

2017-06-22 Thread Andres Freund
On 2017-06-22 10:41:41 -0700, Andres Freund wrote:
> On 2017-02-02 12:18:22 -0800, Jimmy Yih wrote:
> > In the above pull request, Heikki also mentions that a similar scenario can
> > happen during palloc() as well... which is similar to what we saw in
> > Greenplum a couple years back for a deadlock in a malloc() call where we
> > responded by changing exit() to _exit() in quickdie as a fix.  That could
> > possibly be applicable to latest Postgres as well.
> 
> Isn't the quickdie() issue that we palloc/malloc in the first place,
> rather than the exit call?  There's some risk for exit() too, but we
> reset our own atexit handlers before exiting, so the risk seems fairly
> small.
> 
> 
> In my opinion the fix here would be to do it right and never emit error
> messages from signal handlers via elog.c - we've progressed a lot
> towards the goal and do a lot less in signal handlers these days - but
> quickdie() is one glaring exception to that.  I think a reasonable fix
> here would be to use write() of a statically allocated message, rather
> then elog proper, and not to send the message to the client.  Libpq, and
> I presume other clients, synthethize a message upon unexpected socket
> closure anyway, and it's completely unclear whether we can send a
> message anyway.

Or, probably more robust: Simply _exit(2) without further ado, and rely
on postmaster to output an appropriate error message. Arguably it's not
actually useful to see hundreds of "WARNING: terminating connection because of
crash of another server process" messages in the log anyway.

- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] possible self-deadlock window after bad ProcessStartupPacket

2017-06-22 Thread Andres Freund
On 2017-02-02 12:18:22 -0800, Jimmy Yih wrote:
> In the above pull request, Heikki also mentions that a similar scenario can
> happen during palloc() as well... which is similar to what we saw in
> Greenplum a couple years back for a deadlock in a malloc() call where we
> responded by changing exit() to _exit() in quickdie as a fix.  That could
> possibly be applicable to latest Postgres as well.

Isn't the quickdie() issue that we palloc/malloc in the first place,
rather than the exit call?  There's some risk for exit() too, but we
reset our own atexit handlers before exiting, so the risk seems fairly
small.


In my opinion the fix here would be to do it right and never emit error
messages from signal handlers via elog.c - we've progressed a lot
towards the goal and do a lot less in signal handlers these days - but
quickdie() is one glaring exception to that.  I think a reasonable fix
here would be to use write() of a statically allocated message, rather
then elog proper, and not to send the message to the client.  Libpq, and
I presume other clients, synthethize a message upon unexpected socket
closure anyway, and it's completely unclear whether we can send a
message anyway.

Greetings,

Andres Freund


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] possible self-deadlock window after bad ProcessStartupPacket

2017-06-22 Thread Robert Haas
On Thu, Feb 2, 2017 at 3:18 PM, Jimmy Yih  wrote:
> In that pull request, we fix the issue by checking for proc_exit_inprogress.
> Is there a reason why startup_die should not check for proc_exit_inprogress?

startup_die() is just calling proc_exit(), so it seems like it might
be better to fix it by putting the check into proc_exit itself.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] possible self-deadlock window after bad ProcessStartupPacket

2017-02-02 Thread Jimmy Yih
Hello,

There may possibly be a very small window for a double exit() self-deadlock
during a forked backend process's ProcessStartupPacket returns and status
is not STATUS_OK.  The process will go into proc_exit and then a very well
timed SIGQUIT will call startup_die for another proc_exit.  If timed
correctly, the two exit() calls can deadlock since exit() is not
re-entrant.  It seems extremely hard to hit the deadlock but I believe the
opportunity is there.

Using gdb, I was able to create the window and get this stacktrace:

#0  startup_die (postgres_signal_arg=0) at postmaster.c:5090
#1  
#2  proc_exit_prepare (code=0) at ipc.c:158
#3  0x007c4135 in proc_exit (code=0) at ipc.c:102
#4  0x0076b736 in BackendInitialize (port=0x2c13740) at
postmaster.c:4207
#5  0x0076b190 in BackendStartup (port=0x2c13740) at postmaster.c:3979
#6  0x00767ad3 in ServerLoop () at postmaster.c:1722
#7  0x007671df in PostmasterMain (argc=3, argv=0x2bebad0) at
postmaster.c:1330
#8  0x006b5df6 in main (argc=3, argv=0x2bebad0) at main.c:228

I got the stacktrace by doing the following:

   1. gdb attach to postmaster and run set follow-fork child and break
   postmaster.c:4206(right after ProcessStartupPacket) and continue
   2. In another terminal, open a psql session which should trigger the gdb
   follow
   3. In the gdb session, set status=1 and step into proc_exit()
   4. In another terminal, kill -s QUIT  to send SIGQUIT to the
   child process. Or run pg_ctl stop -M immediate.
   5. In the gdb session, step to process the signal into startup_die and
   run bt

This was discovered while hacking on Greenplum Database (currently based
off of Postgres 8.3) where we recently started encountering the
self-deadlock intermittently in our testing environment.

Here's the pull request discussion:
https://github.com/greenplum-db/gpdb/pull/1662

In that pull request, we fix the issue by checking for proc_exit_inprogress.
Is there a reason why startup_die should not check for proc_exit_inprogress?

In the above pull request, Heikki also mentions that a similar scenario can
happen during palloc() as well... which is similar to what we saw in
Greenplum a couple years back for a deadlock in a malloc() call where we
responded by changing exit() to _exit() in quickdie as a fix.  That could
possibly be applicable to latest Postgres as well.

Regards,
Jimmy