Re: [HACKERS] random failing builds on spoonbill - backends not exiting...
Stefan Kaltenbrunner writes: > On 06/22/2012 11:47 PM, Tom Lane wrote: >> Could you gdb each of these processes and get a stack trace? [ unsurprising stack traces ] OK, so they're waiting exactly where they should be. So what we know is that the shutdown failure is caused by the child processes having all blockable signals blocked. What we don't know is how they could have gotten that way. I do not see any way that the code in Postgres would ever block all signals, which makes this look a lot like a BSD libc bug. But that doesn't help much. The only way I can think of to narrow it down is to run the postmaster under "strace -f" or local equivalent, and look for sigprocmask calls that set all the bits. That could be pretty tedious though, if the occurrence of the bug is as low-probability as it seems to be. Given your earlier comment about a new threading library, I wonder whether this has something to do with confusion between process-wide and per-thread signal masks ... regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] random failing builds on spoonbill - backends not exiting...
On 06/22/2012 11:47 PM, Tom Lane wrote: > Stefan Kaltenbrunner writes: PID PENDING CAUGHT IGNORED BLOCKED COMMAND 12480 20004004 34084005 c942b002 fffefeff postgres: writer process 9841 20004004 34084007 c942b000 fffefeff postgres: wal writer process > >> this seems to be SIGUSR1,SIGTERM and SIGQUIT > > OK, I looked up OpenBSD's signal numbers on the web. It looks to me > like these two processes have everything blocked except KILL and STOP > (which are unblockable of course). I do not see any place in the PG > code that could possibly set such a mask (note that BlockSig should > have more holes in it than that). So I'm thinking these must be > blocked inside some system function that's installed a restrictive > signal mask, or some such function forgot to restore the mask on exit. > Could you gdb each of these processes and get a stack trace? background writer (12480): (gdb) bt #0 0x000208eb5928 in poll () from /usr/lib/libc.so.62.0 #1 0x00020a972b88 in _thread_kern_poll (wait_reqd=Variable "wait_reqd" is not available. ) at /usr/src/lib/libpthread/uthread/uthread_kern.c:784 #2 0x00020a973d04 in _thread_kern_sched (scp=0x0) at /usr/src/lib/libpthread/uthread/uthread_kern.c:384 #3 0x00020a96b35c in poll (fds=0xfffefa80, nfds=Variable "nfds" is not available. ) at /usr/src/lib/libpthread/uthread/uthread_poll.c:94 #4 0x00395538 in WaitLatchOrSocket (latch=0x212bdc97c, wakeEvents=25, sock=-1, timeout=1) at pg_latch.c:286 #5 0x00399800 in BackgroundWriterMain () at bgwriter.c:325 #6 0x00201850 in AuxiliaryProcessMain (argc=2, argv=0xfffefd98) at bootstrap.c:419 #7 0x003a1534 in StartChildProcess (type=BgWriterProcess) at postmaster.c:4518 #8 0x003a7574 in reaper (postgres_signal_arg=Variable "postgres_signal_arg" is not available. ) at postmaster.c:2385 #9 0x00020a974528 in _dispatch_signal (sig=20, scp=0x03e0) at /usr/src/lib/libpthread/uthread/uthread_sig.c:408 #10 0x00020a97462c in _dispatch_signals (scp=0x03e0) at /usr/src/lib/libpthread/uthread/uthread_sig.c:437 #11 0x00020a974e28 in _thread_sig_handler (sig=20, info=0x0420, scp=0x03e0) at /usr/src/lib/libpthread/uthread/uthread_sig.c:139 #12 #13 _thread_kern_set_timeout (timeout=0x0630) at /usr/src/lib/libpthread/uthread/uthread_kern.c:989 #14 0x00020a96bc8c in select (numfds=9, readfds=0x0730, writefds=0x0, exceptfds=0x0, timeout=Variable "timeout" is not available. ) at /usr/src/lib/libpthread/uthread/uthread_select.c:85 #15 0x003a2894 in ServerLoop () at postmaster.c:1321 #16 0x003a45ac in PostmasterMain (argc=Variable "argc" is not available. ) at postmaster.c:1121 #17 0x00326df8 in main (argc=6, argv=0x14f8) at main.c:199 wal writer(9841): #0 0x000208eb5928 in poll () from /usr/lib/libc.so.62.0 #1 0x00020a972b88 in _thread_kern_poll (wait_reqd=Variable "wait_reqd" is not available. ) at /usr/src/lib/libpthread/uthread/uthread_kern.c:784 #2 0x00020a973d04 in _thread_kern_sched (scp=0x0) at /usr/src/lib/libpthread/uthread/uthread_kern.c:384 #3 0x00020a96b35c in poll (fds=0xfffefa80, nfds=Variable "nfds" is not available. ) at /usr/src/lib/libpthread/uthread/uthread_poll.c:94 #4 0x00395538 in WaitLatchOrSocket (latch=0x212bdc69c, wakeEvents=25, sock=-1, timeout=5000) at pg_latch.c:286 #5 0x003aa794 in WalWriterMain () at walwriter.c:301 #6 0x00201878 in AuxiliaryProcessMain (argc=2, argv=0xfffefd98) at bootstrap.c:430 #7 0x003a1534 in StartChildProcess (type=WalWriterProcess) at postmaster.c:4518 #8 0x003a7564 in reaper (postgres_signal_arg=Variable "postgres_signal_arg" is not available. ) at postmaster.c:2387 #9 0x00020a974528 in _dispatch_signal (sig=20, scp=0x03e0) at /usr/src/lib/libpthread/uthread/uthread_sig.c:408 #10 0x00020a97462c in _dispatch_signals (scp=0x03e0) at /usr/src/lib/libpthread/uthread/uthread_sig.c:437 #11 0x00020a974e28 in _thread_sig_handler (sig=20, info=0x0420, scp=0x03e0) at /usr/src/lib/libpthread/uthread/uthread_sig.c:139 #12 #13 _thread_kern_set_timeout (timeout=0x0630) at /usr/src/lib/libpthread/uthread/uthread_kern.c:989 #14 0x00020a96bc8c in select (numfds=9, readfds=0x0730, writefds=0x0, exceptfds=0x0, timeout=Variable "timeout" is not available. ) at /usr/src/lib/libpthread/uthread/uthread_select.c:85 #15 0x003a2894 in ServerLoop () at postmaster.c:1321 #16 0x003a45ac in PostmasterMain (argc=Variable "argc" is not available. ) at postmaster.c:1121 #17 0x00326df8 in main (argc=6, argv=0x14f8) at main.c:199 Stefan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] random failing builds on spoonbill - backends not exiting...
On 06/22/2012 11:53 PM, Tom Lane wrote: > oh, and just for comparison's sake, what do the postmaster's signal > masks look like? # ps -o pid,sig,sigcatch,sigignore,sigmask,command -p 18020 PID PENDING CAUGHT IGNORED BLOCKED COMMAND 180200 74084007 8972b0000 /home/pgbuild/pgbuildfarm/HEAD/pgsql.5709/src/interfaces/ecpg/test/./t... Stefan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] random failing builds on spoonbill - backends not exiting...
On Fri, Jun 22, 2012 at 2:57 PM, Andres Freund wrote: > On Friday, June 22, 2012 08:51:55 PM Robert Haas wrote: >> On Fri, Jun 22, 2012 at 2:16 PM, Stefan Kaltenbrunner >> >> wrote: >> > sending a manual kill -15 to either of them does not seem to make them >> > exit either... >> > >> > I did some further investiagations with robert on IM but I don't think >> > he has any further ideas other than that I have a weird OS :) >> > It seems worth noticing that this is OpenBSD 5.1 on Sparc64 which has a >> > new threading implementation compared to older OpenBSD versions. >> >> I remarked to Stefan that the symptoms seem consistent with the idea >> that the children have signals blocked. But I don't know how that >> could happen. > You cannot block sigkill. Obviously. The issue is: the postmaster apparently sent SIGTERM (15) to all of these children, and yet they're not dead; and a manual SIGTERM doesn't kill them either. I'm sure SIGKILL (9) would do the trick, but then it's not a clean shutdown. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] random failing builds on spoonbill - backends not exiting...
oh, and just for comparison's sake, what do the postmaster's signal masks look like? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] random failing builds on spoonbill - backends not exiting...
Stefan Kaltenbrunner writes: >>> PID PENDING CAUGHT IGNORED BLOCKED COMMAND >>> 12480 20004004 34084005 c942b002 fffefeff postgres: writer process >>> 9841 20004004 34084007 c942b000 fffefeff postgres: wal writer process > this seems to be SIGUSR1,SIGTERM and SIGQUIT OK, I looked up OpenBSD's signal numbers on the web. It looks to me like these two processes have everything blocked except KILL and STOP (which are unblockable of course). I do not see any place in the PG code that could possibly set such a mask (note that BlockSig should have more holes in it than that). So I'm thinking these must be blocked inside some system function that's installed a restrictive signal mask, or some such function forgot to restore the mask on exit. Could you gdb each of these processes and get a stack trace? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] random failing builds on spoonbill - backends not exiting...
On 06/22/2012 11:02 PM, Tom Lane wrote: > Stefan Kaltenbrunner writes: >> On 06/22/2012 09:39 PM, Tom Lane wrote: >>> (Hey Stefan, is there a way on BSD to check a process's signals-blocked >>> state from outside? If so, next time this happens you should try to >>> determine the children's signal state.) > >> with help from RhodiumToad on IRC: > >> # ps -o pid,sig,sigcatch,sigignore,sigmask,command -p 12480 > >> PID PENDING CAUGHT IGNORED BLOCKED COMMAND >> 12480 20004004 34084005 c942b002 fffefeff postgres: writer process >> (postgres) > >> # ps -o pid,sig,sigcatch,sigignore,sigmask,command -p 9841 >> PID PENDING CAUGHT IGNORED BLOCKED COMMAND >> 9841 20004004 34084007 c942b000 fffefeff postgres: wal writer process >> (postgres) > > Well, the nonzero PENDING masks sure look like a smoking gun, but why > are there multiple pending signals? And I'm not sure I know OpenBSD's > signal numbers by heart. Could you convert those masks into text signal > name lists for us? this seems to be SIGUSR1,SIGTERM and SIGQUIT Stefan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] random failing builds on spoonbill - backends not exiting...
Stefan Kaltenbrunner writes: > On 06/22/2012 09:39 PM, Tom Lane wrote: >> (Hey Stefan, is there a way on BSD to check a process's signals-blocked >> state from outside? If so, next time this happens you should try to >> determine the children's signal state.) > with help from RhodiumToad on IRC: > # ps -o pid,sig,sigcatch,sigignore,sigmask,command -p 12480 > PID PENDING CAUGHT IGNORED BLOCKED COMMAND > 12480 20004004 34084005 c942b002 fffefeff postgres: writer process > (postgres) > # ps -o pid,sig,sigcatch,sigignore,sigmask,command -p 9841 > PID PENDING CAUGHT IGNORED BLOCKED COMMAND > 9841 20004004 34084007 c942b000 fffefeff postgres: wal writer process > (postgres) Well, the nonzero PENDING masks sure look like a smoking gun, but why are there multiple pending signals? And I'm not sure I know OpenBSD's signal numbers by heart. Could you convert those masks into text signal name lists for us? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] random failing builds on spoonbill - backends not exiting...
On 06/22/2012 09:39 PM, Tom Lane wrote: > Andres Freund writes: >> On Friday, June 22, 2012 08:51:55 PM Robert Haas wrote: >>> I remarked to Stefan that the symptoms seem consistent with the idea >>> that the children have signals blocked. But I don't know how that >>> could happen. > >> You cannot block sigkill. > > sigterm is at issue, not sigkill. But I don't care for the > signals-blocked theory either, at least not in three different children > at the same time. > > (Hey Stefan, is there a way on BSD to check a process's signals-blocked > state from outside? If so, next time this happens you should try to > determine the children's signal state.) with help from RhodiumToad on IRC: # ps -o pid,sig,sigcatch,sigignore,sigmask,command -p 12480 PID PENDING CAUGHT IGNORED BLOCKED COMMAND 12480 20004004 34084005 c942b002 fffefeff postgres: writer process (postgres) # ps -o pid,sig,sigcatch,sigignore,sigmask,command -p 9841 PID PENDING CAUGHT IGNORED BLOCKED COMMAND 9841 20004004 34084007 c942b000 fffefeff postgres: wal writer process (postgres) Stefan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] random failing builds on spoonbill - backends not exiting...
Andres Freund writes: > On Friday, June 22, 2012 08:51:55 PM Robert Haas wrote: >> I remarked to Stefan that the symptoms seem consistent with the idea >> that the children have signals blocked. But I don't know how that >> could happen. > You cannot block sigkill. sigterm is at issue, not sigkill. But I don't care for the signals-blocked theory either, at least not in three different children at the same time. (Hey Stefan, is there a way on BSD to check a process's signals-blocked state from outside? If so, next time this happens you should try to determine the children's signal state.) regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] random failing builds on spoonbill - backends not exiting...
Stefan Kaltenbrunner writes: > On 06/22/2012 08:34 PM, Tom Lane wrote: >> Still, panther is NetBSD so there may be some general BSD flavor to >> whatever's going on here. > yeah the threading reference was mostly because all backtraces contain > references to threading libs and because the threading tests are the > last ones done by the ECPG changes... It is weird that this seems to be happening only in the context of the ecpg and isolation tests, because it's not clear why client-side activity would have anything to do with it. Your gdb investigations confirm that all the actual client-serving backends are gone, and the postmaster knows it. But the background service processes haven't shut down. AFAICS the postmaster could not have reached PM_WAIT_BACKENDS state without signaling them, so why aren't they shutting down, and why does it matter which set of tests we'd been running? My first thought about it was that maybe a signal got missed, but it's hard to credit that bgwriter, walwriter, and autovac would all have missed signals concurrently. (checkpointer and stats collector don't get signaled yet, so it's not surprising those are still around.) I wonder whether signal_child() could have failed? It logs about such failures, but only at debug3 which seems overly taciturn. I wonder if we should crank that up to LOG level. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] random failing builds on spoonbill - backends not exiting...
On 06/22/2012 08:34 PM, Tom Lane wrote: > Stefan Kaltenbrunner writes: >> It has now happened at least twice that builds on spponbill started to >> fail after it failed during ECPGcheck: >> http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=spoonbill&dt=2012-06-19%2023%3A00%3A04 >> the first failure was: >> http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=spoonbill&dt=2012-05-24%2023%3A00%3A05 >> so in both cases the postmaster was not shuting down properly > > panther has been showing similar postmaster-does-not-shut-down failures > every so often, though IIRC always in the IsolationCheck step not ECPG. hmm > >> I did some further investiagations with robert on IM but I don't think >> he has any further ideas other than that I have a weird OS :) >> It seems worth noticing that this is OpenBSD 5.1 on Sparc64 which has a >> new threading implementation compared to older OpenBSD versions. > > But we don't use threading ... > > Still, panther is NetBSD so there may be some general BSD flavor to > whatever's going on here. yeah the threading reference was mostly because all backtraces contain references to threading libs and because the threading tests are the last ones done by the ECPG changes... Stefan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] random failing builds on spoonbill - backends not exiting...
On Friday, June 22, 2012 08:51:55 PM Robert Haas wrote: > On Fri, Jun 22, 2012 at 2:16 PM, Stefan Kaltenbrunner > > wrote: > > sending a manual kill -15 to either of them does not seem to make them > > exit either... > > > > I did some further investiagations with robert on IM but I don't think > > he has any further ideas other than that I have a weird OS :) > > It seems worth noticing that this is OpenBSD 5.1 on Sparc64 which has a > > new threading implementation compared to older OpenBSD versions. > > I remarked to Stefan that the symptoms seem consistent with the idea > that the children have signals blocked. But I don't know how that > could happen. You cannot block sigkill. Andres -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] random failing builds on spoonbill - backends not exiting...
On Fri, Jun 22, 2012 at 2:16 PM, Stefan Kaltenbrunner wrote: > sending a manual kill -15 to either of them does not seem to make them > exit either... > > I did some further investiagations with robert on IM but I don't think > he has any further ideas other than that I have a weird OS :) > It seems worth noticing that this is OpenBSD 5.1 on Sparc64 which has a > new threading implementation compared to older OpenBSD versions. I remarked to Stefan that the symptoms seem consistent with the idea that the children have signals blocked. But I don't know how that could happen. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] random failing builds on spoonbill - backends not exiting...
On 06/22/2012 02:34 PM, Tom Lane wrote: Stefan Kaltenbrunner writes: It has now happened at least twice that builds on spponbill started to fail after it failed during ECPGcheck: http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=spoonbill&dt=2012-06-19%2023%3A00%3A04 the first failure was: http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=spoonbill&dt=2012-05-24%2023%3A00%3A05 so in both cases the postmaster was not shuting down properly panther has been showing similar postmaster-does-not-shut-down failures every so often, though IIRC always in the IsolationCheck step not ECPG. brolga (Cygwin) has had similar sporadic failures on ecpg checks, so much so that, since I have been unable to debug it in the time I have available, for the time being I have disabled the ecpg checks. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] random failing builds on spoonbill - backends not exiting...
Stefan Kaltenbrunner writes: > It has now happened at least twice that builds on spponbill started to > fail after it failed during ECPGcheck: > http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=spoonbill&dt=2012-06-19%2023%3A00%3A04 > the first failure was: > http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=spoonbill&dt=2012-05-24%2023%3A00%3A05 > so in both cases the postmaster was not shuting down properly panther has been showing similar postmaster-does-not-shut-down failures every so often, though IIRC always in the IsolationCheck step not ECPG. > I did some further investiagations with robert on IM but I don't think > he has any further ideas other than that I have a weird OS :) > It seems worth noticing that this is OpenBSD 5.1 on Sparc64 which has a > new threading implementation compared to older OpenBSD versions. But we don't use threading ... Still, panther is NetBSD so there may be some general BSD flavor to whatever's going on here. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] random failing builds on spoonbill - backends not exiting...
It has now happened at least twice that builds on spponbill started to fail after it failed during ECPGcheck: http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=spoonbill&dt=2012-06-19%2023%3A00%3A04 the first failure was: http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=spoonbill&dt=2012-05-24%2023%3A00%3A05 so in both cases the postmaster was not shuting down properly and it was in fact still running - I have attached gdb to to the still running backend: (gdb) bt #0 0x000208eb5928 in poll () from /usr/lib/libc.so.62.0 #1 0x00020a972b88 in _thread_kern_poll (wait_reqd=Variable "wait_reqd" is not available. ) at /usr/src/lib/libpthread/uthread/uthread_kern.c:784 #2 0x00020a973d04 in _thread_kern_sched (scp=0x0) at /usr/src/lib/libpthread/uthread/uthread_kern.c:384 #3 0x00020a96c080 in select (numfds=Variable "numfds" is not available. ) at /usr/src/lib/libpthread/uthread/uthread_select.c:170 #4 0x003a2894 in ServerLoop () at postmaster.c:1321 #5 0x003a45ac in PostmasterMain (argc=Variable "argc" is not available. ) at postmaster.c:1121 #6 0x00326df8 in main (argc=6, argv=0x14f8) at main.c:199 (gdb) print Shutdown $2 = 2 (gdb) print pmState $3 = PM_WAIT_BACKENDS (gdb) p *(Backend *) (BackendList->dll_head) Cannot access memory at address 0x0 (gdb) p *BackendList $9 = {dll_head = 0x0, dll_tail = 0x0} all processes are still running: pgbuild 18020 0.0 1.2 5952 12408 ?? I Wed04AM0:03.98 /home/pgbuild/pgbuildfarm/HEAD/pgsql.5709/src/interfaces/ecpg/test/./tmp_check/install//home/pgbuild/pgbuildfarm/HEAD/inst/bin/postgres -D / pgbuild 21483 0.0 0.7 6088 7296 ?? IsWed04AM0:00.68 postgres: checkpointer process(postgres) pgbuild 12480 0.0 0.4 5952 4464 ?? SsWed04AM0:06.88 postgres: writer process(postgres) pgbuild 9841 0.0 0.5 5952 4936 ?? SsWed04AM0:06.92 postgres: wal writer process(postgres) pgbuild623 0.1 0.6 7424 6288 ?? SsWed04AM4:16.76 postgres: autovacuum launcher process(postgres) pgbuild 30949 0.0 0.4 6280 3896 ?? SsWed04AM0:40.94 postgres: stats collector process(postgres) sending a manual kill -15 to either of them does not seem to make them exit either... I did some further investiagations with robert on IM but I don't think he has any further ideas other than that I have a weird OS :) It seems worth noticing that this is OpenBSD 5.1 on Sparc64 which has a new threading implementation compared to older OpenBSD versions. Stefan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers