Re: [HACKERS] What is happening on buildfarm member crake?
Robert Haas robertmh...@gmail.com writes: On Sat, Jan 25, 2014 at 5:04 PM, Tom Lane t...@sss.pgh.pa.us wrote: Yeah. If Robert's diagnosis is correct, and it sounds pretty plausible, then this is really just one instance of a bug that's probably pretty widespread in our signal handlers. Somebody needs to go through 'em all and look for touches of shared memory. I haven't made a comprehensive study of every signal handler we have, [ but here's how to fix procsignal_sigusr1_handler ] I've trawled all the remaining signal handlers (or at least everything declared with SIGNAL_ARGS, which hopefully is all of them). I find the following bugs: 1. A couple of signal handlers do if (MyWalSnd) SetLatch(MyWalSnd-latch); which is fine as far as it goes, but the shutdown sequence in WalSndKill has exactly the same bug you just fixed in ProcKill: it needs to clear MyWalSnd before disowning the latch, not after. 2. WalRcvSigUsr1Handler and worker_spi_sighup fail to preserve errno. Will fix, but the latter bug is a tad disappointing considering that the coding rule about saving errno in signal handlers has been there for a *long* time. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] What is happening on buildfarm member crake?
Robert Haas robertmh...@gmail.com writes: On Sat, Jan 25, 2014 at 5:04 PM, Tom Lane t...@sss.pgh.pa.us wrote: Yeah. If Robert's diagnosis is correct, and it sounds pretty plausible, then this is really just one instance of a bug that's probably pretty widespread in our signal handlers. Somebody needs to go through 'em all and look for touches of shared memory. I haven't made a comprehensive study of every signal handler we have, but looking at procsignal_sigusr1_handler, the list of functions that can get called from here is quite short: CheckProcSignal(), RecoveryConflictInterrupt(), SetLatch(), and latch_sigusr1_handler(). You forgot HandleCatchupInterrupt and HandleNotifyInterrupt, but those are also safe because they do nothing once proc_exit_inprogress is set. I think maybe the best fix is to *clear* MyProc in ProcKill/AuxiliaryProcKill and MyProcSignalSlot in CleanupProcSignalState, as shown in the attached patch. This looks good to me in principle. A couple minor beefs: * The addition to CleanupProcSignalState could use a comment, similar to the one you added in ProcKill. * I think the code in ProcKill and AuxiliaryProcKill might be more readable if the new local variable was named myproc (lower case). and we can easily add a NULL guard to the SetLatch() call in procsignal_sigusr1_handler, which the attached patch also does. Um ... no such change actually visible in patch, but it's clearly necessary. This might not be a complete fix to every problem of this type that exists anywhere in our code, but I think it's enough to make the world safe for procsignal_sigusr1_handler. Yeah; at the least this should cut down on the buildfarm noise we are seeing ATM. Assuming nobody objects too much to this basic approach, should I back-patch the parts of this that apply pre-9.4? Yes, I think so. We have seen some reports of irreproducible crashes at process exit, and maybe this explains them. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] What is happening on buildfarm member crake?
On Fri, Jan 31, 2014 at 6:15 PM, Tom Lane t...@sss.pgh.pa.us wrote: This looks good to me in principle. A couple minor beefs: * The addition to CleanupProcSignalState could use a comment, similar to the one you added in ProcKill. OK. * I think the code in ProcKill and AuxiliaryProcKill might be more readable if the new local variable was named myproc (lower case). grep indicates that naming is less common that what I picked, so I chose to stick with what I picked. and we can easily add a NULL guard to the SetLatch() call in procsignal_sigusr1_handler, which the attached patch also does. Um ... no such change actually visible in patch, but it's clearly necessary. Fixed. This might not be a complete fix to every problem of this type that exists anywhere in our code, but I think it's enough to make the world safe for procsignal_sigusr1_handler. Yeah; at the least this should cut down on the buildfarm noise we are seeing ATM. Assuming nobody objects too much to this basic approach, should I back-patch the parts of this that apply pre-9.4? Yes, I think so. We have seen some reports of irreproducible crashes at process exit, and maybe this explains them. OK, done. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] What is happening on buildfarm member crake?
On Sat, Jan 25, 2014 at 5:04 PM, Tom Lane t...@sss.pgh.pa.us wrote: Yeah. If Robert's diagnosis is correct, and it sounds pretty plausible, then this is really just one instance of a bug that's probably pretty widespread in our signal handlers. Somebody needs to go through 'em all and look for touches of shared memory. I haven't made a comprehensive study of every signal handler we have, but looking at procsignal_sigusr1_handler, the list of functions that can get called from here is quite short: CheckProcSignal(), RecoveryConflictInterrupt(), SetLatch(), and latch_sigusr1_handler(). Taking those in reverse order: - latch_sigusr1_handler() is fine. Nothing down this path touches shared memory; moreover, if we've already disowned our latch, the waiting flag won't be set and this will do nothing at all. - The call to SetLatch() is problematic as we already know. This is new code in 9.4. - RecoveryConflictInterrupt() does nothing if proc_exit_inprogress is set. So it's fine. - CheckProcSignal() also appears problematic. If we've already detached shared memory, MyProcSignalSlot will be pointing to garbage, but we'll try to dereference it anyway. I think maybe the best fix is to *clear* MyProc in ProcKill/AuxiliaryProcKill and MyProcSignalSlot in CleanupProcSignalState, as shown in the attached patch. Most places that dereference those pointers already check that they aren't null, and we can easily add a NULL guard to the SetLatch() call in procsignal_sigusr1_handler, which the attached patch also does. This might not be a complete fix to every problem of this type that exists anywhere in our code, but I think it's enough to make the world safe for procsignal_sigusr1_handler. We also have a *large* number of signal handlers that do little more than this: if (MyProc) SetLatch(MyProc-procLatch); ...and this change should make all of those safe as well. So I think this is a pretty good start. Assuming nobody objects too much to this basic approach, should I back-patch the parts of this that apply pre-9.4? The problem with CleanupProcSignalState, at least, goes all the way back to 9.0, when the signal-multiplexing infrastructure was introduced. But the probability of an actual crash must be pretty low, or I imagine we would have noticed this sooner. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c index 6ebabce..1372a7e 100644 --- a/src/backend/storage/ipc/procsignal.c +++ b/src/backend/storage/ipc/procsignal.c @@ -149,6 +149,8 @@ CleanupProcSignalState(int status, Datum arg) slot = ProcSignalSlots[pss_idx - 1]; Assert(slot == MyProcSignalSlot); + MyProcSignalSlot = NULL; + /* sanity check */ if (slot-pss_pid != MyProcPid) { diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c index ee6c24c..f64e1c4 100644 --- a/src/backend/storage/lmgr/proc.c +++ b/src/backend/storage/lmgr/proc.c @@ -772,6 +772,7 @@ ProcKill(int code, Datum arg) { /* use volatile pointer to prevent code rearrangement */ volatile PROC_HDR *procglobal = ProcGlobal; + PGPROC *proc; Assert(MyProc != NULL); @@ -796,31 +797,34 @@ ProcKill(int code, Datum arg) */ LWLockReleaseAll(); - /* Release ownership of the process's latch, too */ - DisownLatch(MyProc-procLatch); + /* + * Clear MyProc first; then disown the process latch. This is so that + * signal handlers won't try to clear the process latch after it's no + * longer ours. + */ + proc = MyProc; + MyProc = NULL; + DisownLatch(proc-procLatch); SpinLockAcquire(ProcStructLock); /* Return PGPROC structure (and semaphore) to appropriate freelist */ if (IsAnyAutoVacuumProcess()) { - MyProc-links.next = (SHM_QUEUE *) procglobal-autovacFreeProcs; - procglobal-autovacFreeProcs = MyProc; + proc-links.next = (SHM_QUEUE *) procglobal-autovacFreeProcs; + procglobal-autovacFreeProcs = proc; } else if (IsBackgroundWorker) { - MyProc-links.next = (SHM_QUEUE *) procglobal-bgworkerFreeProcs; - procglobal-bgworkerFreeProcs = MyProc; + proc-links.next = (SHM_QUEUE *) procglobal-bgworkerFreeProcs; + procglobal-bgworkerFreeProcs = proc; } else { - MyProc-links.next = (SHM_QUEUE *) procglobal-freeProcs; - procglobal-freeProcs = MyProc; + proc-links.next = (SHM_QUEUE *) procglobal-freeProcs; + procglobal-freeProcs = proc; } - /* PGPROC struct isn't mine anymore */ - MyProc = NULL; - /* Update shared estimate of spins_per_delay */ procglobal-spins_per_delay = update_spins_per_delay(procglobal-spins_per_delay); @@ -849,6 +853,7 @@ AuxiliaryProcKill(int code, Datum arg) { int proctype = DatumGetInt32(arg); PGPROC *auxproc PG_USED_FOR_ASSERTS_ONLY; + PGPROC *proc; Assert(proctype = 0 proctype NUM_AUXILIARY_PROCS); @@ -859,16 +864,19 @@ AuxiliaryProcKill(int code, Datum arg) /* Release any LW locks I am holding (see notes above) */
Re: [HACKERS] What is happening on buildfarm member crake?
On 01/19/2014 08:22 PM, Robert Haas wrote: Hmm, that looks an awful lot like the SIGUSR1 signal handler is getting called after we've already completed shmem_exit. And indeed that seems like the sort of thing that would result in dying horribly in just this way. The obvious fix seems to be to check proc_exit_inprogress before doing anything that might touch shared memory, but there are a lot of other SIGUSR1 handlers that don't do that either. However, in those cases, the likely cause of a SIGUSR1 would be a sinval catchup interrupt or a recovery conflict, which aren't likely to be so far delayed that they arrive after we've already disconnected from shared memory. But the dynamic background workers stuff adds a new possible cause of SIGUSR1: the postmaster letting us know that a child has started or died. And that could happen even after we've detached shared memory. Is anything happening about this? We're still getting quite a few of these: http://www.pgbuildfarm.org/cgi-bin/show_failures.pl?max_days=3member=crake cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] What is happening on buildfarm member crake?
Andrew Dunstan and...@dunslane.net writes: On 01/19/2014 08:22 PM, Robert Haas wrote: Hmm, that looks an awful lot like the SIGUSR1 signal handler is getting called after we've already completed shmem_exit. And indeed that seems like the sort of thing that would result in dying horribly in just this way. The obvious fix seems to be to check proc_exit_inprogress before doing anything that might touch shared memory, but there are a lot of other SIGUSR1 handlers that don't do that either. However, in those cases, the likely cause of a SIGUSR1 would be a sinval catchup interrupt or a recovery conflict, which aren't likely to be so far delayed that they arrive after we've already disconnected from shared memory. But the dynamic background workers stuff adds a new possible cause of SIGUSR1: the postmaster letting us know that a child has started or died. And that could happen even after we've detached shared memory. Is anything happening about this? We're still getting quite a few of these: http://www.pgbuildfarm.org/cgi-bin/show_failures.pl?max_days=3member=crake Yeah. If Robert's diagnosis is correct, and it sounds pretty plausible, then this is really just one instance of a bug that's probably pretty widespread in our signal handlers. Somebody needs to go through 'em all and look for touches of shared memory. I'm not sure if we can just disable signal response the moment the proc_exit_inprogress flag goes up, though. In some cases such as lock handling, it's likely that we need that functionality to keep working for some part of the shutdown process. We might end up having to disable individual signal handlers at appropriate places. Ick. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] What is happening on buildfarm member crake?
On 01/25/2014 05:04 PM, Tom Lane wrote: Andrew Dunstan and...@dunslane.net writes: On 01/19/2014 08:22 PM, Robert Haas wrote: Hmm, that looks an awful lot like the SIGUSR1 signal handler is getting called after we've already completed shmem_exit. And indeed that seems like the sort of thing that would result in dying horribly in just this way. The obvious fix seems to be to check proc_exit_inprogress before doing anything that might touch shared memory, but there are a lot of other SIGUSR1 handlers that don't do that either. However, in those cases, the likely cause of a SIGUSR1 would be a sinval catchup interrupt or a recovery conflict, which aren't likely to be so far delayed that they arrive after we've already disconnected from shared memory. But the dynamic background workers stuff adds a new possible cause of SIGUSR1: the postmaster letting us know that a child has started or died. And that could happen even after we've detached shared memory. Is anything happening about this? We're still getting quite a few of these: http://www.pgbuildfarm.org/cgi-bin/show_failures.pl?max_days=3member=crake Yeah. If Robert's diagnosis is correct, and it sounds pretty plausible, then this is really just one instance of a bug that's probably pretty widespread in our signal handlers. Somebody needs to go through 'em all and look for touches of shared memory. I'm not sure if we can just disable signal response the moment the proc_exit_inprogress flag goes up, though. In some cases such as lock handling, it's likely that we need that functionality to keep working for some part of the shutdown process. We might end up having to disable individual signal handlers at appropriate places. Ick. Yeah. Since we're now providing for user-defined backends, maybe we need some mechanism for white-listing handlers that can run in whole or part during shutdown. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] What is happening on buildfarm member crake?
Since contrib/test_shm_mq went into the tree, crake has been crashing intermittently (maybe one time in three) at the contribcheck step. While there's not conclusive proof that test_shm_mq is at fault, the circumstantial evidence is pretty strong. For instance, in the postmaster log for the last failure, http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crakedt=2014-01-19%2013%3A26%3A08 we see LOG: registering background worker test_shm_mq LOG: starting background worker process test_shm_mq LOG: worker process: test_shm_mq (PID 12585) exited with exit code 1 LOG: unregistering background worker test_shm_mq LOG: registering background worker test_shm_mq LOG: registering background worker test_shm_mq LOG: registering background worker test_shm_mq LOG: starting background worker process test_shm_mq LOG: starting background worker process test_shm_mq LOG: starting background worker process test_shm_mq LOG: worker process: test_shm_mq (PID 12588) exited with exit code 1 LOG: unregistering background worker test_shm_mq LOG: worker process: test_shm_mq (PID 12586) exited with exit code 1 LOG: unregistering background worker test_shm_mq LOG: worker process: test_shm_mq (PID 12587) exited with exit code 1 LOG: unregistering background worker test_shm_mq LOG: server process (PID 12584) was terminated by signal 11: Segmentation fault LOG: terminating any other active server processes Comparing the PIDs makes it seem pretty likely that what's crashing is the backend running the test_shm_mq test. Since the connected psql doesn't notice any problem, most likely the crash occurs during backend shutdown after psql has disconnected (which would also explain why no current query gets shown in the crash report). crake is running F16 x86_64, which I don't have installed here. So I tried to reproduce this on RHEL6 and F19 x86_64 machines, with build options matching what crake says it's using, with absolutely no success. I don't think any other buildfarm critters are showing this either, which makes it look like it's possibly specific to the particular compiler version crake has got. Or maybe there's some other factor needed to trigger it. Anyway, I wonder whether Andrew could get a stack trace from the core dump next time it happens. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] What is happening on buildfarm member crake?
I wrote: I don't think any other buildfarm critters are showing this either, Scratch that --- closer inspection of the buildfarm logs shows that kouprey and lapwing have suffered identical failures, though much less often than crake. That lets out the theory that it's x86_64 specific, or even 64-bit specific. Still have no idea why I can't reproduce it here. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] What is happening on buildfarm member crake?
On 01/19/2014 04:14 PM, Tom Lane wrote: I wrote: I don't think any other buildfarm critters are showing this either, Scratch that --- closer inspection of the buildfarm logs shows that kouprey and lapwing have suffered identical failures, though much less often than crake. That lets out the theory that it's x86_64 specific, or even 64-bit specific. Still have no idea why I can't reproduce it here. I can give you an account on the machine if you want to play. Also crake does produce backtraces on core dumps, and they are at the bottom of the buildfarm log. The latest failure backtrace is reproduced below. == stack trace: /home/bf/bfr/root/HEAD/inst/data-C/core.12584 == [New LWP 12584] [Thread debugging using libthread_db enabled] Using host libthread_db library /lib64/libthread_db.so.1. Core was generated by `postgres: buildfarm contrib_regression_test_shm_mq'. Program terminated with signal 11, Segmentation fault. #0 SetLatch (latch=0x1c) at pg_latch.c:509 509 if (latch-is_set) #0 SetLatch (latch=0x1c) at pg_latch.c:509 #1 0x0064c04e in procsignal_sigusr1_handler (postgres_signal_arg=optimized out) at /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/storage/ipc/procsignal.c:289 #2 signal handler called #3 _dl_fini () at dl-fini.c:190 #4 0x00361ba39931 in __run_exit_handlers (status=0, listp=0x361bdb1668, run_list_atexit=true) at exit.c:78 #5 0x00361ba399b5 in __GI_exit (status=optimized out) at exit.c:100 #6 0x006485a6 in proc_exit (code=0) at /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/storage/ipc/ipc.c:143 #7 0x00663abb in PostgresMain (argc=optimized out, argv=optimized out, dbname=0x12b8170 contrib_regression_test_shm_mq, username=optimized out) at /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/tcop/postgres.c:4225 #8 0x0062220f in BackendRun (port=0x12d6bf0) at /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/postmaster/postmaster.c:4083 #9 BackendStartup (port=0x12d6bf0) at /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/postmaster/postmaster.c:3772 #10 ServerLoop () at /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/postmaster/postmaster.c:1583 #11 PostmasterMain (argc=optimized out, argv=optimized out) at /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/postmaster/postmaster.c:1238 #12 0x0045e2e8 in main (argc=3, argv=0x12b7430) at /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/main/main.c:205 cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] What is happening on buildfarm member crake?
On Sun, Jan 19, 2014 at 7:53 PM, Andrew Dunstan and...@dunslane.net wrote: Also crake does produce backtraces on core dumps, and they are at the bottom of the buildfarm log. The latest failure backtrace is reproduced below. == stack trace: /home/bf/bfr/root/HEAD/inst/data-C/core.12584 == [New LWP 12584] [Thread debugging using libthread_db enabled] Using host libthread_db library /lib64/libthread_db.so.1. Core was generated by `postgres: buildfarm contrib_regression_test_shm_mq'. Program terminated with signal 11, Segmentation fault. #0 SetLatch (latch=0x1c) at pg_latch.c:509 509 if (latch-is_set) #0 SetLatch (latch=0x1c) at pg_latch.c:509 #1 0x0064c04e in procsignal_sigusr1_handler (postgres_signal_arg=optimized out) at /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/storage/ipc/procsignal.c:289 #2 signal handler called #3 _dl_fini () at dl-fini.c:190 #4 0x00361ba39931 in __run_exit_handlers (status=0, listp=0x361bdb1668, run_list_atexit=true) at exit.c:78 #5 0x00361ba399b5 in __GI_exit (status=optimized out) at exit.c:100 #6 0x006485a6 in proc_exit (code=0) at /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/storage/ipc/ipc.c:143 #7 0x00663abb in PostgresMain (argc=optimized out, argv=optimized out, dbname=0x12b8170 contrib_regression_test_shm_mq, username=optimized out) at /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/tcop/postgres.c:4225 #8 0x0062220f in BackendRun (port=0x12d6bf0) at /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/postmaster/postmaster.c:4083 #9 BackendStartup (port=0x12d6bf0) at /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/postmaster/postmaster.c:3772 #10 ServerLoop () at /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/postmaster/postmaster.c:1583 #11 PostmasterMain (argc=optimized out, argv=optimized out) at /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/postmaster/postmaster.c:1238 #12 0x0045e2e8 in main (argc=3, argv=0x12b7430) at /home/bf/bfr/root/HEAD/pgsql.25562/../pgsql/src/backend/main/main.c:205 Hmm, that looks an awful lot like the SIGUSR1 signal handler is getting called after we've already completed shmem_exit. And indeed that seems like the sort of thing that would result in dying horribly in just this way. The obvious fix seems to be to check proc_exit_inprogress before doing anything that might touch shared memory, but there are a lot of other SIGUSR1 handlers that don't do that either. However, in those cases, the likely cause of a SIGUSR1 would be a sinval catchup interrupt or a recovery conflict, which aren't likely to be so far delayed that they arrive after we've already disconnected from shared memory. But the dynamic background workers stuff adds a new possible cause of SIGUSR1: the postmaster letting us know that a child has started or died. And that could happen even after we've detached shared memory. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers