Re: [ADMIN] recovery is stuck when children are not processing SIGQUIT from previous crash

2009-11-12 Thread Peter Eisentraut
On lör, 2009-09-26 at 12:19 -0400, Tom Lane wrote: Peter Eisentraut pete...@gmx.net writes: strace on the backend processes all showed them waiting at futex(0x7f1ee5e21c90, FUTEX_WAIT_PRIVATE, 2, NULL Notably, the first argument was the same for all of them. Probably means they are

Re: [ADMIN] recovery is stuck when children are not processing SIGQUIT from previous crash

2009-11-12 Thread Marko Kreen
On 11/12/09, Peter Eisentraut pete...@gmx.net wrote: On lör, 2009-09-26 at 12:19 -0400, Tom Lane wrote: Peter Eisentraut pete...@gmx.net writes: strace on the backend processes all showed them waiting at futex(0x7f1ee5e21c90, FUTEX_WAIT_PRIVATE, 2, NULL Notably, the first argument

Re: [ADMIN] recovery is stuck when children are not processing SIGQUIT from previous crash

2009-11-12 Thread Tom Lane
Peter Eisentraut pete...@gmx.net writes: strace on the backend processes all showed them waiting at futex(0x7f1ee5e21c90, FUTEX_WAIT_PRIVATE, 2, NULL Notably, the first argument was the same for all of them. Looks like a race condition or lockup in the syslog code. Hm, why are there two

Re: [ADMIN] recovery is stuck when children are not processing SIGQUIT from previous crash

2009-11-12 Thread Marko Kreen
On 11/12/09, Tom Lane t...@sss.pgh.pa.us wrote: The other thought is that quickdie should block signals before starting to do anything. There would still be possibility of recursive syslog() calls. Shouldn't we fix that too? I'm not sure how exactly. If the recursive elog() must stay, then

Re: [ADMIN] recovery is stuck when children are not processing SIGQUIT from previous crash

2009-11-12 Thread Tom Lane
Marko Kreen mark...@gmail.com writes: On 11/12/09, Tom Lane t...@sss.pgh.pa.us wrote: The other thought is that quickdie should block signals before starting to do anything. There would still be possibility of recursive syslog() calls. Shouldn't we fix that too? That's what the signal block

Re: [ADMIN] recovery is stuck when children are not processing SIGQUIT from previous crash

2009-11-12 Thread Marko Kreen
On 11/12/09, Tom Lane t...@sss.pgh.pa.us wrote: Marko Kreen mark...@gmail.com writes: On 11/12/09, Tom Lane t...@sss.pgh.pa.us wrote: The other thought is that quickdie should block signals before starting to do anything. There would still be possibility of recursive syslog() calls.

Re: [ADMIN] recovery is stuck when children are not processing SIGQUIT from previous crash

2009-11-12 Thread Tom Lane
Marko Kreen mark...@gmail.com writes: You talked about blocking in quickdie(), but you'd need to block in elog(). I'm not really particularly worried about that case. By that logic, we could not use quickdie at all, because any part of the system might be doing something that wouldn't survive

Re: [ADMIN] recovery is stuck when children are not processing SIGQUIT from previous crash

2009-11-12 Thread Peter Eisentraut
On tor, 2009-11-12 at 10:45 -0500, Tom Lane wrote: In practice the code path isn't sufficiently used or critical enough to be worth trying to make that bulletproof. Well, the subject line is recovery is stuck. Not critical enough? -- Sent via pgsql-admin mailing list

Re: [ADMIN] recovery is stuck when children are not processing SIGQUIT from previous crash

2009-11-12 Thread Marko Kreen
On 11/12/09, Tom Lane t...@sss.pgh.pa.us wrote: Marko Kreen mark...@gmail.com writes: You talked about blocking in quickdie(), but you'd need to block in elog(). I'm not really particularly worried about that case. By that logic, we could not use quickdie at all, because any part of

Re: [ADMIN] recovery is stuck when children are not processing SIGQUIT from previous crash

2009-09-26 Thread Tom Lane
Peter Eisentraut pete...@gmx.net writes: strace on the backend processes all showed them waiting at futex(0x7f1ee5e21c90, FUTEX_WAIT_PRIVATE, 2, NULL Notably, the first argument was the same for all of them. Probably means they are blocked on semaphores. Stack traces would be much more

Re: [ADMIN] recovery is stuck when children are not processing SIGQUIT from previous crash

2009-09-25 Thread Peter Eisentraut
On Wed, 2009-09-23 at 10:04 -0400, Tom Lane wrote: I'd prefer not to go there, at least not without a demonstration that this will solve a bug that's unsolvable otherwise. If a child is really stuck in a state that doesn't accept SIGQUIT, it probably won't accept SIGKILL either (eg,

Re: [ADMIN] recovery is stuck when children are not processing SIGQUIT from previous crash

2009-09-25 Thread Alvaro Herrera
Peter Eisentraut wrote: strace on the backend processes all showed them waiting at futex(0x7f1ee5e21c90, FUTEX_WAIT_PRIVATE, 2, NULL Notably, the first argument was the same for all of them. I gather that a futex is a Linux kernel thing, which is probably then used by glibc to

[ADMIN] recovery is stuck when children are not processing SIGQUIT from previous crash

2009-09-23 Thread Peter Eisentraut
I have observed the following situation a few times now (weeks or months apart), most recently with 8.3.7. Some postgres child process crashes. The postmaster notices and sends SIGQUIT to all other children. Once all other children have exited, it would enter recovery. But for some reason, some

Re: [ADMIN] recovery is stuck when children are not processing SIGQUIT from previous crash

2009-09-23 Thread Tom Lane
Peter Eisentraut pete...@gmx.net writes: I have observed the following situation a few times now (weeks or months apart), most recently with 8.3.7. Some postgres child process crashes. The postmaster notices and sends SIGQUIT to all other children. Once all other children have exited, it