Re: [HACKERS] checkpointer code behaving strangely on postmaster -T

2012-05-11 Thread Simon Riggs
On 10 May 2012 16:14, Tom Lane t...@sss.pgh.pa.us wrote:
 Alvaro Herrera alvhe...@commandprompt.com writes:
 Excerpts from Tom Lane's message of jue may 10 02:27:32 -0400 2012:
 Alvaro Herrera alvhe...@alvh.no-ip.org writes:
 I noticed while doing some tests that the checkpointer process does not
 recover very nicely after a backend crashes under postmaster -T

 It seems to me that the bug is in the postmaster state machine rather
 than checkpointer itself.  After a few false starts, this seems to fix
 it:

 --- a/src/backend/postmaster/postmaster.c
 +++ b/src/backend/postmaster/postmaster.c
 @@ -2136,6 +2136,8 @@ pmdie(SIGNAL_ARGS)
                     signal_child(WalWriterPID, SIGTERM);
                 if (BgWriterPID != 0)
                     signal_child(BgWriterPID, SIGTERM);
 +               if (FatalError  CheckpointerPID != 0)
 +                   signal_child(CheckpointerPID, SIGUSR2);

 Surely we do not want the checkpointer doing a shutdown checkpoint here.
 If we need it to die immediately, SIGQUIT is the way.  If we want a
 shutdown checkpoint, that has to wait till after everything else is
 known dead.  So while I agree this may be a state machine bug, that
 doesn't look like a good fix.

Is this now fixed? You've made a few changes so I'm confused. Thanks.

-- 
 Simon Riggs   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer code behaving strangely on postmaster -T

2012-05-11 Thread Tom Lane
Alvaro Herrera alvhe...@commandprompt.com writes:
 Excerpts from Tom Lane's message of jue may 10 02:27:32 -0400 2012:
 Alvaro Herrera alvhe...@alvh.no-ip.org writes:
 I noticed while doing some tests that the checkpointer process does not
 recover very nicely after a backend crashes under postmaster -T (after
 all processes have been kill -CONTd, of course, and postmaster told to
 shutdown via Ctrl-C on its console).  For some reason it seems to get
 stuck on a loop doing sleep(0.5s)  In other case I caught it trying to
 do a checkpoint, but it was progressing a single page each time and then
 sleeping.  In that condition, the checkpoint took a very long time to
 finish.

 Is this still a problem as of HEAD?  I think I've fixed some issues in
 the checkpointer's outer loop logic, but not sure if what you saw is
 still there.

 Yep, it's still there as far as I can tell.  A backtrace from the
 checkpointer shows it's waiting on the latch.

I'm confused about what you did here and whether this isn't just pilot
error.  If you run with -T then the postmaster will just SIGSTOP the
remaining child processes, but then it will sit and wait for them to
die, since the state machine expects them to react as though they'd been
sent SIGQUIT.  If you SIGCONT any of them then that process will resume,
totally ignorant that it's supposed to die.  So kill -CONTd, of course
makes no sense to me.  I tried killing one child with -KILL, then
sending SIGINT to the postmaster, then killing the remaining
already-stopped children, and the postmaster did exit as expected after
the last child died.

So I don't see any bug here.  And, after closer inspection, your
previous proposed patch is quite bogus because checkpointer is not
supposed to stop yet when the other processes are being terminated
normally.

Possibly it'd be useful to teach the postmaster more thoroughly about
SIGSTOP and have a way for it to really kill the remaining children
after you've finished investigating their state.  But frankly this
is the first time I've heard of anybody using that feature at all;
I always thought it was a vestigial hangover from days when the kernel
was too stupid to write separate core dump files for each backend.
I'd rather remove SendStop than add more complexity there.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer code behaving strangely on postmaster -T

2012-05-11 Thread Alvaro Herrera

Excerpts from Tom Lane's message of vie may 11 16:50:01 -0400 2012:

  Yep, it's still there as far as I can tell.  A backtrace from the
  checkpointer shows it's waiting on the latch.
 
 I'm confused about what you did here and whether this isn't just pilot
 error.  If you run with -T then the postmaster will just SIGSTOP the
 remaining child processes, but then it will sit and wait for them to
 die, since the state machine expects them to react as though they'd been
 sent SIGQUIT.

The sequence of events is:
postmaster -T
crash a backend
SIGINT postmaster
SIGCONT all child processes

My expectation is that postmaster should exit normally after this.  What
happens instead is that all processes exit, except checkpointer.  And in
fact, postmaster is now in PM_WAIT_BACKENDS state, so sending SIGINT a
second time will not shutdown checkpointer either.

Maybe we can consider this to be just pilot error, but then why do all
other processes exit normally?  To me it just seems an oversight in
checkpointer shutdown handling in conjuction with -T.

 If you SIGCONT any of them then that process will resume,
 totally ignorant that it's supposed to die.  So kill -CONTd, of course
 makes no sense to me.  I tried killing one child with -KILL, then
 sending SIGINT to the postmaster, then killing the remaining
 already-stopped children, and the postmaster did exit as expected after
 the last child died.

Uhm, after you SIGINTd postmaster didn't it shutdown all children?  That
would be odd.

 So I don't see any bug here.  And, after closer inspection, your
 previous proposed patch is quite bogus because checkpointer is not
 supposed to stop yet when the other processes are being terminated
 normally.

Well, it does send the signal only when FatalError is set.  So it should
only affect -T behavior.

 Possibly it'd be useful to teach the postmaster more thoroughly about
 SIGSTOP and have a way for it to really kill the remaining children
 after you've finished investigating their state.  But frankly this
 is the first time I've heard of anybody using that feature at all;
 I always thought it was a vestigial hangover from days when the kernel
 was too stupid to write separate core dump files for each backend.
 I'd rather remove SendStop than add more complexity there.

Hah.  I've used it a few times, but I can see that multiple core files
are okay.  Maybe you're right and we should just remove it.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer code behaving strangely on postmaster -T

2012-05-11 Thread Tom Lane
Alvaro Herrera alvhe...@commandprompt.com writes:
 Excerpts from Tom Lane's message of vie may 11 16:50:01 -0400 2012:
 I'm confused about what you did here and whether this isn't just pilot
 error.

 The sequence of events is:
 postmaster -T
 crash a backend
 SIGINT postmaster
 SIGCONT all child processes

 My expectation is that postmaster should exit normally after this.

Well, my expectation is that the postmaster should wait for the children
to finish dying, and then exit rather than respawn anything.  It is not
on the postmaster's head to make them die anymore, because it already
(thinks it) sent them SIGQUIT.  Using SIGCONT here is pilot error.

 Maybe we can consider this to be just pilot error, but then why do all
 other processes exit normally?

The reason for that is that the postmaster's SIGINT interrupt handler
(lines 2163ff) sent them SIGTERM, without bothering to notice that we'd
already sent them SIGQUIT/SIGSTOP; so once you CONT them they get the
SIGTERM and drop out normally.  That handler knows it should not signal
the checkpointer yet, so the checkpointer doesn't get the memo.  But the
lack of a FatalError check here is just a simplicity of implementation
thing; it should not be necessary to send any more signals once we are
in FatalError state.  Besides, this behavior is all wrong for a crash
recovery scenario: there is no guarantee that shared memory is in good
enough condition for SIGTERM shutdown to work.  And we *definitely*
don't want the checkpointer trying to write a shutdown checkpoint.

 So I don't see any bug here.  And, after closer inspection, your
 previous proposed patch is quite bogus because checkpointer is not
 supposed to stop yet when the other processes are being terminated
 normally.

 Well, it does send the signal only when FatalError is set.  So it should
 only affect -T behavior.

If FatalError is set, it should not be necessary to send any more
signals, period, because we already tried to kill every child.  If we
need to defend against somebody using SIGSTOP/SIGCONT inappropriately,
it would take a lot more thought (and code) than this, and it would
still be extremely fragile because a SIGCONT'd backend is going to be
executing against possibly-corrupt shared memory.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer code behaving strangely on postmaster -T

2012-05-10 Thread Tom Lane
Alvaro Herrera alvhe...@alvh.no-ip.org writes:
 I noticed while doing some tests that the checkpointer process does not
 recover very nicely after a backend crashes under postmaster -T (after
 all processes have been kill -CONTd, of course, and postmaster told to
 shutdown via Ctrl-C on its console).  For some reason it seems to get
 stuck on a loop doing sleep(0.5s)  In other case I caught it trying to
 do a checkpoint, but it was progressing a single page each time and then
 sleeping.  In that condition, the checkpoint took a very long time to
 finish.

Is this still a problem as of HEAD?  I think I've fixed some issues in
the checkpointer's outer loop logic, but not sure if what you saw is
still there.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer code behaving strangely on postmaster -T

2012-05-10 Thread Alvaro Herrera

Excerpts from Tom Lane's message of jue may 10 02:27:32 -0400 2012:
 Alvaro Herrera alvhe...@alvh.no-ip.org writes:
  I noticed while doing some tests that the checkpointer process does not
  recover very nicely after a backend crashes under postmaster -T (after
  all processes have been kill -CONTd, of course, and postmaster told to
  shutdown via Ctrl-C on its console).  For some reason it seems to get
  stuck on a loop doing sleep(0.5s)  In other case I caught it trying to
  do a checkpoint, but it was progressing a single page each time and then
  sleeping.  In that condition, the checkpoint took a very long time to
  finish.
 
 Is this still a problem as of HEAD?  I think I've fixed some issues in
 the checkpointer's outer loop logic, but not sure if what you saw is
 still there.

Yep, it's still there as far as I can tell.  A backtrace from the
checkpointer shows it's waiting on the latch.

It seems to me that the bug is in the postmaster state machine rather
than checkpointer itself.  After a few false starts, this seems to fix
it:

--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -2136,6 +2136,8 @@ pmdie(SIGNAL_ARGS)
signal_child(WalWriterPID, SIGTERM);
if (BgWriterPID != 0)
signal_child(BgWriterPID, SIGTERM);
+   if (FatalError  CheckpointerPID != 0)
+   signal_child(CheckpointerPID, SIGUSR2);
 
/*
 * If we're in recovery, we can't kill the startup process
@@ -2178,6 +2180,8 @@ pmdie(SIGNAL_ARGS)
signal_child(WalReceiverPID, SIGTERM);
if (BgWriterPID != 0)
signal_child(BgWriterPID, SIGTERM);
+   if (FatalError  CheckpointerPID != 0)
+   signal_child(CheckpointerPID, SIGUSR2);
if (pmState == PM_RECOVERY)
{
/* only checkpointer is active in this state */


Note that since checkpointer can only be running after we enter
FatalError when the -T (send SIGSTOP instead of SIGQUIT) switch is used,
this bug doesn't seem to affect normal usage.  (I'm not sure SIGUSR2 is
the most appropriate signal to send at this time -- since we're in
FatalError, probably SIGQUIT is better suited.)

One good thing is that when I patched postmaster in a different way
(which I later realized to be bogus), I caused it to die with an
assertion while checkpointer was still running; the debug output let me
know that checkpointer went away immediately.

-- 
Álvaro Herrera alvhe...@commandprompt.com
The PostgreSQL Company - Command Prompt, Inc.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] checkpointer code behaving strangely on postmaster -T

2012-05-10 Thread Tom Lane
Alvaro Herrera alvhe...@commandprompt.com writes:
 Excerpts from Tom Lane's message of jue may 10 02:27:32 -0400 2012:
 Alvaro Herrera alvhe...@alvh.no-ip.org writes:
 I noticed while doing some tests that the checkpointer process does not
 recover very nicely after a backend crashes under postmaster -T

 It seems to me that the bug is in the postmaster state machine rather
 than checkpointer itself.  After a few false starts, this seems to fix
 it:

 --- a/src/backend/postmaster/postmaster.c
 +++ b/src/backend/postmaster/postmaster.c
 @@ -2136,6 +2136,8 @@ pmdie(SIGNAL_ARGS)
 signal_child(WalWriterPID, SIGTERM);
 if (BgWriterPID != 0)
 signal_child(BgWriterPID, SIGTERM);
 +   if (FatalError  CheckpointerPID != 0)
 +   signal_child(CheckpointerPID, SIGUSR2);

Surely we do not want the checkpointer doing a shutdown checkpoint here.
If we need it to die immediately, SIGQUIT is the way.  If we want a
shutdown checkpoint, that has to wait till after everything else is
known dead.  So while I agree this may be a state machine bug, that
doesn't look like a good fix.

regards, tom lane

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers