Re: [HACKERS] Something is fairly whacko about shutdown in CVS HEAD

2007-07-01 Thread Tom Lane
Alvaro Herrera [EMAIL PROTECTED] writes:
 Tom Lane wrote:
 To resolve this I think we need a clearer definition of the autovac
 launcher's role in life.  I see that it is attached to shared memory;
 is it supposed to be able to execute transactions or otherwise do
 anything the bgwriter might have to clean up after?

 No; the launcher is attached to shared memory, but it doesn't connect to
 databases and it doesn't execute transactions either.  In fact, the
 launcher cannot do anything when the postmaster is not running (or when
 it is not in a state where it doesn't want to start processes, anyway),
 because it (the launcher) is only capable of sending signals to
 postmaster (apart from that, it ocasionally grabs lwlocks, reads the
 pg_database flat file, pgstats, and the Xid counter in shared memory).
 It doesn't write anything.  I think it is perfectly acceptable to have
 the launcher shut down in parallel with bgwriter.

[ after sleeping on it... ]  That's probably okay as far as bgwriter
goes, but I think it's a bad idea for anything connected to shared
memory to outlive the postmaster.  To do so opens risks that some
platforms might treat the postmaster's shmctl(IPC_RMID) differently
than others.  So I want the postmaster to not exit until it's seen
the launcher quit.  It's quite easy to make this happen if we treat
the launcher more like a backend, and don't start bgwriter shutdown
till it's gone.  (I have a patch I've been testing that does it that
way.)  Allowing the two to shut down in parallel seems more complex,
though if you see a reasonable way to make it work, propose a patch.

regards, tom lane

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


[HACKERS] Something is fairly whacko about shutdown in CVS HEAD

2007-06-30 Thread Tom Lane
I'm seeing two sets of shutdown messages, and apparently a second
shutdown checkpoint being forced, during a normal database stop:

2007-06-30 14:21:00 EDT 9644 LOG:  received smart shutdown request
2007-06-30 14:21:00 EDT 9647 LOG:  shutting down
2007-06-30 14:21:00 EDT 9647 LOG:  checkpoint starting: shutdown immediate
2007-06-30 14:21:00 EDT 9647 LOG:  checkpoint complete: wrote 23 buffers 
(0.6%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=0.003 s, 
sync=0.178 s, total=0.237 s
2007-06-30 14:21:00 EDT 9647 LOG:  database system is shut down
2007-06-30 14:21:00 EDT 9644 LOG:  background writer process (PID 9647) exited 
with exit code 0
2007-06-30 14:21:00 EDT 9644 LOG:  terminating any other active server processes
2007-06-30 14:21:00 EDT 9644 LOG:  all server processes terminated; 
reinitializing
2007-06-30 14:21:00 EDT 9654 LOG:  database system was shut down at 2007-06-30 
14:21:00 EDT
2007-06-30 14:21:01 EDT 9655 LOG:  shutting down
2007-06-30 14:21:01 EDT 9655 LOG:  checkpoint starting: shutdown immediate
2007-06-30 14:21:01 EDT 9655 LOG:  checkpoint complete: wrote 0 buffers (0.0%); 
0 transaction log file(s) added, 0 removed, 0 recycled; write=0.000 s, 
sync=0.000 s, total=0.041 s
2007-06-30 14:21:01 EDT 9655 LOG:  database system is shut down

I haven't looked for the cause yet but I suspect it's got something to
do with this patch:
http://archives.postgresql.org/pgsql-committers/2007-06/msg00285.php
Either that or somebody else broke it recently.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Something is fairly whacko about shutdown in CVS HEAD

2007-06-30 Thread Alvaro Herrera
Tom Lane wrote:
 I'm seeing two sets of shutdown messages, and apparently a second
 shutdown checkpoint being forced, during a normal database stop:
 
 2007-06-30 14:21:00 EDT 9644 LOG:  received smart shutdown request
 2007-06-30 14:21:00 EDT 9647 LOG:  shutting down
 2007-06-30 14:21:00 EDT 9647 LOG:  checkpoint starting: shutdown immediate
 2007-06-30 14:21:00 EDT 9647 LOG:  checkpoint complete: wrote 23 buffers 
 (0.6%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=0.003 
 s, sync=0.178 s, total=0.237 s
 2007-06-30 14:21:00 EDT 9647 LOG:  database system is shut down
 2007-06-30 14:21:00 EDT 9644 LOG:  background writer process (PID 9647) 
 exited with exit code 0
 2007-06-30 14:21:00 EDT 9644 LOG:  terminating any other active server 
 processes
 2007-06-30 14:21:00 EDT 9644 LOG:  all server processes terminated; 
 reinitializing
 2007-06-30 14:21:00 EDT 9654 LOG:  database system was shut down at 
 2007-06-30 14:21:00 EDT
 2007-06-30 14:21:01 EDT 9655 LOG:  shutting down
 2007-06-30 14:21:01 EDT 9655 LOG:  checkpoint starting: shutdown immediate
 2007-06-30 14:21:01 EDT 9655 LOG:  checkpoint complete: wrote 0 buffers 
 (0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=0.000 
 s, sync=0.000 s, total=0.041 s
 2007-06-30 14:21:01 EDT 9655 LOG:  database system is shut down
 
 I haven't looked for the cause yet but I suspect it's got something to
 do with this patch:
 http://archives.postgresql.org/pgsql-committers/2007-06/msg00285.php
 Either that or somebody else broke it recently.

Huh, I can't reproduce it here.

2007-06-30 20:51:27 CLT 23221 LOG:  received smart shutdown request
2007-06-30 20:51:27 CLT 23223 LOG:  shutting down
2007-06-30 20:51:27 CLT 23225 LOG:  autovacuum launcher shutting down
2007-06-30 20:51:27 CLT 23223 LOG:  checkpoint starting: shutdown immediate
2007-06-30 20:51:27 CLT 23223 LOG:  checkpoint complete: wrote 474 buffers 
(15.4%); 0 transaction log file(s) added, 0 removed, 1 recycled; write=0.016 s, 
sync=0.084 s, total=0.124 s
2007-06-30 20:51:27 CLT 23223 LOG:  database system is shut down

In your test output, it looks like the bgwriter is exiting early with code 0
which makes postmaster upset, so it restarts it.  Or maybe bgwriter
exits as expected but the postmaster is denying reality.

-- 
Alvaro Herrera http://www.amazon.com/gp/registry/DXLWNGRJD34J
Investigación es lo que hago cuando no sé lo que estoy haciendo
(Wernher von Braun)

---(end of broadcast)---
TIP 1: if posting/reading through Usenet, please send an appropriate
   subscribe-nomail command to [EMAIL PROTECTED] so that your
   message can get through to the mailing list cleanly


Re: [HACKERS] Something is fairly whacko about shutdown in CVS HEAD

2007-06-30 Thread Tom Lane
Alvaro Herrera [EMAIL PROTECTED] writes:
 Tom Lane wrote:
 I'm seeing two sets of shutdown messages, and apparently a second
 shutdown checkpoint being forced, during a normal database stop:

 Huh, I can't reproduce it here.

It looks to me like this is a race condition induced by the
autovacuum-launcher patches.  Observe the following chunk of
postmaster.c, which responds to exit of the bgwriter child:

/*
 * Was it the bgwriter?
 */
if (BgWriterPID != 0  pid == BgWriterPID)
{
BgWriterPID = 0;
if (EXIT_STATUS_0(exitstatus) 
Shutdown  NoShutdown  !FatalError 
!DLGetHead(BackendList)  AutoVacPID == 0)
{
/*
 * Normal postmaster exit is here: we've seen normal exit of
 * the bgwriter after it's been told to shut down. We expect
 * that it wrote a shutdown checkpoint.  (If for some reason
 * it didn't, recovery will occur on next postmaster start.)
 *
 * Note: we do not wait around for exit of the archiver or
 * stats processes.  They've been sent SIGQUIT by this point,
 * and in any case contain logic to commit hara-kiri if they
 * notice the postmaster is gone.
 */
ExitPostmaster(0);
}

/*
 * Any unexpected exit of the bgwriter (including FATAL exit)
 * is treated as a crash.
 */
HandleChildCrash(pid, exitstatus,
 _(background writer process));

If AutoVacPID is still nonzero when bgwriter exit is detected,
then we think we've seen a crash.  I'm not clear why it happens
reliably for me and not for you, but this is certainly a bug.

To resolve this I think we need a clearer definition of the autovac
launcher's role in life.  I see that it is attached to shared memory;
is it supposed to be able to execute transactions or otherwise do
anything the bgwriter might have to clean up after?  If so we need
to fix things so that we don't tell the bgwriter to exit until after
the launcher is gone.  If not, we could possibly allow these things
to happen asynchronously, though I wonder whether it wouldn't be best
to force the ordering anyway.

regards, tom lane

---(end of broadcast)---
TIP 3: Have you checked our extensive FAQ?

   http://www.postgresql.org/docs/faq


Re: [HACKERS] Something is fairly whacko about shutdown in CVS HEAD

2007-06-30 Thread Alvaro Herrera
Tom Lane wrote:
 Alvaro Herrera [EMAIL PROTECTED] writes:
  Tom Lane wrote:
  I'm seeing two sets of shutdown messages, and apparently a second
  shutdown checkpoint being forced, during a normal database stop:
 
  Huh, I can't reproduce it here.
 
 It looks to me like this is a race condition induced by the
 autovacuum-launcher patches.  Observe the following chunk of
 postmaster.c, which responds to exit of the bgwriter child:

 If AutoVacPID is still nonzero when bgwriter exit is detected,
 then we think we've seen a crash.  I'm not clear why it happens
 reliably for me and not for you, but this is certainly a bug.

Oops, you are right.  So the problem predates the latest autovac patch;
it goes all the way back to when the launcher was introduced.  I can't
imagine the reason, but I haven't ever seen the double shutdown
checkpoint (not that I've been looking too closely, but I almost always
run postmasters with stderr to a console in front of my eyes when I'm
testing autovac).

 To resolve this I think we need a clearer definition of the autovac
 launcher's role in life.  I see that it is attached to shared memory;
 is it supposed to be able to execute transactions or otherwise do
 anything the bgwriter might have to clean up after?  If so we need
 to fix things so that we don't tell the bgwriter to exit until after
 the launcher is gone.  If not, we could possibly allow these things
 to happen asynchronously, though I wonder whether it wouldn't be best
 to force the ordering anyway.

No; the launcher is attached to shared memory, but it doesn't connect to
databases and it doesn't execute transactions either.  In fact, the
launcher cannot do anything when the postmaster is not running (or when
it is not in a state where it doesn't want to start processes, anyway),
because it (the launcher) is only capable of sending signals to
postmaster (apart from that, it ocasionally grabs lwlocks, reads the
pg_database flat file, pgstats, and the Xid counter in shared memory).
It doesn't write anything.  I think it is perfectly acceptable to have
the launcher shut down in parallel with bgwriter.

-- 
Alvaro Herrera  http://www.amazon.com/gp/registry/5ZYLFMCVHXC
Amanece.   (Ignacio Reyes)
 El Cerro San Cristóbal me mira, cínicamente, con ojos de virgen

---(end of broadcast)---
TIP 6: explain analyze is your friend