Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)

2013-06-28 Thread Alvaro Herrera
MauMau escribió:

Hi,

 I did this.  Please find attached the revised patch.  I modified
 HandleChildCrash().  I tested the immediate shutdown, and the child
 cleanup succeeded.

Thanks, committed.

There are two matters pending here:

1. do we want postmaster to exit immediately after sending the SIGKILL,
or hang around until all children are gone?

2. how far do we want to backpatch this, if at all?

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)

2013-06-28 Thread Robert Haas
On Fri, Jun 28, 2013 at 6:00 PM, Alvaro Herrera
alvhe...@2ndquadrant.com wrote:
 MauMau escribió:

 Hi,

 I did this.  Please find attached the revised patch.  I modified
 HandleChildCrash().  I tested the immediate shutdown, and the child
 cleanup succeeded.

 Thanks, committed.

 There are two matters pending here:

 1. do we want postmaster to exit immediately after sending the SIGKILL,
 or hang around until all children are gone?

 2. how far do we want to backpatch this, if at all?

Considering that neither Tom nor I was convinced that this was a good
idea AT ALL, I'm surprised you committed it in the first place.  I'd
be more inclined to revert it than back-patch it, but at least if we
only change it in HEAD we have some chance of finding out whether or
not it is in fact a bad idea before it's too late to change our mind.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)

2013-06-27 Thread MauMau

Hi, Alvaro san,

From: Alvaro Herrera alvhe...@2ndquadrant.com

MauMau escribió:
Yeah, I see that --- after removing that early exit, there are unwanted
messages.  And in fact there are some signals sent that weren't
previously sent.  Clearly we need something here: if we're in immediate
shutdown handler, don't signal anyone (because they have already been
signalled) and don't log any more messages; but the cleaning up of
postmaster's process list must still be carried out.

Would you please add that on top of the attached cleaned up version of
your patch?


I did this.  Please find attached the revised patch.  I modified 
HandleChildCrash().  I tested the immediate shutdown, and the child cleanup 
succeeded.


In addition, I added if condition at the end of the function.  This is to 
prevent resetting AbortStartTime every time one child terminates.


Regards
MauMau



reliable_immediate_shutdown-3.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)

2013-06-25 Thread MauMau

From: Alvaro Herrera alvhe...@2ndquadrant.com

Yeah, I see that --- after removing that early exit, there are unwanted
messages.  And in fact there are some signals sent that weren't
previously sent.  Clearly we need something here: if we're in immediate
shutdown handler, don't signal anyone (because they have already been
signalled) and don't log any more messages; but the cleaning up of
postmaster's process list must still be carried out.

Would you please add that on top of the attached cleaned up version of
your patch?


Thanks.  I'll do that tomorrow at the earliest.


Noah Misch escribió:

On Sun, Jun 23, 2013 at 01:55:19PM +0900, MauMau wrote:



 the clients at the immediate shutdown.  The code gets much simpler.  In
 addition, it may be better that we similarly send SIGKILL in backend
 crash (FatalError) case, eliminate the use of SIGQUIT and remove
 quickdie() and other SIGQUIT handlers.

My take is that the client message has some value, and we shouldn't just
discard it to simplify the code slightly.  Finishing the shutdown quickly 
has
value, of course.  The relative importance of those things should guide 
the
choice of a timeout under method #2, but I don't see a rigorous way to 
draw

the line.  I feel five seconds is, if anything, too high.


There's obviously a lot of disagreement on 5 seconds being too high or
too low.  We have just followed SysV init's precedent of waiting 5
seconds by default between sending signals TERM and QUIT during a
shutdown.  I will note that during a normal shutdown, services are
entitled to do much more work than just signal all their children to
exit immediately; and yet I don't find much evidence that this period is
inordinately short.  I don't feel strongly that it couldn't be shorter,
though.

What about using deadlock_timeout?  It constitutes a rough barometer on 
the
system's tolerance for failure detection latency, and we already overload 
it
by having it guide log_lock_waits.  The default of 1s makes sense to me 
for
this purpose, too.  We can always add a separate 
immediate_shutdown_timeout if
there's demand, but I don't expect such demand.  (If we did add such a 
GUC,

setting it to zero could be the way to request method 1.  If some folks
strongly desire method 1, that's probably the way to offer it.)


I dunno.  Having this be configurable seems overkill to me.  But perhaps
that's the way to satisfy most people: some people could set it very
high so that they could have postmaster wait longer if they believe
their server is going to be overloaded; people wishing immediate SIGKILL
could get that too, as you describe.

I think this should be a separate patch, however.


I think so, too.  We can add a parameter later if we find it highly 
necessary after some experience in the field.



Regards
MauMau




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)

2013-06-24 Thread Alvaro Herrera
MauMau escribió:
 From: Alvaro Herrera alvhe...@2ndquadrant.com

 Actually, in further testing I noticed that the fast-path you introduced
 in BackendCleanup (or was it HandleChildCrash?) in the immediate
 shutdown case caused postmaster to fail to clean up properly after
 sending the SIGKILL signal, so I had to remove that as well.  Was there
 any reason for that?
 
 You are talking about the code below at the beginning of
 HandleChildCrash(), aren't you?
 
 /* Do nothing if the child terminated due to immediate shutdown */
 if (Shutdown == ImmediateShutdown)
  return;
 
 If my memory is correct, this was necessary; without this,
 HandleChildCrash() or LogChildExit() left extra messages in the
 server log.
 
 LOG:  %s (PID %d) exited with exit code %d
 LOG:  terminating any other active server processes
 
 These messages are output because postmaster sends SIGQUIT to its
 children and wait for them to terminate.  The children exit with
 non-zero status when they receive SIGQUIT.

Yeah, I see that --- after removing that early exit, there are unwanted
messages.  And in fact there are some signals sent that weren't
previously sent.  Clearly we need something here: if we're in immediate
shutdown handler, don't signal anyone (because they have already been
signalled) and don't log any more messages; but the cleaning up of
postmaster's process list must still be carried out.

Would you please add that on top of the attached cleaned up version of
your patch?


Noah Misch escribió:
 On Sun, Jun 23, 2013 at 01:55:19PM +0900, MauMau wrote:

  the clients at the immediate shutdown.  The code gets much simpler.  In  
  addition, it may be better that we similarly send SIGKILL in backend 
  crash (FatalError) case, eliminate the use of SIGQUIT and remove 
  quickdie() and other SIGQUIT handlers.
 
 My take is that the client message has some value, and we shouldn't just
 discard it to simplify the code slightly.  Finishing the shutdown quickly has
 value, of course.  The relative importance of those things should guide the
 choice of a timeout under method #2, but I don't see a rigorous way to draw
 the line.  I feel five seconds is, if anything, too high.

There's obviously a lot of disagreement on 5 seconds being too high or
too low.  We have just followed SysV init's precedent of waiting 5
seconds by default between sending signals TERM and QUIT during a
shutdown.  I will note that during a normal shutdown, services are
entitled to do much more work than just signal all their children to
exit immediately; and yet I don't find much evidence that this period is
inordinately short.  I don't feel strongly that it couldn't be shorter,
though.

 What about using deadlock_timeout?  It constitutes a rough barometer on the
 system's tolerance for failure detection latency, and we already overload it
 by having it guide log_lock_waits.  The default of 1s makes sense to me for
 this purpose, too.  We can always add a separate immediate_shutdown_timeout if
 there's demand, but I don't expect such demand.  (If we did add such a GUC,
 setting it to zero could be the way to request method 1.  If some folks
 strongly desire method 1, that's probably the way to offer it.)

I dunno.  Having this be configurable seems overkill to me.  But perhaps
that's the way to satisfy most people: some people could set it very
high so that they could have postmaster wait longer if they believe
their server is going to be overloaded; people wishing immediate SIGKILL
could get that too, as you describe.

I think this should be a separate patch, however.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services
*** a/doc/src/sgml/runtime.sgml
--- b/doc/src/sgml/runtime.sgml
***
*** 1362,1372  echo -1000  /proc/self/oom_score_adj
   listitem
para
This is the firsttermImmediate Shutdown/firstterm mode.
!   The master commandpostgres/command process will send a
!   systemitemSIGQUIT/systemitem to all child processes and exit
!   immediately, without properly shutting itself down. The child processes
!   likewise exit immediately upon receiving
!   systemitemSIGQUIT/systemitem. This will lead to recovery (by
replaying the WAL log) upon next start-up. This is recommended
only in emergencies.
/para
--- 1362,1372 
   listitem
para
This is the firsttermImmediate Shutdown/firstterm mode.
!   The server will send systemitemSIGQUIT/systemitem to all child
!   processes and wait for them to terminate.  Those that don't terminate
!   within 5 seconds, will be sent systemitemSIGKILL/systemitem by the
!   master commandpostgres/command process, which will then terminate
!   without further waiting.  This will lead to recovery (by
replaying the WAL log) upon next start-up. This is recommended
only in emergencies.
/para
*** 

Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)

2013-06-22 Thread Alvaro Herrera
MauMau escribió:

 Are you suggesting simplifying the following part in ServerLoop()?
 I welcome the idea if this condition becomes simpler.  However, I
 cannot imagine how.

  if (AbortStartTime  0   /* SIGKILL only once */
   (Shutdown == ImmediateShutdown || (FatalError  !SendStop)) 
   now - AbortStartTime = 10)
  {
   SignalAllChildren(SIGKILL);
   AbortStartTime = 0;
  }

Yes, that's what I'm suggesting.  I haven't tried coding it yet.

 I thought of adding some new state of pmState for some reason (that
 might be the same as your idea).
 But I refrained from doing that, because pmState has already many
 states.  I was afraid adding a new pmState value for this bug fix
 would complicate the state management (e.g. state transition in
 PostmasterStateMachine()).  In addition, I felt PM_WAIT_BACKENDS was
 appropriate because postmaster is actually waiting for backends to
 terminate after sending SIGQUIT.  The state name is natural.

Well, a natural state name is of no use if we're using it to indicate
two different states, which I think is what would be happening here.
Basically, with your patch, PM_WAIT_BACKENDS means two different things
depending on AbortStartTime.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)

2013-06-22 Thread Robert Haas
On Fri, Jun 21, 2013 at 10:02 PM, MauMau maumau...@gmail.com wrote:
 I'm comfortable with 5 seconds.  We are talking about the interval between
 sending SIGQUIT to the children and then sending SIGKILL to them.  In most
 situations, the backends should terminate immediately.  However, as I said a
 few months ago, ereport() call in quickdie() can deadlock indefinitely. This
 is a PostgreSQL's bug.

So let's fix that bug.  Then we don't need this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)

2013-06-22 Thread MauMau

From: Alvaro Herrera alvhe...@2ndquadrant.com

MauMau escribió:

I thought of adding some new state of pmState for some reason (that
might be the same as your idea).
But I refrained from doing that, because pmState has already many
states.  I was afraid adding a new pmState value for this bug fix
would complicate the state management (e.g. state transition in
PostmasterStateMachine()).  In addition, I felt PM_WAIT_BACKENDS was
appropriate because postmaster is actually waiting for backends to
terminate after sending SIGQUIT.  The state name is natural.


Well, a natural state name is of no use if we're using it to indicate
two different states, which I think is what would be happening here.
Basically, with your patch, PM_WAIT_BACKENDS means two different things
depending on AbortStartTime.


i PROBABLY GOT YOUR FEELING.  yOU AREN'T FEELING COMFORTABLE ABOUT USING THE 
TIME VARIABLE aBORTsTARTtIME AS A STATE VARIABLE TO CHANGE POSTMASTER'S 
BEHAVIOR, ARE YOU?


tHAT MAY BE RIGHT, BUT i'M NOT SURE WELL... BECAUSE IN pm_wait_backends, AS 
THE NAME INDICATES, POSTMASTER IS INDEED WAITING FOR THE BACKENDS TO 
TERMINATE REGARDLESS OF aBORTsTARTtIME.  aPART FROM THIS, POSTMASTER SEEMS 
TO CHANGE ITS BEHAVIOR IN THE SAME PMsTATE DEPENDING ON OTHER VARIABLES SUCH 
AS sHUTDOWN AND fATALeRROR.  i'M NOT CONFIDENT IN WHICH IS BETTER, SO i 
WON'T OBJECT IF THE REVIEWER OR COMMITTER MODIFIES THE CODE.


rEGARDS
mAUmAU



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)

2013-06-22 Thread MauMau

From: Robert Haas robertmh...@gmail.com

On Fri, Jun 21, 2013 at 10:02 PM, MauMau maumau...@gmail.com wrote:
I'm comfortable with 5 seconds.  We are talking about the interval 
between
sending SIGQUIT to the children and then sending SIGKILL to them.  In 
most
situations, the backends should terminate immediately.  However, as I 
said a
few months ago, ereport() call in quickdie() can deadlock indefinitely. 
This

is a PostgreSQL's bug.


So let's fix that bug.  Then we don't need this.


tHERE ARE TWO WAYS TO FIX THAT BUG.  yOU ARE SUGGESTING 1 OF THE FOLLOWING 
TWO METHODS, AREN'T YOU?


1. (rOBERT-SAN'S IDEA)
uPON RECEIPT OF IMMEDIATE SHUTDOWN REQUEST, POSTMASTER SENDS sigkill TO ITS 
CHILDREN.


2. (tOM-SAN'S IDEA)
uPON RECEIPT OF IMMEDIATE SHUTDOWN REQUEST, POSTMASTER FIRST SENDS sigquit 
TO ITS CHILDREN, WAIT A WHILE FOR THEM TO TERMINATE, THEN SENDS sigkill TO 
THEM.



aT FIRST i PROPOSED 1.  tHEN tOM SAN SUGGESTED 2 SO THAT THE SERVER IS AS 
FRIENDLY TO THE CLIENTS AS NOW BY NOTIFYING THEM OF THE SERVER SHUTDOWN.  i 
WAS CONVINCED BY THAT IDEA AND CHOSE 2.


bASICALLY, i THINK BOTH IDEAS ARE RIGHT.  tHEY CAN SOLVE THE ORIGINAL 
PROBLEM.


hOWEVER, RE-CONSIDERING THE MEANING OF IMMEDIATE SHUTDOWN, i FEEL 1 IS A 
BIT BETTER, BECAUSE TRYING TO DO SOMETHING IN QUICKDIE() OR SOMEWHERE DOES 
NOT MATCH THE IDEA OF IMMEDIATE. wE MAY NOT HAVE TO BE FRIENDLY TO THE 
CLIENTS AT THE IMMEDIATE SHUTDOWN.  tHE CODE GETS MUCH SIMPLER.  iN 
ADDITION, IT MAY BE BETTER THAT WE SIMILARLY SEND sigkill IN BACKEND CRASH 
(fATALeRROR) CASE, ELIMINATE THE USE OF sigquit AND REMOVE QUICKDIE() AND 
OTHER sigquit HANDLERS.


wHAT DO YOU THINK?  hOW SHOULD WE MAKE CONSENSUS AND PROCEED?

rEGARDS
mAUmAU




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)

2013-06-21 Thread Hitoshi Harada
On Thu, Jun 20, 2013 at 3:40 PM, MauMau maumau...@gmail.com wrote:

  Here, reliable means that the database server is certainly shut
 down when pg_ctl returns, not telling a lie that I shut down the
 server processes for you, so you do not have to be worried that some
 postgres process might still remain and write to disk.  I suppose
 reliable shutdown is crucial especially in HA cluster.  If pg_ctl
 stop -mi gets stuck forever when there is an unkillable process (in
 what situations does this happen? OS bug, or NFS hard mount?), I
 think the DBA has to notice this situation from the unfinished
 pg_ctl, investigate the cause, and take corrective action.


 So you're suggesting that keeping postmaster up is a useful sign that
 the shutdown is not going well?  I'm not really sure about this.  What
 do others think?


 I think you are right, and there is no harm in leaving postgres processes
 in unkillable state.  I'd like to leave the decision to you and/or others.


+1 for leaving processes, not waiting for long (or possibly forever,
remember not all people are running postgres on such cluster ware).  I'm
sure some users rely on the current behavior of immediate shutdown.  If
someone wants to ensure processes are finished when pg_ctl returns, is it
fast shutdown, not immediate shutdown?  To me the current immediate
shutdown is reliable, in that it without doubt returns control back to
terminal, after killing postmaster at least.

Thanks,
-- 
Hitoshi Harada


Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)

2013-06-21 Thread MauMau

From: Alvaro Herrera alvhe...@2ndquadrant.com

MauMau escribió:


One concern is that umount would fail in such a situation because
postgres has some open files on the filesystem, which is on the
shared disk in case of traditional HA cluster.


See my reply to Noah.  If postmaster stays around, would this be any
different?  I don't think so.


I'll consider this a bit and respond as a reply to Andres-san's mail.



Actually, in further testing I noticed that the fast-path you introduced
in BackendCleanup (or was it HandleChildCrash?) in the immediate
shutdown case caused postmaster to fail to clean up properly after
sending the SIGKILL signal, so I had to remove that as well.  Was there
any reason for that?


You are talking about the code below at the beginning of HandleChildCrash(), 
aren't you?


/* Do nothing if the child terminated due to immediate shutdown */
if (Shutdown == ImmediateShutdown)
 return;

If my memory is correct, this was necessary; without this, 
HandleChildCrash() or LogChildExit() left extra messages in the server log.


LOG:  %s (PID %d) exited with exit code %d
LOG:  terminating any other active server processes

These messages are output because postmaster sends SIGQUIT to its children 
and wait for them to terminate.  The children exit with non-zero status when 
they receive SIGQUIT.



Regards
MauMau



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)

2013-06-21 Thread MauMau

From: Alvaro Herrera alvhe...@2ndquadrant.com

Actually, I think it would be cleaner to have a new state in pmState,
namely PM_IMMED_SHUTDOWN which is entered when we send SIGQUIT.  When
we're in this state, postmaster is only waiting for the timeout to
expire; and when it does, it sends SIGKILL and exits.  Pretty much the
same you have, except that instead of checking AbortStartTime we check
the pmState variable.


Are you suggesting simplifying the following part in ServerLoop()?  I 
welcome the idea if this condition becomes simpler.  However, I cannot 
imagine how.


 if (AbortStartTime  0   /* SIGKILL only once */
  (Shutdown == ImmediateShutdown || (FatalError  !SendStop)) 
  now - AbortStartTime = 10)
 {
  SignalAllChildren(SIGKILL);
  AbortStartTime = 0;
 }

I thought of adding some new state of pmState for some reason (that might be 
the same as your idea).
But I refrained from doing that, because pmState has already many states.  I 
was afraid adding a new pmState value for this bug fix would complicate the 
state management (e.g. state transition in PostmasterStateMachine()).  In 
addition, I felt PM_WAIT_BACKENDS was appropriate because postmaster is 
actually waiting for backends to terminate after sending SIGQUIT.  The state 
name is natural.


I don't have strong objection to your idea if it makes the code cleaner and 
more understandable.  Thank you very much.


Regards
MauMau



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)

2013-06-21 Thread Robert Haas
On Thu, Jun 20, 2013 at 12:33 PM, Alvaro Herrera
alvhe...@2ndquadrant.com wrote:
 I will go with 5 seconds, then.

I'm uncomfortable with this whole concept, and particularly with such
a short timeout.  On a very busy system, things can take a LOT longer
than they think we should; it can take 30 seconds or more just to get
a prompt back from a shell command.  5 seconds is the blink of an eye.

More generally, what do we think the point is of sending SIGQUIT
rather than SIGKILL in the first place, and why does that point cease
to be valid after 5 seconds?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)

2013-06-21 Thread Tom Lane
Robert Haas robertmh...@gmail.com writes:
 More generally, what do we think the point is of sending SIGQUIT
 rather than SIGKILL in the first place, and why does that point cease
 to be valid after 5 seconds?

Well, mostly it's about telling the client we're committing hara-kiri.
Without that, there's no very good reason to run quickdie() at all.

A practical issue with starting to send SIGKILL ourselves is that we
will no longer be able to reflexively diagnose server process died
on signal 9 as the linux OOM killer got you.  I'm not at all
convinced that the cases where SIGQUIT doesn't work are sufficiently
common to justify losing that property.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)

2013-06-21 Thread Robert Haas
On Fri, Jun 21, 2013 at 2:55 PM, Tom Lane t...@sss.pgh.pa.us wrote:
 Robert Haas robertmh...@gmail.com writes:
 More generally, what do we think the point is of sending SIGQUIT
 rather than SIGKILL in the first place, and why does that point cease
 to be valid after 5 seconds?

 Well, mostly it's about telling the client we're committing hara-kiri.
 Without that, there's no very good reason to run quickdie() at all.

That's what I thought, too.  It seems to me that if we think that's
important, then it's important even if it takes more than 5 seconds
for some reason.

 A practical issue with starting to send SIGKILL ourselves is that we
 will no longer be able to reflexively diagnose server process died
 on signal 9 as the linux OOM killer got you.  I'm not at all
 convinced that the cases where SIGQUIT doesn't work are sufficiently
 common to justify losing that property.

I'm not, either.  Maybe this question will provoke many indignant
responses, but who in their right mind even uses immediate shutdown on
a production server with any regularity?  The shutdown checkpoint is
sometimes painfully long, but do you really want to run recovery just
to avoid it?  And in the rare case where an immediate shutdown fails
to work, what's wrong will killall -9 postgres?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)

2013-06-21 Thread Christopher Browne
The case where I wanted routine shutdown immediate (and I'm not sure I
ever actually got it) was when we were using IBM HA/CMP, where I wanted a
terminate with a fair bit of prejudice.

If we know we want to switch right away now, immediate seemed pretty much
right.  I was fine with interrupting user sessions, and there wasn't as
much going on in the way of system background stuff back then.

I wasn't keen on waiting on much of anything.  The background writer ought
to be keeping things from being too desperately out of date.

If there's stuff worth waiting a few seconds for, I'm all ears.

But if I have to wait arbitrarily long, colour me unhappy.

If I have to distinguish, myself, between a checkpoint nearly done flushing
and a backend that's stuck waiting forlornly for filesystem access, I'm
inclined to kill -9 and hope recovery doesn't take *too* long on the next
node...

If shutting a server down in an emergency situation requires a DBA to look
in, as opposed to init.d doing its thing, I think that's pretty much the
same problem too.


Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)

2013-06-21 Thread MauMau

From: Robert Haas robertmh...@gmail.com
On Thu, Jun 20, 2013 at 12:33 PM, Alvaro Herrera

alvhe...@2ndquadrant.com wrote:

I will go with 5 seconds, then.


I'm uncomfortable with this whole concept, and particularly with such
a short timeout.  On a very busy system, things can take a LOT longer
than they think we should; it can take 30 seconds or more just to get
a prompt back from a shell command.  5 seconds is the blink of an eye.


I'm comfortable with 5 seconds.  We are talking about the interval between 
sending SIGQUIT to the children and then sending SIGKILL to them.  In most 
situations, the backends should terminate immediately.  However, as I said a 
few months ago, ereport() call in quickdie() can deadlock indefinitely. 
This is a PostgreSQL's bug.  In addition, Tom san was concerned that some 
PLs (PL/Perl or PL/Python?) block SIGQUIT while executing the UDF, so they 
may not be able to respond to the immediate shutdown request.


What DBAs want from pg_ctl stop -mi is to shutdown the database server as 
immediately as possible.  So I think 5 second is reasonable.


Regards
MauMau



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)

2013-06-21 Thread MauMau

From: Robert Haas robertmh...@gmail.com

On Fri, Jun 21, 2013 at 2:55 PM, Tom Lane t...@sss.pgh.pa.us wrote:

Robert Haas robertmh...@gmail.com writes:

More generally, what do we think the point is of sending SIGQUIT
rather than SIGKILL in the first place, and why does that point cease
to be valid after 5 seconds?


Well, mostly it's about telling the client we're committing hara-kiri.
Without that, there's no very good reason to run quickdie() at all.


That's what I thought, too.  It seems to me that if we think that's
important, then it's important even if it takes more than 5 seconds
for some reason.


I guess Tom san is saying we should be as kind as possible to the client, 
so try to notify the client of the shutdown.  Not complete kindness. 
Because the DBA requested immediate shutdown by running pg_ctl stop -mi, 
the top priority is to shutdown the database server as immediately as 
possible.  The idea here is to try to be friendly to the client as long as 
the DBA can stand.  That's tthe 5 second.




A practical issue with starting to send SIGKILL ourselves is that we
will no longer be able to reflexively diagnose server process died
on signal 9 as the linux OOM killer got you.  I'm not at all
convinced that the cases where SIGQUIT doesn't work are sufficiently
common to justify losing that property.


I'm not, either.  Maybe this question will provoke many indignant
responses, but who in their right mind even uses immediate shutdown on
a production server with any regularity?  The shutdown checkpoint is
sometimes painfully long, but do you really want to run recovery just
to avoid it?  And in the rare case where an immediate shutdown fails
to work, what's wrong will killall -9 postgres?


Checkpoint is irrelevant here because we are discussing immediate shutdown. 
Some problems with killall -9 postgres are:


1. It's not available on Windows.
2. It may kill other database server instances running on the same machine.

Regards
MauMau




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)

2013-06-20 Thread MauMau

First, thank you for the review.

From: Alvaro Herrera alvhe...@2ndquadrant.com

This seems reasonable.  Why 10 seconds?  We could wait 5 seconds, or 15.
Is there a rationale behind the 10?  If we said 60, that would fit
perfectly well within the already existing 60-second loop in postmaster,
but that seems way too long.


There is no good rationale.  I arbitrarily chose a short period because this 
is immediate shutdown.  I felt more than 10 second was long.  I think 5 
second may be better.  Although not directly related to this fix, these 
influenced my choice:


1. According to the man page of init, init sends SIGKILL to all remaining 
processes 5 seconds after it sends SIGTERM to them.


2. At computer shutdown, Windows proceeds shutdown forcibly after waiting 
for services to terminate 20 seconds.




I have only one concern about this patch, which is visible in the
documentation proposed change:

  para
  This is the firsttermImmediate Shutdown/firstterm mode.
  The master commandpostgres/command process will send a
-  systemitemSIGQUIT/systemitem to all child processes and exit
-  immediately, without properly shutting itself down. The child 
processes

-  likewise exit immediately upon receiving
-  systemitemSIGQUIT/systemitem. This will lead to recovery (by
+  systemitemSIGQUIT/systemitem to all child processes, wait for
+  them to terminate, and exit. The child processes
+  exit immediately upon receiving
+  systemitemSIGQUIT/systemitem. If any of the child processes
+  does not terminate within 10 seconds for some unexpected reason,
+  the master postgres process will send a 
systemitemSIGKILL/systemitem

+  to all remaining ones, wait for their termination
+  again, and exit. This will lead to recovery (by
  replaying the WAL log) upon next start-up. This is recommended
  only in emergencies.
  /para

Note that the previous text said that postmaster will send SIGQUIT, then
terminate without checking anything.  In the new code, postmaster sends
SIGQUIT, then waits, then SIGKILL, then waits again.  If there's an
unkillable process (say because it's stuck in a noninterruptible sleep)
postmaster might never exit.  I think it should send SIGQUIT, then wait,
then SIGKILL, then exit without checking.


At first I thought the same, but decided not to do that.  The purpose of 
this patch is to make the immediate shutdown reliable.  Here, reliable 
means that the database server is certainly shut down when pg_ctl returns, 
not telling a lie that I shut down the server processes for you, so you do 
not have to be worried that some postgres process might still remain and 
write to disk.  I suppose reliable shutdown is crucial especially in HA 
cluster.  If pg_ctl stop -mi gets stuck forever when there is an unkillable 
process (in what situations does this happen? OS bug, or NFS hard mount?), I 
think the DBA has to notice this situation from the unfinished pg_ctl, 
investigate the cause, and take corrective action.  Anyway, in HA cluster, 
the clusterware will terminate the node with STONITH, not leaving pg_ctl 
running forever.




I have tweaked the patch a bit and I'm about to commit as soon as we
resolve the above two items.


I appreciate your tweaking, especially in the documentation and source code 
comments, as English is not my mother tongue.


Regards
MauMau



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)

2013-06-20 Thread Alvaro Herrera
MauMau escribió:
 First, thank you for the review.
 
 From: Alvaro Herrera alvhe...@2ndquadrant.com
 This seems reasonable.  Why 10 seconds?  We could wait 5 seconds, or 15.
 Is there a rationale behind the 10?  If we said 60, that would fit
 perfectly well within the already existing 60-second loop in postmaster,
 but that seems way too long.
 
 There is no good rationale.  I arbitrarily chose a short period
 because this is immediate shutdown.  I felt more than 10 second
 was long.  I think 5 second may be better.  Although not directly
 related to this fix, these influenced my choice:
 
 1. According to the man page of init, init sends SIGKILL to all
 remaining processes 5 seconds after it sends SIGTERM to them.
 
 2. At computer shutdown, Windows proceeds shutdown forcibly after
 waiting for services to terminate 20 seconds.

Well, as for the first of these, I don't think it matters whether
processes get their SIGKILL from postmaster or from init, when a
shutdown or runlevel is changing.  Now, one would think that this sets a
precedent.  I think 20 seconds might be too long; perhaps it's expected
in an operating system that forcibly has a GUI.  I feel safe to ignore
that.

I will go with 5 seconds, then.


 Note that the previous text said that postmaster will send SIGQUIT, then
 terminate without checking anything.  In the new code, postmaster sends
 SIGQUIT, then waits, then SIGKILL, then waits again.  If there's an
 unkillable process (say because it's stuck in a noninterruptible sleep)
 postmaster might never exit.  I think it should send SIGQUIT, then wait,
 then SIGKILL, then exit without checking.
 
 At first I thought the same, but decided not to do that.  The
 purpose of this patch is to make the immediate shutdown reliable.

My point is that there is no difference.  For one thing, once we enter
immediate shutdown state, and sigkill has been sent, no further action
is taken.  Postmaster will just sit there indefinitely until processes
are gone.  If we were to make it repeat SIGKILL until they die, that
would be different.  However, repeating SIGKILL is pointless, because it
they didn't die when they first received it, they will still not die
when they receive it second.  Also, if they're in uninterruptible sleep
and don't die, then they will die as soon as they get out of that state;
no further queries will get processed, no further memory access will be
done.  So there's no harm in they remaining there until underlying
storage returns to life, ISTM.

 Here, reliable means that the database server is certainly shut
 down when pg_ctl returns, not telling a lie that I shut down the
 server processes for you, so you do not have to be worried that some
 postgres process might still remain and write to disk.  I suppose
 reliable shutdown is crucial especially in HA cluster.  If pg_ctl
 stop -mi gets stuck forever when there is an unkillable process (in
 what situations does this happen? OS bug, or NFS hard mount?), I
 think the DBA has to notice this situation from the unfinished
 pg_ctl, investigate the cause, and take corrective action.

So you're suggesting that keeping postmaster up is a useful sign that
the shutdown is not going well?  I'm not really sure about this.  What
do others think?

 I have tweaked the patch a bit and I'm about to commit as soon as we
 resolve the above two items.
 
 I appreciate your tweaking, especially in the documentation and
 source code comments, as English is not my mother tongue.

IIRC the only other interesting tweak I did was rename the
SignalAllChildren() function to TerminateChildren().  I did this because
it doesn't really signal all children; syslogger and dead_end backends
are kept around.  So the original name was a bit misleading.  And we
couldn't really name it SignalAlmostAllChildren(), could we ..

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)

2013-06-20 Thread MauMau

From: Alvaro Herrera alvhe...@2ndquadrant.com

I will go with 5 seconds, then.


OK, I agree.



My point is that there is no difference.  For one thing, once we enter
immediate shutdown state, and sigkill has been sent, no further action
is taken.  Postmaster will just sit there indefinitely until processes
are gone.  If we were to make it repeat SIGKILL until they die, that
would be different.  However, repeating SIGKILL is pointless, because it
they didn't die when they first received it, they will still not die
when they receive it second.  Also, if they're in uninterruptible sleep
and don't die, then they will die as soon as they get out of that state;
no further queries will get processed, no further memory access will be
done.  So there's no harm in they remaining there until underlying
storage returns to life, ISTM.


Here, reliable means that the database server is certainly shut
down when pg_ctl returns, not telling a lie that I shut down the
server processes for you, so you do not have to be worried that some
postgres process might still remain and write to disk.  I suppose
reliable shutdown is crucial especially in HA cluster.  If pg_ctl
stop -mi gets stuck forever when there is an unkillable process (in
what situations does this happen? OS bug, or NFS hard mount?), I
think the DBA has to notice this situation from the unfinished
pg_ctl, investigate the cause, and take corrective action.


So you're suggesting that keeping postmaster up is a useful sign that
the shutdown is not going well?  I'm not really sure about this.  What
do others think?


I think you are right, and there is no harm in leaving postgres processes in 
unkillable state.  I'd like to leave the decision to you and/or others.


One concern is that umount would fail in such a situation because postgres 
has some open files on the filesystem, which is on the shared disk in case 
of traditional HA cluster.  However, STONITH should resolve the problem by 
terminating the stuck node...  I just feel it is strange for umount to fail 
due to remaining postgres, because pg_ctl stop -mi reported success.



IIRC the only other interesting tweak I did was rename the
SignalAllChildren() function to TerminateChildren().  I did this because
it doesn't really signal all children; syslogger and dead_end backends
are kept around.  So the original name was a bit misleading.  And we
couldn't really name it SignalAlmostAllChildren(), could we ..


I see.  thank you.

Regards
MauMau



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)

2013-06-20 Thread Alvaro Herrera
MauMau escribió:
 From: Alvaro Herrera alvhe...@2ndquadrant.com

 One concern is that umount would fail in such a situation because
 postgres has some open files on the filesystem, which is on the
 shared disk in case of traditional HA cluster.

See my reply to Noah.  If postmaster stays around, would this be any
different?  I don't think so.

 IIRC the only other interesting tweak I did was rename the
 SignalAllChildren() function to TerminateChildren().  I did this because
 it doesn't really signal all children; syslogger and dead_end backends
 are kept around.  So the original name was a bit misleading.  And we
 couldn't really name it SignalAlmostAllChildren(), could we ..
 
 I see.  thank you.

Actually, in further testing I noticed that the fast-path you introduced
in BackendCleanup (or was it HandleChildCrash?) in the immediate
shutdown case caused postmaster to fail to clean up properly after
sending the SIGKILL signal, so I had to remove that as well.  Was there
any reason for that?

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)

2013-06-20 Thread Alvaro Herrera
Actually, I think it would be cleaner to have a new state in pmState,
namely PM_IMMED_SHUTDOWN which is entered when we send SIGQUIT.  When
we're in this state, postmaster is only waiting for the timeout to
expire; and when it does, it sends SIGKILL and exits.  Pretty much the
same you have, except that instead of checking AbortStartTime we check
the pmState variable.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)

2013-06-19 Thread Alvaro Herrera
MauMau escribió:

 Could you review the patch?  The summary of the change is:
 1. postmaster waits for children to terminate when it gets an
 immediate shutdown request, instead of exiting.
 
 2. postmaster sends SIGKILL to remaining children if all of the
 child processes do not terminate within 10 seconds since the start
 of immediate shutdown or FatalError condition.

This seems reasonable.  Why 10 seconds?  We could wait 5 seconds, or 15.
Is there a rationale behind the 10?  If we said 60, that would fit
perfectly well within the already existing 60-second loop in postmaster,
but that seems way too long.

I have only one concern about this patch, which is visible in the
documentation proposed change:

   para
   This is the firsttermImmediate Shutdown/firstterm mode.
   The master commandpostgres/command process will send a
-  systemitemSIGQUIT/systemitem to all child processes and exit
-  immediately, without properly shutting itself down. The child processes
-  likewise exit immediately upon receiving
-  systemitemSIGQUIT/systemitem. This will lead to recovery (by
+  systemitemSIGQUIT/systemitem to all child processes, wait for
+  them to terminate, and exit. The child processes
+  exit immediately upon receiving
+  systemitemSIGQUIT/systemitem. If any of the child processes
+  does not terminate within 10 seconds for some unexpected reason,
+  the master postgres process will send a systemitemSIGKILL/systemitem
+  to all remaining ones, wait for their termination
+  again, and exit. This will lead to recovery (by
   replaying the WAL log) upon next start-up. This is recommended
   only in emergencies.
   /para

Note that the previous text said that postmaster will send SIGQUIT, then
terminate without checking anything.  In the new code, postmaster sends
SIGQUIT, then waits, then SIGKILL, then waits again.  If there's an
unkillable process (say because it's stuck in a noninterruptible sleep)
postmaster might never exit.  I think it should send SIGQUIT, then wait,
then SIGKILL, then exit without checking.

I have tweaked the patch a bit and I'm about to commit as soon as we
resolve the above two items.

-- 
Álvaro Herrerahttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)

2013-02-07 Thread MauMau

Hello, Tom-san, folks,

From: Tom Lane t...@sss.pgh.pa.us

I think if we want to make it bulletproof we'd have to do what the
OP suggested and switch to SIGKILL.  I'm not enamored of that for the
reasons I mentioned --- but one idea that might dodge the disadvantages
is to have the postmaster wait a few seconds and then SIGKILL any
backends that hadn't exited.

I think if we want something that's actually trustworthy, the
wait-and-then-kill logic has to be in the postmaster.



I'm sorry to be late to submit a patch.  I made a fix according to the above 
comment.  Unfortunately, it was not in time for 9.1.8 release...  I hope 
this patch will be included in 9.1.9.


Could you review the patch?  The summary of the change is:
1. postmaster waits for children to terminate when it gets an immediate 
shutdown request, instead of exiting.


2. postmaster sends SIGKILL to remaining children if all of the child 
processes do not terminate within 10 seconds since the start of immediate 
shutdown or FatalError condition.


3. On Windows, kill(SIGKILL) calls TerminateProcess().

4. Documentation change to describe the behavior of immediate shutdown.


Regards
MauMau


reliable_immediate_shutdown.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)

2013-02-01 Thread Peter Eisentraut
On 1/31/13 5:42 PM, MauMau wrote:
 Thank you for sharing your experience.  So you also considered making
 postmaster SIGKILL children like me, didn't you?  I bet most of people
 who encounter this problem would feel like that.
 
 It is definitely pg_ctl who needs to be prepared, not the users.  It may
 not be easy to find out postgres processes to SIGKILL if multiple
 instances are running on the same host.  Just doing pkill postgres
 will unexpectedly terminate postgres of other instances.

In my case, it was one backend process segfaulting, and then some other
backend processes didn't respond to the subsequent SIGQUIT sent out by
the postmaster.  So pg_ctl didn't have any part in it.

We ended up addressing that by installing a nagios event handler that
checked for this situation and cleaned it up.

 I would like to make a patch which that changes SIGQUIT to SIGKILL when
 postmaster terminates children.  Any other better ideas?

That was my idea back then, but there were some concerns about it.

I found an old patch that I had prepared for this, which I have
attached.  YMMV.
From bebb95abe7a55173cab0558da3373d6a3631465b Mon Sep 17 00:00:00 2001
From: Peter Eisentraut pet...@postgresql.org
Date: Wed, 16 Dec 2009 17:19:14 +0200
Subject: [PATCH 3/3] Time out the ereport() call in quickdie() after 60 seconds

---
 src/backend/tcop/postgres.c |   25 +
 1 files changed, 25 insertions(+), 0 deletions(-)

diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index b2fb501..ab6805a 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -191,6 +191,7 @@ static bool IsTransactionExitStmtList(List *parseTrees);
 static bool IsTransactionStmtList(List *parseTrees);
 static void drop_unnamed_stmt(void);
 static void SigHupHandler(SIGNAL_ARGS);
+static void quickdie_alarm_handler(SIGNAL_ARGS);
 static void log_disconnections(int code, Datum arg);
 
 
@@ -2539,9 +2540,17 @@ void
 quickdie(SIGNAL_ARGS)
 {
sigaddset(BlockSig, SIGQUIT); /* prevent nested calls */
+   sigdelset(BlockSig, SIGALRM);
PG_SETMASK(BlockSig);
 
/*
+* Set up a timeout in case the ereport() call below blocks for a
+* long time.
+*/
+   pqsignal(SIGALRM, quickdie_alarm_handler);
+   alarm(60);
+
+   /*
 * If we're aborting out of client auth, don't risk trying to send
 * anything to the client; we will likely violate the protocol,
 * not to mention that we may have interrupted the guts of OpenSSL
@@ -2586,6 +2595,22 @@ quickdie(SIGNAL_ARGS)
 }
 
 /*
+ * Take over quickdie()'s work if the alarm expired.
+ */
+static void
+quickdie_alarm_handler(SIGNAL_ARGS)
+{
+   /*
+* We got here if ereport() was blocking, so don't go there again
+* except when really asked for.
+*/
+   elog(DEBUG5, quickdie aborted by alarm);
+
+   on_exit_reset();
+   exit(2);
+}
+
+/*
  * Shutdown signal from postmaster: abort transaction and exit
  * at soonest convenient time
  */
-- 
1.6.5


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)

2013-02-01 Thread Andres Freund
On 2013-02-01 08:55:24 -0500, Peter Eisentraut wrote:
 On 1/31/13 5:42 PM, MauMau wrote:
  Thank you for sharing your experience.  So you also considered making
  postmaster SIGKILL children like me, didn't you?  I bet most of people
  who encounter this problem would feel like that.
  
  It is definitely pg_ctl who needs to be prepared, not the users.  It may
  not be easy to find out postgres processes to SIGKILL if multiple
  instances are running on the same host.  Just doing pkill postgres
  will unexpectedly terminate postgres of other instances.
 
 In my case, it was one backend process segfaulting, and then some other
 backend processes didn't respond to the subsequent SIGQUIT sent out by
 the postmaster.  So pg_ctl didn't have any part in it.
 
 We ended up addressing that by installing a nagios event handler that
 checked for this situation and cleaned it up.
 
  I would like to make a patch which that changes SIGQUIT to SIGKILL when
  postmaster terminates children.  Any other better ideas?
 
 That was my idea back then, but there were some concerns about it.
 
 I found an old patch that I had prepared for this, which I have
 attached.  YMMV.

 +static void
 +quickdie_alarm_handler(SIGNAL_ARGS)
 +{
 + /*
 +  * We got here if ereport() was blocking, so don't go there again
 +  * except when really asked for.
 +  */
 + elog(DEBUG5, quickdie aborted by alarm);
 +

Its probably not wise to enter elog.c again, that path might allocate
memory and we wouldn't be any wiser. Unfortunately there's not much
besides a write(2) to stderr that can safely be done...

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)

2013-02-01 Thread Tom Lane
Andres Freund and...@2ndquadrant.com writes:
 On 2013-02-01 08:55:24 -0500, Peter Eisentraut wrote:
 I found an old patch that I had prepared for this, which I have
 attached.  YMMV.

 +static void
 +quickdie_alarm_handler(SIGNAL_ARGS)
 +{
 +/*
 + * We got here if ereport() was blocking, so don't go there again
 + * except when really asked for.
 + */
 +elog(DEBUG5, quickdie aborted by alarm);
 +

 Its probably not wise to enter elog.c again, that path might allocate
 memory and we wouldn't be any wiser. Unfortunately there's not much
 besides a write(2) to stderr that can safely be done...

Another objection to this is it still doesn't fix the case where the
process has blocked SIGQUIT.  My faith that libperl, libR, etc will
never do that is nil.

I think if we want something that's actually trustworthy, the
wait-and-then-kill logic has to be in the postmaster.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)

2013-01-31 Thread Peter Eisentraut
On 1/30/13 9:11 AM, MauMau wrote:
 When I ran pg_ctl stop -mi against the primary, some applications
 connected to the primary did not stop.  The cause was that the backends
 was deadlocked in quickdie() with some call stack like the following. 
 I'm sorry to have left the stack trace file on the testing machine, so
 I'll show you the precise stack trace tomorrow.

I've had similar problems in the past:

http://www.postgresql.org/message-id/1253704891.20834.8.ca...@fsopti579.f-secure.com

The discussion there never quite concluded.  But yes, you need to be
prepared that in rare circumstances SIGQUIT won't succeed and you need
to use SIGKILL.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)

2013-01-31 Thread MauMau

From: Peter Eisentraut pete...@gmx.net

On 1/30/13 9:11 AM, MauMau wrote:

When I ran pg_ctl stop -mi against the primary, some applications
connected to the primary did not stop.  The cause was that the backends
was deadlocked in quickdie() with some call stack like the following.
I'm sorry to have left the stack trace file on the testing machine, so
I'll show you the precise stack trace tomorrow.


I've had similar problems in the past:

http://www.postgresql.org/message-id/1253704891.20834.8.ca...@fsopti579.f-secure.com

The discussion there never quite concluded.  But yes, you need to be
prepared that in rare circumstances SIGQUIT won't succeed and you need
to use SIGKILL.


Thank you for sharing your experience.  So you also considered making 
postmaster SIGKILL children like me, didn't you?  I bet most of people who 
encounter this problem would feel like that.


It is definitely pg_ctl who needs to be prepared, not the users.  It may not 
be easy to find out postgres processes to SIGKILL if multiple instances are 
running on the same host.  Just doing pkill postgres will unexpectedly 
terminate postgres of other instances.


I would like to make a patch which that changes SIGQUIT to SIGKILL when 
postmaster terminates children.  Any other better ideas?


Regards
MauMau



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)

2013-01-31 Thread Kevin Grittner
MauMau maumau...@gmail.com wrote:

 Just doing pkill postgres will unexpectedly terminate postgres
 of other instances.

Not if you run each instance under a different OS user, and execute
pkill with the right user.  (Never use root for that!)  This is
just one of the reasons that you should not run multiple clusters
on the same machine with the same OS user.

-Kevin


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)

2013-01-30 Thread MauMau

From: Tom Lane t...@sss.pgh.pa.us

Since we've fixed a couple of relatively nasty bugs recently, the core
committee has determined that it'd be a good idea to push out PG update
releases soon.  The current plan is to wrap on Monday Feb 4 for public
announcement Thursday Feb 7.  If you're aware of any bug fixes you think
ought to get included, now's the time to get them done ...


I've just encountered another serious bug, which I wish to be fixed in the 
upcoming minor release.


I'm using streaming replication with PostgreSQL 9.1.6 on Linux (RHEL6.2, 
kernel 2.6.32).  But this problem should happen regardless of the use of 
streaming replication.


When I ran pg_ctl stop -mi against the primary, some applications 
connected to the primary did not stop.  The cause was that the backends was 
deadlocked in quickdie() with some call stack like the following.  I'm sorry 
to have left the stack trace file on the testing machine, so I'll show you 
the precise stack trace tomorrow.


some lock function
malloc()
gettext()
errhint()
quickdie()
signal handler called because of SIGQUIT
free()
...
PostgresMain()
...

The root cause is that gettext() is called in the signal handler quickdie() 
via errhint().  As you know, malloc() cannot be called in a signal handler:


http://www.gnu.org/software/libc/manual/html_node/Nonreentrancy.html#Nonreentrancy

[Excerpt]
On most systems, malloc and free are not reentrant, because they use a 
static data structure which records what memory blocks are free. As a 
result, no library functions that allocate or free memory are reentrant. 
This includes functions that allocate space to store a result.



And gettext() calls malloc(), as reported below:

http://lists.gnu.org/archive/html/bug-coreutils/2005-04/msg00056.html

I think the solution is the typical one.  That is, to just remember the 
receipt of SIGQUIT by setting a global variable and call siglongjmp() in 
quickdie(), and perform tasks currently done in quickdie() when sigsetjmp() 
returns in PostgresMain().


What do think about the solution?  Could you include the fix?  If it's okay 
and you want, I'll submit the patch.


Regards
MauMau



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)

2013-01-30 Thread Tom Lane
MauMau maumau...@gmail.com writes:
 When I ran pg_ctl stop -mi against the primary, some applications 
 connected to the primary did not stop. ...
 The root cause is that gettext() is called in the signal handler quickdie() 
 via errhint().

Yeah, it's a known hazard that quickdie() operates like that.

 I think the solution is the typical one.  That is, to just remember the 
 receipt of SIGQUIT by setting a global variable and call siglongjmp() in 
 quickdie(), and perform tasks currently done in quickdie() when sigsetjmp() 
 returns in PostgresMain().

I think this cure is considerably worse than the disease.  As stated,
it's not a fix at all: longjmp'ing out of a signal handler is no better
defined than what happens now, in fact it's probably even less safe.
We could just set a flag and wait for the mainline code to notice,
but that would make SIGQUIT hardly any stronger than SIGTERM --- in
particular it couldn't get you out of any loop that wasn't checking for
interrupts.

The long and the short of it is that SIGQUIT is the emergency-stop panic
button.  You don't use it for routine shutdowns --- you use it when
there is a damn good reason to and you're prepared to do some manual
cleanup if necessary.

http://en.wikipedia.org/wiki/Big_Red_Switch

 What do think about the solution?  Could you include the fix?

Even if we had an arguably-better solution, I'd be disinclined to
risk cramming it into stable branches on such short notice.

What might make sense on short notice is to strengthen the
documentation's cautions against using SIGQUIT unnecessarily.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)

2013-01-30 Thread Andres Freund
On 2013-01-30 10:23:09 -0500, Tom Lane wrote:
 MauMau maumau...@gmail.com writes:
  When I ran pg_ctl stop -mi against the primary, some applications 
  connected to the primary did not stop. ...
  The root cause is that gettext() is called in the signal handler quickdie() 
  via errhint().
 
 Yeah, it's a known hazard that quickdie() operates like that.

What about not translating those? The messages are static and all memory
needed by postgres should be pre-allocated.

Greetings,

Andres Freund

-- 
 Andres Freund http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training  Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)

2013-01-30 Thread Tom Lane
Andres Freund and...@2ndquadrant.com writes:
 On 2013-01-30 10:23:09 -0500, Tom Lane wrote:
 Yeah, it's a known hazard that quickdie() operates like that.

 What about not translating those? The messages are static and all memory
 needed by postgres should be pre-allocated.

That would reduce our exposure slightly, but hardly to zero.  For
instance, if SIGQUIT happened in the midst of handling a regular error,
ErrorContext might be pretty full already, necessitating further malloc
requests.  I thought myself about suggesting that quickdie do something
to disable gettext(), but it doesn't seem like that would make it enough
safer to justify the loss of user-friendliness for non English speakers.

I think the conflict between we don't want SIGQUIT to interrupt this
and we do want SIGQUIT to interrupt that is pretty fundamental, and
there's probably not any bulletproof solution (or at least none that
would have reasonable development/maintenance cost).  If we had more
confidence that there were no major loops lacking CHECK_FOR_INTERRUPTS
calls, maybe the set-a-flag approach would be acceptable ... but I
sure don't have such confidence.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)

2013-01-30 Thread MauMau

From: Tom Lane t...@sss.pgh.pa.us

MauMau maumau...@gmail.com writes:

I think the solution is the typical one.  That is, to just remember the
receipt of SIGQUIT by setting a global variable and call siglongjmp() in
quickdie(), and perform tasks currently done in quickdie() when 
sigsetjmp()

returns in PostgresMain().


I think this cure is considerably worse than the disease.  As stated,
it's not a fix at all: longjmp'ing out of a signal handler is no better
defined than what happens now, in fact it's probably even less safe.
We could just set a flag and wait for the mainline code to notice,
but that would make SIGQUIT hardly any stronger than SIGTERM --- in
particular it couldn't get you out of any loop that wasn't checking for
interrupts.


Oh, I was careless.  You are right, my suggestion is not a fix at all 
because free() would continue to hold some lock after siglongjmp(), which 
malloc() tries to acquire.



The long and the short of it is that SIGQUIT is the emergency-stop panic
button.  You don't use it for routine shutdowns --- you use it when
there is a damn good reason to and you're prepared to do some manual
cleanup if necessary.

http://en.wikipedia.org/wiki/Big_Red_Switch


How about the case where some backend crashes due to a bug of PostgreSQL? 
In this case, postmaster sends SIGQUIT to all backends, too.  The instance 
is expected to disappear cleanly and quickly.  Doesn't the hanging backend 
harm the restart of the instance?


How about using SIGKILL instead of SIGQUIT?  The purpose of SIGQUIT is to 
shutdown the processes quickly.  SIGKILL is the best signal for that 
purpose.  The WARNING message would not be sent to clients, but that does 
not justify the inability of immediately shutting down.


Regards
MauMau




--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)

2013-01-30 Thread Tom Lane
MauMau maumau...@gmail.com writes:
 From: Tom Lane t...@sss.pgh.pa.us
 The long and the short of it is that SIGQUIT is the emergency-stop panic
 button.  You don't use it for routine shutdowns --- you use it when
 there is a damn good reason to and you're prepared to do some manual
 cleanup if necessary.

 How about the case where some backend crashes due to a bug of PostgreSQL? 
 In this case, postmaster sends SIGQUIT to all backends, too.  The instance 
 is expected to disappear cleanly and quickly.  Doesn't the hanging backend 
 harm the restart of the instance?

[ shrug... ]  That isn't guaranteed, and never has been --- for
instance, the process might have SIGQUIT blocked, perhaps as a result
of third-party code we have no control over.

 How about using SIGKILL instead of SIGQUIT?

Because then we couldn't notify clients at all.  One practical
disadvantage of that is that it would become quite hard to tell from
the outside which client session actually crashed, which is frequently
useful to know.

This isn't an area that admits of quick-fix solutions --- everything
we might do has disadvantages.  Also, the lack of complaints to date
shows that the problem is not so large as to justify panic responses.
I'm not really inclined to mess around with a tradeoff that's been
working pretty well for a dozen years or more.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers