Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)
MauMau escribió: Hi, I did this. Please find attached the revised patch. I modified HandleChildCrash(). I tested the immediate shutdown, and the child cleanup succeeded. Thanks, committed. There are two matters pending here: 1. do we want postmaster to exit immediately after sending the SIGKILL, or hang around until all children are gone? 2. how far do we want to backpatch this, if at all? -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)
On Fri, Jun 28, 2013 at 6:00 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: MauMau escribió: Hi, I did this. Please find attached the revised patch. I modified HandleChildCrash(). I tested the immediate shutdown, and the child cleanup succeeded. Thanks, committed. There are two matters pending here: 1. do we want postmaster to exit immediately after sending the SIGKILL, or hang around until all children are gone? 2. how far do we want to backpatch this, if at all? Considering that neither Tom nor I was convinced that this was a good idea AT ALL, I'm surprised you committed it in the first place. I'd be more inclined to revert it than back-patch it, but at least if we only change it in HEAD we have some chance of finding out whether or not it is in fact a bad idea before it's too late to change our mind. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)
Hi, Alvaro san, From: Alvaro Herrera alvhe...@2ndquadrant.com MauMau escribió: Yeah, I see that --- after removing that early exit, there are unwanted messages. And in fact there are some signals sent that weren't previously sent. Clearly we need something here: if we're in immediate shutdown handler, don't signal anyone (because they have already been signalled) and don't log any more messages; but the cleaning up of postmaster's process list must still be carried out. Would you please add that on top of the attached cleaned up version of your patch? I did this. Please find attached the revised patch. I modified HandleChildCrash(). I tested the immediate shutdown, and the child cleanup succeeded. In addition, I added if condition at the end of the function. This is to prevent resetting AbortStartTime every time one child terminates. Regards MauMau reliable_immediate_shutdown-3.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)
From: Alvaro Herrera alvhe...@2ndquadrant.com Yeah, I see that --- after removing that early exit, there are unwanted messages. And in fact there are some signals sent that weren't previously sent. Clearly we need something here: if we're in immediate shutdown handler, don't signal anyone (because they have already been signalled) and don't log any more messages; but the cleaning up of postmaster's process list must still be carried out. Would you please add that on top of the attached cleaned up version of your patch? Thanks. I'll do that tomorrow at the earliest. Noah Misch escribió: On Sun, Jun 23, 2013 at 01:55:19PM +0900, MauMau wrote: the clients at the immediate shutdown. The code gets much simpler. In addition, it may be better that we similarly send SIGKILL in backend crash (FatalError) case, eliminate the use of SIGQUIT and remove quickdie() and other SIGQUIT handlers. My take is that the client message has some value, and we shouldn't just discard it to simplify the code slightly. Finishing the shutdown quickly has value, of course. The relative importance of those things should guide the choice of a timeout under method #2, but I don't see a rigorous way to draw the line. I feel five seconds is, if anything, too high. There's obviously a lot of disagreement on 5 seconds being too high or too low. We have just followed SysV init's precedent of waiting 5 seconds by default between sending signals TERM and QUIT during a shutdown. I will note that during a normal shutdown, services are entitled to do much more work than just signal all their children to exit immediately; and yet I don't find much evidence that this period is inordinately short. I don't feel strongly that it couldn't be shorter, though. What about using deadlock_timeout? It constitutes a rough barometer on the system's tolerance for failure detection latency, and we already overload it by having it guide log_lock_waits. The default of 1s makes sense to me for this purpose, too. We can always add a separate immediate_shutdown_timeout if there's demand, but I don't expect such demand. (If we did add such a GUC, setting it to zero could be the way to request method 1. If some folks strongly desire method 1, that's probably the way to offer it.) I dunno. Having this be configurable seems overkill to me. But perhaps that's the way to satisfy most people: some people could set it very high so that they could have postmaster wait longer if they believe their server is going to be overloaded; people wishing immediate SIGKILL could get that too, as you describe. I think this should be a separate patch, however. I think so, too. We can add a parameter later if we find it highly necessary after some experience in the field. Regards MauMau -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)
MauMau escribió: From: Alvaro Herrera alvhe...@2ndquadrant.com Actually, in further testing I noticed that the fast-path you introduced in BackendCleanup (or was it HandleChildCrash?) in the immediate shutdown case caused postmaster to fail to clean up properly after sending the SIGKILL signal, so I had to remove that as well. Was there any reason for that? You are talking about the code below at the beginning of HandleChildCrash(), aren't you? /* Do nothing if the child terminated due to immediate shutdown */ if (Shutdown == ImmediateShutdown) return; If my memory is correct, this was necessary; without this, HandleChildCrash() or LogChildExit() left extra messages in the server log. LOG: %s (PID %d) exited with exit code %d LOG: terminating any other active server processes These messages are output because postmaster sends SIGQUIT to its children and wait for them to terminate. The children exit with non-zero status when they receive SIGQUIT. Yeah, I see that --- after removing that early exit, there are unwanted messages. And in fact there are some signals sent that weren't previously sent. Clearly we need something here: if we're in immediate shutdown handler, don't signal anyone (because they have already been signalled) and don't log any more messages; but the cleaning up of postmaster's process list must still be carried out. Would you please add that on top of the attached cleaned up version of your patch? Noah Misch escribió: On Sun, Jun 23, 2013 at 01:55:19PM +0900, MauMau wrote: the clients at the immediate shutdown. The code gets much simpler. In addition, it may be better that we similarly send SIGKILL in backend crash (FatalError) case, eliminate the use of SIGQUIT and remove quickdie() and other SIGQUIT handlers. My take is that the client message has some value, and we shouldn't just discard it to simplify the code slightly. Finishing the shutdown quickly has value, of course. The relative importance of those things should guide the choice of a timeout under method #2, but I don't see a rigorous way to draw the line. I feel five seconds is, if anything, too high. There's obviously a lot of disagreement on 5 seconds being too high or too low. We have just followed SysV init's precedent of waiting 5 seconds by default between sending signals TERM and QUIT during a shutdown. I will note that during a normal shutdown, services are entitled to do much more work than just signal all their children to exit immediately; and yet I don't find much evidence that this period is inordinately short. I don't feel strongly that it couldn't be shorter, though. What about using deadlock_timeout? It constitutes a rough barometer on the system's tolerance for failure detection latency, and we already overload it by having it guide log_lock_waits. The default of 1s makes sense to me for this purpose, too. We can always add a separate immediate_shutdown_timeout if there's demand, but I don't expect such demand. (If we did add such a GUC, setting it to zero could be the way to request method 1. If some folks strongly desire method 1, that's probably the way to offer it.) I dunno. Having this be configurable seems overkill to me. But perhaps that's the way to satisfy most people: some people could set it very high so that they could have postmaster wait longer if they believe their server is going to be overloaded; people wishing immediate SIGKILL could get that too, as you describe. I think this should be a separate patch, however. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services *** a/doc/src/sgml/runtime.sgml --- b/doc/src/sgml/runtime.sgml *** *** 1362,1372 echo -1000 /proc/self/oom_score_adj listitem para This is the firsttermImmediate Shutdown/firstterm mode. ! The master commandpostgres/command process will send a ! systemitemSIGQUIT/systemitem to all child processes and exit ! immediately, without properly shutting itself down. The child processes ! likewise exit immediately upon receiving ! systemitemSIGQUIT/systemitem. This will lead to recovery (by replaying the WAL log) upon next start-up. This is recommended only in emergencies. /para --- 1362,1372 listitem para This is the firsttermImmediate Shutdown/firstterm mode. ! The server will send systemitemSIGQUIT/systemitem to all child ! processes and wait for them to terminate. Those that don't terminate ! within 5 seconds, will be sent systemitemSIGKILL/systemitem by the ! master commandpostgres/command process, which will then terminate ! without further waiting. This will lead to recovery (by replaying the WAL log) upon next start-up. This is recommended only in emergencies. /para ***
Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)
MauMau escribió: Are you suggesting simplifying the following part in ServerLoop()? I welcome the idea if this condition becomes simpler. However, I cannot imagine how. if (AbortStartTime 0 /* SIGKILL only once */ (Shutdown == ImmediateShutdown || (FatalError !SendStop)) now - AbortStartTime = 10) { SignalAllChildren(SIGKILL); AbortStartTime = 0; } Yes, that's what I'm suggesting. I haven't tried coding it yet. I thought of adding some new state of pmState for some reason (that might be the same as your idea). But I refrained from doing that, because pmState has already many states. I was afraid adding a new pmState value for this bug fix would complicate the state management (e.g. state transition in PostmasterStateMachine()). In addition, I felt PM_WAIT_BACKENDS was appropriate because postmaster is actually waiting for backends to terminate after sending SIGQUIT. The state name is natural. Well, a natural state name is of no use if we're using it to indicate two different states, which I think is what would be happening here. Basically, with your patch, PM_WAIT_BACKENDS means two different things depending on AbortStartTime. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)
On Fri, Jun 21, 2013 at 10:02 PM, MauMau maumau...@gmail.com wrote: I'm comfortable with 5 seconds. We are talking about the interval between sending SIGQUIT to the children and then sending SIGKILL to them. In most situations, the backends should terminate immediately. However, as I said a few months ago, ereport() call in quickdie() can deadlock indefinitely. This is a PostgreSQL's bug. So let's fix that bug. Then we don't need this. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)
From: Alvaro Herrera alvhe...@2ndquadrant.com MauMau escribió: I thought of adding some new state of pmState for some reason (that might be the same as your idea). But I refrained from doing that, because pmState has already many states. I was afraid adding a new pmState value for this bug fix would complicate the state management (e.g. state transition in PostmasterStateMachine()). In addition, I felt PM_WAIT_BACKENDS was appropriate because postmaster is actually waiting for backends to terminate after sending SIGQUIT. The state name is natural. Well, a natural state name is of no use if we're using it to indicate two different states, which I think is what would be happening here. Basically, with your patch, PM_WAIT_BACKENDS means two different things depending on AbortStartTime. i PROBABLY GOT YOUR FEELING. yOU AREN'T FEELING COMFORTABLE ABOUT USING THE TIME VARIABLE aBORTsTARTtIME AS A STATE VARIABLE TO CHANGE POSTMASTER'S BEHAVIOR, ARE YOU? tHAT MAY BE RIGHT, BUT i'M NOT SURE WELL... BECAUSE IN pm_wait_backends, AS THE NAME INDICATES, POSTMASTER IS INDEED WAITING FOR THE BACKENDS TO TERMINATE REGARDLESS OF aBORTsTARTtIME. aPART FROM THIS, POSTMASTER SEEMS TO CHANGE ITS BEHAVIOR IN THE SAME PMsTATE DEPENDING ON OTHER VARIABLES SUCH AS sHUTDOWN AND fATALeRROR. i'M NOT CONFIDENT IN WHICH IS BETTER, SO i WON'T OBJECT IF THE REVIEWER OR COMMITTER MODIFIES THE CODE. rEGARDS mAUmAU -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)
From: Robert Haas robertmh...@gmail.com On Fri, Jun 21, 2013 at 10:02 PM, MauMau maumau...@gmail.com wrote: I'm comfortable with 5 seconds. We are talking about the interval between sending SIGQUIT to the children and then sending SIGKILL to them. In most situations, the backends should terminate immediately. However, as I said a few months ago, ereport() call in quickdie() can deadlock indefinitely. This is a PostgreSQL's bug. So let's fix that bug. Then we don't need this. tHERE ARE TWO WAYS TO FIX THAT BUG. yOU ARE SUGGESTING 1 OF THE FOLLOWING TWO METHODS, AREN'T YOU? 1. (rOBERT-SAN'S IDEA) uPON RECEIPT OF IMMEDIATE SHUTDOWN REQUEST, POSTMASTER SENDS sigkill TO ITS CHILDREN. 2. (tOM-SAN'S IDEA) uPON RECEIPT OF IMMEDIATE SHUTDOWN REQUEST, POSTMASTER FIRST SENDS sigquit TO ITS CHILDREN, WAIT A WHILE FOR THEM TO TERMINATE, THEN SENDS sigkill TO THEM. aT FIRST i PROPOSED 1. tHEN tOM SAN SUGGESTED 2 SO THAT THE SERVER IS AS FRIENDLY TO THE CLIENTS AS NOW BY NOTIFYING THEM OF THE SERVER SHUTDOWN. i WAS CONVINCED BY THAT IDEA AND CHOSE 2. bASICALLY, i THINK BOTH IDEAS ARE RIGHT. tHEY CAN SOLVE THE ORIGINAL PROBLEM. hOWEVER, RE-CONSIDERING THE MEANING OF IMMEDIATE SHUTDOWN, i FEEL 1 IS A BIT BETTER, BECAUSE TRYING TO DO SOMETHING IN QUICKDIE() OR SOMEWHERE DOES NOT MATCH THE IDEA OF IMMEDIATE. wE MAY NOT HAVE TO BE FRIENDLY TO THE CLIENTS AT THE IMMEDIATE SHUTDOWN. tHE CODE GETS MUCH SIMPLER. iN ADDITION, IT MAY BE BETTER THAT WE SIMILARLY SEND sigkill IN BACKEND CRASH (fATALeRROR) CASE, ELIMINATE THE USE OF sigquit AND REMOVE QUICKDIE() AND OTHER sigquit HANDLERS. wHAT DO YOU THINK? hOW SHOULD WE MAKE CONSENSUS AND PROCEED? rEGARDS mAUmAU -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)
On Thu, Jun 20, 2013 at 3:40 PM, MauMau maumau...@gmail.com wrote: Here, reliable means that the database server is certainly shut down when pg_ctl returns, not telling a lie that I shut down the server processes for you, so you do not have to be worried that some postgres process might still remain and write to disk. I suppose reliable shutdown is crucial especially in HA cluster. If pg_ctl stop -mi gets stuck forever when there is an unkillable process (in what situations does this happen? OS bug, or NFS hard mount?), I think the DBA has to notice this situation from the unfinished pg_ctl, investigate the cause, and take corrective action. So you're suggesting that keeping postmaster up is a useful sign that the shutdown is not going well? I'm not really sure about this. What do others think? I think you are right, and there is no harm in leaving postgres processes in unkillable state. I'd like to leave the decision to you and/or others. +1 for leaving processes, not waiting for long (or possibly forever, remember not all people are running postgres on such cluster ware). I'm sure some users rely on the current behavior of immediate shutdown. If someone wants to ensure processes are finished when pg_ctl returns, is it fast shutdown, not immediate shutdown? To me the current immediate shutdown is reliable, in that it without doubt returns control back to terminal, after killing postmaster at least. Thanks, -- Hitoshi Harada
Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)
From: Alvaro Herrera alvhe...@2ndquadrant.com MauMau escribió: One concern is that umount would fail in such a situation because postgres has some open files on the filesystem, which is on the shared disk in case of traditional HA cluster. See my reply to Noah. If postmaster stays around, would this be any different? I don't think so. I'll consider this a bit and respond as a reply to Andres-san's mail. Actually, in further testing I noticed that the fast-path you introduced in BackendCleanup (or was it HandleChildCrash?) in the immediate shutdown case caused postmaster to fail to clean up properly after sending the SIGKILL signal, so I had to remove that as well. Was there any reason for that? You are talking about the code below at the beginning of HandleChildCrash(), aren't you? /* Do nothing if the child terminated due to immediate shutdown */ if (Shutdown == ImmediateShutdown) return; If my memory is correct, this was necessary; without this, HandleChildCrash() or LogChildExit() left extra messages in the server log. LOG: %s (PID %d) exited with exit code %d LOG: terminating any other active server processes These messages are output because postmaster sends SIGQUIT to its children and wait for them to terminate. The children exit with non-zero status when they receive SIGQUIT. Regards MauMau -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)
From: Alvaro Herrera alvhe...@2ndquadrant.com Actually, I think it would be cleaner to have a new state in pmState, namely PM_IMMED_SHUTDOWN which is entered when we send SIGQUIT. When we're in this state, postmaster is only waiting for the timeout to expire; and when it does, it sends SIGKILL and exits. Pretty much the same you have, except that instead of checking AbortStartTime we check the pmState variable. Are you suggesting simplifying the following part in ServerLoop()? I welcome the idea if this condition becomes simpler. However, I cannot imagine how. if (AbortStartTime 0 /* SIGKILL only once */ (Shutdown == ImmediateShutdown || (FatalError !SendStop)) now - AbortStartTime = 10) { SignalAllChildren(SIGKILL); AbortStartTime = 0; } I thought of adding some new state of pmState for some reason (that might be the same as your idea). But I refrained from doing that, because pmState has already many states. I was afraid adding a new pmState value for this bug fix would complicate the state management (e.g. state transition in PostmasterStateMachine()). In addition, I felt PM_WAIT_BACKENDS was appropriate because postmaster is actually waiting for backends to terminate after sending SIGQUIT. The state name is natural. I don't have strong objection to your idea if it makes the code cleaner and more understandable. Thank you very much. Regards MauMau -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)
On Thu, Jun 20, 2013 at 12:33 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: I will go with 5 seconds, then. I'm uncomfortable with this whole concept, and particularly with such a short timeout. On a very busy system, things can take a LOT longer than they think we should; it can take 30 seconds or more just to get a prompt back from a shell command. 5 seconds is the blink of an eye. More generally, what do we think the point is of sending SIGQUIT rather than SIGKILL in the first place, and why does that point cease to be valid after 5 seconds? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)
Robert Haas robertmh...@gmail.com writes: More generally, what do we think the point is of sending SIGQUIT rather than SIGKILL in the first place, and why does that point cease to be valid after 5 seconds? Well, mostly it's about telling the client we're committing hara-kiri. Without that, there's no very good reason to run quickdie() at all. A practical issue with starting to send SIGKILL ourselves is that we will no longer be able to reflexively diagnose server process died on signal 9 as the linux OOM killer got you. I'm not at all convinced that the cases where SIGQUIT doesn't work are sufficiently common to justify losing that property. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)
On Fri, Jun 21, 2013 at 2:55 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: More generally, what do we think the point is of sending SIGQUIT rather than SIGKILL in the first place, and why does that point cease to be valid after 5 seconds? Well, mostly it's about telling the client we're committing hara-kiri. Without that, there's no very good reason to run quickdie() at all. That's what I thought, too. It seems to me that if we think that's important, then it's important even if it takes more than 5 seconds for some reason. A practical issue with starting to send SIGKILL ourselves is that we will no longer be able to reflexively diagnose server process died on signal 9 as the linux OOM killer got you. I'm not at all convinced that the cases where SIGQUIT doesn't work are sufficiently common to justify losing that property. I'm not, either. Maybe this question will provoke many indignant responses, but who in their right mind even uses immediate shutdown on a production server with any regularity? The shutdown checkpoint is sometimes painfully long, but do you really want to run recovery just to avoid it? And in the rare case where an immediate shutdown fails to work, what's wrong will killall -9 postgres? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)
The case where I wanted routine shutdown immediate (and I'm not sure I ever actually got it) was when we were using IBM HA/CMP, where I wanted a terminate with a fair bit of prejudice. If we know we want to switch right away now, immediate seemed pretty much right. I was fine with interrupting user sessions, and there wasn't as much going on in the way of system background stuff back then. I wasn't keen on waiting on much of anything. The background writer ought to be keeping things from being too desperately out of date. If there's stuff worth waiting a few seconds for, I'm all ears. But if I have to wait arbitrarily long, colour me unhappy. If I have to distinguish, myself, between a checkpoint nearly done flushing and a backend that's stuck waiting forlornly for filesystem access, I'm inclined to kill -9 and hope recovery doesn't take *too* long on the next node... If shutting a server down in an emergency situation requires a DBA to look in, as opposed to init.d doing its thing, I think that's pretty much the same problem too.
Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)
From: Robert Haas robertmh...@gmail.com On Thu, Jun 20, 2013 at 12:33 PM, Alvaro Herrera alvhe...@2ndquadrant.com wrote: I will go with 5 seconds, then. I'm uncomfortable with this whole concept, and particularly with such a short timeout. On a very busy system, things can take a LOT longer than they think we should; it can take 30 seconds or more just to get a prompt back from a shell command. 5 seconds is the blink of an eye. I'm comfortable with 5 seconds. We are talking about the interval between sending SIGQUIT to the children and then sending SIGKILL to them. In most situations, the backends should terminate immediately. However, as I said a few months ago, ereport() call in quickdie() can deadlock indefinitely. This is a PostgreSQL's bug. In addition, Tom san was concerned that some PLs (PL/Perl or PL/Python?) block SIGQUIT while executing the UDF, so they may not be able to respond to the immediate shutdown request. What DBAs want from pg_ctl stop -mi is to shutdown the database server as immediately as possible. So I think 5 second is reasonable. Regards MauMau -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)
From: Robert Haas robertmh...@gmail.com On Fri, Jun 21, 2013 at 2:55 PM, Tom Lane t...@sss.pgh.pa.us wrote: Robert Haas robertmh...@gmail.com writes: More generally, what do we think the point is of sending SIGQUIT rather than SIGKILL in the first place, and why does that point cease to be valid after 5 seconds? Well, mostly it's about telling the client we're committing hara-kiri. Without that, there's no very good reason to run quickdie() at all. That's what I thought, too. It seems to me that if we think that's important, then it's important even if it takes more than 5 seconds for some reason. I guess Tom san is saying we should be as kind as possible to the client, so try to notify the client of the shutdown. Not complete kindness. Because the DBA requested immediate shutdown by running pg_ctl stop -mi, the top priority is to shutdown the database server as immediately as possible. The idea here is to try to be friendly to the client as long as the DBA can stand. That's tthe 5 second. A practical issue with starting to send SIGKILL ourselves is that we will no longer be able to reflexively diagnose server process died on signal 9 as the linux OOM killer got you. I'm not at all convinced that the cases where SIGQUIT doesn't work are sufficiently common to justify losing that property. I'm not, either. Maybe this question will provoke many indignant responses, but who in their right mind even uses immediate shutdown on a production server with any regularity? The shutdown checkpoint is sometimes painfully long, but do you really want to run recovery just to avoid it? And in the rare case where an immediate shutdown fails to work, what's wrong will killall -9 postgres? Checkpoint is irrelevant here because we are discussing immediate shutdown. Some problems with killall -9 postgres are: 1. It's not available on Windows. 2. It may kill other database server instances running on the same machine. Regards MauMau -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)
First, thank you for the review. From: Alvaro Herrera alvhe...@2ndquadrant.com This seems reasonable. Why 10 seconds? We could wait 5 seconds, or 15. Is there a rationale behind the 10? If we said 60, that would fit perfectly well within the already existing 60-second loop in postmaster, but that seems way too long. There is no good rationale. I arbitrarily chose a short period because this is immediate shutdown. I felt more than 10 second was long. I think 5 second may be better. Although not directly related to this fix, these influenced my choice: 1. According to the man page of init, init sends SIGKILL to all remaining processes 5 seconds after it sends SIGTERM to them. 2. At computer shutdown, Windows proceeds shutdown forcibly after waiting for services to terminate 20 seconds. I have only one concern about this patch, which is visible in the documentation proposed change: para This is the firsttermImmediate Shutdown/firstterm mode. The master commandpostgres/command process will send a - systemitemSIGQUIT/systemitem to all child processes and exit - immediately, without properly shutting itself down. The child processes - likewise exit immediately upon receiving - systemitemSIGQUIT/systemitem. This will lead to recovery (by + systemitemSIGQUIT/systemitem to all child processes, wait for + them to terminate, and exit. The child processes + exit immediately upon receiving + systemitemSIGQUIT/systemitem. If any of the child processes + does not terminate within 10 seconds for some unexpected reason, + the master postgres process will send a systemitemSIGKILL/systemitem + to all remaining ones, wait for their termination + again, and exit. This will lead to recovery (by replaying the WAL log) upon next start-up. This is recommended only in emergencies. /para Note that the previous text said that postmaster will send SIGQUIT, then terminate without checking anything. In the new code, postmaster sends SIGQUIT, then waits, then SIGKILL, then waits again. If there's an unkillable process (say because it's stuck in a noninterruptible sleep) postmaster might never exit. I think it should send SIGQUIT, then wait, then SIGKILL, then exit without checking. At first I thought the same, but decided not to do that. The purpose of this patch is to make the immediate shutdown reliable. Here, reliable means that the database server is certainly shut down when pg_ctl returns, not telling a lie that I shut down the server processes for you, so you do not have to be worried that some postgres process might still remain and write to disk. I suppose reliable shutdown is crucial especially in HA cluster. If pg_ctl stop -mi gets stuck forever when there is an unkillable process (in what situations does this happen? OS bug, or NFS hard mount?), I think the DBA has to notice this situation from the unfinished pg_ctl, investigate the cause, and take corrective action. Anyway, in HA cluster, the clusterware will terminate the node with STONITH, not leaving pg_ctl running forever. I have tweaked the patch a bit and I'm about to commit as soon as we resolve the above two items. I appreciate your tweaking, especially in the documentation and source code comments, as English is not my mother tongue. Regards MauMau -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)
MauMau escribió: First, thank you for the review. From: Alvaro Herrera alvhe...@2ndquadrant.com This seems reasonable. Why 10 seconds? We could wait 5 seconds, or 15. Is there a rationale behind the 10? If we said 60, that would fit perfectly well within the already existing 60-second loop in postmaster, but that seems way too long. There is no good rationale. I arbitrarily chose a short period because this is immediate shutdown. I felt more than 10 second was long. I think 5 second may be better. Although not directly related to this fix, these influenced my choice: 1. According to the man page of init, init sends SIGKILL to all remaining processes 5 seconds after it sends SIGTERM to them. 2. At computer shutdown, Windows proceeds shutdown forcibly after waiting for services to terminate 20 seconds. Well, as for the first of these, I don't think it matters whether processes get their SIGKILL from postmaster or from init, when a shutdown or runlevel is changing. Now, one would think that this sets a precedent. I think 20 seconds might be too long; perhaps it's expected in an operating system that forcibly has a GUI. I feel safe to ignore that. I will go with 5 seconds, then. Note that the previous text said that postmaster will send SIGQUIT, then terminate without checking anything. In the new code, postmaster sends SIGQUIT, then waits, then SIGKILL, then waits again. If there's an unkillable process (say because it's stuck in a noninterruptible sleep) postmaster might never exit. I think it should send SIGQUIT, then wait, then SIGKILL, then exit without checking. At first I thought the same, but decided not to do that. The purpose of this patch is to make the immediate shutdown reliable. My point is that there is no difference. For one thing, once we enter immediate shutdown state, and sigkill has been sent, no further action is taken. Postmaster will just sit there indefinitely until processes are gone. If we were to make it repeat SIGKILL until they die, that would be different. However, repeating SIGKILL is pointless, because it they didn't die when they first received it, they will still not die when they receive it second. Also, if they're in uninterruptible sleep and don't die, then they will die as soon as they get out of that state; no further queries will get processed, no further memory access will be done. So there's no harm in they remaining there until underlying storage returns to life, ISTM. Here, reliable means that the database server is certainly shut down when pg_ctl returns, not telling a lie that I shut down the server processes for you, so you do not have to be worried that some postgres process might still remain and write to disk. I suppose reliable shutdown is crucial especially in HA cluster. If pg_ctl stop -mi gets stuck forever when there is an unkillable process (in what situations does this happen? OS bug, or NFS hard mount?), I think the DBA has to notice this situation from the unfinished pg_ctl, investigate the cause, and take corrective action. So you're suggesting that keeping postmaster up is a useful sign that the shutdown is not going well? I'm not really sure about this. What do others think? I have tweaked the patch a bit and I'm about to commit as soon as we resolve the above two items. I appreciate your tweaking, especially in the documentation and source code comments, as English is not my mother tongue. IIRC the only other interesting tweak I did was rename the SignalAllChildren() function to TerminateChildren(). I did this because it doesn't really signal all children; syslogger and dead_end backends are kept around. So the original name was a bit misleading. And we couldn't really name it SignalAlmostAllChildren(), could we .. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)
From: Alvaro Herrera alvhe...@2ndquadrant.com I will go with 5 seconds, then. OK, I agree. My point is that there is no difference. For one thing, once we enter immediate shutdown state, and sigkill has been sent, no further action is taken. Postmaster will just sit there indefinitely until processes are gone. If we were to make it repeat SIGKILL until they die, that would be different. However, repeating SIGKILL is pointless, because it they didn't die when they first received it, they will still not die when they receive it second. Also, if they're in uninterruptible sleep and don't die, then they will die as soon as they get out of that state; no further queries will get processed, no further memory access will be done. So there's no harm in they remaining there until underlying storage returns to life, ISTM. Here, reliable means that the database server is certainly shut down when pg_ctl returns, not telling a lie that I shut down the server processes for you, so you do not have to be worried that some postgres process might still remain and write to disk. I suppose reliable shutdown is crucial especially in HA cluster. If pg_ctl stop -mi gets stuck forever when there is an unkillable process (in what situations does this happen? OS bug, or NFS hard mount?), I think the DBA has to notice this situation from the unfinished pg_ctl, investigate the cause, and take corrective action. So you're suggesting that keeping postmaster up is a useful sign that the shutdown is not going well? I'm not really sure about this. What do others think? I think you are right, and there is no harm in leaving postgres processes in unkillable state. I'd like to leave the decision to you and/or others. One concern is that umount would fail in such a situation because postgres has some open files on the filesystem, which is on the shared disk in case of traditional HA cluster. However, STONITH should resolve the problem by terminating the stuck node... I just feel it is strange for umount to fail due to remaining postgres, because pg_ctl stop -mi reported success. IIRC the only other interesting tweak I did was rename the SignalAllChildren() function to TerminateChildren(). I did this because it doesn't really signal all children; syslogger and dead_end backends are kept around. So the original name was a bit misleading. And we couldn't really name it SignalAlmostAllChildren(), could we .. I see. thank you. Regards MauMau -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)
MauMau escribió: From: Alvaro Herrera alvhe...@2ndquadrant.com One concern is that umount would fail in such a situation because postgres has some open files on the filesystem, which is on the shared disk in case of traditional HA cluster. See my reply to Noah. If postmaster stays around, would this be any different? I don't think so. IIRC the only other interesting tweak I did was rename the SignalAllChildren() function to TerminateChildren(). I did this because it doesn't really signal all children; syslogger and dead_end backends are kept around. So the original name was a bit misleading. And we couldn't really name it SignalAlmostAllChildren(), could we .. I see. thank you. Actually, in further testing I noticed that the fast-path you introduced in BackendCleanup (or was it HandleChildCrash?) in the immediate shutdown case caused postmaster to fail to clean up properly after sending the SIGKILL signal, so I had to remove that as well. Was there any reason for that? -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)
Actually, I think it would be cleaner to have a new state in pmState, namely PM_IMMED_SHUTDOWN which is entered when we send SIGQUIT. When we're in this state, postmaster is only waiting for the timeout to expire; and when it does, it sends SIGKILL and exits. Pretty much the same you have, except that instead of checking AbortStartTime we check the pmState variable. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)
MauMau escribió: Could you review the patch? The summary of the change is: 1. postmaster waits for children to terminate when it gets an immediate shutdown request, instead of exiting. 2. postmaster sends SIGKILL to remaining children if all of the child processes do not terminate within 10 seconds since the start of immediate shutdown or FatalError condition. This seems reasonable. Why 10 seconds? We could wait 5 seconds, or 15. Is there a rationale behind the 10? If we said 60, that would fit perfectly well within the already existing 60-second loop in postmaster, but that seems way too long. I have only one concern about this patch, which is visible in the documentation proposed change: para This is the firsttermImmediate Shutdown/firstterm mode. The master commandpostgres/command process will send a - systemitemSIGQUIT/systemitem to all child processes and exit - immediately, without properly shutting itself down. The child processes - likewise exit immediately upon receiving - systemitemSIGQUIT/systemitem. This will lead to recovery (by + systemitemSIGQUIT/systemitem to all child processes, wait for + them to terminate, and exit. The child processes + exit immediately upon receiving + systemitemSIGQUIT/systemitem. If any of the child processes + does not terminate within 10 seconds for some unexpected reason, + the master postgres process will send a systemitemSIGKILL/systemitem + to all remaining ones, wait for their termination + again, and exit. This will lead to recovery (by replaying the WAL log) upon next start-up. This is recommended only in emergencies. /para Note that the previous text said that postmaster will send SIGQUIT, then terminate without checking anything. In the new code, postmaster sends SIGQUIT, then waits, then SIGKILL, then waits again. If there's an unkillable process (say because it's stuck in a noninterruptible sleep) postmaster might never exit. I think it should send SIGQUIT, then wait, then SIGKILL, then exit without checking. I have tweaked the patch a bit and I'm about to commit as soon as we resolve the above two items. -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)
Hello, Tom-san, folks, From: Tom Lane t...@sss.pgh.pa.us I think if we want to make it bulletproof we'd have to do what the OP suggested and switch to SIGKILL. I'm not enamored of that for the reasons I mentioned --- but one idea that might dodge the disadvantages is to have the postmaster wait a few seconds and then SIGKILL any backends that hadn't exited. I think if we want something that's actually trustworthy, the wait-and-then-kill logic has to be in the postmaster. I'm sorry to be late to submit a patch. I made a fix according to the above comment. Unfortunately, it was not in time for 9.1.8 release... I hope this patch will be included in 9.1.9. Could you review the patch? The summary of the change is: 1. postmaster waits for children to terminate when it gets an immediate shutdown request, instead of exiting. 2. postmaster sends SIGKILL to remaining children if all of the child processes do not terminate within 10 seconds since the start of immediate shutdown or FatalError condition. 3. On Windows, kill(SIGKILL) calls TerminateProcess(). 4. Documentation change to describe the behavior of immediate shutdown. Regards MauMau reliable_immediate_shutdown.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)
On 1/31/13 5:42 PM, MauMau wrote: Thank you for sharing your experience. So you also considered making postmaster SIGKILL children like me, didn't you? I bet most of people who encounter this problem would feel like that. It is definitely pg_ctl who needs to be prepared, not the users. It may not be easy to find out postgres processes to SIGKILL if multiple instances are running on the same host. Just doing pkill postgres will unexpectedly terminate postgres of other instances. In my case, it was one backend process segfaulting, and then some other backend processes didn't respond to the subsequent SIGQUIT sent out by the postmaster. So pg_ctl didn't have any part in it. We ended up addressing that by installing a nagios event handler that checked for this situation and cleaned it up. I would like to make a patch which that changes SIGQUIT to SIGKILL when postmaster terminates children. Any other better ideas? That was my idea back then, but there were some concerns about it. I found an old patch that I had prepared for this, which I have attached. YMMV. From bebb95abe7a55173cab0558da3373d6a3631465b Mon Sep 17 00:00:00 2001 From: Peter Eisentraut pet...@postgresql.org Date: Wed, 16 Dec 2009 17:19:14 +0200 Subject: [PATCH 3/3] Time out the ereport() call in quickdie() after 60 seconds --- src/backend/tcop/postgres.c | 25 + 1 files changed, 25 insertions(+), 0 deletions(-) diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c index b2fb501..ab6805a 100644 --- a/src/backend/tcop/postgres.c +++ b/src/backend/tcop/postgres.c @@ -191,6 +191,7 @@ static bool IsTransactionExitStmtList(List *parseTrees); static bool IsTransactionStmtList(List *parseTrees); static void drop_unnamed_stmt(void); static void SigHupHandler(SIGNAL_ARGS); +static void quickdie_alarm_handler(SIGNAL_ARGS); static void log_disconnections(int code, Datum arg); @@ -2539,9 +2540,17 @@ void quickdie(SIGNAL_ARGS) { sigaddset(BlockSig, SIGQUIT); /* prevent nested calls */ + sigdelset(BlockSig, SIGALRM); PG_SETMASK(BlockSig); /* +* Set up a timeout in case the ereport() call below blocks for a +* long time. +*/ + pqsignal(SIGALRM, quickdie_alarm_handler); + alarm(60); + + /* * If we're aborting out of client auth, don't risk trying to send * anything to the client; we will likely violate the protocol, * not to mention that we may have interrupted the guts of OpenSSL @@ -2586,6 +2595,22 @@ quickdie(SIGNAL_ARGS) } /* + * Take over quickdie()'s work if the alarm expired. + */ +static void +quickdie_alarm_handler(SIGNAL_ARGS) +{ + /* +* We got here if ereport() was blocking, so don't go there again +* except when really asked for. +*/ + elog(DEBUG5, quickdie aborted by alarm); + + on_exit_reset(); + exit(2); +} + +/* * Shutdown signal from postmaster: abort transaction and exit * at soonest convenient time */ -- 1.6.5 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)
On 2013-02-01 08:55:24 -0500, Peter Eisentraut wrote: On 1/31/13 5:42 PM, MauMau wrote: Thank you for sharing your experience. So you also considered making postmaster SIGKILL children like me, didn't you? I bet most of people who encounter this problem would feel like that. It is definitely pg_ctl who needs to be prepared, not the users. It may not be easy to find out postgres processes to SIGKILL if multiple instances are running on the same host. Just doing pkill postgres will unexpectedly terminate postgres of other instances. In my case, it was one backend process segfaulting, and then some other backend processes didn't respond to the subsequent SIGQUIT sent out by the postmaster. So pg_ctl didn't have any part in it. We ended up addressing that by installing a nagios event handler that checked for this situation and cleaned it up. I would like to make a patch which that changes SIGQUIT to SIGKILL when postmaster terminates children. Any other better ideas? That was my idea back then, but there were some concerns about it. I found an old patch that I had prepared for this, which I have attached. YMMV. +static void +quickdie_alarm_handler(SIGNAL_ARGS) +{ + /* + * We got here if ereport() was blocking, so don't go there again + * except when really asked for. + */ + elog(DEBUG5, quickdie aborted by alarm); + Its probably not wise to enter elog.c again, that path might allocate memory and we wouldn't be any wiser. Unfortunately there's not much besides a write(2) to stderr that can safely be done... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)
Andres Freund and...@2ndquadrant.com writes: On 2013-02-01 08:55:24 -0500, Peter Eisentraut wrote: I found an old patch that I had prepared for this, which I have attached. YMMV. +static void +quickdie_alarm_handler(SIGNAL_ARGS) +{ +/* + * We got here if ereport() was blocking, so don't go there again + * except when really asked for. + */ +elog(DEBUG5, quickdie aborted by alarm); + Its probably not wise to enter elog.c again, that path might allocate memory and we wouldn't be any wiser. Unfortunately there's not much besides a write(2) to stderr that can safely be done... Another objection to this is it still doesn't fix the case where the process has blocked SIGQUIT. My faith that libperl, libR, etc will never do that is nil. I think if we want something that's actually trustworthy, the wait-and-then-kill logic has to be in the postmaster. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)
On 1/30/13 9:11 AM, MauMau wrote: When I ran pg_ctl stop -mi against the primary, some applications connected to the primary did not stop. The cause was that the backends was deadlocked in quickdie() with some call stack like the following. I'm sorry to have left the stack trace file on the testing machine, so I'll show you the precise stack trace tomorrow. I've had similar problems in the past: http://www.postgresql.org/message-id/1253704891.20834.8.ca...@fsopti579.f-secure.com The discussion there never quite concluded. But yes, you need to be prepared that in rare circumstances SIGQUIT won't succeed and you need to use SIGKILL. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)
From: Peter Eisentraut pete...@gmx.net On 1/30/13 9:11 AM, MauMau wrote: When I ran pg_ctl stop -mi against the primary, some applications connected to the primary did not stop. The cause was that the backends was deadlocked in quickdie() with some call stack like the following. I'm sorry to have left the stack trace file on the testing machine, so I'll show you the precise stack trace tomorrow. I've had similar problems in the past: http://www.postgresql.org/message-id/1253704891.20834.8.ca...@fsopti579.f-secure.com The discussion there never quite concluded. But yes, you need to be prepared that in rare circumstances SIGQUIT won't succeed and you need to use SIGKILL. Thank you for sharing your experience. So you also considered making postmaster SIGKILL children like me, didn't you? I bet most of people who encounter this problem would feel like that. It is definitely pg_ctl who needs to be prepared, not the users. It may not be easy to find out postgres processes to SIGKILL if multiple instances are running on the same host. Just doing pkill postgres will unexpectedly terminate postgres of other instances. I would like to make a patch which that changes SIGQUIT to SIGKILL when postmaster terminates children. Any other better ideas? Regards MauMau -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)
MauMau maumau...@gmail.com wrote: Just doing pkill postgres will unexpectedly terminate postgres of other instances. Not if you run each instance under a different OS user, and execute pkill with the right user. (Never use root for that!) This is just one of the reasons that you should not run multiple clusters on the same machine with the same OS user. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)
From: Tom Lane t...@sss.pgh.pa.us Since we've fixed a couple of relatively nasty bugs recently, the core committee has determined that it'd be a good idea to push out PG update releases soon. The current plan is to wrap on Monday Feb 4 for public announcement Thursday Feb 7. If you're aware of any bug fixes you think ought to get included, now's the time to get them done ... I've just encountered another serious bug, which I wish to be fixed in the upcoming minor release. I'm using streaming replication with PostgreSQL 9.1.6 on Linux (RHEL6.2, kernel 2.6.32). But this problem should happen regardless of the use of streaming replication. When I ran pg_ctl stop -mi against the primary, some applications connected to the primary did not stop. The cause was that the backends was deadlocked in quickdie() with some call stack like the following. I'm sorry to have left the stack trace file on the testing machine, so I'll show you the precise stack trace tomorrow. some lock function malloc() gettext() errhint() quickdie() signal handler called because of SIGQUIT free() ... PostgresMain() ... The root cause is that gettext() is called in the signal handler quickdie() via errhint(). As you know, malloc() cannot be called in a signal handler: http://www.gnu.org/software/libc/manual/html_node/Nonreentrancy.html#Nonreentrancy [Excerpt] On most systems, malloc and free are not reentrant, because they use a static data structure which records what memory blocks are free. As a result, no library functions that allocate or free memory are reentrant. This includes functions that allocate space to store a result. And gettext() calls malloc(), as reported below: http://lists.gnu.org/archive/html/bug-coreutils/2005-04/msg00056.html I think the solution is the typical one. That is, to just remember the receipt of SIGQUIT by setting a global variable and call siglongjmp() in quickdie(), and perform tasks currently done in quickdie() when sigsetjmp() returns in PostgresMain(). What do think about the solution? Could you include the fix? If it's okay and you want, I'll submit the patch. Regards MauMau -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)
MauMau maumau...@gmail.com writes: When I ran pg_ctl stop -mi against the primary, some applications connected to the primary did not stop. ... The root cause is that gettext() is called in the signal handler quickdie() via errhint(). Yeah, it's a known hazard that quickdie() operates like that. I think the solution is the typical one. That is, to just remember the receipt of SIGQUIT by setting a global variable and call siglongjmp() in quickdie(), and perform tasks currently done in quickdie() when sigsetjmp() returns in PostgresMain(). I think this cure is considerably worse than the disease. As stated, it's not a fix at all: longjmp'ing out of a signal handler is no better defined than what happens now, in fact it's probably even less safe. We could just set a flag and wait for the mainline code to notice, but that would make SIGQUIT hardly any stronger than SIGTERM --- in particular it couldn't get you out of any loop that wasn't checking for interrupts. The long and the short of it is that SIGQUIT is the emergency-stop panic button. You don't use it for routine shutdowns --- you use it when there is a damn good reason to and you're prepared to do some manual cleanup if necessary. http://en.wikipedia.org/wiki/Big_Red_Switch What do think about the solution? Could you include the fix? Even if we had an arguably-better solution, I'd be disinclined to risk cramming it into stable branches on such short notice. What might make sense on short notice is to strengthen the documentation's cautions against using SIGQUIT unnecessarily. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)
On 2013-01-30 10:23:09 -0500, Tom Lane wrote: MauMau maumau...@gmail.com writes: When I ran pg_ctl stop -mi against the primary, some applications connected to the primary did not stop. ... The root cause is that gettext() is called in the signal handler quickdie() via errhint(). Yeah, it's a known hazard that quickdie() operates like that. What about not translating those? The messages are static and all memory needed by postgres should be pre-allocated. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)
Andres Freund and...@2ndquadrant.com writes: On 2013-01-30 10:23:09 -0500, Tom Lane wrote: Yeah, it's a known hazard that quickdie() operates like that. What about not translating those? The messages are static and all memory needed by postgres should be pre-allocated. That would reduce our exposure slightly, but hardly to zero. For instance, if SIGQUIT happened in the midst of handling a regular error, ErrorContext might be pretty full already, necessitating further malloc requests. I thought myself about suggesting that quickdie do something to disable gettext(), but it doesn't seem like that would make it enough safer to justify the loss of user-friendliness for non English speakers. I think the conflict between we don't want SIGQUIT to interrupt this and we do want SIGQUIT to interrupt that is pretty fundamental, and there's probably not any bulletproof solution (or at least none that would have reasonable development/maintenance cost). If we had more confidence that there were no major loops lacking CHECK_FOR_INTERRUPTS calls, maybe the set-a-flag approach would be acceptable ... but I sure don't have such confidence. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)
From: Tom Lane t...@sss.pgh.pa.us MauMau maumau...@gmail.com writes: I think the solution is the typical one. That is, to just remember the receipt of SIGQUIT by setting a global variable and call siglongjmp() in quickdie(), and perform tasks currently done in quickdie() when sigsetjmp() returns in PostgresMain(). I think this cure is considerably worse than the disease. As stated, it's not a fix at all: longjmp'ing out of a signal handler is no better defined than what happens now, in fact it's probably even less safe. We could just set a flag and wait for the mainline code to notice, but that would make SIGQUIT hardly any stronger than SIGTERM --- in particular it couldn't get you out of any loop that wasn't checking for interrupts. Oh, I was careless. You are right, my suggestion is not a fix at all because free() would continue to hold some lock after siglongjmp(), which malloc() tries to acquire. The long and the short of it is that SIGQUIT is the emergency-stop panic button. You don't use it for routine shutdowns --- you use it when there is a damn good reason to and you're prepared to do some manual cleanup if necessary. http://en.wikipedia.org/wiki/Big_Red_Switch How about the case where some backend crashes due to a bug of PostgreSQL? In this case, postmaster sends SIGQUIT to all backends, too. The instance is expected to disappear cleanly and quickly. Doesn't the hanging backend harm the restart of the instance? How about using SIGKILL instead of SIGQUIT? The purpose of SIGQUIT is to shutdown the processes quickly. SIGKILL is the best signal for that purpose. The WARNING message would not be sent to clients, but that does not justify the inability of immediately shutting down. Regards MauMau -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: backend hangs at immediate shutdown (Re: [HACKERS] Back-branch update releases coming in a couple weeks)
MauMau maumau...@gmail.com writes: From: Tom Lane t...@sss.pgh.pa.us The long and the short of it is that SIGQUIT is the emergency-stop panic button. You don't use it for routine shutdowns --- you use it when there is a damn good reason to and you're prepared to do some manual cleanup if necessary. How about the case where some backend crashes due to a bug of PostgreSQL? In this case, postmaster sends SIGQUIT to all backends, too. The instance is expected to disappear cleanly and quickly. Doesn't the hanging backend harm the restart of the instance? [ shrug... ] That isn't guaranteed, and never has been --- for instance, the process might have SIGQUIT blocked, perhaps as a result of third-party code we have no control over. How about using SIGKILL instead of SIGQUIT? Because then we couldn't notify clients at all. One practical disadvantage of that is that it would become quite hard to tell from the outside which client session actually crashed, which is frequently useful to know. This isn't an area that admits of quick-fix solutions --- everything we might do has disadvantages. Also, the lack of complaints to date shows that the problem is not so large as to justify panic responses. I'm not really inclined to mess around with a tradeoff that's been working pretty well for a dozen years or more. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers