Re: [HACKERS] max_standby_delay considered harmful
On Thu, May 13, 2010 at 1:12 PM, Josh Berkus j...@agliodbs.com wrote: On 5/12/10 8:07 PM, Robert Haas wrote: I think that would be a good thing to check (it'll confirm whether this is the same bug), but I'm not convinced we should actually fix it that way. Prior to 8.4, we handled a smart shutdown during recovery at the conclusion of recovery, just prior to entering normal running. I'm wondering if we shouldn't revert to that behavior in both 8.4 and HEAD. This would be OK as long as we document it well. We patched the shutdown the way we did specifically because Fujii thought it would be an easy fix; if it's complicated, we should revert it and document the issue for DBAs. I don't understand this comment. Oh, and to confirm: the same issue exists, and has always existed, with Warm Standby. That's what I was thinking, but I hadn't gotten around to testing it. Thanks for the confirmation. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
This would be OK as long as we document it well. We patched the shutdown the way we did specifically because Fujii thought it would be an easy fix; if it's complicated, we should revert it and document the issue for DBAs. I don't understand this comment. In other words, I'm saying that it's not critical that we troubleshoot this for 9.0. Revering Fujii's patch, if it's not working, is an option. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
On Fri, May 14, 2010 at 5:51 PM, Josh Berkus j...@agliodbs.com wrote: This would be OK as long as we document it well. We patched the shutdown the way we did specifically because Fujii thought it would be an easy fix; if it's complicated, we should revert it and document the issue for DBAs. I don't understand this comment. In other words, I'm saying that it's not critical that we troubleshoot this for 9.0. Revering Fujii's patch, if it's not working, is an option. There is no patch which we could revert to fix this, by Fujii Masao or anyone else. The patch he proposed has not been committed. I am still studying the problem to try to figure out where to go with it. We could decide to punt the whole thing for 9.1, but I'd like to understand what the options are before we make that decision. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
On 5/12/10 8:07 PM, Robert Haas wrote: I think that would be a good thing to check (it'll confirm whether this is the same bug), but I'm not convinced we should actually fix it that way. Prior to 8.4, we handled a smart shutdown during recovery at the conclusion of recovery, just prior to entering normal running. I'm wondering if we shouldn't revert to that behavior in both 8.4 and HEAD. This would be OK as long as we document it well. We patched the shutdown the way we did specifically because Fujii thought it would be an easy fix; if it's complicated, we should revert it and document the issue for DBAs. Oh, and to confirm: the same issue exists, and has always existed, with Warm Standby. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
On Tue, 2010-05-11 at 14:01 +0900, Fujii Masao wrote: On Mon, May 10, 2010 at 3:27 PM, Simon Riggs si...@2ndquadrant.com wrote: I already explained that killing the startup process first is a bad idea for many reasons when shutdown was discussed. Can't remember who added the new standby shutdown code recently, but it sounds like their design was pretty poor if it didn't include shutting down properly with HS. I hope they fix the bug they have introduced. HS was never designed to work that way, so there is no flaw there; it certainly worked when committed. New smart shutdown during recovery doesn't kill the startup process until all of the read only backends have gone away. So it works fine with HS. Yes, I thought some more about what Robert said. HS works identically to normal running in this regard, so there's no hint of a bug or design flaw on that for either of us to worry about. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
On Wed, May 12, 2010 at 2:50 AM, Simon Riggs si...@2ndquadrant.com wrote: On Tue, 2010-05-11 at 14:01 +0900, Fujii Masao wrote: On Mon, May 10, 2010 at 3:27 PM, Simon Riggs si...@2ndquadrant.com wrote: I already explained that killing the startup process first is a bad idea for many reasons when shutdown was discussed. Can't remember who added the new standby shutdown code recently, but it sounds like their design was pretty poor if it didn't include shutting down properly with HS. I hope they fix the bug they have introduced. HS was never designed to work that way, so there is no flaw there; it certainly worked when committed. New smart shutdown during recovery doesn't kill the startup process until all of the read only backends have gone away. So it works fine with HS. Yes, I thought some more about what Robert said. HS works identically to normal running in this regard, so there's no hint of a bug or design flaw on that for either of us to worry about. I'm not sure what to make of this. Sometimes not shutting down doesn't sound like a feature to me. http://archives.postgresql.org/pgsql-hackers/2010-05/msg00098.php http://archives.postgresql.org/pgsql-hackers/2010-05/msg00103.php -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
On Wed, 2010-05-12 at 07:10 -0400, Robert Haas wrote: I'm not sure what to make of this. Sometimes not shutting down doesn't sound like a feature to me. It acts exactly the same in recovery as in normal running. It is not a special feature of recovery at all, bug or otherwise. You may think its a strange feature generally and I would agree. I would welcome you changing that in 9.1+, as long as your change works in both recovery and normal running. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
On Wed, May 12, 2010 at 12:26 PM, Simon Riggs si...@2ndquadrant.com wrote: On Wed, 2010-05-12 at 07:10 -0400, Robert Haas wrote: I'm not sure what to make of this. Sometimes not shutting down doesn't sound like a feature to me. It acts exactly the same in recovery as in normal running. It is not a special feature of recovery at all, bug or otherwise. I admit I've sometimes been surprised that smart shutdown was waiting when I didn't expect it to. It would be good to give the shutdown more feedback. If it explicitly shows Waiting for n sessions with active transactions to commit or Waiting for n sessions to disconnect then the user would at least understand why it was waiting and what would be necessary to get it to continue. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
On Wed, May 12, 2010 at 7:26 AM, Simon Riggs si...@2ndquadrant.com wrote: On Wed, 2010-05-12 at 07:10 -0400, Robert Haas wrote: I'm not sure what to make of this. Sometimes not shutting down doesn't sound like a feature to me. It acts exactly the same in recovery as in normal running. It is not a special feature of recovery at all, bug or otherwise. Simon, that doesn't make any sense. We are talking about a backend getting stuck forever on an exclusive lock that is held by the startup process and which will never be released (for example, because the master has shut down and no more WAL can be obtained for replay). The startup process does not hold locks in normal operation. There are other things we might want to change about the shutdown behavior (for example, switching from smart to fast automatically after N seconds) which could apply to both the primary and the standby and which might also be workarounds for this problem, but this particular issue is specific to Hot Standby mode and pretending otherwise is just sticking your head in the sand. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
On Wed, 2010-05-12 at 08:52 -0400, Robert Haas wrote: On Wed, May 12, 2010 at 7:26 AM, Simon Riggs si...@2ndquadrant.com wrote: On Wed, 2010-05-12 at 07:10 -0400, Robert Haas wrote: I'm not sure what to make of this. Sometimes not shutting down doesn't sound like a feature to me. It acts exactly the same in recovery as in normal running. It is not a special feature of recovery at all, bug or otherwise. Simon, that doesn't make any sense. We are talking about a backend getting stuck forever on an exclusive lock that is held by the startup process and which will never be released (for example, because the master has shut down and no more WAL can be obtained for replay). The startup process does not hold locks in normal operation. When I test it, startup process holding a lock does not prevent shutdown of a standby. I'd be happy to see your test case showing a bug exists and that the behaviour differs from normal running. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
Simon Riggs wrote: On Wed, 2010-05-12 at 08:52 -0400, Robert Haas wrote: On Wed, May 12, 2010 at 7:26 AM, Simon Riggs si...@2ndquadrant.com wrote: On Wed, 2010-05-12 at 07:10 -0400, Robert Haas wrote: I'm not sure what to make of this. Sometimes not shutting down doesn't sound like a feature to me. It acts exactly the same in recovery as in normal running. It is not a special feature of recovery at all, bug or otherwise. Simon, that doesn't make any sense. We are talking about a backend getting stuck forever on an exclusive lock that is held by the startup process and which will never be released (for example, because the master has shut down and no more WAL can be obtained for replay). The startup process does not hold locks in normal operation. When I test it, startup process holding a lock does not prevent shutdown of a standby. I'd be happy to see your test case showing a bug exists and that the behaviour differs from normal running. In my testing the postmaster simply does not shut down even with no clients connected any more once in a while - most of the time it works just fine but in like 1 out of 10 cases it get's stuck - my testcase (as detailed in the related thread) is simply doing an interval load on the master (pgbench -T 120 sleep 30 pgbench -T 120 - rinse and repeat as needed) and pgbench -S pg_ctl restart pgbench -S in a lop on the standby. once in a while the standby will simply not shut down (forever - not only by eceeding the default timeout of pgctl which seems to get triggered much more often on the standby than on the master - have not looked into that yet in detail) Stefan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
On Wed, 2010-05-12 at 16:03 +0200, Stefan Kaltenbrunner wrote: Simon Riggs wrote: On Wed, 2010-05-12 at 08:52 -0400, Robert Haas wrote: On Wed, May 12, 2010 at 7:26 AM, Simon Riggs si...@2ndquadrant.com wrote: On Wed, 2010-05-12 at 07:10 -0400, Robert Haas wrote: I'm not sure what to make of this. Sometimes not shutting down doesn't sound like a feature to me. It acts exactly the same in recovery as in normal running. It is not a special feature of recovery at all, bug or otherwise. Simon, that doesn't make any sense. We are talking about a backend getting stuck forever on an exclusive lock that is held by the startup process and which will never be released (for example, because the master has shut down and no more WAL can be obtained for replay). The startup process does not hold locks in normal operation. When I test it, startup process holding a lock does not prevent shutdown of a standby. I'd be happy to see your test case showing a bug exists and that the behaviour differs from normal running. In my testing the postmaster simply does not shut down even with no clients connected any more once in a while - most of the time it works just fine but in like 1 out of 10 cases it get's stuck - my testcase (as detailed in the related thread) is simply doing an interval load on the master (pgbench -T 120 sleep 30 pgbench -T 120 - rinse and repeat as needed) and pgbench -S pg_ctl restart pgbench -S in a lop on the standby. once in a while the standby will simply not shut down (forever - not only by eceeding the default timeout of pgctl which seems to get triggered much more often on the standby than on the master - have not looked into that yet in detail) If you could recreate that on a server in debug mode we can see what's happening. If you can attach to the server and get a back trace that would help. I've not seen that behaviour at all during testing and if the issue is sporadic its not likely to help much trying to recreate myself. This could be an issue with SR, or an issue with the shutdown code itself. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
On Wed, 2010-05-12 at 14:18 +0100, Simon Riggs wrote: On Wed, 2010-05-12 at 08:52 -0400, Robert Haas wrote: On Wed, May 12, 2010 at 7:26 AM, Simon Riggs si...@2ndquadrant.com wrote: On Wed, 2010-05-12 at 07:10 -0400, Robert Haas wrote: I'm not sure what to make of this. Sometimes not shutting down doesn't sound like a feature to me. It acts exactly the same in recovery as in normal running. It is not a special feature of recovery at all, bug or otherwise. Simon, that doesn't make any sense. We are talking about a backend getting stuck forever on an exclusive lock that is held by the startup process and which will never be released (for example, because the master has shut down and no more WAL can be obtained for replay). The startup process does not hold locks in normal operation. When I test it, startup process holding a lock does not prevent shutdown of a standby. I'd be happy to see your test case showing a bug exists and that the behaviour differs from normal running. Let me put this differently: I accept that Stefan has reported a problem. Neither Tom nor myself can reproduce the problem. I've re-run Stefan's test case and restarted the server more than 400 times now without any issue. I re-read your post where you gave what you yourself called uninformed speculation. There's no real polite way to say it, but yes your speculation does appear to be uninformed, since it is incorrect. Reasons would be not least that Stefan's tests don't actually send any locks to the standby anyway (!), but even if they did your speculation as to the cause is still all wrong, as explained. There is no evidence to link this behaviour with HS, as yet, and you should be considering the possibility the problem lies elsewhere, especially since it could be code you committed that is at fault. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
On Wed, May 12, 2010 at 11:28 AM, Simon Riggs si...@2ndquadrant.com wrote: On Wed, 2010-05-12 at 14:18 +0100, Simon Riggs wrote: On Wed, 2010-05-12 at 08:52 -0400, Robert Haas wrote: On Wed, May 12, 2010 at 7:26 AM, Simon Riggs si...@2ndquadrant.com wrote: On Wed, 2010-05-12 at 07:10 -0400, Robert Haas wrote: I'm not sure what to make of this. Sometimes not shutting down doesn't sound like a feature to me. It acts exactly the same in recovery as in normal running. It is not a special feature of recovery at all, bug or otherwise. Simon, that doesn't make any sense. We are talking about a backend getting stuck forever on an exclusive lock that is held by the startup process and which will never be released (for example, because the master has shut down and no more WAL can be obtained for replay). The startup process does not hold locks in normal operation. When I test it, startup process holding a lock does not prevent shutdown of a standby. I'd be happy to see your test case showing a bug exists and that the behaviour differs from normal running. Let me put this differently: I accept that Stefan has reported a problem. Neither Tom nor myself can reproduce the problem. I've re-run Stefan's test case and restarted the server more than 400 times now without any issue. OK, I'm glad to hear you've been testing this. I wasn't aware of that. I re-read your post where you gave what you yourself called uninformed speculation. There's no real polite way to say it, but yes your speculation does appear to be uninformed, since it is incorrect. Reasons would be not least that Stefan's tests don't actually send any locks to the standby anyway (!), Hmm. Well, assuming you're correct, that does seem to be a, uh, slight problem with my theory. but even if they did your speculation as to the cause is still all wrong, as explained. You lost me. I don't understand why the problem that I'm referring to couldn't happen, even if it's not what's happening here. There is no evidence to link this behaviour with HS, as yet, and you should be considering the possibility the problem lies elsewhere, especially since it could be code you committed that is at fault. Huh?? The evidence that this bug is linked with HS is that it occurs on a server running in HS mode, and not otherwise. As for whether the bug is code I committed, that's certainly possible, but keep in mind it didn't work at all before IN HOT STANDBY MODE - and that will be code you committed. I'm going to go test this and see if I can figure out what's going on. I hope you will keep at it also - as you point out, your knowledge of this code far exceeds mine. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
On Wed, 2010-05-12 at 12:04 -0400, Robert Haas wrote: Huh?? The evidence that this bug is linked with HS is that it occurs on a server running in HS mode, and not otherwise. As for whether the bug is code I committed, that's certainly possible, but keep in mind it didn't work at all before IN HOT STANDBY MODE - and that will be code you committed. I'll say it now, so its plain. I'm not going to investigate every bug that occurs on Postgres, just because someone was in HS when they found it. Any more than all bugs on Postgres in normal running are MVCC bugs. There needs to be reasonable evidence or a conjecture by someone that knows something about the code. If HS were the only thing changed in recovery in this release, that might not seem reasonable, but since we have much new code and I am not the only developer, it is. Normal shutdown didn't work on a standby before HS was committed and it didn't work afterwards either. Use all the capitals you like but if you use poor arguments and combine that with no evidence then we'll not get very far, either in working together or in solving the actual bugs. Please don't continue to make wild speculations about things related to HS and recovery, so that issues do not become confused; there is no need to comment on every thread. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
On Wed, May 12, 2010 at 5:49 PM, Simon Riggs si...@2ndquadrant.com wrote: On Wed, 2010-05-12 at 12:04 -0400, Robert Haas wrote: Huh?? The evidence that this bug is linked with HS is that it occurs on a server running in HS mode, and not otherwise. As for whether the bug is code I committed, that's certainly possible, but keep in mind it didn't work at all before IN HOT STANDBY MODE - and that will be code you committed. I'll say it now, so its plain. I'm not going to investigate every bug that occurs on Postgres, just because someone was in HS when they found it. Fair enough, though your help debugging is always appreciated regardless of whether a problem is HS related or not. Nobody's obligated to work on anything in Postgres after all. I'm not sure who to blame for the shouting match over whose commit introduced the bug -- it doesn't seem like a relevant or useful thing to argue about, please both stop. there is no need to comment on every thread. This is out of line. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
On Wed, 2010-05-12 at 18:05 +0100, Greg Stark wrote: I'm not sure who to blame for the shouting match over whose commit introduced the bug -- it doesn't seem like a relevant or useful thing to argue about, please both stop. I haven't blamed Robert's code, merely asked him to consider that it is something other HS, since we have no evidence either way at present because the issue is sporadic and has not been replicated as yet, with no specific detail leading to any section of code. there is no need to comment on every thread. This is out of line. Quoted out of context, it is. My full comment is Please don't continue to make wild speculations about things related to HS and recovery, so that issues do not become confused; there is no need to comment on every thread. ... by which I mean threads related to HS and recovery. I respect everybody's right to free speech here, but I would say the same to anyone if they do it repeatedly. I'm not the first to make such a comment on hackers either. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
On Wed, 2010-05-12 at 17:49 +0100, Simon Riggs wrote: On Wed, 2010-05-12 at 12:04 -0400, Robert Haas wrote: Normal shutdown didn't work on a standby before HS was committed and it didn't work afterwards either. Use all the capitals you like but if you use poor arguments and combine that with no evidence then we'll not get very far, either in working together or in solving the actual bugs. Please don't continue to make wild speculations about things related to HS and recovery, so that issues do not become confused; there is no need to comment on every thread. Simon, People are very passionate about this feature. This feature has the ability to show us as moving forward in a fashion that will allow us to directly compete with the big boys in the big installs, although we are still probably 2-3 releases from that. It also has the ability to make us look like a bunch of yahoos (no pun intended) who are better served beating up on that database that Oracle just bought, versus Oracle itself. Patience is a virtue for all when it comes to the this feature. Joshua D. Drake -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564 Consulting, Training, Support, Custom Development, Engineering -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
On Wed, May 12, 2010 at 1:21 PM, Simon Riggs si...@2ndquadrant.com wrote: On Wed, 2010-05-12 at 18:05 +0100, Greg Stark wrote: I'm not sure who to blame for the shouting match over whose commit introduced the bug -- it doesn't seem like a relevant or useful thing to argue about, please both stop. I haven't blamed Robert's code, merely asked him to consider that it is something other HS, since we have no evidence either way at present because the issue is sporadic and has not been replicated as yet, with no specific detail leading to any section of code. I'm not really sure what we're arguing about here. I feel like I'm being accused either of (a) introducing the bug (which is possible) or (b) saying that Simon introduced the bug (which presumably is also possible, although it's not really my point). I ventured an uninformed guess at what the problem might be; Simon thinks my guess is wrong, and it may well be: but either way there's a bug buried in here somewhere and it would be nice to fix it. I thought that it would be a good idea for Simon to look at it because, on the surface, it APPEARS to have something to do with Hot Standby, since that's what Stefan was testing when he found it. Sure, the investigation might lead somewhere else; I completely admit that. Now, Simon just said he HAS looked at it and can't reproduce the problem. So now I'm even less sure what we're arguing about. I'm glad he looked at it. It's interesting that he wasn't able to reproduce the problem. I hope that he or someone else will find something that helps us move forward. I am having difficulty reproducing Stefan's test environment and perhaps for that reason I can't reproduce it either, though I've encountered several other problems about which, I suppose, I will post separate emails. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
On 05/12/2010 05:28 PM, Simon Riggs wrote: On Wed, 2010-05-12 at 14:18 +0100, Simon Riggs wrote: On Wed, 2010-05-12 at 08:52 -0400, Robert Haas wrote: On Wed, May 12, 2010 at 7:26 AM, Simon Riggssi...@2ndquadrant.com wrote: On Wed, 2010-05-12 at 07:10 -0400, Robert Haas wrote: I'm not sure what to make of this. Sometimes not shutting down doesn't sound like a feature to me. It acts exactly the same in recovery as in normal running. It is not a special feature of recovery at all, bug or otherwise. Simon, that doesn't make any sense. We are talking about a backend getting stuck forever on an exclusive lock that is held by the startup process and which will never be released (for example, because the master has shut down and no more WAL can be obtained for replay). The startup process does not hold locks in normal operation. When I test it, startup process holding a lock does not prevent shutdown of a standby. I'd be happy to see your test case showing a bug exists and that the behaviour differs from normal running. Let me put this differently: I accept that Stefan has reported a problem. Neither Tom nor myself can reproduce the problem. I've re-run Stefan's test case and restarted the server more than 400 times now without any issue. I re-read your post where you gave what you yourself called uninformed speculation. There's no real polite way to say it, but yes your speculation does appear to be uninformed, since it is incorrect. Reasons would be not least that Stefan's tests don't actually send any locks to the standby anyway (!), but even if they did your speculation as to the cause is still all wrong, as explained. There is no evidence to link this behaviour with HS, as yet, and you should be considering the possibility the problem lies elsewhere, especially since it could be code you committed that is at fault. Well I'm not sure why people seem to have that hard a time reproducing that issue - it seems that I can provoke it really trivially(in this case no loops, no pgbench, no tricks). A few minutes ago I logged into my test standby (which is idle except for the odd connect to template1 caused by nagios - the master is idle as well and has been for days): postg...@soldata005:~$ psql psql (9.0beta1) Type help for help. postgres=# select 1; ?column? -- 1 (1 row) postgres=# \q postg...@soldata005:~$ pg_ctl -D /var/lib/postgresql/9.0b1/main/ restart waiting for server to shut down done server stopped server starting postg...@soldata005:~$ pg_ctl -D /var/lib/postgresql/9.0b1/main/ restart waiting for server to shut down done server stopped server starting postg...@soldata005:~$ pg_ctl -D /var/lib/postgresql/9.0b1/main/ restart waiting for server to shut down... failed pg_ctl: server does not shut down the server log for that is as follows: 2010-05-12 20:36:18.166 CEST,,, LOG: received smart shutdown request 2010-05-12 20:36:18.167 CEST,,, FATAL: terminating walreceiver process due to administrator command 2010-05-12 20:36:18.174 CEST,,, LOG: shutting down 2010-05-12 20:36:18.251 CEST,,, LOG: database system is shut down 2010-05-12 20:36:19.706 CEST,,, LOG: database system was interrupted while in recovery at log time 2010-05-06 17:36:05 CEST 2010-05-12 20:36:19.706 CEST,,, HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target. 2010-05-12 20:36:19.706 CEST,,, LOG: entering standby mode 2010-05-12 20:36:19.721 CEST,,, LOG: consistent recovery state reached at 1/1278 2010-05-12 20:36:19.721 CEST,,, LOG: invalid record length at 1/1278 2010-05-12 20:36:19.723 CEST,,, LOG: database system is ready to accept read only connections 2010-05-12 20:36:19.737 CEST,,, LOG: streaming replication successfully connected to primary 2010-05-12 20:36:19.918 CEST,,, LOG: received smart shutdown request 2010-05-12 20:36:19.919 CEST,,, FATAL: terminating walreceiver process due to administrator command 2010-05-12 20:36:19.922 CEST,,, LOG: shutting down 2010-05-12 20:36:19.937 CEST,,, LOG: database system is shut down 2010-05-12 20:36:21.433 CEST,,, LOG: database system was interrupted while in recovery at log time 2010-05-06 17:36:05 CEST 2010-05-12 20:36:21.433 CEST,,, HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target. 2010-05-12 20:36:21.433 CEST,,, LOG: entering standby mode 2010-05-12 20:36:21.482 CEST,,, LOG: received smart shutdown request 2010-05-12 20:36:21.504 CEST,,, LOG: consistent recovery state reached at 1/1278 2010-05-12 20:36:21.504 CEST,,, LOG: invalid record length at 1/1278 2010-05-12 20:36:21.505 CEST,,, LOG: database system is ready to accept read only connections 2010-05-12 20:36:21.516 CEST,,, LOG: streaming replication successfully connected to primary so it restarted two times
Re: [HACKERS] max_standby_delay considered harmful
On Wed, 2010-05-12 at 21:10 +0200, Stefan Kaltenbrunner wrote: There is no evidence to link this behaviour with HS, as yet, and you should be considering the possibility the problem lies elsewhere, especially since it could be code you committed that is at fault. Well I'm not sure why people seem to have that hard a time reproducing that issue - it seems that I can provoke it really trivially(in this case no loops, no pgbench, no tricks). A few minutes ago I logged into my test standby (which is idle except for the odd connect to template1 caused by nagios - the master is idle as well and has been for days): Thanks, good report. so it restarted two times successfully - however if one looks at the third time one can see that it received the smart shutdown request BEFORE it reached a consistent recovery state - yet it continued to enable HS and reenabled SR as well. The database is now sitting there doing nothing and it more or less broken because you cannot connect to it in the current state: ~$ psql psql: FATAL: the database system is shutting down the startup process has the following backtrace: (gdb) bt #0 0x7fbe24cb2c83 in select () from /lib/libc.so.6 #1 0x006e811a in pg_usleep () #2 0x0048c333 in XLogPageRead () #3 0x0048c967 in ReadRecord () #4 0x00493ab6 in StartupXLOG () #5 0x00495a88 in StartupProcessMain () #6 0x004ab25e in AuxiliaryProcessMain () #7 0x005d4a7d in StartChildProcess () #8 0x005d70c2 in PostmasterMain () #9 0x0057d898 in main () Well, its waiting for new info from primary. Nothing to do with locking, but that's not an indication that its an SR issue though either. ;-) I'll put some waits into that part of the code and see if I can induce the failure. Maybe its just a simple lack of a CHECK_FOR_INTERRUPTS(). -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
Excerpts from Stefan Kaltenbrunner's message of mié may 12 15:10:28 -0400 2010: the startup process has the following backtrace: (gdb) bt #0 0x7fbe24cb2c83 in select () from /lib/libc.so.6 #1 0x006e811a in pg_usleep () #2 0x0048c333 in XLogPageRead () #3 0x0048c967 in ReadRecord () #4 0x00493ab6 in StartupXLOG () #5 0x00495a88 in StartupProcessMain () #6 0x004ab25e in AuxiliaryProcessMain () #7 0x005d4a7d in StartChildProcess () #8 0x005d70c2 in PostmasterMain () #9 0x0057d898 in main () I just noticed that we have some code assigning the return value of time() to a pg_time_t variable. Is this supposed to work reliably? (xlog.c lines 9267ff) -- -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
On Wed, 2010-05-12 at 15:36 -0400, Alvaro Herrera wrote: I just noticed that we have some code assigning the return value of time() to a pg_time_t variable. Is this supposed to work reliably? (xlog.c lines 9267ff) Code's used that for a while now. Checkpoints and everywhere. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
On Wed, May 12, 2010 at 3:36 PM, Alvaro Herrera alvhe...@alvh.no-ip.org wrote: Excerpts from Stefan Kaltenbrunner's message of mié may 12 15:10:28 -0400 2010: the startup process has the following backtrace: (gdb) bt #0 0x7fbe24cb2c83 in select () from /lib/libc.so.6 #1 0x006e811a in pg_usleep () #2 0x0048c333 in XLogPageRead () #3 0x0048c967 in ReadRecord () #4 0x00493ab6 in StartupXLOG () #5 0x00495a88 in StartupProcessMain () #6 0x004ab25e in AuxiliaryProcessMain () #7 0x005d4a7d in StartChildProcess () #8 0x005d70c2 in PostmasterMain () #9 0x0057d898 in main () I just noticed that we have some code assigning the return value of time() to a pg_time_t variable. Is this supposed to work reliably? (xlog.c lines 9267ff) I' -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
On Wed, May 12, 2010 at 3:51 PM, Robert Haas robertmh...@gmail.com wrote: On Wed, May 12, 2010 at 3:36 PM, Alvaro Herrera alvhe...@alvh.no-ip.org wrote: Excerpts from Stefan Kaltenbrunner's message of mié may 12 15:10:28 -0400 2010: the startup process has the following backtrace: (gdb) bt #0 0x7fbe24cb2c83 in select () from /lib/libc.so.6 #1 0x006e811a in pg_usleep () #2 0x0048c333 in XLogPageRead () #3 0x0048c967 in ReadRecord () #4 0x00493ab6 in StartupXLOG () #5 0x00495a88 in StartupProcessMain () #6 0x004ab25e in AuxiliaryProcessMain () #7 0x005d4a7d in StartChildProcess () #8 0x005d70c2 in PostmasterMain () #9 0x0057d898 in main () I just noticed that we have some code assigning the return value of time() to a pg_time_t variable. Is this supposed to work reliably? (xlog.c lines 9267ff) I' I have a love-hate relationship with GMail, sorry. I am wondering if we are not correctly handling the case where we get a shutdown request while we are still in the PM_STARTUP state. It looks like we might go ahead and switch to PM_RECOVERY and then PM_RECOVERY_CONSISTENT without noticing the shutdown. There is some logic to handle the shutdown when the startup process exits, but if the startup process never exits it looks like we might get stuck. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
On Wed, 2010-05-12 at 14:43 -0400, Robert Haas wrote: I thought that it would be a good idea for Simon to look at it because, on the surface, it APPEARS to have something to do with Hot Standby, since that's what Stefan was testing when he found it. He was also testing SR, yet you haven't breathed a word about that for some strange reason. It didn't APPEAR like it was HS at all, not from basic logic or from technical knowledge. So you'll have to forgive me if I don't leap into action when you say something is an HS problem in the future. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
Simon, Robert, He was also testing SR, yet you haven't breathed a word about that for some strange reason. It didn't APPEAR like it was HS at all, not from basic logic or from technical knowledge. So you'll have to forgive me if I don't leap into action when you say something is an HS problem in the future. Can we please chill out on this some? Especially since we now have an actual reproduceable bug? Simon, it's natural for people to come to you because you are knowledgeable and responsive. You should take it as a compliment. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
On Wed, 2010-05-12 at 22:34 +0100, Simon Riggs wrote: On Wed, 2010-05-12 at 14:43 -0400, Robert Haas wrote: I thought that it would be a good idea for Simon to look at it because, on the surface, it APPEARS to have something to do with Hot Standby, since that's what Stefan was testing when he found it. He was also testing SR, yet you haven't breathed a word about that for some strange reason. It didn't APPEAR like it was HS at all, not from basic logic or from technical knowledge. So you'll have to forgive me if I don't leap into action when you say something is an HS problem in the future. Simon, with respect -- knock it off. Robert gave a very reasonable response. He is just trying to help. Relax man. Joshua Drake -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564 Consulting, Training, Support, Custom Development, Engineering -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
On Wed, May 12, 2010 at 5:34 PM, Simon Riggs si...@2ndquadrant.com wrote: On Wed, 2010-05-12 at 14:43 -0400, Robert Haas wrote: I thought that it would be a good idea for Simon to look at it because, on the surface, it APPEARS to have something to do with Hot Standby, since that's what Stefan was testing when he found it. He was also testing SR, yet you haven't breathed a word about that for some strange reason. It didn't APPEAR like it was HS at all, not from basic logic or from technical knowledge. So you'll have to forgive me if I don't leap into action when you say something is an HS problem in the future. Well, the original subject line of the report had mentioned SR only, but I had a specific theory about what might be happening that was related to the operation of HS. You've said that you think my guess is incorrect, and that's very possible, but until we actually find and fix the bug we're all just guessing. I wasn't intending to cast aspersions on your code. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
On Thu, May 13, 2010 at 4:55 AM, Robert Haas robertmh...@gmail.com wrote: I am wondering if we are not correctly handling the case where we get a shutdown request while we are still in the PM_STARTUP state. It looks like we might go ahead and switch to PM_RECOVERY and then PM_RECOVERY_CONSISTENT without noticing the shutdown. There is some logic to handle the shutdown when the startup process exits, but if the startup process never exits it looks like we might get stuck. Right. I reported this problem and submitted the patch before. http://archives.postgresql.org/pgsql-hackers/2010-04/msg00592.php Stefan, Could you check whether the patch fixes the problem you encountered? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
On Wed, May 12, 2010 at 10:46 PM, Fujii Masao masao.fu...@gmail.com wrote: On Thu, May 13, 2010 at 4:55 AM, Robert Haas robertmh...@gmail.com wrote: I am wondering if we are not correctly handling the case where we get a shutdown request while we are still in the PM_STARTUP state. It looks like we might go ahead and switch to PM_RECOVERY and then PM_RECOVERY_CONSISTENT without noticing the shutdown. There is some logic to handle the shutdown when the startup process exits, but if the startup process never exits it looks like we might get stuck. Right. I reported this problem and submitted the patch before. http://archives.postgresql.org/pgsql-hackers/2010-04/msg00592.php Sorry we missed that. Stefan, Could you check whether the patch fixes the problem you encountered? I think that would be a good thing to check (it'll confirm whether this is the same bug), but I'm not convinced we should actually fix it that way. Prior to 8.4, we handled a smart shutdown during recovery at the conclusion of recovery, just prior to entering normal running. I'm wondering if we shouldn't revert to that behavior in both 8.4 and HEAD. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
On Sun, 2010-05-09 at 20:56 -0400, Robert Haas wrote: Seems like it could take FOREVER on a busy system. Surely that's not OK. The fact that Hot Standby has to take exclusive locks that can't be released until WAL replay has progressed to a certain point seems like a fairly serious wart. If this is a serious wart then it's not one of hot standby, but one of postgres proper. AccessExclusiveLocks (SELECT-blocking locks that is, as opposed to UPDATE/DELETE-blocking locks) are never necessary from a correctness POV, they're only there for implementation reasons. Getting rid of them doesn't seem completely insurmountable either - just as multiple row versions remove the need to block SELECTs dues to concurrent UPDATEs, multiple datafile versions could remove the need to block SELECTs due to concurrent ALTERs. But people seem to live with them quite well, judged from the amount of work put into getting rid of them (zero). I therefore fail to see why they should pose a significant problem in HS setups. The difference is that in HS you have to wait for a moment where *no exclusive lock at all* exist, possibly without contending for any of them, while on the master you might not even blocked by the existence of any of those locks. If you have two sessions which in overlapping transactions lock different tables exlusively you have no problem shutting the master down, but you will never reach a point where no exclusive lock is taken on the slave. A possible solution to this in the shutdown case is to kill anyone waiting on a lock held by the startup process at the same time we kill the startup process, and to kill anyone who subsequently waits for such a lock as soon as they attempt to take it. I already explained that killing the startup process first is a bad idea for many reasons when shutdown was discussed. Can't remember who added the new standby shutdown code recently, but it sounds like their design was pretty poor if it didn't include shutting down properly with HS. I hope they fix the bug they have introduced. HS was never designed to work that way, so there is no flaw there; it certainly worked when committed. I'm not sure if this would also make sense in the pause case. Not sure why pausing replay would make any difference at all. Being between one WAL record and the next is a valid and normal state that exists many thousands of times per second. If making that state longer would cause problems we would already have seen any issues. There are none, it will work fine. Another possible solution would be to try to figure out if there's a way to delay application of WAL that requires the taking of AELs to the point where we could apply it all at once. That might not be feasible, though, or only in some cases, and it's certainly 9.1 material (at least) in any case. Locks usually protect users from accessing a table while its being clustered or dropped or something like that. Locks are not bad. They are also used by some developers to specifically serialize access to an object. AccessExclusiveLocks are rare in normal running and not to be avoided when they do exist. HS correctly supports locking, as and when such locks are made on the master. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
Robert Haas robertmh...@gmail.com writes: On Sun, May 9, 2010 at 6:58 PM, Andres Freund and...@anarazel.de wrote: The difference is that in HS you have to wait for a moment where *no exclusive lock at all* exist, possibly without contending for any of them, while on the master you might not even blocked by the existence of any of those locks. If you have two sessions which in overlapping transactions lock different tables exlusively you have no problem shutting the master down, but you will never reach a point where no exclusive lock is taken on the slave. A possible solution to this in the shutdown case is to kill anyone waiting on a lock held by the startup process at the same time we kill the startup process, and to kill anyone who subsequently waits for such a lock as soon as they attempt to take it. I'm not sure if this would also make sense in the pause case. Well, wait, I'm getting lost here. It seems to me that no query on the slave is granted to take AEL, not matter what. The only case is a query waiting for the replay to release its locks. The only consequence of pause not waiting for any lock to get released from the replay is that those backends will be, well, paused. But that applies the same to any backend started after we pause. Waiting for replay to release all its locks before to pause would mean that there's a possibility that the activity on the master is such that you never reach a pause in the WAL stream. Let's assume we want any new code we throw in at this stage to be a magic wand making every use happy at once. So we'd need a pause function taking either 1 or 2 arguments, first is to say we pause now even if we know the replay is holding some locks that might pause the reporting queries too, the other is to wait until the locks are not held anymore, with a timeout (default 1min?). Ok, that's designing the API we're missing, and we should not be in the process of doing any design at this stage. But we are. [good summary of current positions] I can't presume to extract a consensus from that; I don't think there is one. All we know for sure is that Tom does not want to release as-is, and he rightfully insists on several objectives as far as the editing is concerned: - no addition of code we might want to throw away later - avoid having to deprecate released behavior, it's too hard - minimal change set, possibly with no new features. One more, pausing the replay is *already* in the code base, it's exactly what happens under the hood if you favor queries rather than replay, to the point I don't understand why the pause design needs to happen now. We're only talking about having an *explicit* version of it. Regards, -- dim I too am growing tired of insisting this much. I only continue because I really can't get to understand why-o-why considering a new API over existing feature is not possible at this stage. I'm hitting my head on the wal, so to say… -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
Robert Haas wrote: On Sun, May 9, 2010 at 6:58 PM, Andres Freund and...@anarazel.de wrote: On Monday 10 May 2010 00:25:44 Florian Pflug wrote: On May 9, 2010, at 22:01 , Robert Haas wrote: On Sun, May 9, 2010 at 3:09 PM, Dimitri Fontaine dfonta...@hi-media.com wrote: Seems like it could take FOREVER on a busy system. Surely that's not OK. The fact that Hot Standby has to take exclusive locks that can't be released until WAL replay has progressed to a certain point seems like a fairly serious wart. If this is a serious wart then it's not one of hot standby, but one of postgres proper. AccessExclusiveLocks (SELECT-blocking locks that is, as opposed to UPDATE/DELETE-blocking locks) are never necessary from a correctness POV, they're only there for implementation reasons. Getting rid of them doesn't seem completely insurmountable either - just as multiple row versions remove the need to block SELECTs dues to concurrent UPDATEs, multiple datafile versions could remove the need to block SELECTs due to concurrent ALTERs. But people seem to live with them quite well, judged from the amount of work put into getting rid of them (zero). I therefore fail to see why they should pose a significant problem in HS setups. The difference is that in HS you have to wait for a moment where *no exclusive lock at all* exist, possibly without contending for any of them, while on the master you might not even blocked by the existence of any of those locks. If you have two sessions which in overlapping transactions lock different tables exlusively you have no problem shutting the master down, but you will never reach a point where no exclusive lock is taken on the slave. A possible solution to this in the shutdown case is to kill anyone waiting on a lock held by the startup process at the same time we kill the startup process, and to kill anyone who subsequently waits for such a lock as soon as they attempt to take it. If you're not going to apply any more WAL records before shutdown, you could also just release all the AccessExclusiveLocks held by the startup process. Whatever the transaction was doing with the locked relation, if we're not going to replay any more WAL records before shutdown, we will not see the transaction committing or doing anything else with the relation, so we should be safe. Whatever state the data on disk is in, it must be valid, or we would have a problem with crash recovery recovering up to this WAL record and then starting up too. I'm not 100% clear if that reasoning applies to AccessExclusiveLocks taken explicitly with LOCK TABLE. It's not clear what the application would use the lock for. Nevertheless, maybe killing the transactions that wait for the locks would be more intuitive anyway. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
Robert Haas wrote: On Thu, May 6, 2010 at 2:47 PM, Josh Berkus j...@agliodbs.com wrote: Now that I've realized what the real problem is with max_standby_delay (namely, that inactivity on the master can use up the delay), I think we should do what Tom originally suggested here. It's not as good as a really working max_standby_delay, but we're not going to have that for 9.0, and it's clearly better than a boolean. I guess I'm not clear on how what Tom proposed is fundamentally different from max_standby_delay = -1. If there's enough concurrent queries, recovery would never catch up. If your workload is that the standby server is getting pounded with queries like crazy, then it's probably not that different: it will fall progressively further behind. But I suspect many people will set up standby servers where most of the activity happens on the primary, but they run some reporting queries on the standby. If you expect your reporting queries to finish in 10s, you could set the max delay to say 60s. In the event that something gets wedged, recovery will eventually kill it and move on rather than just getting stuck forever. If the volume of queries is known not to be too high, it's reasonable to expect that a few good whacks will be enough to get things back on track. Yeah, I could live with that. A problem with using the name max_standby_delay for Tom's suggestion is that it sounds like a hard limit, which it isn't. But if we name it something like: # -1 = no timeout # 0 = kill conflicting queries immediately # 0 wait for N seconds, then kill query standby_conflict_timeout = -1 it's more clear that the setting is a timeout for each *conflict*, and it's less surprising that the standby can fall indefinitely behind in the worst case. If we name the setting along those lines, I could live with that. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
On May 10, 2010, at 11:43 , Heikki Linnakangas wrote: If you're not going to apply any more WAL records before shutdown, you could also just release all the AccessExclusiveLocks held by the startup process. Whatever the transaction was doing with the locked relation, if we're not going to replay any more WAL records before shutdown, we will not see the transaction committing or doing anything else with the relation, so we should be safe. Whatever state the data on disk is in, it must be valid, or we would have a problem with crash recovery recovering up to this WAL record and then starting up too. Sounds plausible. But wouldn't this imply that HS could *always* postpone the acquisition of an AccessExclusiveLocks until right before the corresponding commit record is replayed? If fail to see a case where this would fail, yet recovery in case of an intermediate crash would be correct. best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
* Heikki Linnakangas heikki.linnakan...@enterprisedb.com [100510 06:03]: A problem with using the name max_standby_delay for Tom's suggestion is that it sounds like a hard limit, which it isn't. But if we name it something like: I'ld still rather an if your killing something, make sure you kill enough to get all the way current behaviour, but that's just me I'm want to run my standbys in a always current mode... But if I decide to play with a lagged HR, I really want to make sure there is some mechanism to cap the lag, and the cap is something I can understand and use to make a reasonable estimate as to when data I know is live on the primary will be seen on the standby... bonus points if it works similarly for archive recovery ;-) a. -- Aidan Van Dyk Create like a god, ai...@highrise.ca command like a king, http://www.highrise.ca/ work like a slave. signature.asc Description: Digital signature
Re: [HACKERS] max_standby_delay considered harmful
On Mon, May 10, 2010 at 2:27 AM, Simon Riggs si...@2ndquadrant.com wrote: I already explained that killing the startup process first is a bad idea for many reasons when shutdown was discussed. Can't remember who added the new standby shutdown code recently, but it sounds like their design was pretty poor if it didn't include shutting down properly with HS. I hope they fix the bug they have introduced. HS was never designed to work that way, so there is no flaw there; it certainly worked when committed. The patch was written by Fujii Masao and committed, after review, by me. Prior to that patch, smart shutdown never worked; now it works, or so I believe, unless recovery is stalled holding a lock upon which a regular back-end is blocking. Clearly that is both better and not all that good. If you have any ideas to improve the situation further, I'm all ears. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
On Mon, May 10, 2010 at 6:03 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Yeah, I could live with that. A problem with using the name max_standby_delay for Tom's suggestion is that it sounds like a hard limit, which it isn't. But if we name it something like: # -1 = no timeout # 0 = kill conflicting queries immediately # 0 wait for N seconds, then kill query standby_conflict_timeout = -1 it's more clear that the setting is a timeout for each *conflict*, and it's less surprising that the standby can fall indefinitely behind in the worst case. If we name the setting along those lines, I could live with that. Yeah, if we do it that way, +1 for changing the name, and your suggestion seems good. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
On Mon, May 10, 2010 at 6:13 AM, Florian Pflug f...@phlo.org wrote: On May 10, 2010, at 11:43 , Heikki Linnakangas wrote: If you're not going to apply any more WAL records before shutdown, you could also just release all the AccessExclusiveLocks held by the startup process. Whatever the transaction was doing with the locked relation, if we're not going to replay any more WAL records before shutdown, we will not see the transaction committing or doing anything else with the relation, so we should be safe. Whatever state the data on disk is in, it must be valid, or we would have a problem with crash recovery recovering up to this WAL record and then starting up too. Sounds plausible. But wouldn't this imply that HS could *always* postpone the acquisition of an AccessExclusiveLocks until right before the corresponding commit record is replayed? If fail to see a case where this would fail, yet recovery in case of an intermediate crash would be correct. Yeah, I'd like to understand this, too. I don't have a clear understanding of when HS needs to take locks here in the first place. [removing Josh Berkus's persistently bouncing email from the CC line] -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
Florian Pflug wrote: On May 10, 2010, at 11:43 , Heikki Linnakangas wrote: If you're not going to apply any more WAL records before shutdown, you could also just release all the AccessExclusiveLocks held by the startup process. Whatever the transaction was doing with the locked relation, if we're not going to replay any more WAL records before shutdown, we will not see the transaction committing or doing anything else with the relation, so we should be safe. Whatever state the data on disk is in, it must be valid, or we would have a problem with crash recovery recovering up to this WAL record and then starting up too. Sounds plausible. But wouldn't this imply that HS could *always* postpone the acquisition of an AccessExclusiveLocks until right before the corresponding commit record is replayed? If fail to see a case where this would fail, yet recovery in case of an intermediate crash would be correct. I guess it could in some situations, but for example the AccessExclusiveLock taken at the end of lazy vacuum to truncate the relation must be held during the truncation, or concurrent readers will get upset. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
On Monday 10 May 2010 14:00:45 Heikki Linnakangas wrote: Florian Pflug wrote: On May 10, 2010, at 11:43 , Heikki Linnakangas wrote: If you're not going to apply any more WAL records before shutdown, you could also just release all the AccessExclusiveLocks held by the startup process. Whatever the transaction was doing with the locked relation, if we're not going to replay any more WAL records before shutdown, we will not see the transaction committing or doing anything else with the relation, so we should be safe. Whatever state the data on disk is in, it must be valid, or we would have a problem with crash recovery recovering up to this WAL record and then starting up too. Sounds plausible. But wouldn't this imply that HS could *always* postpone the acquisition of an AccessExclusiveLocks until right before the corresponding commit record is replayed? If fail to see a case where this would fail, yet recovery in case of an intermediate crash would be correct. I guess it could in some situations, but for example the AccessExclusiveLock taken at the end of lazy vacuum to truncate the relation must be held during the truncation, or concurrent readers will get upset. Actually all the locks that do not need to be taken on the slave would not need to be an ACCESS EXCLUSIVE but a EXCLUSIVE on the master, right? That should be fixed on the master, not hacked up on the slave and is by far out of scope of 9.0. Thats an area where I definitely would like to improve pg in the future... Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
Simon Riggs wrote: Bruce has used the word crippleware for the current state. Raising a problem and then blocking solutions is the best way I know to cripple a release. It should be clear that I've done my best to avoid this FYI, it was Robert Haas who used the term crippleware to describe a boolean value for max_standby_delay, and I was just repeating his term, and disputing it would be crippleware. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
Robert Haas wrote: Wultsch (who doesn't ever want to kill queries and therefore would be happy with a boolean), Yeb Havinga (who never wants to stall recovery and therefore would also be happy with a boolean), and Florian Pflug (who points out that pause/resume is actually a nontrivial feature). Apologies if I've left anyone out or misrepresented their position. Overall I would say opinion is about evenly split between: - leave it as-is - make it a Boolean - change it in some way but to something more expressive than a Boolean I can't presume to extract a consensus from that; I don't think there is one. You could say the majority of people want to change something and that would be true; you could also say the majority of people don't want a Boolean and that would also be true. Yep, this is where we are. Discussion had stopped, so it seemed like time for a decision, and with no one agreeing on what to do, feature removal seemed like the best approach. Suggesting we will fix it later in beta is not a solution. Now, if everyone agrees we should do X, and X in simple, lets do X, but I am stil not seeing that. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
On Mon, May 10, 2010 at 6:03 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Robert Haas wrote: On Thu, May 6, 2010 at 2:47 PM, Josh Berkus j...@agliodbs.com wrote: Now that I've realized what the real problem is with max_standby_delay (namely, that inactivity on the master can use up the delay), I think we should do what Tom originally suggested here. It's not as good as a really working max_standby_delay, but we're not going to have that for 9.0, and it's clearly better than a boolean. I guess I'm not clear on how what Tom proposed is fundamentally different from max_standby_delay = -1. If there's enough concurrent queries, recovery would never catch up. If your workload is that the standby server is getting pounded with queries like crazy, then it's probably not that different: it will fall progressively further behind. But I suspect many people will set up standby servers where most of the activity happens on the primary, but they run some reporting queries on the standby. If you expect your reporting queries to finish in 10s, you could set the max delay to say 60s. In the event that something gets wedged, recovery will eventually kill it and move on rather than just getting stuck forever. If the volume of queries is known not to be too high, it's reasonable to expect that a few good whacks will be enough to get things back on track. Yeah, I could live with that. A problem with using the name max_standby_delay for Tom's suggestion is that it sounds like a hard limit, which it isn't. But if we name it something like: # -1 = no timeout # 0 = kill conflicting queries immediately # 0 wait for N seconds, then kill query standby_conflict_timeout = -1 it's more clear that the setting is a timeout for each *conflict*, and it's less surprising that the standby can fall indefinitely behind in the worst case. If we name the setting along those lines, I could live with that. +1 from the peanut gallery. -- Mike Rylander | VP, Research and Design | Equinox Software, Inc. / The Evergreen Experts | phone: 1-877-OPEN-ILS (673-6457) | email: mi...@esilibrary.com | web: http://www.esilibrary.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
Bruce Momjian br...@momjian.us wrote: Robert Haas wrote: Overall I would say opinion is about evenly split between: - leave it as-is - make it a Boolean - change it in some way but to something more expressive than a Boolean I think a boolean would limit the environments in which HS would be useful. Personally, I think how far the replica is behind the source is a more useful metric, even with anomalies on the transition from idle to active; but a blocking duration would be much better than no finer control than the boolean. So my instant runoff second choice would be for the block duration knob. time for a decision, and with no one agreeing on what to do, feature removal seemed like the best approach. I keep wondering at the assertion that once a GUC is present (especially a tuning GUC like this) that we're stuck with it. I know that's true of SQL code constructs, but postgresql.conf files? How about redirect_stderr, max_fsm_*, sort_mem, etc.? This argument seems tenuous. Suggesting we will fix it later in beta is not a solution. I'm with you there, 100% -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
* Aidan Van Dyk (ai...@highrise.ca) wrote: * Heikki Linnakangas heikki.linnakan...@enterprisedb.com [100510 06:03]: A problem with using the name max_standby_delay for Tom's suggestion is that it sounds like a hard limit, which it isn't. But if we name it something like: I'ld still rather an if your killing something, make sure you kill enough to get all the way current behaviour, but that's just me I agree with that comment, and it's more like what max_standby_delay was. That's what I had thought Tom was proposing initially, since it makes a heck of alot more sense to me than just keep waiting, just keep waiting... Now, if it's possible to have things queue up behind the recovery process, such that the recovery process will only wait up to timeout * # of locks held when recovery started, that might be alright, but that's not the impression I've gotten about how this will work. Of course, I also want to be able to have a Nagios hook that checks how far behind the slave has gotten, and a way to tell the slave oook, you're too far behind, just forcibly catch up right *now*. If I could use reload to change max_standby_delay (or whatever) and I can figure out how long the delay is (even if I have to update a table on the master and then see what it says on the slave..), I'd be happy. That being said, I do think it makes more sense to wait until we've got a conflict to start the timer, and I rather like avoiding the uncertainty of time sync between master and slave by using WAL arrival time on the slave. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] max_standby_delay considered harmful
Kevin Grittner wrote: Bruce Momjian br...@momjian.us wrote: Robert Haas wrote: Overall I would say opinion is about evenly split between: - leave it as-is - make it a Boolean - change it in some way but to something more expressive than a Boolean I think a boolean would limit the environments in which HS would be useful. Personally, I think how far the replica is behind the source is a more useful metric, even with anomalies on the transition from idle to active; but a blocking duration would be much better than no finer control than the boolean. So my instant runoff second choice would be for the block duration knob. time for a decision, and with no one agreeing on what to do, feature removal seemed like the best approach. I keep wondering at the assertion that once a GUC is present (especially a tuning GUC like this) that we're stuck with it. I know that's true of SQL code constructs, but postgresql.conf files? How about redirect_stderr, max_fsm_*, sort_mem, etc.? This argument seems tenuous. You are right that we are much more flexible about changing administrative configuration parameters (like this one) than SQL. In the past, we even renamed logging parameters to be more consistent, and I think that proves the bar is quite low for GUC administrative parameter change. :-) The concern about 'max_standby_delay' is that it controls a lot of new code and affects the behavior of HS/SR in ways that might cause a poor user experience, expecially for non-expert users. I admit that expert users can use the setting, but we are coding for a general user base, and we might have to field many questions about 'max_standby_delay' from general users that will make us look bad. The setting is total useless is something we have heard about other partial solutions we have released in the past. We try to avoid that. ;-) Labeling something experimental also makes our code look sloppy. And if we decide the problem is unsolvable using this approach, we should remove it now rather than later. We don't like to carry around a wart for a small segment of our userbase. I realize many of you have not been around to see some of our less-than-perfect solutions and to see the pain they cause. Once something gets it, we have to fight to remove it. In fact, there is no way we would add 'max_standby_delay' into our codebase now, knowing its limitations, but people are having to fight hard for its removal, if necessary. Now that discussion has restarted again, let's keep going to see if can reach some kind of simple solution. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
On Mon, May 10, 2010 at 5:20 PM, Bruce Momjian br...@momjian.us wrote: You are right that we are much more flexible about changing administrative configuration parameters (like this one) than SQL. In the past, we even renamed logging parameters to be more consistent, and I think that proves the bar is quite low for GUC administrative parameter change. :-) The concern about 'max_standby_delay' is that it controls a lot of new code and affects the behavior of HS/SR in ways that might cause a poor user experience, expecially for non-expert users. I would like to propose that we do the following: 1) Replace max_standby_delay with a boolean as per heikki's suggestion 2) Add an explicitly experimental option like max_standby_delay or recovery_conflict_timeout which is only effective if you've chosen recovery_conflict=pause recovery option and is explicitly documented as being scheduled to be replaced with a more complete system in future versions. My thinking is that when we do replace max_standby_delay we would keep the recovery_conflict parameter with the same semantics. It's just the additional experimental option which would change. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
1) Replace max_standby_delay with a boolean as per heikki's suggestion 2) Add an explicitly experimental option like max_standby_delay or recovery_conflict_timeout which is only effective if you've chosen recovery_conflict=pause recovery option and is explicitly documented as being scheduled to be replaced with a more complete system in future versions. +1 As far as I can tell, the current delay *works*. It just doesn't necessarily work the way most people expect it to to work. Kind of like, hmmm, shared_buffers? Or effective_cache_size? Or effective_io_concurrency? And I still think that having this kind of a delay option will give us invaluable use feedback on how the option *should* work in 9.1, which we won't get if we don't have an option. I think we will be overhauling it for 9.1, but I don't think that overhaul will benefit from a lack of data. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
On May 10, 2010, at 17:39 , Kevin Grittner wrote: Bruce Momjian br...@momjian.us wrote: Robert Haas wrote: Overall I would say opinion is about evenly split between: - leave it as-is - make it a Boolean - change it in some way but to something more expressive than a Boolean I think a boolean would limit the environments in which HS would be useful. Personally, I think how far the replica is behind the source is a more useful metric, even with anomalies on the transition from idle to active; but a blocking duration would be much better than no finer control than the boolean. So my instant runoff second choice would be for the block duration knob. You could always toggle that boolean automatically, based on some measurement of the replication lag (Assuming the boolean would be settable at runtime). That'd give you much more flexibility than any built-on knob could provide, and even more so than a built-in knob with known deficiencies. My preference is hence to make it a boolean, but in a way that allows more advanced behavior to be implemented on top of it. In the simplest case by allowing the boolean to be flipped at runtime and ensuring that the system reacts in a sane way. best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
On Mon, May 10, 2010 at 3:27 PM, Simon Riggs si...@2ndquadrant.com wrote: I already explained that killing the startup process first is a bad idea for many reasons when shutdown was discussed. Can't remember who added the new standby shutdown code recently, but it sounds like their design was pretty poor if it didn't include shutting down properly with HS. I hope they fix the bug they have introduced. HS was never designed to work that way, so there is no flaw there; it certainly worked when committed. New smart shutdown during recovery doesn't kill the startup process until all of the read only backends have gone away. So it works fine with HS. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
On Sat, 2010-05-08 at 14:48 -0400, Bruce Momjian wrote: I think the consensus is to change this setting to a boolean. If you don't want to do it, I am sure we can find someone who will. You expect others to act on consensus and follow rules, yet ignore them yourself when it suits your purpose. Your other points seem designed to distract people from seeing that. There is clear agreement that a problem exists. The action to take as a result of that problem is very clearly in doubt and yet you repeatedly ignore other people's comments and viable technical resolutions. If you can find a cat's paw to break consensus for you, more fool them. You might find someone with a good resolution, if you ask that instead. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
Bruce Momjian wrote: I think everyone agrees the current code is unusable, per Heikki's comment about a WAL file arriving after a period of no WAL activity I don't. I am curious to hear how many complaints we've had from alpha and beta testers of HS regarding this issue. I know that if we used it with our software, the issue would probably go unnoticed because of our usage patterns and automatic query retry. A positive setting would work as intended for us. I can think of pessimal usage patterns, different software approaches, and/or goals for HS usage which would conflict badly with a positive setting. Hopefully we can document this area better than we've historically done with, for example, fsync -- which has similar trade-offs, only with more dire consequences for bad user choices. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
Tom Lane t...@sss.pgh.pa.us writes: I like the proposal of a boolean because it provides only the minimal feature set of two cases that are both clearly needed and easily implementable. Whatever we do later is certain to provide a superset of those two cases. If we do something else (and that includes my own proposal of a straight lock timeout), we'll be implementing something we might wish to take back later. Taking out features after they've been in a release is very hard, even if we realize they're badly designed. That's where I though my proposal fitted in. I fail to see us wanting to take back explicit pause/resume admin functions in any future release. Now, after having read Greg's arguments, my vote would be the following: - hot_standby_conflict_winner = queries|replay, defaults to replay - add pause/resume so that people can switch temporarily to queries - label max_standby_delay *experimental*, keep current code By clearly stating the feature is *experimental* it should be easy to both get feedback on it so that we know what to implement in 9.1, and should that be completely different, take back the feature. It should even be possible to continue tweaking its behavior during beta, or do something better. Of course it will piss off some users, but they knew they were depending on some *experimental* feature after all. Regards, -- dim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
On May 9, 2010, at 13:59 , Dimitri Fontaine wrote: Tom Lane t...@sss.pgh.pa.us writes: I like the proposal of a boolean because it provides only the minimal feature set of two cases that are both clearly needed and easily implementable. Whatever we do later is certain to provide a superset of those two cases. If we do something else (and that includes my own proposal of a straight lock timeout), we'll be implementing something we might wish to take back later. Taking out features after they've been in a release is very hard, even if we realize they're badly designed. That's where I though my proposal fitted in. I fail to see us wanting to take back explicit pause/resume admin functions in any future release. Now, after having read Greg's arguments, my vote would be the following: - hot_standby_conflict_winner = queries|replay, defaults to replay - add pause/resume so that people can switch temporarily to queries - label max_standby_delay *experimental*, keep current code Adding pause/resume seems to introduce some non-trivial locking problems, though. How would you handle a pause request if the recovery process currently held a lock? Dropping the lock is not an option for correctness reasons. Otherwise you wouldn't have needed to take the lock in the first place, no? Pausing with the lock held leads to priority-inversion like problems. Queries now might block until recovery is resumed - quite the opposite of what pause() is supposed to archive The only remaining option is to continue applying WAL until you reach a point where no locks are held, then pause. But from a user's POV that is nearly indistinguishable from simply setting hot_standby_conflict_winner to in the first place I think. best regards, Florian Pflug smime.p7s Description: S/MIME cryptographic signature
Re: [HACKERS] max_standby_delay considered harmful
On Sun, May 9, 2010 at 4:00 AM, Greg Smith g...@2ndquadrant.com wrote: The use cases are covered as best they can be without better support from expected future SR features like heartbeats and XID loopback. For what it's worth I think deferring these extra complications is a very useful exercise. I would like to see a system that doesn't depend on them for basic functionality. In particular I would like to see a system that can be useful using purely WAL log shipping without streaming replication at all. I'm a bit unclear how the boolean proposal would solve things though. Surely if you set the boolean to recovery-wins then when using streaming replication with any non-idle master virtually every query would be cancelled immediately as every HOT cleanup would cause a snapshot conflict with even short-lived queires in the slave. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
On Sun, May 9, 2010 at 12:47 PM, Greg Stark gsst...@mit.edu wrote: On Sun, May 9, 2010 at 4:00 AM, Greg Smith g...@2ndquadrant.com wrote: The use cases are covered as best they can be without better support from expected future SR features like heartbeats and XID loopback. For what it's worth I think deferring these extra complications is a very useful exercise. I would like to see a system that doesn't depend on them for basic functionality. In particular I would like to see a system that can be useful using purely WAL log shipping without streaming replication at all. I'm a bit unclear how the boolean proposal would solve things though. Surely if you set the boolean to recovery-wins then when using streaming replication with any non-idle master virtually every query would be cancelled immediately as every HOT cleanup would cause a snapshot conflict with even short-lived queires in the slave. It sounds to me like what we need here is some testing. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
On Sat, 2010-05-08 at 20:57 -0400, Tom Lane wrote: Andres Freund and...@anarazel.de writes: On Sunday 09 May 2010 01:34:18 Bruce Momjian wrote: I think everyone agrees the current code is unusable, per Heikki's comment about a WAL file arriving after a period of no WAL activity, and look how long it took our group to even understand why that fails so badly. To be honest its not *that* hard to simply make sure generating wal regularly to combat that. While it surely aint a nice workaround its not much of a problem either. Well, that's dumping a kluge onto users; but really that isn't the point. What we have here is a badly designed and badly implemented feature, and we need to not ship it like this so as to not institutionalize a bad design. No, you have it backwards. HS was designed to work with SR. SR unfortunately did not deliver any form of monitoring, and in doing so the keepalive that it was known HS needed was left out, although it had been on the todo list for some time. Luckily Greg and I argued to have some monitoring added and my code was used to provide barest minimum monitoring for SR, yet not enough to help HS. Of course, if one team doesn't deliver for whatever reason then others must take up the slack, if they can: no complaints. Since I personally didn't know this was going to be the case until after freeze, it is very late to resolve this situation sensibly and time has been against us. It's much harder for me to reach into the depths of another person's work and see how to add necessary mechanisms, especially when I'm working elsewhere. Even if I had done, it's likely that I would have been blocked with the great idea, next release response as already used on this thread. Without doubt the current mechanism suffers from the issues you mention, though the current state is not the result of bad design, merely inaction and lack of integration. We could resolve the current state in many ways, if we chose. Bruce has used the word crippleware for the current state. Raising a problem and then blocking solutions is the best way I know to cripple a release. It should be clear that I've done my best to avoid this situation and have been active on both SR and HS. Had I not acted as I have done to date, SR would at this point slurp CPU like a bandit and be unmonitorable, both fatal flaws in production. I point this out not to argue, but to set the record straight. IMHO your assignment of blame is misplaced and your comments about poor design do not reflect how we arrived at the current state. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
On Sun, 2010-05-09 at 16:10 +0200, Florian Pflug wrote: Adding pause/resume seems to introduce some non-trivial locking problems, though. How would you handle a pause request if the recovery process currently held a lock? (We are only talking about AccessExclusiveLocks here. No LWlocks are held across WAL records during replay) Just pause. There are no technical problem there. Perhaps a danger of unforeseen consequences, though doing that might also be desirable, who can say? -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
Florian Pflug f...@phlo.org writes: The only remaining option is to continue applying WAL until you reach a point where no locks are held, then pause. But from a user's POV that is nearly indistinguishable from simply setting hot_standby_conflict_winner to in the first place I think. Not really, the use case would be using the slave as a reporting server, you know you have say 4 hours of reporting queries during which you will pause the recovery. So it's ok for the pause command to take time. What I understand the boolean option would do is to force the user into choosing either high-availability or using the slave for other purposes too. The problem is in wanting both, and that's what HS was meant to solve. Having pause/resume allows for a mixed case usage which is simple to drive and understand, yet fails to provide adaptive behavior where queries are allowed to pause recovery implicitly for a while. In my mind, that would be a compromise we could reach for 9.0, but it seems introducing those admin functions now is to far a stretch. I've been failing to understand exactly why, only getting a generic answer I find unsatisfying here, because all the alternative paths being proposed, apart from improve documentation, are more involved code wise. Regards, -- dim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
On May 9, 2010, at 21:04 , Simon Riggs wrote: On Sun, 2010-05-09 at 16:10 +0200, Florian Pflug wrote: Adding pause/resume seems to introduce some non-trivial locking problems, though. How would you handle a pause request if the recovery process currently held a lock? (We are only talking about AccessExclusiveLocks here. No LWlocks are held across WAL records during replay) Just pause. There are no technical problem there. Perhaps a danger of unforeseen consequences, though doing that might also be desirable, who can say? No technical problems perhaps, but some usability ones, no? I assume people would pause recovery to prevent it from interfering with long-running reporting queries. Now, if those queries might block indefinitely if the pause request by chance was issued while the recovery process held an AccessExclusiveLock, then the pause *caused* exactly what it was supposed to prevent. Setting hot_standby_conflict_winner to queries would at least have allowed the reporting queries to finish eventually. If AccessExclusiveLocks are taken out of the picture (they're supposed to be pretty rare on a production system anyway), setting hot_standby_conflict_winner to queries seems to act like a conditional pause request - recovery is paused as soon as it gets in the way. In this setting, the real advantage of pause would be to prevent recovery from using up all available IO bandwidth. This seems like a valid concern, but calls more for something like recovery_delay (similar to vacuum_delay) instead of pause(). best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
On Sun, May 9, 2010 at 3:09 PM, Dimitri Fontaine dfonta...@hi-media.com wrote: Florian Pflug f...@phlo.org writes: The only remaining option is to continue applying WAL until you reach a point where no locks are held, then pause. But from a user's POV that is nearly indistinguishable from simply setting hot_standby_conflict_winner to in the first place I think. Not really, the use case would be using the slave as a reporting server, you know you have say 4 hours of reporting queries during which you will pause the recovery. So it's ok for the pause command to take time. Seems like it could take FOREVER on a busy system. Surely that's not OK. The fact that Hot Standby has to take exclusive locks that can't be released until WAL replay has progressed to a certain point seems like a fairly serious wart. We had a discussion on another thread of how this can make the database fail to shut down properly, a problem we're not addressing because we're too busy arguing about max_standby_delay. In fact, if we knew how to pause replay without leaving random locks lying around, we could rearrange the whole smart shutdown sequence so that we paused replay FIRST and then waited for all backends to exit, but the consensus on the thread where we discussed this was that we did not know how to do that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
On Sun, 2010-05-09 at 16:01 -0400, Robert Haas wrote: The fact that Hot Standby has to take exclusive locks that can't be released until WAL replay has progressed to a certain point seems like a fairly serious wart. LOL And people lecture me about design. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
On May 9, 2010, at 22:01 , Robert Haas wrote: On Sun, May 9, 2010 at 3:09 PM, Dimitri Fontaine dfonta...@hi-media.com wrote: Florian Pflug f...@phlo.org writes: The only remaining option is to continue applying WAL until you reach a point where no locks are held, then pause. But from a user's POV that is nearly indistinguishable from simply setting hot_standby_conflict_winner to in the first place I think. Not really, the use case would be using the slave as a reporting server, you know you have say 4 hours of reporting queries during which you will pause the recovery. So it's ok for the pause command to take time. Seems like it could take FOREVER on a busy system. Surely that's not OK. The fact that Hot Standby has to take exclusive locks that can't be released until WAL replay has progressed to a certain point seems like a fairly serious wart. If this is a serious wart then it's not one of hot standby, but one of postgres proper. AccessExclusiveLocks (SELECT-blocking locks that is, as opposed to UPDATE/DELETE-blocking locks) are never necessary from a correctness POV, they're only there for implementation reasons. Getting rid of them doesn't seem completely insurmountable either - just as multiple row versions remove the need to block SELECTs dues to concurrent UPDATEs, multiple datafile versions could remove the need to block SELECTs due to concurrent ALTERs. But people seem to live with them quite well, judged from the amount of work put into getting rid of them (zero). I therefore fail to see why they should pose a significant problem in HS setups. We had a discussion on another thread of how this can make the database fail to shut down properly, a problem we're not addressing because we're too busy arguing about max_standby_delay. In fact, if we knew how to pause replay without leaving random locks lying around, we could rearrange the whole smart shutdown sequence so that we paused replay FIRST and then waited for all backends to exit, but the consensus on the thread where we discussed this was that we did not know how to do that. Yeah, this was exactly my line of thought too. best regards, Florian Pflug -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
On Monday 10 May 2010 00:25:44 Florian Pflug wrote: On May 9, 2010, at 22:01 , Robert Haas wrote: On Sun, May 9, 2010 at 3:09 PM, Dimitri Fontaine dfonta...@hi-media.com wrote: Florian Pflug f...@phlo.org writes: The only remaining option is to continue applying WAL until you reach a point where no locks are held, then pause. But from a user's POV that is nearly indistinguishable from simply setting hot_standby_conflict_winner to in the first place I think. Not really, the use case would be using the slave as a reporting server, you know you have say 4 hours of reporting queries during which you will pause the recovery. So it's ok for the pause command to take time. Seems like it could take FOREVER on a busy system. Surely that's not OK. The fact that Hot Standby has to take exclusive locks that can't be released until WAL replay has progressed to a certain point seems like a fairly serious wart. If this is a serious wart then it's not one of hot standby, but one of postgres proper. AccessExclusiveLocks (SELECT-blocking locks that is, as opposed to UPDATE/DELETE-blocking locks) are never necessary from a correctness POV, they're only there for implementation reasons. Getting rid of them doesn't seem completely insurmountable either - just as multiple row versions remove the need to block SELECTs dues to concurrent UPDATEs, multiple datafile versions could remove the need to block SELECTs due to concurrent ALTERs. But people seem to live with them quite well, judged from the amount of work put into getting rid of them (zero). I therefore fail to see why they should pose a significant problem in HS setups. The difference is that in HS you have to wait for a moment where *no exclusive lock at all* exist, possibly without contending for any of them, while on the master you might not even blocked by the existence of any of those locks. If you have two sessions which in overlapping transactions lock different tables exlusively you have no problem shutting the master down, but you will never reach a point where no exclusive lock is taken on the slave. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
On Sun, May 9, 2010 at 6:58 PM, Andres Freund and...@anarazel.de wrote: On Monday 10 May 2010 00:25:44 Florian Pflug wrote: On May 9, 2010, at 22:01 , Robert Haas wrote: On Sun, May 9, 2010 at 3:09 PM, Dimitri Fontaine dfonta...@hi-media.com wrote: Florian Pflug f...@phlo.org writes: The only remaining option is to continue applying WAL until you reach a point where no locks are held, then pause. But from a user's POV that is nearly indistinguishable from simply setting hot_standby_conflict_winner to in the first place I think. Not really, the use case would be using the slave as a reporting server, you know you have say 4 hours of reporting queries during which you will pause the recovery. So it's ok for the pause command to take time. Seems like it could take FOREVER on a busy system. Surely that's not OK. The fact that Hot Standby has to take exclusive locks that can't be released until WAL replay has progressed to a certain point seems like a fairly serious wart. If this is a serious wart then it's not one of hot standby, but one of postgres proper. AccessExclusiveLocks (SELECT-blocking locks that is, as opposed to UPDATE/DELETE-blocking locks) are never necessary from a correctness POV, they're only there for implementation reasons. Getting rid of them doesn't seem completely insurmountable either - just as multiple row versions remove the need to block SELECTs dues to concurrent UPDATEs, multiple datafile versions could remove the need to block SELECTs due to concurrent ALTERs. But people seem to live with them quite well, judged from the amount of work put into getting rid of them (zero). I therefore fail to see why they should pose a significant problem in HS setups. The difference is that in HS you have to wait for a moment where *no exclusive lock at all* exist, possibly without contending for any of them, while on the master you might not even blocked by the existence of any of those locks. If you have two sessions which in overlapping transactions lock different tables exlusively you have no problem shutting the master down, but you will never reach a point where no exclusive lock is taken on the slave. A possible solution to this in the shutdown case is to kill anyone waiting on a lock held by the startup process at the same time we kill the startup process, and to kill anyone who subsequently waits for such a lock as soon as they attempt to take it. I'm not sure if this would also make sense in the pause case. Another possible solution would be to try to figure out if there's a way to delay application of WAL that requires the taking of AELs to the point where we could apply it all at once. That might not be feasible, though, or only in some cases, and it's certainly 9.1 material (at least) in any case. Anyway, this is all a little off-topic. We need to get back to arguing about how best to cut the legs out from under a feature that's been in the tree for six months but Tom didn't get around to looking at until last week. I'll restate my position: now that I understand what the issues are (I think), the feature as currently implemented seems pretty wonky, but cutting it down to a boolean seems like an exercise in excessive pessimism about our ability to predict future development directions, as well as possibly quite inconvenient for people attempting to use Hot Standby. Therefore I think we should adopt Tom's original proposal (with +1 also from Stephen Frost), but that doesn't seem likely to fly because, on the one hand, we have Tom himself arguing (along with Bruce and possibly Heikki) that we should whack it down all the way to a boolean; and on the other hand Simon and Greg Smith and I think also Andres Freund and Kevin Grittner arguing that the original feature is OK as-is. Other people who weighed in include Stefan Kaltenbrunner (who opined that Tom had a legitimate complaint about the current design but didn't vote for a specific resolution), Greg Sabino Mullane (who pointed out that SOME of the issues that Tom raised could be solved with proper time synchronization), Josh Drake (who thought requiring NTP to be working was a bad idea, and therefore presumably favors changing something), Josh Berkus (who changed his vote at least once and whose priority seems to have to do with releasing before the turn of the century than with the actual technical option we select, apologies if I'm misreading his emails), Greg Stark (who seems to think that a boolean will be bad news but didn't specifically vote for another option), Dimitri Fontaine (who wants a boolean plus pause/resume functions, or maybe a plugin facility of some kind), Rob Wultsch (who doesn't ever want to kill queries and therefore would be happy with a boolean), Yeb Havinga (who never wants to stall recovery and therefore would also be happy with a boolean), and Florian Pflug (who points out that pause/resume is actually a nontrivial feature). Apologies if I've
Re: [HACKERS] max_standby_delay considered harmful
On Thu, 2010-05-06 at 12:03 -0700, Josh Berkus wrote: So changing to a lock-based mechanism or designing a plugin interface are really not at all realistic at this date. I agree that changing to a lock-based mechanism is too much at this stage of development. However, putting in a plugin is trivial. We could do it if we choose, without instability or risk. It is as easy a change as option (1). It's not complex to design because it would use the exact same API as the internal conflict resolution module already does; we can just move the current conflict code straight into a contrib module. This can be done bug-free in about 3 hours work. There is no performance issue associated with that either. Plugins would allow all of the various mechanisms requested on list over 18 months, nor would they prevent including some of those options within the core at a later date. Without meaning to cause further contention, it is very clear that putting in contrib modules isn't bad after all, so there is no material argument against the plugin approach. I recognise that plugins for some reason ignite unstated fears, by observation that there is always an argument every time I mention them. I invite an explanation of that off-list. Realistically, we have two options at this point: 1) reduce max_standby_delay to a boolean. 2) have a delay option (based either on WAL glob start time or on system time) like the current max_standby_delay, preferably with some bugs fixed. With a plugin option, I would not object to option 1. If option 1 was taken, without plugins, it's clear that would be against consensus. Having said that, I'll confirm now there will not be an extreme reaction from me if option (1) was forced, nor do I counsel that from others. I said it before and I'll say it again: release early, release often. None of this needs to delay release. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
Simon Riggs wrote: With a plugin option, I would not object to option 1. If option 1 was taken, without plugins, it's clear that would be against consensus. Having said that, I'll confirm now there will not be an extreme reaction from me if option (1) was forced, nor do I counsel that from others. I found this email amusing. You phrase it like the community is supposed to be worried by an objection from you or an extreme reaction; I certainly am not. You have been in the community long enough to not use such phrasing. This is not the first time I have complained about this. I have no idea why an objection from you should mean more than an objection from anyone else in the community, and I have no idea what an extreme reaction means, or why anyone should care. Do you think the community is negotiting with you? I think the concensus is to change this setting to a boolean. If you don't want to do it, I am sure we can find someone who will. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
On Sat, May 8, 2010 at 2:48 PM, Bruce Momjian br...@momjian.us wrote: I think the concensus is to change this setting to a boolean. If you don't want to do it, I am sure we can find someone who will. I still think we should revert to Tom's original proposal. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
Robert Haas wrote: On Sat, May 8, 2010 at 2:48 PM, Bruce Momjian br...@momjian.us wrote: I think the concensus is to change this setting to a boolean. ?If you don't want to do it, I am sure we can find someone who will. I still think we should revert to Tom's original proposal. And Tom's proposal was to do it on WAL slave arrival time? If we could get agreement from everyone that that is the proper direction, fine, but I am hearing things like plugins, and other complexity that makes it seem we are not getting closer to an agreed solution, and without agreement, the simplest approach seems to be just to remove the part we can't agree upon. I think the big question is whether this issue is significant enough that we should ignore our policy of no feature design during beta. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
On Sat, May 8, 2010 at 3:40 PM, Bruce Momjian br...@momjian.us wrote: Robert Haas wrote: On Sat, May 8, 2010 at 2:48 PM, Bruce Momjian br...@momjian.us wrote: I think the concensus is to change this setting to a boolean. ?If you don't want to do it, I am sure we can find someone who will. I still think we should revert to Tom's original proposal. And Tom's proposal was to do it on WAL slave arrival time? If we could get agreement from everyone that that is the proper direction, fine, but I am hearing things like plugins, and other complexity that makes it seem we are not getting closer to an agreed solution, and without agreement, the simplest approach seems to be just to remove the part we can't agree upon. I think the big question is whether this issue is significant enough that we should ignore our policy of no feature design during beta. Tom's proposal was basically to define recovery_process_lock_timeout. The recovery process would wait X seconds for a lock, then kill whoever held it. It's not the greatest knob in the world for the reasons already pointed out, but I think it's still better than a boolean and will be useful to some users. And it's pretty simple. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
Bruce Momjian br...@momjian.us writes: I have no idea why an objection from you should mean more than an objection from anyone else in the community, and I have no idea what an extreme reaction means, or why anyone should care. Maybe I shouldn't say anything here. But clearly while you're spot on that Simon's objection is worth just as much as any other contributor's, I disagree that we shouldn't care about the way those people feel about being a member of our community. I appreciate your efforts to avoid having anyone here use such a wording but I can't help to dislike your argument for it. I hope that's simply a localisation issue (l10n is so much harder than i18n). Anyway, I so much hate reading such exchanges here that I couldn't help ranting about it. Back to suitable -hackers content. I think the concensus is to change this setting to a boolean. If you don't want to do it, I am sure we can find someone who will. I don't think so. I understand the current state to be: a. this problem is not blocking beta, but a must fix before release b. we either have to change the API or the behavior c. only one behavior change has been proposed, by Tom d. proposed behavior would favor queries rather than availability e. API change 1 is boolean + explicit pause/resume command f. API change 2 is boolean + plugin facility, with a contrib for current behavior. g. API change 3 is boolean only I don't remember reading any mail on this thread bearing consensus on the choices above, but rather either one of us pushing for his vision or people defending the current situation, complaining about it or asking that a reasonable choice is made soon. If we have to choose between reasonable and soon, soon won't be my vote. Beta is meant to last more or less 3 months after all. Each party's standing is clear. Decision remains to be made, and I guess that the one writing the code will have a much louder voice. Regards, -- dim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
Robert Haas wrote: On Sat, May 8, 2010 at 3:40 PM, Bruce Momjian br...@momjian.us wrote: Robert Haas wrote: On Sat, May 8, 2010 at 2:48 PM, Bruce Momjian br...@momjian.us wrote: I think the concensus is to change this setting to a boolean. ?If you don't want to do it, I am sure we can find someone who will. I still think we should revert to Tom's original proposal. And Tom's proposal was to do it on WAL slave arrival time? ?If we could get agreement from everyone that that is the proper direction, fine, but I am hearing things like plugins, and other complexity that makes it seem we are not getting closer to an agreed solution, and without agreement, the simplest approach seems to be just to remove the part we can't agree upon. I think the big question is whether this issue is significant enough that we should ignore our policy of no feature design during beta. Tom's proposal was basically to define recovery_process_lock_timeout. The recovery process would wait X seconds for a lock, then kill whoever held it. It's not the greatest knob in the world for the reasons already pointed out, but I think it's still better than a boolean and will be useful to some users. And it's pretty simple. I thought there was concern about lock stacking causing unpredictable/unbounded delays. I am not sure boolean has a majority vote, but I am suggesting that because it is the _minimal_ feature set, and when we can't agree during beta, the minimal feature set seems like the best choice. Clearly, anything is more feature-full than boolean --- the big question is whether Tom's proposal is significantly better than boolean that we should spend the time designing and implementing it, with the possibility it will all be changed in 9.1. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
Bruce Momjian wrote: I think the big question is whether this issue is significant enough that we should ignore our policy of no feature design during beta The idea that you're considering removal of a feature that we already have people using in beta and making plans around is a policy violation too you know. A freeze should include not cutting things just because their UI or implementation is not ideal yet. And you've been using the word consensus here when there is no such thing. At best there's barely a majority here among people who have stated an opinion, and consensus means something much stronger even than that; that means something closer to unanimity. I thought the summary of where the project is at Josh wrote at http://archives.postgresql.org/message-id/4be31279.7040...@agliodbs.com was excellent, both from a technical and a process commentary standpoint. I'd be completely happy to follow that plan, and then we'd be at a consensus--with no one left arguing. It was very clear back in February that if SR didn't hit the feature set to make HS less troublesome out of the box, there would be some limitations here, and that set of concerns hasn't changed much since then. I thought the backup plan if we didn't get things like xid feedback was to keep the capability as written anyway, knowing that it's still much better than no control over cancellation timing available at all. Keep improving documentation around its issues, and continue to hack away at them in user space and in the field. Then we do better for 9.1. You seem bent on removing the feedback part of that cycle. The full statement of the ESR bit Josh was quoting is Release early. Release often. And listen to your customers.[1] My customers include some of whom believed the PostgreSQL community process enough to contribute toward the HS development that's been completed and donated to the project. They have a pretty clear view on this I'm relaying when I talk about what I'd like to see happen. They are saying they cannot completely ignore their requirements for HA failover, but would be willing to loosen them just a bit (increasing failover time slightly) if it reduces the odds of query cancellation, and therefore improves how much load they can expect to push toward the standby. max_standby_delay is a currently available mechanism that does that. I'm not going to be their nanny and say no, that's not perfectly predictable, you might get a query canceled sometimes when you don't expect it anyway. Instead, I was hoping to let them deploy 9.0 with this option available (but certainly not the default), informed of the potential risks, see how that goes. We can confirm whether the userland workarounds we believe will be effective here really are. If so, then we can solider forward directly incorporating them into the server code, knowing that works. If not, switch to one of the safer modes, see if there's something better to use altogether in 9.1, and perhaps this whole approach gets removed. That's healthy development progress either way. Upthread Bruce expressed some concern that this was going to live forever once deployed. There is no way I'm going to let this behavior continue to be available in 9.1 if field tests say the workarounds aren't good enough. That's going to torture all of us who do customer deployments of this technology every day if that turns out to be the case, and nobody is going to feel the heat from that worse than 2ndQuadrant. I did a round once of removing GUCs that didn't do what they were expected to in the field before, based on real-world tests showing regular misuse, and I'll do it again if this falls into that same category. We've already exposed this release to a whole stack of risk from work during its development cycle, risk that doesn't really drop much just from cutting this one bit. I'd at least like to get all the reward possible from that risk, which I expected to include feedback in this area. Circumventing the planned development process by dropping this now will ruin how I expected the project to feel out the right thing on the user side, and we'll all be left with little more insight for what to do in 9.1 than we have now. And I'm not looking forward to explaining to people why a feature they've been seeing and planning to deploy for months has now been cut only after what was supposed to be a freeze for beta. [1] http://catb.org/esr/writings/homesteading/cathedral-bazaar/ar01s04.html , and this particular bit is quite relevant here: Linus was keeping his hacker/users constantly stimulated and rewarded—stimulated by the prospect of having an ego-satisfying piece of the action, rewarded by the sight of constant (even daily) improvement in their work. Linus was directly aiming to maximize the number of person-hours thrown at debugging and development, even at the possible cost of instability in the code and user-base burnout
Re: [HACKERS] max_standby_delay considered harmful
Greg Smith wrote: Bruce Momjian wrote: I think the big question is whether this issue is significant enough that we should ignore our policy of no feature design during beta The idea that you're considering removal of a feature that we already have people using in beta and making plans around is a policy violation too you know. A freeze should include not cutting things just because their UI or implementation is not ideal yet. And you've been using the word consensus here when there is no such thing. At best there's barely a majority here among people who have stated an opinion, and consensus means something much stronger even than that; that means something closer to unanimity. I thought the summary of where the project is at Josh wrote at http://archives.postgresql.org/message-id/4be31279.7040...@agliodbs.com was excellent, both from a technical and a process commentary standpoint. I'd be completely happy to follow that plan, and then we'd be at a consensus--with no one left arguing. I can't argue with anything you have said in your email. The big question is whether designing during beta is worth it in this case, and whether we can get something that is useful and gives us useful feedback for 9.1, and is it worth spending the time to figure this out during beta? If we can, great, let's do it, but I have not seen that yet, and I am unclear how long we should keep trying to find it. I think everyone agrees the current code is unusable, per Heikki's comment about a WAL file arriving after a period of no WAL activity, and look how long it took our group to even understand why that fails so badly. I thought Tom's idea had problems, and there were ideas of how to improve it. It just seems like we are drifting around on something that has no easy solution, and not something that we are likely to hit during beta where we should be focusing on the release. Saying we have three months to fix this during beta seems like a recipe for delaying the final release, and this feature is not worth that. What we could do is to convert max_standby_delay to a boolean, 'ifdef' out the code that was handling non-boolean cases, and then if someone wants to work on a patch in a corner and propose something in a month that improves this, we can judge the patch on its own merits, and apply it if it is a great benefit, because basically that is what we are doing now if we fix this --- adding a new patch/feature during beta. (Frankly, because we are not requiring an initdb during beta, I am unclear how we are going to rename max_standby_delay to behave as a boolean.) It is great if we can get a working max_standby_delay, but I fear drifting/distraction at this stage. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
On Sunday 09 May 2010 01:34:18 Bruce Momjian wrote: I think everyone agrees the current code is unusable, per Heikki's comment about a WAL file arriving after a period of no WAL activity, and look how long it took our group to even understand why that fails so badly. To be honest its not *that* hard to simply make sure generating wal regularly to combat that. While it surely aint a nice workaround its not much of a problem either. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
Andres Freund and...@anarazel.de writes: On Sunday 09 May 2010 01:34:18 Bruce Momjian wrote: I think everyone agrees the current code is unusable, per Heikki's comment about a WAL file arriving after a period of no WAL activity, and look how long it took our group to even understand why that fails so badly. To be honest its not *that* hard to simply make sure generating wal regularly to combat that. While it surely aint a nice workaround its not much of a problem either. Well, that's dumping a kluge onto users; but really that isn't the point. What we have here is a badly designed and badly implemented feature, and we need to not ship it like this so as to not institutionalize a bad design. I like the proposal of a boolean because it provides only the minimal feature set of two cases that are both clearly needed and easily implementable. Whatever we do later is certain to provide a superset of those two cases. If we do something else (and that includes my own proposal of a straight lock timeout), we'll be implementing something we might wish to take back later. Taking out features after they've been in a release is very hard, even if we realize they're badly designed. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
Tom Lane wrote: Taking out features after they've been in a release is very hard, even if we realize they're badly designed. It doesn't have to be; that's the problem the release often part takes care of. If a release has only been out a year, and a new one comes out saying oh, that thing we released for the first time in the last version, it didn't work as well as we'd hoped in the field; you should try to avoid that and use this new implementation that works better instead once you can upgrade, that's not only not hard, it's exactly what people using a X.0 release expect to happen. I've read the message from you that started off this thread several times now. Your low-level code implementation details shared later obviously need to be addressed. But all of the fundamental and fatal issues you mentioned at the start continue to strike me as either situations where you don't agree with the use case this was designed for, or spots where you feel the userland workarounds required to make it work right are too onerous. Bruce's objections seem to fall mainly into the latter category. I've been wandering around talking to people about that exact subject--what do people want and expect from Hot Standby, and what would they do to gain its benefits--for over six months now, independently of Simon's work which did a lot of that before me too. The use cases are covered as best they can be without better support from expected future SR features like heartbeats and XID loopback. As for the workarounds required to make things work, the responses I get match what we just saw from Andres. When the required details are explained, people say that's annoying but I can do that, and off we go. There are significant documentation issues I know need to be cleaned up here, and I've already said I'll take care of that as soon as freeze is really here and I have a stable target. (That this discussion is still going on says that's not yet) What I fail to see are problems significant enough to not ship the parts of this feature that are done, so that it can be used by those it is appropriate for, allow feedback, and make it easy to test individual improvements upon what's already there. I can't make you prioritize based on what people are telling me. All I can do is suggest you reconsider handing control over the decision to use this feature or not to the users of the software, so they can make their own choice. I'm tired of arguing about this instead of doing productive work, and I've done all I can here to try and work within the development process of the community. If talk of removing the max_standby_delay feature clears up, I'll happily provide my promised round of documentation updates, to make its limitations and associated workarounds as clear as they can be, within a week of being told go on that. If instead this capability goes away, making those moot, I'll maintain my own release for the 2ndQuadrant customers who have insisted they need this capability if I have to. That would be really unfortunate, because the only bucket I can pull time out of for that is the one I currently allocate to answering questions on the mailing lists here most days. I'd rather spend that helping out the PostgreSQL community, but we do need to deliver what our customers want too. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support g...@2ndquadrant.com www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
Greg Smith wrote: Tom Lane wrote: Taking out features after they've been in a release is very hard, even if we realize they're badly designed. It doesn't have to be; that's the problem the release often part takes care of. If a release has only been out a year, and a new one comes out saying oh, that thing we released for the first time in the last version, it didn't work as well as we'd hoped in the field; you should try to avoid that and use this new implementation that works better instead once you can upgrade, that's not only not hard, it's exactly what people using a X.0 release expect to happen. I think this is the crux of the issue. Tom and I are saying that historically we have shipped only complete features, or as complete as reasonable, and have removed items during beta that we found didn't meet this criteria, in an attempt to reduce the amount of feature set churn in Postgres. A database is complex, so modifying the API between major releases is something we only do when we find a significant benefit. In this case, if we keep max_standby_delay as non-boolean, we know it will have to be redesigned in 9.1, and it is unclear to me what additional knowledge we will gain by shipping it in 9.0, except to have to tell people that it doesn't work well or requires complex work-arounds, and that doesn't thrill any of us. (I already suggested that statement_timeout might supply a reasonable and predictable workaround for non-boolean usage of max_standby_delay.) -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
On Sat, May 8, 2010 at 6:51 PM, Bruce Momjian br...@momjian.us wrote: Robert Haas wrote: On Sat, May 8, 2010 at 3:40 PM, Bruce Momjian br...@momjian.us wrote: Robert Haas wrote: On Sat, May 8, 2010 at 2:48 PM, Bruce Momjian br...@momjian.us wrote: I think the concensus is to change this setting to a boolean. ?If you don't want to do it, I am sure we can find someone who will. I still think we should revert to Tom's original proposal. And Tom's proposal was to do it on WAL slave arrival time? ?If we could get agreement from everyone that that is the proper direction, fine, but I am hearing things like plugins, and other complexity that makes it seem we are not getting closer to an agreed solution, and without agreement, the simplest approach seems to be just to remove the part we can't agree upon. I think the big question is whether this issue is significant enough that we should ignore our policy of no feature design during beta. Tom's proposal was basically to define recovery_process_lock_timeout. The recovery process would wait X seconds for a lock, then kill whoever held it. It's not the greatest knob in the world for the reasons already pointed out, but I think it's still better than a boolean and will be useful to some users. And it's pretty simple. I thought there was concern about lock stacking causing unpredictable/unbounded delays. I am not sure boolean has a majority vote, but I am suggesting that because it is the _minimal_ feature set, and when we can't agree during beta, the minimal feature set seems like the best choice. Clearly, anything is more feature-full than boolean --- the big question is whether Tom's proposal is significantly better than boolean that we should spend the time designing and implementing it, with the possibility it will all be changed in 9.1. I doubt it's likely to be thrown out completely. We might decide to fine-tune it in some way. My fear is that if we ship this with only a boolean, we're shipping crippleware. If that fear turns out to be unfounded, I will of course be happy, but that's my concern, and I don't believe that it's entirely unfounded. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
Robert Haas wrote: Clearly, anything is more feature-full than boolean --- the big question is whether Tom's proposal is significantly better than boolean that we should spend the time designing and implementing it, with the possibility it will all be changed in 9.1. I doubt it's likely to be thrown out completely. We might decide to fine-tune it in some way. My fear is that if we ship this with only a boolean, we're shipping crippleware. If that fear turns out to be unfounded, I will of course be happy, but that's my concern, and I don't believe that it's entirely unfounded. Well, historically, we have been willing to not ship features if we can't get it right. No one has ever accused us of crippleware, but our hesitancy has caused slower user adoption, though long-term, it has helped us grow a dedicated user base that trusts us. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
On Sun, May 9, 2010 at 12:08 AM, Bruce Momjian br...@momjian.us wrote: Robert Haas wrote: Clearly, anything is more feature-full than boolean --- the big question is whether Tom's proposal is significantly better than boolean that we should spend the time designing and implementing it, with the possibility it will all be changed in 9.1. I doubt it's likely to be thrown out completely. We might decide to fine-tune it in some way. My fear is that if we ship this with only a boolean, we're shipping crippleware. If that fear turns out to be unfounded, I will of course be happy, but that's my concern, and I don't believe that it's entirely unfounded. Well, historically, we have been willing to not ship features if we can't get it right. No one has ever accused us of crippleware, but our hesitancy has caused slower user adoption, though long-term, it has helped us grow a dedicated user base that trusts us. We can make the decision to not ship the feature if the feature is max_standby_delay. But I think the feature is Hot Standby, which I think we've pretty much committed to shipping. And I am concerned that if the only mechanism for controlling query cancellation vs. recovery lag is a boolean, people feel that we didn't get Hot Standby right. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
On Wed, May 5, 2010 at 9:32 PM, Robert Haas robertmh...@gmail.com wrote: On Wed, May 5, 2010 at 11:50 PM, Bruce Momjian br...@momjian.us wrote: If someone wants to suggest that HS is useless if max_standby_delay supports only boolean values, I am ready to suggest we remove HS as well and head to 9.0 because that would suggest that HS itself is going to be useless. I think HS is going to be a lot less useful than many people think, at least in 9.0. But I think ripping out max_standby_delay will make it worse. The code will not be thrown away; we will bring it back for 9.1. If that's the case, then taking it out makes no sense. mysql dba troll I manage a bunch of different environments and I am pretty sure that in any of them if the db started seemingly randomly killing queries I would have application teams followed quickly by executives coming after me with torches and pitchforks. I can not imagine setting this value to anything other than a bool and most of the time that bool would be -1. I would only be unleashing a kill storm in utter desperation and I would probably need to explain myself in detail after. Utter desperation means I am sure I am going to have to do a impactful failover at any moment and need a slave completely up to date NOW. It is good to have the option to automatically cancel queries, but I think it is a mistake to assume many people will use it. What I would really need for instrumentation is the ability to determine *easily* how much a slave is lagging in clock time. /mysql dba troll -- Rob Wultsch wult...@gmail.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
On Thu, 2010-05-06 at 00:47 -0400, Robert Haas wrote: That just doesn't sound that bad to me, especially since the proposed alternative is: - Queries will get cancelled like crazy, period. Or else: - Replication can fall infinitely far behind and you can write a tedious and error-prone script to try to prevent it if you like. I think THAT is going to tarnish our reputation. Yes, that will. There is no consensus to remove max_standby_delay. It could be improved with minor adjustments and it makes more sense to allow a few of those, treating them as bugs. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
On Wed, 2010-05-05 at 23:15 -0700, Rob Wultsch wrote: I manage a bunch of different environments and I am pretty sure that in any of them if the db started seemingly randomly killing queries I would have application teams followed quickly by executives coming after me with torches and pitchforks. Fully understood and well argued, thanks for your input. HS doesn't randomly kill queries and there are documented work-arounds to control this behaviour. Removing the parameter won't help the situation at all, it will make the situation *worse* by removing control from where it's clearly needed and removing all hope of making the HS feature work in practice. There is no consensus to remove the parameter. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
Greg Smith g...@2ndquadrant.com writes: If you need a script that involves changing a server setting to do something, that translates into you can't do that for a typical DBA. The idea of a program regularly changing a server configuration setting on a production system is one you just can't sell. That makes this idea incredibly more difficult to use in the field than any of the workarounds that cope with the known max_standby_delay issues. I still think that the best API we can do in a timely fashion for 9.0 is: standby_conflict_winner = replay|queries pg_pause_recovery() / pg_resume_recovery() It seems to me those two functions are only exposing existing facilities in the code, so that's more an API change that a new feature inclusion. Of course I'm certainly wrong. But the code has already been written. I don't think we'll find any better to offer our users in the right time frame. Now I'll try to step back and stop repeating myself in the void :) Regards, -- dim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
On May 6, 2010, at 11:26 , Dimitri Fontaine wrote: Greg Smith g...@2ndquadrant.com writes: If you need a script that involves changing a server setting to do something, that translates into you can't do that for a typical DBA. The idea of a program regularly changing a server configuration setting on a production system is one you just can't sell. That makes this idea incredibly more difficult to use in the field than any of the workarounds that cope with the known max_standby_delay issues. I still think that the best API we can do in a timely fashion for 9.0 is: standby_conflict_winner = replay|queries pg_pause_recovery() / pg_resume_recovery() It seems to me those two functions are only exposing existing facilities in the code, so that's more an API change that a new feature inclusion. Of course I'm certainly wrong. But the code has already been written. If there was an additional SQL-callable function that returned the backends the recovery process is currently waiting for, plus one that reported that last timestamp seen in the WAL, than all those different cancellation policies could be implemented as daemons that monitor recovery and kill backends as needed, no? That would allow people to experiment with different cancellation policies, and maybe shed some light on what the useful policies are in practice. best regards, Florian Pflug smime.p7s Description: S/MIME cryptographic signature
Re: [HACKERS] max_standby_delay considered harmful
Rob Wultsch wrote: I manage a bunch of different environments and I am pretty sure that in any of them if the db started seemingly randomly killing queries I would have application teams followed quickly by executives coming after me with torches and pitchforks. I can not imagine setting this value to anything other than a bool and most of the time that bool would be -1. I would only be unleashing a kill storm in utter desperation and I would probably need to explain myself in detail after. Utter desperation means I am sure I am going to have to do a impactful failover at any moment and need a slave completely up to date NOW. That's funny because when I was reading this thread, I was thinking the exact opposite: having max_standby_delay always set to 0 so I know the standby server is as up-to-date as possible. The application that accesses the hot standby has to be 'special' anyway because it might deliver not-up-to-date data. If that information about specialties regarding querying the standby server includes the warning that queries might get cancelled, they can opt for a retry themselves (is there a special return code to catch that case? like PGRES_RETRY_LATER) or a message to the user that their report is currently unavailable and they should retry in a few minutes. regards, Yeb Havinga -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
On Thu, May 6, 2010 at 1:35 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Robert Haas wrote: On Wed, May 5, 2010 at 11:52 PM, Bruce Momjian br...@momjian.us wrote: I am afraid the current setting is tempting for users to enable, but will be so unpredictable that it will tarnish the repuation of HS and Postgres. We don't want to be thinking in 9 months, Wow, we shouldn't have shipped that features. It is causing all kinds of problems. We have done that before (rarely), and it isn't a good feeling. I am not convinced it will be unpredictable. The only caveats that I've seen so far are: - You need to run ntpd. - Queries will get cancelled like crazy if you're not using steaming replication. And also in situations where the master is idle for a while and then starts doing stuff. That's the most significant source of confusion, IMHO, I wouldn't mind the requirement of ntpd so much. Oh. Ouch. OK, sorry, I missed that part. Wow, that's awful. OK, I agree: we can't ship that as-is. /me feels embarrassed for completely failing to understand the root of the issue until 84 emails into the thread. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
Hi, On Thursday 06 May 2010 07:35:49 Heikki Linnakangas wrote: Robert Haas wrote: On Wed, May 5, 2010 at 11:52 PM, Bruce Momjian br...@momjian.us wrote: I am afraid the current setting is tempting for users to enable, but will be so unpredictable that it will tarnish the repuation of HS and Postgres. We don't want to be thinking in 9 months, Wow, we shouldn't have shipped that features. It is causing all kinds of problems. We have done that before (rarely), and it isn't a good feeling. I am not convinced it will be unpredictable. The only caveats that I've seen so far are: - You need to run ntpd. - Queries will get cancelled like crazy if you're not using steaming replication. And also in situations where the master is idle for a while and then starts doing stuff. That's the most significant source of confusion, IMHO, I wouldn't mind the requirement of ntpd so much. Personally I would much rather like to keep that configurability and manually generate a record a second. Or possibly do something akin to archive_timeout... That may be not as important once there are less sources of conflict resolutions - but thats something *definitely* not going to happen for 9.0... Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
On Thu, 2010-05-06 at 11:36 +0200, Florian Pflug wrote: If there was an additional SQL-callable function that returned the backends the recovery process is currently waiting for, plus one that reported that last timestamp seen in the WAL, than all those different cancellation policies could be implemented as daemons that monitor recovery and kill backends as needed, no? That would allow people to experiment with different cancellation policies, and maybe shed some light on what the useful policies are in practice. It would be easier to implement a conflict resolution plugin that is called when a conflict occurs, allowing users to have a customisable mechanism. Again, I have no objection to that proposal. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
On May 6, 2010, at 12:48 , Simon Riggs wrote: On Thu, 2010-05-06 at 11:36 +0200, Florian Pflug wrote: If there was an additional SQL-callable function that returned the backends the recovery process is currently waiting for, plus one that reported that last timestamp seen in the WAL, than all those different cancellation policies could be implemented as daemons that monitor recovery and kill backends as needed, no? That would allow people to experiment with different cancellation policies, and maybe shed some light on what the useful policies are in practice. It would be easier to implement a conflict resolution plugin that is called when a conflict occurs, allowing users to have a customisable mechanism. Again, I have no objection to that proposal. True, providing a plugin API would be even better, since no SQL callable API would have to be devised, and possible algorithms wouldn't be constrained by such an API's limitations. The existing max_standby_delay logic could be moved to such a plugin, living in contrib. Since it was already established (I believe) that the existing max_standby_delay logic is sufficiently fragile to require significant knowledge on the user's side about potential pitfalls, asking those users to install the plugin from contrib shouldn't be too much to ask for. This way, users who really need something more sophisticated than recovery wins always or standby wins always are given the tools they need *if* they're willing to put in the extra effort. For those who don't, offering max_standby_delay probably does more harm than good anyway, so nothing is lost by not offering it in the first place. best regards, Florian Pflug smime.p7s Description: S/MIME cryptographic signature
Re: [HACKERS] max_standby_delay considered harmful
On Thu, 2010-05-06 at 13:46 +0200, Florian Pflug wrote: On May 6, 2010, at 12:48 , Simon Riggs wrote: On Thu, 2010-05-06 at 11:36 +0200, Florian Pflug wrote: If there was an additional SQL-callable function that returned the backends the recovery process is currently waiting for, plus one that reported that last timestamp seen in the WAL, than all those different cancellation policies could be implemented as daemons that monitor recovery and kill backends as needed, no? That would allow people to experiment with different cancellation policies, and maybe shed some light on what the useful policies are in practice. It would be easier to implement a conflict resolution plugin that is called when a conflict occurs, allowing users to have a customisable mechanism. Again, I have no objection to that proposal. True, providing a plugin API would be even better, since no SQL callable API would have to be devised, and possible algorithms wouldn't be constrained by such an API's limitations. The existing max_standby_delay logic could be moved to such a plugin, living in contrib. Since it was already established (I believe) that the existing max_standby_delay logic is sufficiently fragile to require significant knowledge on the user's side about potential pitfalls, asking those users to install the plugin from contrib shouldn't be too much to ask for. This way, users who really need something more sophisticated than recovery wins always or standby wins always are given the tools they need *if* they're willing to put in the extra effort. For those who don't, offering max_standby_delay probably does more harm than good anyway, so nothing is lost by not offering it in the first place. No problem from me with that approach. As long as 9.0 ships with the current capability to enforce max_standby_delay, I have no problem. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
Heikki Linnakangas wrote: Robert Haas wrote: I am not convinced it will be unpredictable. The only caveats that I've seen so far are: - You need to run ntpd. - Queries will get cancelled like crazy if you're not using steaming replication. And also in situations where the master is idle for a while and then starts doing stuff. That's the most significant source of confusion, IMHO, I wouldn't mind the requirement of ntpd so much. I consider it mandatory to include an documentation update here that says if you set max_standby_delay 0, and do not run something that regularly generates activity to the master like [example], you will get unnecessary query cancellation on the standby. As well as something like what Josh was suggesting, adding warnings that this is for advanced users only, to borrow his wording. This is why my name has been on the open items list for a while now--to make sure I follow through on that. I haven't written it yet because there were still changes to the underlying code being made up until moments before beta started, then this discussion started without a break between. There are a clear set of user land things that can be done to make up the deficiencies in the state of the server code, but we won't even get to see how they work out in the field (feedback needed to improve the 9.1 design) if this capability goes away altogether. Is it not clear that there are some people who consider the occasional bit of cancellation OK, because they can correct for at the application layer and they're willing to factor it in to their design if it allows using the otherwise idle HA standby? I'm fine with expanding that section of the documentation too, to make it more obvious that's the only situation this aspect of HS is aimed at and suitable for. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support g...@2ndquadrant.com www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
Yeb Havinga wrote: Rob Wultsch wrote: I can not imagine setting this value to anything other than a bool and most of the time that bool would be -1. That's funny because when I was reading this thread, I was thinking the exact opposite: having max_standby_delay always set to 0 so I know the standby server is as up-to-date as possible. If you ask one person about this, you'll discover they only consider one behavior here sane, and any other setting is crazy. Ask five people, and you'll likely find someone who believes the complete opposite. Ask ten and carefully work out the trade-offs they're willing to make given the fundamental limitations of replication, and you'll arrive at the range of behaviors available right now, plus some more that haven't been built yet. There are a lot of different types of database applications out there, each with their own reliability and speed requirements to balance. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support g...@2ndquadrant.com www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
Simon Riggs si...@2ndquadrant.com writes: It would be easier to implement a conflict resolution plugin that is called when a conflict occurs, allowing users to have a customisable mechanism. Again, I have no objection to that proposal. To implement, if you say so, no doubt. To use, that means you need to install a contrib module after validation that the trade offs there are the one you're interested into, or you have to code it yourself. In C. I don't see that as an improvement over what we have now. Our main problem seems to be the documentation of the max_standby_delay, where we give the impression it's doing things the code can not do. IIUC. Regards, -- dim -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
On Thu, 2010-05-06 at 16:09 +0200, Dimitri Fontaine wrote: Simon Riggs si...@2ndquadrant.com writes: It would be easier to implement a conflict resolution plugin that is called when a conflict occurs, allowing users to have a customisable mechanism. Again, I have no objection to that proposal. To implement, if you say so, no doubt. To use, that means you need to install a contrib module after validation that the trade offs there are the one you're interested into, or you have to code it yourself. In C. I don't see that as an improvement over what we have now. Our main problem seems to be the documentation of the max_standby_delay, where we give the impression it's doing things the code can not do. IIUC. I meant easier to implement than what Florian suggested. The plugin would also allow you to have the pause/resume capability. -- Simon Riggs www.2ndQuadrant.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
Simon Riggs wrote: On Thu, 2010-05-06 at 16:09 +0200, Dimitri Fontaine wrote: Simon Riggs si...@2ndquadrant.com writes: It would be easier to implement a conflict resolution plugin that is called when a conflict occurs, allowing users to have a customisable mechanism. Again, I have no objection to that proposal. To implement, if you say so, no doubt. To use, that means you need to install a contrib module after validation that the trade offs there are the one you're interested into, or you have to code it yourself. In C. I don't see that as an improvement over what we have now. Our main problem seems to be the documentation of the max_standby_delay, where we give the impression it's doing things the code can not do. IIUC. I meant easier to implement than what Florian suggested. The plugin would also allow you to have the pause/resume capability. Not the same plugin. A hook for stop/resume would need to be called before and/or after each record, the one for conflict resolution would need to be called at each conflict. Designing a good interface for a plugin is hard, you need at least a couple of samples ideas for plugins that would use the hook, before you know the interface is flexible enough. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] max_standby_delay considered harmful
Simon Riggs wrote: On Thu, 2010-05-06 at 16:09 +0200, Dimitri Fontaine wrote: Simon Riggs si...@2ndquadrant.com writes: It would be easier to implement a conflict resolution plugin that is called when a conflict occurs, allowing users to have a customisable mechanism. Again, I have no objection to that proposal. To implement, if you say so, no doubt. To use, that means you need to install a contrib module after validation that the trade offs there are the one you're interested into, or you have to code it yourself. In C. I don't see that as an improvement over what we have now. Our main problem seems to be the documentation of the max_standby_delay, where we give the impression it's doing things the code can not do. IIUC. I meant easier to implement than what Florian suggested. The plugin would also allow you to have the pause/resume capability. Not the same plugin. A hook for stop/resume would need to be called before and/or after each record, the one for conflict resolution would need to be called at each conflict. Designing a good interface for a plugin is hard, you need at least a couple of sample ideas for plugins that would use the hook, before you know the interface is flexible enough. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers