Re: [HACKERS] Issues with Quorum Commit
Tom Lane wrote: Greg Smith g...@2ndquadrant.com writes: I don't see this as needing any implementation any more complicated than the usual way such timeouts are handled. Note how long you've been trying to reach the standby. Default to -1 for forever. And if you hit the timeout, mark the standby as degraded and force them to do a proper resync when they disconnect. Once that's done, then they can re-enter sync rep mode again, via the same process a new node would have done so. Well, actually, that's *considerably* more complicated than just a timeout. How are you going to mark the standby as degraded? The standby can't keep that information, because it's not even connected when the master makes the decision. ISTM that this requires 1. a unique identifier for each standby (not just role names that multiple standbys might share); 2. state on the master associated with each possible standby -- not just the ones currently connected. Both of those are perhaps possible, but the sense I have of the discussion is that people want to avoid them. Actually, #2 seems rather difficult even if you want it. Presumably you'd like to keep that state in reliable storage, so it survives master crashes. But how you gonna commit a change to that state, if you just lost every standby (suppose master's ethernet cable got unplugged)? Looks to me like it has to be reliable non-replicated storage. Leaving aside the question of how reliable it can really be if not replicated, it's still the case that we have noplace to put such information given the WAL-is-across-the-whole-cluster design. I assumed we would have a parameter called sync_rep_failure that would take a command and the command would be called when communication to the slave was lost. If you restart, it tries again and might call the function again. -- Bruce Momjian br...@momjian.ushttp://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. + -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On Tue, Oct 12, 2010 at 11:50 PM, Robert Haas robertmh...@gmail.com wrote: There's another problem here we should think about, too. Suppose you have a master and two standbys. The master dies. You promote one of the standbys, which turns out to be behind the other. You then repoint the other standby at the one you promoted. Congratulations, your database is now very possible corrupt, and you may very well get no warning of that fact. It seems to me that we would be well-advised to install some kind of bullet-proof safeguard against this kind of problem, so that you will KNOW that the standby needs to be re-synced. I mention this because I have a vague feeling that timelines are supposed to prevent you from getting different WAL histories confused with each other, but they don't actually cover all the cases that can happen. Why don't the usual protections kick in here? The new record read from the location the xlog reader is expecting to find it has to have a valid CRC and a correct back pointer to the previous record. If the new wal sender is behind the old one then the new record it's sent won't match up at all. -- greg -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On Wed, Oct 13, 2010 at 5:22 AM, Fujii Masao masao.fu...@gmail.com wrote: On Wed, Oct 13, 2010 at 3:50 PM, Robert Haas robertmh...@gmail.com wrote: There's another problem here we should think about, too. Suppose you have a master and two standbys. The master dies. You promote one of the standbys, which turns out to be behind the other. You then repoint the other standby at the one you promoted. Congratulations, your database is now very possible corrupt, and you may very well get no warning of that fact. It seems to me that we would be well-advised to install some kind of bullet-proof safeguard against this kind of problem, so that you will KNOW that the standby needs to be re-synced. Yep. This is why I said it's not easy to implement that. To start the standby without taking a base backup from new master after failover, the user basically has to promote the standby which is ahead of the other standbys (e.g., by comparing pg_last_xlog_replay_location on each standby). As the safeguard, we seem to need to compare the location at the switch of the timeline on the master with the last replay location on the standby. If the latter location is ahead AND the timeline ID of the standby is not the same as that of the master, we should emit warning and terminate the replication connection. That doesn't seem very bullet-proof. You can accidentally corrupt a standby even when only one time-line is involved. AFAIK, stopping a standby, removing recovery.conf, and starting it up again does not change time lines. You can even shut down the standby, bring it up as a master, generate a little WAL, shut it back down, and bring it back up as a standby pointing to the same master. It would be nice to embed in each checkpoint record an identifier that changes randomly on each transition to normal running, so that if you do something like this we can notice and complain loudly. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On Thu, Oct 14, 2010 at 11:18 AM, Greg Stark gsst...@mit.edu wrote: Why don't the usual protections kick in here? The new record read from the location the xlog reader is expecting to find it has to have a valid CRC and a correct back pointer to the previous record. Yep. In most cases, those protections seem to be able to make the standby notice the inconsistency of WAL and then give up continuing replication. But not in all the cases. We can regard those protections as bullet-proof safeguard? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On Wed, Oct 13, 2010 at 10:18 PM, Greg Stark gsst...@mit.edu wrote: On Tue, Oct 12, 2010 at 11:50 PM, Robert Haas robertmh...@gmail.com wrote: There's another problem here we should think about, too. Suppose you have a master and two standbys. The master dies. You promote one of the standbys, which turns out to be behind the other. You then repoint the other standby at the one you promoted. Congratulations, your database is now very possible corrupt, and you may very well get no warning of that fact. It seems to me that we would be well-advised to install some kind of bullet-proof safeguard against this kind of problem, so that you will KNOW that the standby needs to be re-synced. I mention this because I have a vague feeling that timelines are supposed to prevent you from getting different WAL histories confused with each other, but they don't actually cover all the cases that can happen. Why don't the usual protections kick in here? The new record read from the location the xlog reader is expecting to find it has to have a valid CRC and a correct back pointer to the previous record. If the new wal sender is behind the old one then the new record it's sent won't match up at all. There's some kind of logic that rewinds to the beginning of the WAL segment and tries to replay from there. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On 13.10.2010 08:21, Fujii Masao wrote: On Sat, Oct 9, 2010 at 4:31 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: It shouldn't be too hard to fix. Walsender needs to be able to read WAL from preceding timelines, like recovery does, and walreceiver needs to write the incoming WAL to the right file. And walsender seems to need to transfer the current timeline history to the standby. Otherwise, the standby cannot recover the WAL file with new timeline. And the standby might need to create the timeline history file in order to recover the WAL file with new timeline even after it's restarted. Yes, true, you need that too. It might be good to divide this work into two phases, teaching archive recovery to notice new timelines appearing in the archive first, and doing the walsender/walreceiver changes after that. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On Wed, Oct 13, 2010 at 2:43 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: On 13.10.2010 08:21, Fujii Masao wrote: On Sat, Oct 9, 2010 at 4:31 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: It shouldn't be too hard to fix. Walsender needs to be able to read WAL from preceding timelines, like recovery does, and walreceiver needs to write the incoming WAL to the right file. And walsender seems to need to transfer the current timeline history to the standby. Otherwise, the standby cannot recover the WAL file with new timeline. And the standby might need to create the timeline history file in order to recover the WAL file with new timeline even after it's restarted. Yes, true, you need that too. It might be good to divide this work into two phases, teaching archive recovery to notice new timelines appearing in the archive first, and doing the walsender/walreceiver changes after that. There's another problem here we should think about, too. Suppose you have a master and two standbys. The master dies. You promote one of the standbys, which turns out to be behind the other. You then repoint the other standby at the one you promoted. Congratulations, your database is now very possible corrupt, and you may very well get no warning of that fact. It seems to me that we would be well-advised to install some kind of bullet-proof safeguard against this kind of problem, so that you will KNOW that the standby needs to be re-synced. I mention this because I have a vague feeling that timelines are supposed to prevent you from getting different WAL histories confused with each other, but they don't actually cover all the cases that can happen. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On 10/13/2010 06:43 AM, Fujii Masao wrote: Unfortunately even enough standbys don't increase write-availability unless you choose wait-forever. Because, after promoting one of standbys to new master, you must keep all the transactions waiting until at least one standby has connected to and caught up with new master. Currently this wait time is not short. Why is that? Don't the standbies just have to switch from one walsender to another? If there's any significant delay in switching, this either hurts availability or robustness, yes. Hmm.. that increases the number of procedures which the users must perform at the failover. I only consider fully automated failover. However, you seem to be worried about the initial setup of sync rep. At least, the users seem to have to wait until the standby has caught up with new master, increase quorum_commit and then reload the configuration file. For switching from a single node to a sync replication setup with one or more standbies, that seems reasonable. There are way more components you need to setup or adjust in such a case (network, load balancer, alerting system and maybe even the application itself). There's really no other option, if you want the kind of robustness guarantee that sync rep with wait forever provides. OTOH, if you just replicate to whatever standby is there and don't care much if it isn't, the admin doesn't need to worry much about quorum_commit - it doesn't have much of an effect anyway. Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On Wed, Oct 13, 2010 at 3:43 PM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: On 13.10.2010 08:21, Fujii Masao wrote: On Sat, Oct 9, 2010 at 4:31 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: It shouldn't be too hard to fix. Walsender needs to be able to read WAL from preceding timelines, like recovery does, and walreceiver needs to write the incoming WAL to the right file. And walsender seems to need to transfer the current timeline history to the standby. Otherwise, the standby cannot recover the WAL file with new timeline. And the standby might need to create the timeline history file in order to recover the WAL file with new timeline even after it's restarted. Yes, true, you need that too. It might be good to divide this work into two phases, teaching archive recovery to notice new timelines appearing in the archive first, and doing the walsender/walreceiver changes after that. OK. In detail, 1. After failover, when the standby connects to new master, walsender transfers the current timeline history in the handshake processing. 2. If the timeline history in the master is inconsistent with that in the standby, walreceiver terminates the replication connection. 3. Walreceiver creates the timeline history file. 4. Walreceiver signals the change of timeline history to startup process and makes it read the timeline history file. After this, startup process tries to recover the WAL files with even new timeline ID. 5. After the handshake, walsender sends the WAL from preceding timelines, like recovery does, and walreceiver writes the incoming WAL to the right file. Am I missing something? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On Wed, Oct 13, 2010 at 3:50 PM, Robert Haas robertmh...@gmail.com wrote: There's another problem here we should think about, too. Suppose you have a master and two standbys. The master dies. You promote one of the standbys, which turns out to be behind the other. You then repoint the other standby at the one you promoted. Congratulations, your database is now very possible corrupt, and you may very well get no warning of that fact. It seems to me that we would be well-advised to install some kind of bullet-proof safeguard against this kind of problem, so that you will KNOW that the standby needs to be re-synced. Yep. This is why I said it's not easy to implement that. To start the standby without taking a base backup from new master after failover, the user basically has to promote the standby which is ahead of the other standbys (e.g., by comparing pg_last_xlog_replay_location on each standby). As the safeguard, we seem to need to compare the location at the switch of the timeline on the master with the last replay location on the standby. If the latter location is ahead AND the timeline ID of the standby is not the same as that of the master, we should emit warning and terminate the replication connection. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On Sat, Oct 9, 2010 at 12:12 AM, Markus Wanner mar...@bluegap.ch wrote: On 10/08/2010 04:48 PM, Fujii Masao wrote: I believe many systems require write-availability. Sure. Make sure you have enough standbies to fail over to. Unfortunately even enough standbys don't increase write-availability unless you choose wait-forever. Because, after promoting one of standbys to new master, you must keep all the transactions waiting until at least one standby has connected to and caught up with new master. Currently this wait time is not short. (I think there are even more situations where read-availability is much more important, though). Even so, we should not ignore the write-availability aspect. Start with 0 (i.e. replication off), then add standbies, then increase quorum_commit to your new requirements. No. This only makes the procedure of failover more complex. Huh? This doesn't affect fail-over at all. Quite the opposite, the guarantees and requirements remain the same even after a fail-over. Hmm.. that increases the number of procedures which the users must perform at the failover. At least, the users seem to have to wait until the standby has caught up with new master, increase quorum_commit and then reload the configuration file. What is a full-cluster crash? The event that all of your cluster nodes are down (most probably due to power failure, but fires or other catastrophic events can be other causes). Chances for that to happen can certainly be reduced by distributing to distant locations, but that equally certainly increases latency, which isn't always an option. Yep. Why does it cause a split-brain? First master node A fails, a standby B takes over, but then fails as well. Let node C take over. Then the power aggregates catches fire, the infamous full-cluster crash (where lights out management gets a completely new meaning ;-) ). Split brain would be the situation that arises if all three nodes (A, B and C) start up again and think they have been the former master, so they can now continue to apply new transactions. Their data diverges, leading to what could be seen as a split-brain from the outside. Obviously, you must disallow A and B to take the role of the master after recovery. Yep. Something like STONITH would be required. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On Sat, Oct 9, 2010 at 1:41 AM, Josh Berkus j...@agliodbs.com wrote: And, I'd like to know whether the master waits forever because of the standby failure in other solutions such as Oracle DataGuard, MySQL semi-synchronous replication. MySQL used to be fond of simiply failing sliently. Not sure what 5.4 does, or Oracle. In any case MySQL's replication has always really been async (except Cluster, which is a very different database), so it's not really a comparison. IIRC, MySQL *semi-synchronous* replication is not async, so it can be comparison. Of course, though MySQL default replication is async. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On Sat, Oct 9, 2010 at 4:31 AM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Yes. But if there is no unsent WAL when the master goes down, we can start new standby without new backup by copying the timeline history file from new master to new standby and setting recovery_target_timeline to 'latest'. .. and restart the standby. Yes. It's a pretty severe shortcoming at the moment. For starters, it means that you need a shared archive, even if you set wal_keep_segments to a high number. Secondly, it's a lot of scripting to get it working, I don't like the thought of testing failovers in synchronous replication if I have to do all that. Frankly, this seems more important to me than synchronous replication. There seems to be difference in outlook between us. I prefer sync rep. But I'm OK to address that first if it's not hard. It shouldn't be too hard to fix. Walsender needs to be able to read WAL from preceding timelines, like recovery does, and walreceiver needs to write the incoming WAL to the right file. And walsender seems to need to transfer the current timeline history to the standby. Otherwise, the standby cannot recover the WAL file with new timeline. And the standby might need to create the timeline history file in order to recover the WAL file with new timeline even after it's restarted. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
Greg, to me it looks like we have very similar goals, but start from different preconditions. I absolutely agree with you given the preconditions you named. On 10/08/2010 10:04 PM, Greg Smith wrote: How is that a new problem? It's already possible to end up with a standby pair that has suffered through some bizarre failure chain such that it's not necessarily obvious which of the two systems has the most recent set of data on it. And that's not this project's problem to solve. Thanks for pointing that out. I think that might not have been clear to me. This limitation of scope certainly make sense for the Postgres project in general. Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
Greg Smith g...@2ndquadrant.com writes: […] I don't see this as needing any implementation any more complicated than the usual way such timeouts are handled. Note how long you've been trying to reach the standby. Default to -1 for forever. And if you hit the timeout, mark the standby as degraded and force them to do a proper resync when they disconnect. Once that's done, then they can re-enter sync rep mode again, via the same process a new node would have done so. Thank you for this post, which is so much better than anything I could achieve. Just wanted to add that it should be possible in lots of cases to have a standby rejoin the party without getting as far back as taking a new base backup. Depends on wal_keep_segments and standby's degraded state, among other parameters (archives, etc). Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On 10/08/2010 12:30 AM, Simon Riggs wrote: I do, but its not a parameter. The k = 1 behaviour is hardcoded and considerably simplifies the design. Moving to k 1 is additional work, slows things down and seems likely to be fragile. Perfect! So I'm all in favor of committing that, but leaving away the timeout thing, which I think is just adding unneeded complexity and fragility. Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
Simon, On 10/08/2010 12:25 AM, Simon Riggs wrote: Asking for k 1 does *not* mean those servers are time synchronised. Yes, it's technically impossible to create a fully synchronized cluster (on the basis of shared-nothing nodes we are aiming for, that is). There always is some kind of lag on either side. Maybe the use case for a no-lag cluster doesn't exist, because it's technically not feasible. In a bad case, those 3 acknowledgements might happen say 5 seconds apart on the worst and best of the 3 servers. So the first standby to receive the data could have applied the changes ~4.8 seconds prior to the 3rd standby. There is still a chance of reading stale data on one standby, but reading fresh data on another server. In most cases the time window is small, but still exists. Well, the transaction isn't committed on the master, so one could argue it shouldn't matter. The guarantee just needs to be one way: as soon as confirmed committed to the client, all k standbies need to have it committed, too. (At least for the apply replication level). So standbys are eventually consistent whether or not the master relies on them to provide an acknowledgement. The only place where you can guarantee non-stale data is on the master. That's formulated a bit too strong. With apply replication level, you should be able to rely on the guarantee that a committed transaction is visible on at least k standbies. Maybe in advance of the commit on the master, but I wouldn't call that stale data. Given the current proposals, the master is the one that's lagging the most, compared to the k standbies. High values of k reduce the possibility of data loss, whereas expected cluster availability is reduced as N - k gets smaller. Exactly. One addendum: a timeout increases availability at the cost of increased danger of data loss and higher complexity. Don't use it, just increase (N - k) instead. Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On 07.10.2010 21:38, Markus Wanner wrote: On 10/07/2010 03:19 PM, Dimitri Fontaine wrote: I think you're all into durability, and that's good. The extra cost is service downtime It's just *reduced* availability. That doesn't necessarily mean downtime, if you combine cleverly with async replication. if that's not what you're after: there's also availability and load balancing read queries on a system with no lag (no stale data servicing) when all is working right. All I'm saying is that those use cases are much better served with async replication. Maybe together with something that warns and takes action in case the standby's lag gets too big. Or what kind of customers do you think really need a no-lag solution for read-only queries? In the LAN case, the lag of async rep is negligible and in the WAN case the latencies of sync rep are prohibitive. There is a very good use case for that particular set up, actually. If your hot standby is guaranteed to be up-to-date with any transaction that has been committed in the master, you can use the standby interchangeably with the master for read-only queries. Very useful for load balancing. Imagine a web application that's mostly read-only, but a user can modify his own personal details like name and address, for example. Imagine that the user changes his street address and clicks 'save', causing an UPDATE, and the next query fetches that information again to display to the user. If you use load balancing, the query can be routed to the hot standby server, and if it lags even 1-2 seconds behind it's quite possible that it will still return the old address. The user will go WTF, I just changed that!. That's the load balancing use case, which is quite different from the zero data loss on server failure use case that most people here seem to be interested in. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On 10/08/2010 04:01 AM, Fujii Masao wrote: Really? I don't think that ko-count=0 means wait-forever. Telling from the documentation, I'd also say it doesn't wait forever by default. However, please note that there are different parameters for the initial wait for connection during boot up (wfc-timeout and degr-wfc-timeout). So you might to test what happens on a node failure, not just absence of a standby. Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On 08.10.2010 06:41, Fujii Masao wrote: On Thu, Oct 7, 2010 at 3:01 AM, Markus Wannermar...@bluegap.ch wrote: Of course, it doesn't make sense to wait-forever on *every* standby that ever gets added. Quorum commit is required, yes (and that's what this thread is about, IIRC). But with quorum commit, adding a standby only improves availability, but certainly doesn't block the master in any way. But, even with quorum commit, if you choose wait-forever option, failover would decrease availability. Right after the failover, no standby has connected to new master, so if quorum= 1, all the transactions must wait for a while. Sure, the new master can't proceed with commits until enough standbys have connected to it. Basically we need to take a base backup from new master to start the standbys and make them connect to new master. Do we really need that? I don't think that's acceptable, we'll need to fix that if that's the case. I think you're right, streaming replication doesn't work across timeline changes. We left that out of 9.0, to keep things simple, but it seems that we really should fix that for 9.1. You can cross timelines with the archive, though. But IIRC there was some issue with that too, you needed to restart the standbys because the standby scans what timelines exist at the beginning of recovery, and won't notice new timelines that appear after that? We need to address that, apart from any of the other things discussed wrt. synchronous replication. It will benefit asynchronous replication too. IMHO *that* is the next thing we should do, the next patch we commit. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On Fri, 2010-10-08 at 09:52 +0200, Markus Wanner wrote: One addendum: a timeout increases availability at the cost of increased danger of data loss and higher complexity. Don't use it, just increase (N - k) instead. Completely agree. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On 10/08/2010 05:41 AM, Fujii Masao wrote: But, even with quorum commit, if you choose wait-forever option, failover would decrease availability. Right after the failover, no standby has connected to new master, so if quorum = 1, all the transactions must wait for a while. That's a point, yes. But again, this is just write-availability, you can happily read from all active standbies. And connection time is certainly negligible compared to any kind of timeout (which certainly needs to be way bigger than a couple of network round-trips). Basically we need to take a base backup from new master to start the standbys and make them connect to new master. This might take a long time. Since transaction commits cannot advance for that time, availability would goes down. Just don't increase your quorum_commit to unreasonable values which your hardware cannot possible satisfy. It doesn't make sense to set a quorum_commit of 1 or even bigger, if you don't already have a standby attached. Start with 0 (i.e. replication off), then add standbies, then increase quorum_commit to your new requirements. Or you think that wait-forever option is applied only when the standby goes down? That wouldn't work in case of a full-cluster crash, where the wait-forever option is required again. Otherwise you risk a split-brain situation. Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On 08.10.2010 01:25, Simon Riggs wrote: On Thu, 2010-10-07 at 13:44 -0400, Aidan Van Dyk wrote: To get non-stale responses, you can only query those k=3 servers. But you've shot your self in the foot because you don't know which 3/10 those will be. The other 7 *are* stale (by definition). They talk about picking the caught up slave when the master fails, but you actually need to do that for *every query*. There is a big confusion around that point and I need to point out that statement isn't accurate. It's taken me a long while to understand this. Asking for k 1 does *not* mean those servers are time synchronised. All it means is that the master will stop waiting after 3 acknowledgements. There is no connection between the master receiving acknowledgements and the standby applying changes received from master; the standbys are all independent of one another. In a bad case, those 3 acknowledgements might happen say 5 seconds apart on the worst and best of the 3 servers. So the first standby to receive the data could have applied the changes ~4.8 seconds prior to the 3rd standby. There is still a chance of reading stale data on one standby, but reading fresh data on another server. In most cases the time window is small, but still exists. The other 7 are stale with respect to the first 3. But then so are the last 9 compared with the first one. The value of k has nothing whatsoever to do with the time difference between the master and the last standby to receive/apply the changes. The gap between first and last standby (i.e. N, not k) is the time window during which a query might/might not see a particular committed result. So standbys are eventually consistent whether or not the master relies on them to provide an acknowledgement. The only place where you can guarantee non-stale data is on the master. Yes, that's a good point. Synchronous replication for load-balancing purposes guarantees that when *you* perform a commit, after it finishes it will be visible in all standbys. But if you run the same query across different standbys, you're not guaranteed get same results. If you just pick a random server for every query, you might even see time moving backwards. Affinity is definitely a good idea for the load-balancing scenario, but even then the anomaly is possible if you get re-routed to a different server because the one you were bound to dies. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On Fri, 2010-10-08 at 10:56 +0300, Heikki Linnakangas wrote: Or what kind of customers do you think really need a no-lag solution for read-only queries? In the LAN case, the lag of async rep is negligible and in the WAN case the latencies of sync rep are prohibitive. There is a very good use case for that particular set up, actually. If your hot standby is guaranteed to be up-to-date with any transaction that has been committed in the master, you can use the standby interchangeably with the master for read-only queries. This is an important point. It is desirable, but there is no such thing. We must not take any project decisions based upon that false premise. Hot Standby is never guaranteed to be up-to-date with master. There is no such thing as certainty that you have the same data as the master. All sync rep gives you is a better durability guarantee that the changes are safe. It doesn't guarantee those changes are transferred to all nodes prior to making the data changes on any one standby. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On 08.10.2010 11:25, Simon Riggs wrote: On Fri, 2010-10-08 at 10:56 +0300, Heikki Linnakangas wrote: Or what kind of customers do you think really need a no-lag solution for read-only queries? In the LAN case, the lag of async rep is negligible and in the WAN case the latencies of sync rep are prohibitive. There is a very good use case for that particular set up, actually. If your hot standby is guaranteed to be up-to-date with any transaction that has been committed in the master, you can use the standby interchangeably with the master for read-only queries. This is an important point. It is desirable, but there is no such thing. We must not take any project decisions based upon that false premise. Hot Standby is never guaranteed to be up-to-date with master. There is no such thing as certainty that you have the same data as the master. Synchronous replication in the 'replay' mode is supposed to guarantee exactly that, no? -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On 10/08/2010 10:27 AM, Heikki Linnakangas wrote: Synchronous replication in the 'replay' mode is supposed to guarantee exactly that, no? The master may lag behind, so it's not strictly speaking the same data. Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On 10/08/2010 09:56 AM, Heikki Linnakangas wrote: Imagine a web application that's mostly read-only, but a user can modify his own personal details like name and address, for example. Imagine that the user changes his street address and clicks 'save', causing an UPDATE, and the next query fetches that information again to display to the user. I don't think that use case justifies sync replication and the additional network overhead that brings. Latency is low in that case, okay, but so is the lag for async replication. Why not tell the load balancer to read from the master for n seconds after the last write. After that, it should be save to query standbies, again. If the load on the master is the problem, and you want to reduce that by moving the read-only transactions to the slave, sync replication pretty certainly won't help you, either, because it actually *increases* concurrency (by increased commit latency). Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On Fri, 2010-10-08 at 11:27 +0300, Heikki Linnakangas wrote: On 08.10.2010 11:25, Simon Riggs wrote: On Fri, 2010-10-08 at 10:56 +0300, Heikki Linnakangas wrote: Or what kind of customers do you think really need a no-lag solution for read-only queries? In the LAN case, the lag of async rep is negligible and in the WAN case the latencies of sync rep are prohibitive. There is a very good use case for that particular set up, actually. If your hot standby is guaranteed to be up-to-date with any transaction that has been committed in the master, you can use the standby interchangeably with the master for read-only queries. This is an important point. It is desirable, but there is no such thing. We must not take any project decisions based upon that false premise. Hot Standby is never guaranteed to be up-to-date with master. There is no such thing as certainty that you have the same data as the master. Synchronous replication in the 'replay' mode is supposed to guarantee exactly that, no? From the perspective of the person making the change on the master: yes. If they make the change, wait for commit, then check the value on a standby, yes it will be there (or a later version). From the perspective of an observer, randomly selecting a standby for load balancing purposes: No, they are not guaranteed to see the latest answer, nor even can they find out whether what they are seeing is the latest answer. What sync rep does guarantee is that if the person making the change is told it succeeded (commit) then that change is safe on at least k other servers. Sync rep is about guarantees of safety, not observability. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On 10/08/2010 01:44 AM, Greg Smith wrote: They'll use Sync Rep to maximize the odds a system failure doesn't cause any transaction loss. They'll use good quality hardware on the master so it's unlikely to fail. ..unlikely to fail? Ehm.. is that you speaking, Greg? ;-) But when the database finds the standby unreachable, and it's left with the choice between either degrading into async rep or coming to a complete halt, you must give people the option of choosing to degrade instead after a timeout. Let them set off the red flashing lights, sound the alarms, and pray the master doesn't go down until you can fix the problem. Okay, okay, fair enough - if there had been red flashing lights. And alarms. And bells and whistles. But that's what I'm afraid the timeout is removing. I don't see this as needing any implementation any more complicated than the usual way such timeouts are handled. Note how long you've been trying to reach the standby. Default to -1 for forever. And if you hit the timeout, mark the standby as degraded ..and how do you make sure you are not marking your second standby as degraded just because it's currently lagging? Effectively degrading the utterly needed one, because your first standby has just bitten the dust? And how do you prevent the split brain situation in case the master dies shortly after these events, but fails to come up again immediately? Your list of data recovery projects will get larger and the projects more complicated. Because there's a lot more to it than just the implementation of a timeout. Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On 10/08/2010 11:00 AM, Simon Riggs wrote: From the perspective of an observer, randomly selecting a standby for load balancing purposes: No, they are not guaranteed to see the latest answer, nor even can they find out whether what they are seeing is the latest answer. I completely agree. The application (or at least the load balancer) needs to be aware of that fact. Regards Markus -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
Markus Wanner mar...@bluegap.ch writes: ..and how do you make sure you are not marking your second standby as degraded just because it's currently lagging? Well, in sync rep, a standby that's not able to stay under the timeout is degraded. Full stop. The presence of the timeout (or its value not being -1) means that the admin has chosen this definition. Effectively degrading the utterly needed one, because your first standby has just bitten the dust? Well, now you have a worst case scenario: first standby is dead and the remaining one was not able to keep up. You have lost all your master's failover replacements. And how do you prevent the split brain situation in case the master dies shortly after these events, but fails to come up again immediately? Same old story. Either you're able to try and fix the master so that you don't lose any data and don't even have to check for that, or you take a risk and start from a non synced standby. It's all availability against durability again. What I really want us to be able to provide is the clear facts so that whoever has to take the decision is able to. Meaning, here, that it should be easy to see that neither the standby are in sync at this point. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On 10/08/2010 11:41 AM, Dimitri Fontaine wrote: Same old story. Either you're able to try and fix the master so that you don't lose any data and don't even have to check for that, or you take a risk and start from a non synced standby. It's all availability against durability again. ..and a whole lot of manual work, that's prone to error for something that could easily be automated, at certainly less than 2000 EUR initial, additional cost (if any at all, in case you already have three servers). Sorry, I still fail to understand that use case. It reminds me of the customer that wanted to save the cost of the BBU and ran with fsync=off. Until his server got down due to a power outage. But yeah, we provide that option as well, yes. Point taken. Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
Markus Wanner mar...@bluegap.ch writes: ..and a whole lot of manual work, that's prone to error for something that could easily be automated So, the master just crashed, first standby is dead and second ain't in sync. What's the easy and automated way out? Sorry, I need a hand here. -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On Fri, Oct 8, 2010 at 5:07 PM, Markus Wanner mar...@bluegap.ch wrote: On 10/08/2010 04:01 AM, Fujii Masao wrote: Really? I don't think that ko-count=0 means wait-forever. Telling from the documentation, I'd also say it doesn't wait forever by default. However, please note that there are different parameters for the initial wait for connection during boot up (wfc-timeout and degr-wfc-timeout). So you might to test what happens on a node failure, not just absence of a standby. Unfortunately I've already taken down my DRBD environment. As far as I heard from my colleague who is familiar with DRBD, standby node failure doesn't prevent the master from writing data to the DRBD disk by default. If there is DRBD environment available around me, I'll try the test. And, I'd like to know whether the master waits forever because of the standby failure in other solutions such as Oracle DataGuard, MySQL semi-synchronous replication. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
Greg Smith g...@2ndquadrant.com writes: I don't see this as needing any implementation any more complicated than the usual way such timeouts are handled. Note how long you've been trying to reach the standby. Default to -1 for forever. And if you hit the timeout, mark the standby as degraded and force them to do a proper resync when they disconnect. Once that's done, then they can re-enter sync rep mode again, via the same process a new node would have done so. Well, actually, that's *considerably* more complicated than just a timeout. How are you going to mark the standby as degraded? The standby can't keep that information, because it's not even connected when the master makes the decision. ISTM that this requires 1. a unique identifier for each standby (not just role names that multiple standbys might share); 2. state on the master associated with each possible standby -- not just the ones currently connected. Both of those are perhaps possible, but the sense I have of the discussion is that people want to avoid them. Actually, #2 seems rather difficult even if you want it. Presumably you'd like to keep that state in reliable storage, so it survives master crashes. But how you gonna commit a change to that state, if you just lost every standby (suppose master's ethernet cable got unplugged)? Looks to me like it has to be reliable non-replicated storage. Leaving aside the question of how reliable it can really be if not replicated, it's still the case that we have noplace to put such information given the WAL-is-across-the-whole-cluster design. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On Fri, Oct 8, 2010 at 5:10 PM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Do we really need that? Yes. But if there is no unsent WAL when the master goes down, we can start new standby without new backup by copying the timeline history file from new master to new standby and setting recovery_target_timeline to 'latest'. In this case, new standby advances the recovery to the latest timeline ID which new master uses before connecting to the master. This seems to have been successful in my test environment. Though I'm missing something. I don't think that's acceptable, we'll need to fix that if that's the case. Agreed. You can cross timelines with the archive, though. But IIRC there was some issue with that too, you needed to restart the standbys because the standby scans what timelines exist at the beginning of recovery, and won't notice new timelines that appear after that? Yes. We need to address that, apart from any of the other things discussed wrt. synchronous replication. It will benefit asynchronous replication too. IMHO *that* is the next thing we should do, the next patch we commit. You mean to commit that capability before synchronous replication? If so, I disagree with you. I think that it's not easy to address that problem. So I'm worried about that implementing that capability first means the miss of sync rep in 9.1. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On 10/08/2010 04:11 PM, Tom Lane wrote: Actually, #2 seems rather difficult even if you want it. Presumably you'd like to keep that state in reliable storage, so it survives master crashes. But how you gonna commit a change to that state, if you just lost every standby (suppose master's ethernet cable got unplugged)? IIUC you seem to assume that the master node keeps its master role. But users who value availability a lot certainly want automatic fail-over, so any node can potentially be the new master. After recovery from a full-cluster outage, the first question is which node was the most recent master (or which former standby is up to date and could take over). Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
Tom Lane t...@sss.pgh.pa.us writes: Well, actually, that's *considerably* more complicated than just a timeout. How are you going to mark the standby as degraded? The standby can't keep that information, because it's not even connected when the master makes the decision. ISTM that this requires 1. a unique identifier for each standby (not just role names that multiple standbys might share); 2. state on the master associated with each possible standby -- not just the ones currently connected. Both of those are perhaps possible, but the sense I have of the discussion is that people want to avoid them. What we'd like to avoid is for the users to have to cope with such needs. Now, if that's internal to the code and automatic, that's not the same thing at all. What I'd have in mind is a Database standby system identifier that would be part of the initial hand shake in the replication protocol. And a system function to be able to unregister the standby. Actually, #2 seems rather difficult even if you want it. Presumably you'd like to keep that state in reliable storage, so it survives master crashes. But how you gonna commit a change to that state, if you just lost every standby (suppose master's ethernet cable got unplugged)? I don't see that as a huge problem myself, because I'm already well sold to the per-transaction replication-synchronous behaviour. So any change done there by the master would be hard-coded as async. What I'm missing? Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On 10/08/2010 12:05 PM, Dimitri Fontaine wrote: Markus Wanner mar...@bluegap.ch writes: ..and a whole lot of manual work, that's prone to error for something that could easily be automated So, the master just crashed, first standby is dead and second ain't in sync. What's the easy and automated way out? Sorry, I need a hand here. Thinking this through, I'm realizing that this can potentially work automatically with three nodes in both cases. Each node needs to keep track of whether or not it is (or became) the master - and when (lamport timestamp, maybe, not necessarily wall clock). A new master might continue to commit new transactions after a fail-over, without the old master being able to record that fact (because it's down). This means there's a different requirement after a full-cluster crash (i.e. master failure and no up-to-date standby is available). With the timeout, you absolutely need the former master to come back up again for zero data loss, no matter what your quorum_commit setting was. To be able to automatically tell who was the most recent master, you need to query the state of all other nodes, because they could be a more recent master. If that's not possible (or not feasible, because the replacement part isn't currently available), you are at risk of data loss. With the given three node scenario, the zero data loss guarantee only holds true as long as either at least one node (that is in sync) is running or if you can recover the former master after a full cluster crash. When waiting forever, you only need one of the k nodes to come back up again. You also need to query other nodes to find out which the k of N nodes are, but being able to recovery (N - k + 1) nodes is sufficient to figure that out. So any (k-1) nodes may fail, even permanently, at any point in time, and you are still not at risk of losing data. (Nor at risk of losing availability, BTW). I'm still of the opinion that that's the way easier and clearer guarantee. Also note that with higher values for N, this gets more and more important, because the chance to be able to recovery all N nodes after a full crash shrinks with increasing N (while the time required to do so increases). But maybe the current sync rep feature doesn't need to target setups with that many nodes. I certainly agree that either way is complicated to implement. With Postgtres-R, I'm clearly going the way that's able to satisfy large numbers of nodes. Thanks for an interesting discussion. And for respectful disagreement. Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
Markus Wanner mar...@bluegap.ch writes: On 10/08/2010 04:11 PM, Tom Lane wrote: Actually, #2 seems rather difficult even if you want it. Presumably you'd like to keep that state in reliable storage, so it survives master crashes. But how you gonna commit a change to that state, if you just lost every standby (suppose master's ethernet cable got unplugged)? IIUC you seem to assume that the master node keeps its master role. But users who value availability a lot certainly want automatic fail-over, Huh? Surely loss of the slaves shouldn't force a failover. Maybe the slaves really are all dead. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On 10/08/2010 04:38 PM, Tom Lane wrote: Markus Wanner mar...@bluegap.ch writes: IIUC you seem to assume that the master node keeps its master role. But users who value availability a lot certainly want automatic fail-over, Huh? Surely loss of the slaves shouldn't force a failover. Maybe the slaves really are all dead. I think we are talking across each other. I'm speaking about the need to be able to fail-over to a standby in case the master fails. In case of a full-cluster crash after such a fail-over, you need to take care you don't enter split brain. Some kind of STONITH, lamport clock, or what not. Figuring out which node has been the most recent (and thus most up to date) master is far from trivial. (See also my mail in answer to Dimitri a few minutes ago). Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On Fri, 2010-10-08 at 10:11 -0400, Tom Lane wrote: 1. a unique identifier for each standby (not just role names that multiple standbys might share); That is difficult because each standby is identical. If a standby goes down, people can regenerate a new standby by taking a copy from another standby. What number do we give this new standby?... 2. state on the master associated with each possible standby -- not just the ones currently connected. Both of those are perhaps possible, but the sense I have of the discussion is that people want to avoid them. Yes, I really want to avoid such issues and likely complexities we get into trying to solve them. In reality they should not be common because it only happens if the sysadmin has not configured sufficient number of redundant standbys. My proposed design is that the timeout does not cause the standby to be marked as degraded. It is up to the user to decide whether they wait, or whether they progress without sync rep. Or sysadmin can release the waiters via a function call. If the cluster does become degraded the sysadmin just generates a new standby and plugs in back into the cluster and away we go. Simple, no state to be recorded and no state to get screwed up either. I don't think we should be spending too much time trying to help people that say they want additional durability guarantees but do not match that with sufficient hardware resources to make it happen smoothly. If we do try to tackle those problems who will be able to validate our code actually works? -- Simon Riggs www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On Fri, Oct 8, 2010 at 5:16 PM, Markus Wanner mar...@bluegap.ch wrote: On 10/08/2010 05:41 AM, Fujii Masao wrote: But, even with quorum commit, if you choose wait-forever option, failover would decrease availability. Right after the failover, no standby has connected to new master, so if quorum = 1, all the transactions must wait for a while. That's a point, yes. But again, this is just write-availability, you can happily read from all active standbies. I believe many systems require write-availability. Basically we need to take a base backup from new master to start the standbys and make them connect to new master. This might take a long time. Since transaction commits cannot advance for that time, availability would goes down. Just don't increase your quorum_commit to unreasonable values which your hardware cannot possible satisfy. It doesn't make sense to set a quorum_commit of 1 or even bigger, if you don't already have a standby attached. Start with 0 (i.e. replication off), then add standbies, then increase quorum_commit to your new requirements. No. This only makes the procedure of failover more complex. Or you think that wait-forever option is applied only when the standby goes down? That wouldn't work in case of a full-cluster crash, where the wait-forever option is required again. Otherwise you risk a split-brain situation. What is a full-cluster crash? Why does it cause a split-brain? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On Fri, Oct 8, 2010 at 6:00 PM, Simon Riggs si...@2ndquadrant.com wrote: From the perspective of an observer, randomly selecting a standby for load balancing purposes: No, they are not guaranteed to see the latest answer, nor even can they find out whether what they are seeing is the latest answer. To guarantee that each standby returns the same result, we would need to use the cluster-wide snapshot to run queries. IIRC, Postgres-XC provides that feature. Though I'm not sure if it can be applied in HS/SR. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On Fri, 2010-10-08 at 23:55 +0900, Fujii Masao wrote: On Fri, Oct 8, 2010 at 6:00 PM, Simon Riggs si...@2ndquadrant.com wrote: From the perspective of an observer, randomly selecting a standby for load balancing purposes: No, they are not guaranteed to see the latest answer, nor even can they find out whether what they are seeing is the latest answer. To guarantee that each standby returns the same result, we would need to use the cluster-wide snapshot to run queries. IIRC, Postgres-XC provides that feature. Though I'm not sure if it can be applied in HS/SR. That is my understanding. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On 10/08/2010 04:47 PM, Simon Riggs wrote: Yes, I really want to avoid such issues and likely complexities we get into trying to solve them. In reality they should not be common because it only happens if the sysadmin has not configured sufficient number of redundant standbys. Well, full cluster outages are infrequent, but sadly cannot be avoided entirely. (Murphy's laughing). IMO we should be prepared to deal with those. Or am I understanding you wrongly here? I don't think we should be spending too much time trying to help people that say they want additional durability guarantees but do not match that with sufficient hardware resources to make it happen smoothly. I fully agree to that statement. Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On 10/08/2010 04:48 PM, Fujii Masao wrote: I believe many systems require write-availability. Sure. Make sure you have enough standbies to fail over to. (I think there are even more situations where read-availability is much more important, though). Start with 0 (i.e. replication off), then add standbies, then increase quorum_commit to your new requirements. No. This only makes the procedure of failover more complex. Huh? This doesn't affect fail-over at all. Quite the opposite, the guarantees and requirements remain the same even after a fail-over. What is a full-cluster crash? The event that all of your cluster nodes are down (most probably due to power failure, but fires or other catastrophic events can be other causes). Chances for that to happen can certainly be reduced by distributing to distant locations, but that equally certainly increases latency, which isn't always an option. Why does it cause a split-brain? First master node A fails, a standby B takes over, but then fails as well. Let node C take over. Then the power aggregates catches fire, the infamous full-cluster crash (where lights out management gets a completely new meaning ;-) ). Split brain would be the situation that arises if all three nodes (A, B and C) start up again and think they have been the former master, so they can now continue to apply new transactions. Their data diverges, leading to what could be seen as a split-brain from the outside. Obviously, you must disallow A and B to take the role of the master after recovery. Ideally, C would continue as the master. However, if the fire destroyed node C, let's hope you had another (sync!) standby that can act as the new master. Otherwise you've lost data. Hope that explains it. Wikipedia certainly provides a better (and less Postgres colored) explanation. Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
And, I'd like to know whether the master waits forever because of the standby failure in other solutions such as Oracle DataGuard, MySQL semi-synchronous replication. MySQL used to be fond of simiply failing sliently. Not sure what 5.4 does, or Oracle. In any case MySQL's replication has always really been async (except Cluster, which is a very different database), so it's not really a comparison. Here's the comparables: Oracle DataGuard DRBD SQL Server DB2 If anyone knows what the above do by default, please speak up! -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
* On 10/8/10, Fujii Masao masao.fu...@gmail.com wrote: On Fri, Oct 8, 2010 at 5:10 PM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Do we really need that? Yes. But if there is no unsent WAL when the master goes down, we can start new standby without new backup by copying the timeline history file from new master to new standby and setting recovery_target_timeline to 'latest'. In this case, new standby advances the recovery to the latest timeline ID which new master uses before connecting to the master. This seems to have been successful in my test environment. Though I'm missing something. I don't think that's acceptable, we'll need to fix that if that's the case. Agreed. You can cross timelines with the archive, though. But IIRC there was some issue with that too, you needed to restart the standbys because the standby scans what timelines exist at the beginning of recovery, and won't notice new timelines that appear after that? Yes. We need to address that, apart from any of the other things discussed wrt. synchronous replication. It will benefit asynchronous replication too. IMHO *that* is the next thing we should do, the next patch we commit. You mean to commit that capability before synchronous replication? If so, I disagree with you. I think that it's not easy to address that problem. So I'm worried about that implementing that capability first means the miss of sync rep in 9.1. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- Rob Wultsch wult...@gmail.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On 08.10.2010 17:26, Fujii Masao wrote: On Fri, Oct 8, 2010 at 5:10 PM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: Do we really need that? Yes. But if there is no unsent WAL when the master goes down, we can start new standby without new backup by copying the timeline history file from new master to new standby and setting recovery_target_timeline to 'latest'. .. and restart the standby. In this case, new standby advances the recovery to the latest timeline ID which new master uses before connecting to the master. This seems to have been successful in my test environment. Though I'm missing something. Yeah, that should work, but it's awfully complicated. I don't think that's acceptable, we'll need to fix that if that's the case. Agreed. You can cross timelines with the archive, though. But IIRC there was some issue with that too, you needed to restart the standbys because the standby scans what timelines exist at the beginning of recovery, and won't notice new timelines that appear after that? Yes. We need to address that, apart from any of the other things discussed wrt. synchronous replication. It will benefit asynchronous replication too. IMHO *that* is the next thing we should do, the next patch we commit. You mean to commit that capability before synchronous replication? If so, I disagree with you. I think that it's not easy to address that problem. So I'm worried about that implementing that capability first means the miss of sync rep in 9.1. It's a pretty severe shortcoming at the moment. For starters, it means that you need a shared archive, even if you set wal_keep_segments to a high number. Secondly, it's a lot of scripting to get it working, I don't like the thought of testing failovers in synchronous replication if I have to do all that. Frankly, this seems more important to me than synchronous replication. It shouldn't be too hard to fix. Walsender needs to be able to read WAL from preceding timelines, like recovery does, and walreceiver needs to write the incoming WAL to the right file. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
Markus Wanner wrote: ..and how do you make sure you are not marking your second standby as degraded just because it's currently lagging? Effectively degrading the utterly needed one, because your first standby has just bitten the dust? People are going to monitor the standby lag. If it gets excessive relative to where it's approaching the known timeout, the flashing yellow lights should go off at this point, before it gets this bad. And if you've set a reasonable business oriented timeout on how long you can stand for the master to be held up waiting for a lagging standby, the right thing to do may very well be to cut it off. At some point people will want to stop waiting for a standby if it's taking so long to commit that it's interfering with the ability of the master to operate normally. Such a master is already degraded, if your performance metrics for availability includes processing transactions in a timely manner. And how do you prevent the split brain situation in case the master dies shortly after these events, but fails to come up again immediately? How is that a new problem? It's already possible to end up with a standby pair that has suffered through some bizarre failure chain such that it's not necessarily obvious which of the two systems has the most recent set of data on it. And that's not this project's problem to solve. Useful answers to the split brain problem involve fencing implementations that normally drop to the hardware level, and clustering solutions including those features are already available that PostgreSQL can integrate into. Assuming you have to solve this in order to deliver a useful database replication component is excessively ambitious. You seem to be under the assumption that a more complicated replication implementation here will make reaching a bad state impossible. I think that's optimistic, both in theory and in regards to how successful code gets built. Here's the thing: the difficultly of testing to prove your code actually works is also proportional to that complexity. This project can chose to commit and potentially ship a simple solution that has known limitations, and expect that people will fill in the gap with existing add-on software to handle the clustering parts it doesn't: fencing, virtual IP address assignment, etc. All while getting useful testing feedback on the simple bottom layer, whose main purpose in life is to transport WAL data synchronously. Or, we can argue in favor of adding additional complexity on top first instead, so we end up with layers and layers of untested code. That path leads to situations where you're lucky to ship at all, and when you do the result is difficult to support. -- Greg Smith, 2ndQuadrant US g...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
Tom Lane wrote: How are you going to mark the standby as degraded? The standby can't keep that information, because it's not even connected when the master makes the decision. From a high level, I'm assuming only that the master has a list in memory of the standby system(s) it believes are up to date, and that it is supposed to commit to synchronously. When I say mark as degraded, I mean that the master merely closes whatever communications channel it had open with that system and removes the standby from that list. If that standby now reconnects again, I don't see how resolving what happens at that point is any different from when a standby is first started after both systems were turned off. If the standby is current with the data available on the master when it has an initial conversation, great; it's now available for synchronous commit too then. If it's not, it goes into a catchup mode first instead. When the master sees you're back to current again, if you're on the list of sync servers too you go back onto the list of active sync systems. There's shouldn't be any state information to save here. If the master and standby can't figure out if they are in or out of sync with one another based on the conversation they have when they first connect to one another, that suggests to me there needs to be improvements made in the communications protocol they use to exchange messages. -- Greg Smith, 2ndQuadrant US g...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On Fri, 2010-10-08 at 17:06 +0200, Markus Wanner wrote: Well, full cluster outages are infrequent, but sadly cannot be avoided entirely. (Murphy's laughing). IMO we should be prepared to deal with those. I've described how I propose to deal with those. I'm not waving away these issues, just proposing that we consciously choose simplicity and therefore robustness. Let me say it again for clarity. (This is written for the general case, though my patch uses only k=1 i.e. one acknowledgement): If we want robustness, we have multiple standbys. So if you lose one, you continue as normal without interruption. That is the first and most important line of defence - not software. When we start to wait, if there aren't sufficient active standbys to acknowledge a commit, then the commit won't wait. This behaviour helps us avoid situations where we are hours or days away from having a working standby to acknowledge the commit. We've had a long debate about servers that ought to be there but aren't; I suggest we treat standbys that aren't there as having a strong possibility they won't come back, and hence not worth waiting for. Heikki disagrees; I have no problem with adding server registration so that we can add additional waits, but I doubt that the majority of users prefer waiting over availability. It can be an option Once we are waiting, if insufficient standbys acknowledge the commit we will wait until the timeout expires, after which we commit and continue working. If you don't like timeouts, set the timeout to 0 to wait forever. This behaviour is designed to emphasise availability. (I acknowledge that some people are so worried by data loss that they would choose to stop changes altogether, and accept unavailability; I regard that as a minority use case, but one which I would not argue against including as an options at some point in the future.) To cover Dimitri's observation that when a streaming standby first connects it might take some time before it can sensibly acknowledge, we don't activate the standby until it has caught up. Once caught up, it will advertise it's capability to offer a sync rep service. Standbys that don't wish to be failover targets can set synchronous_replication_service = off. The paths between servers aren't defined explicitly, so the parameters all still work even after failover. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On Fri, 2010-10-08 at 16:34 -0400, Greg Smith wrote: Tom Lane wrote: How are you going to mark the standby as degraded? The standby can't keep that information, because it's not even connected when the master makes the decision. From a high level, I'm assuming only that the master has a list in memory of the standby system(s) it believes are up to date, and that it is supposed to commit to synchronously. When I say mark as degraded, I mean that the master merely closes whatever communications channel it had open with that system and removes the standby from that list. My current coding works with two sets of parameters: The master marks standby as degraded is handled by the tcp keepalives. When it notices no response, it kicks out the standby. We already had this, so I never mentioned it before as being part of the solution. The second part is the synchronous_replication_timeout which is a user settable parameter defining how long the app is prepared to wait, which could be more or less time than the keepalives. If that standby now reconnects again, I don't see how resolving what happens at that point is any different from when a standby is first started after both systems were turned off. If the standby is current with the data available on the master when it has an initial conversation, great; it's now available for synchronous commit too then. If it's not, it goes into a catchup mode first instead. When the master sees you're back to current again, if you're on the list of sync servers too you go back onto the list of active sync systems. There's shouldn't be any state information to save here. If the master and standby can't figure out if they are in or out of sync with one another based on the conversation they have when they first connect to one another, that suggests to me there needs to be improvements made in the communications protocol they use to exchange messages. Agreed. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On Wed, 2010-10-06 at 10:57 -0700, Josh Berkus wrote: I also strongly believe that we should get single-standby functionality committed and tested *first*, before working further on multi-standby. Yes, lets get k = 1 first. With k = 1 the number of standbys is not limited, so we can still have very robust and highly available architectures. So we mean first-acknowledgement-releases-waiters. (1) Consistency: this is another DBA-false-confidence issue. DBAs who implement (1) are liable to do so thinking that they are not only guaranteeing the consistency of every standby with the master, but the consistency of every standby with every other standby -- a kind of dummy multi-master. They are not, so it will take multiple reminders and workarounds in the docs to explain this. And we'll get complaints anyway. This puts the matter very clearly. Setting k = N is not as good an idea as it sounds when first described. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On Wed, 2010-10-06 at 10:57 -0700, Josh Berkus wrote: (2), (3) Degradation: (Jeff) these two cases make sense only if we give DBAs the tools they need to monitor which standbys are falling behind, and to drop and replace those standbys. Otherwise we risk giving DBAs false confidence that they have better-than-1-standby reliability when actually they don't. Current tools are not really adequate for this. Current tools work just fine for identifying if a server is falling behind. This improved in 9.0 to give fine-grained information. Nothing more is needed here within the server. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On 10/06/2010 10:01 PM, Simon Riggs wrote: The code to implement your desired option is more complex and really should come later. I'm sorry, but I think of that exactly the opposite way. The timeout for automatic continuation after waiting for a standby is the addition. The wait state of the master is there anyway, whether or not it's bound by a timeout. The timeout option should thus come later. Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
Markus Wanner mar...@bluegap.ch writes: I'm just saying that this should be an option, not the only choice. I'm sorry, I just don't see the use case for a mode that drops guarantees when they are most needed. People who don't need those guarantees should definitely go for async replication instead. We're still talking about freezing the master and all the applications when the first standby still has to do a base backup and catch-up to where the master currently is, right? What does a synchronous replication mode that falls back to async upon failure give you, except for a severe degradation in performance during normal operation? Why not use async right away in such a case? It's all about the standard case you're building, sync rep, and how to manage errors. In most cases I want flexibility. Alert says standby is down, you lost your durability requirements, so now I'm building a new standby. Does it mean my applications are all off and the master refusing to work? I sure hope I can choose about that, if possible per application. Next step, the old standby has been able to boot again, thanks to the sysadmins who repaired it, so it's online again, and my replacement machine is doing a base-backup. Are all the applications still unavailable? I sure hope I have a word in this decision. so opening a superuser connection to act on the currently waiting transaction is still possible (pass/fail, but fail is what at this point? shutdown to wait some more offline?). Not sure I'm following here. The admin will be busy re-establishing (connections to) standbies, killing transactions on the master doesn't help anything - whether or not the master waits forever. The idea here would be to be able to manually ACK a transaction that's waiting forever, because you know it won't have an answer and you'd prefer the application to just continue. But I see that's not a valid use case for you. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On 07.10.2010 12:52, Dimitri Fontaine wrote: Markus Wannermar...@bluegap.ch writes: I'm just saying that this should be an option, not the only choice. I'm sorry, I just don't see the use case for a mode that drops guarantees when they are most needed. People who don't need those guarantees should definitely go for async replication instead. We're still talking about freezing the master and all the applications when the first standby still has to do a base backup and catch-up to where the master currently is, right? Either that, or you configure your system for asynchronous replication first, and flip the switch to synchronous only after the standby has caught up. Setting up the first standby happens only once when you initially set up the system, or if you're recovering from a catastrophic loss of the standby. What does a synchronous replication mode that falls back to async upon failure give you, except for a severe degradation in performance during normal operation? Why not use async right away in such a case? It's all about the standard case you're building, sync rep, and how to manage errors. In most cases I want flexibility. Alert says standby is down, you lost your durability requirements, so now I'm building a new standby. Does it mean my applications are all off and the master refusing to work? Yes. That's why you want to have at least two standbys if you care about availability. Or if durability isn't that important to you after all, use asynchronous replication. Of course, if in the heat of the moment the admin is willing to forge ahead without the standby, he can temporarily change the configuration in the master. If you want the standby to be rebuilt automatically, you can even incorporate that configuration change in the scripts too. The important point is that you or your scripts are in control, and you know at all times whether you can trust the standby or not. If the master makes such decisions automatically, you don't know if the standby is trustworthy (ie. guaranteed up-to-date) or not. so opening a superuser connection to act on the currently waiting transaction is still possible (pass/fail, but fail is what at this point? shutdown to wait some more offline?). Not sure I'm following here. The admin will be busy re-establishing (connections to) standbies, killing transactions on the master doesn't help anything - whether or not the master waits forever. The idea here would be to be able to manually ACK a transaction that's waiting forever, because you know it won't have an answer and you'd prefer the application to just continue. But I see that's not a valid use case for you. I don't see anything wrong with having tools for admins to deal with the unexpected. I'm not sure overriding individual transactions is very useful though, more likely you'll want to take the whole server offline, or you want to change the config to allow all transactions to continue without the synchronous standby. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes: Either that, or you configure your system for asynchronous replication first, and flip the switch to synchronous only after the standby has caught up. Setting up the first standby happens only once when you initially set up the system, or if you're recovering from a catastrophic loss of the standby. Or if the standby is lagging and the master wal_keep_segments is not sized big enough. Is that a catastrophic loss of the standby too? It's all about the standard case you're building, sync rep, and how to manage errors. In most cases I want flexibility. Alert says standby is down, you lost your durability requirements, so now I'm building a new standby. Does it mean my applications are all off and the master refusing to work? Yes. That's why you want to have at least two standbys if you care about availability. Or if durability isn't that important to you after all, use asynchronous replication. Agreed, that's a nice simple use case. Another one is to say that I want sync rep when the standby is available, but I don't have the budget for more. So I prefer a good alerting system and low-budget-no-guarantee when the standby is down, that's my risk evaluation. Of course, if in the heat of the moment the admin is willing to forge ahead without the standby, he can temporarily change the configuration in the master. If you want the standby to be rebuilt automatically, you can even incorporate that configuration change in the scripts too. The important point is that you or your scripts are in control, and you know at all times whether you can trust the standby or not. If the master makes such decisions automatically, you don't know if the standby is trustworthy (ie. guaranteed up-to-date) or not. My proposal is that the master has the information to make the decision, and the behavior is something you setup. Default to security, so wait forever and block the applications, but could be set to ignore standby that have not at least reached this state. I don't see that you can make everybody happy without a knob here, and I don't see how we can deliver one without a clear state diagram of the standby possible current states and transitions. The other alternative is to just don't care and accept the timeout as being an option with the quorum, so that you just don't wait for the quorum if so you want. It's much more dynamic and dangerous, but with a good alerting system it'll be very popular I guess. I don't see anything wrong with having tools for admins to deal with the unexpected. I'm not sure overriding individual transactions is very useful though, more likely you'll want to take the whole server offline, or you want to change the config to allow all transactions to continue without the synchronous standby. The question then is, should the new configuration alter running transactions? My implicit was that I don't think so, and then I need another facility, such as SELECT pg_cancel_quorum_wait(procpid) FROM pg_stat_activity WHERE waiting_quorum; Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On Thu, 2010-10-07 at 11:46 +0200, Markus Wanner wrote: On 10/06/2010 10:01 PM, Simon Riggs wrote: The code to implement your desired option is more complex and really should come later. I'm sorry, but I think of that exactly the opposite way. I see why you say that. Dimitri's suggestion is an enhancement on the basic feature, just as Heikki's is. My reply was directed at Heikki, but should also apply to Dimitri's idea also. The timeout for automatic continuation after waiting for a standby is the addition. The wait state of the master is there anyway, whether or not it's bound by a timeout. The timeout option should thus come later. Adding timeout is very little code. We can take that out of the patch if that's an objection. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On 10/07/2010 01:08 PM, Simon Riggs wrote: Adding timeout is very little code. We can take that out of the patch if that's an objection. Okay. If you take it out, we are at the wait-forever option, right? If not, I definitely don't understand how you envision things to happen. I've been asking [1] about that distinction before, but didn't get a direct answer. Regards Markus Wanner [1]: Re: Configuring synchronous replication, Markus Wanner: http://archives.postgresql.org/message-id/4c9c5887.4040...@bluegap.ch -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On Thu, Oct 7, 2010 at 3:30 AM, Simon Riggs si...@2ndquadrant.com wrote: Yes, lets get k = 1 first. With k = 1 the number of standbys is not limited, so we can still have very robust and highly available architectures. So we mean first-acknowledgement-releases-waiters. +1. I like the design Greg Smith proposed yesterday (though there are details to be worked out). -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
Salut Dimitri, On 10/07/2010 12:32 PM, Dimitri Fontaine wrote: Another one is to say that I want sync rep when the standby is available, but I don't have the budget for more. So I prefer a good alerting system and low-budget-no-guarantee when the standby is down, that's my risk evaluation. I think that's a pretty special case, because the good alerting system is at least as expensive as another server that just persistently stores and ACKs incoming WAL. Why does one ever want the guarantee that sync replication gives to only hold true up to one failure, if a better guarantee doesn't cost anything extra? (Note that a good alerting system is impossible to achieve with only two servers. You need a third device anyway). Or put another way: a good alerting system is one that understands Postgres to some extent. It protects you from data loss in *every* case. If you attach at least two database servers to it, you get availability as long as any one of the two is up and running. No matter what happened before, even a full cluster power outage is guaranteed to recover from automatically without any data loss. [ Okay, the standby mode that only stores and ACKs WAL without having a full database behind still needs to be written. However, pg_streamrecv certainly goes that direction already, see [1]. ] Sync replication between really just two servers is asking for trouble and certainly not worth the savings in hardware cost. Better invest in a good UPS and redundant power supplies for a single server. The question then is, should the new configuration alter running transactions? It should definitely affect all currently running and waiting transactions. For anything beyond three servers, where quorum_commit could be bigger than one, it absolutely makes sense to be able to just lower the requirements temporarily, instead of having to cancel the guarantee completely. Regards Markus Wanner [1]: Using streaming replication as log archiving, Magnus Hagander http://archives.postgresql.org/message-id/aanlkti=_bzsyt8a1kjtpwzxnwyygqnvp1nbjwrnsd...@mail.gmail.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
Markus Wanner mar...@bluegap.ch writes: Why does one ever want the guarantee that sync replication gives to only hold true up to one failure, if a better guarantee doesn't cost anything extra? (Note that a good alerting system is impossible to achieve with only two servers. You need a third device anyway). I think you're all into durability, and that's good. The extra cost is service downtime if that's not what you're after: there's also availability and load balancing read queries on a system with no lag (no stale data servicing) when all is working right. I still think your use case is a solid one, but that we need to be ready to answer to some other ones, that you call relaxed and wrong because of data loss risks. My proposal is to make the risk window obvious and the behavior when you enter it configurable. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On Thu, Oct 7, 2010 at 6:32 AM, Dimitri Fontaine dimi...@2ndquadrant.fr wrote: Or if the standby is lagging and the master wal_keep_segments is not sized big enough. Is that a catastrophic loss of the standby too? Sure, but that lagged standy is already asynchrounous, not synchrounous. If it was synchronous, it would have slowed the master down enough it would not be lagged. I'm really confused with all this k N scenarious I see bandied about, because, all it really amounts to is I only want *one* syncronous replication, and a bunch of synchrounous replications. And a bit of chance thrown in the mix to hope the syncronous one is pretty stable, the asynchronous ones aren't *too* far behind (define too and far at your leisure). And then I see a lot of posturing about how to recover when the asynchronous standbys aren't synchronous enough at some point... Agreed, that's a nice simple use case. Another one is to say that I want sync rep when the standby is available, but I don't have the budget for more. So I prefer a good alerting system and low-budget-no-guarantee when the standby is down, that's my risk evaluation. That screems wrong in my books: OK, I want durability, so I always want to have 2 copies of the data, but if we loose one, copy, I want to keep on trucking, because I don't *really* want durability. If you want most-of-the time mostly 2 copy durabiltiy, then really good asynchronous replication is a really good solutions. Yes, I believe you need to have a way for an admin (or process/control/config) to be able to demote a synchronous replication scenario into async (or standalone, which is just an extension of really async). But it's no longer syncronous replication at that point. And if the choice is made to keep trucking while a new standby is being brought online and available and caught up, that's fine too. But during that perioud, until the slave is caught up and synchrounously replicating, it's *not* synchronous replication. So I'm not arguing that there shouldn't be a way to turn of synchronous replication once it's on. Hopefully without having to take down the cluster (pg instance type cluster) But I am pleading that there is a way to setup PG such that synchronous replication *is* synchronously replicating, or things stop and backup until such a time as it is. a. -- Aidan Van Dyk Create like a god, ai...@highrise.ca command like a king, http://www.highrise.ca/ work like a slave. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
Aidan Van Dyk ai...@highrise.ca writes: Sure, but that lagged standy is already asynchrounous, not synchrounous. If it was synchronous, it would have slowed the master down enough it would not be lagged. Agreed, except in the case of a joining standby. But you're saying it better than I do: Yes, I believe you need to have a way for an admin (or process/control/config) to be able to demote a synchronous replication scenario into async (or standalone, which is just an extension of really async). But it's no longer syncronous replication at that point. And if the choice is made to keep trucking while a new standby is being brought online and available and caught up, that's fine too. But during that perioud, until the slave is caught up and synchrounously replicating, it's *not* synchronous replication. That's exactly my point. I think we need to handle the case and make it obvious that this window is a data-loss window where there's no sync rep ongoing, then offer users a choice of behaviour. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On Thu, Oct 7, 2010 at 10:08 AM, Dimitri Fontaine dimi...@2ndquadrant.fr wrote: Aidan Van Dyk ai...@highrise.ca writes: Sure, but that lagged standy is already asynchrounous, not synchrounous. If it was synchronous, it would have slowed the master down enough it would not be lagged. Agreed, except in the case of a joining standby. *shrug* The joining standby is still asynchronous at this point. It's not synchronous replication. It's just another ^k of the N slaves serving stale data ;-) But you're saying it better than I do: Yes, I believe you need to have a way for an admin (or process/control/config) to be able to demote a synchronous replication scenario into async (or standalone, which is just an extension of really async). But it's no longer syncronous replication at that point. And if the choice is made to keep trucking while a new standby is being brought online and available and caught up, that's fine too. But during that perioud, until the slave is caught up and synchrounously replicating, it's *not* synchronous replication. That's exactly my point. I think we need to handle the case and make it obvious that this window is a data-loss window where there's no sync rep ongoing, then offer users a choice of behaviour. Again, I'm stating there is *no* choice in synchronous replication. It's *got* to block, otherwise it's not synchronous replication. The choice is if you want synchronous replication or not at that point. And turning it off might be a good (best) choice for for most people. I just want to make sure that: 1) There's now way to *sensibly* think it's still synchronously replicating 2) There is a way to enforce that the commits happening *are* synchronously replicating. a. -- Aidan Van Dyk Create like a god, ai...@highrise.ca command like a king, http://www.highrise.ca/ work like a slave. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
Aidan Van Dyk ai...@highrise.ca writes: *shrug* The joining standby is still asynchronous at this point. It's not synchronous replication. It's just another ^k of the N slaves serving stale data ;-) Agreed *here*, but if you read the threads again, you'll see that's not at all what's been talked about before my proposal. In particular, the questions about how to unlock a master's setup while its synced standby is doing a base backup should not be allowed to exists, and you seem to agree with my point. That's exactly my point. I think we need to handle the case and make it obvious that this window is a data-loss window where there's no sync rep ongoing, then offer users a choice of behaviour. Again, I'm stating there is *no* choice in synchronous replication. It's *got* to block, otherwise it's not synchronous replication. The choice is if you want synchronous replication or not at that point. Exactly, even if I didn't dare spell it this way. What I want to propose is for the user to be able to configure things so that he loses the sync aspect of the replication if it so happens that the setup is not able to provide for it. It may sound strange, but it's needed when all you want is a no stale data reporting stanbdy, e.g. And it so happens that it's already in Simon's code, AFAIUI (yet to read it, see). And turning it off might be a good (best) choice for for most people. I just want to make sure that: 1) There's now way to *sensibly* think it's still synchronously replicating 2) There is a way to enforce that the commits happening *are* synchronously replicating. We're on the same track. I don't know how to offer your options without a clear listing of standby states and transitions, which must include the synchronicity and whether you just lost it or whatnot. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
Markus Wanner wrote: I think that's a pretty special case, because the good alerting system is at least as expensive as another server that just persistently stores and ACKs incoming WAL. The cost of hardware capable of running a database server is a large multiple of what you can build an alerting machine for. I have two systems that are approaching the trash heap just at my house, relative to the main work I do, but that are fully capable of running an alerting system. Building a production quality database server requires a more significant investment: high quality disks, ECC RAM, battery-backed RAID controller, etc. Relative to what the hardware in a database server costs, what you need to build an alerting system is almost free. Oh: and most businesses that are complicated enough to need a serious database server already have them, so they actually cost nothing beyond the software setup time to point them toward the databases, too. Why does one ever want the guarantee that sync replication gives to only hold true up to one failure, if a better guarantee doesn't cost anything extra? (Note that a good alerting system is impossible to achieve with only two servers. You need a third device anyway). I do not disagree with your theory or reasoning. But as a practical matter, I'm afraid the true cost of the better guarantee you're suggesting here is additional code complexity that will likely cause this feature to miss 9.1 altogether. As far as I'm concerned, this whole diversion into the topic of quorum commit is only consuming resources away from targeting something achievable in the time frame of a single release. Sync replication between really just two servers is asking for trouble and certainly not worth the savings in hardware cost. Better invest in a good UPS and redundant power supplies for a single server. I wish I could give you the long list of data recovery projects I've worked on over the last few years, so you could really appreciate how much what you're saying here is exactly the opposite of the reality here. You cannot make a single server reliable enough to survive all of the things that Murphy's Law will inflict upon it, at any price. For most of the businesses I work with who want sync rep, data is not considered safe until the second copy is on storage miles away from the original, because they know this too. Personal anecdote I can share: I used to have an important project related to stock trading where I kept my backup system about 50 miles away from me. I was aiming for constant availability, while still being able to drive to the other server if needed for disaster recovery. Guess what? Even those two turned out not to be nearly independent enough; see http://en.wikipedia.org/wiki/Northeast_Blackout_of_2003 for details of how I lost both of those at the same time for days. Silly me, I'd only spread them across two adjacent states with different power providers! Not nearly good enough to avoid a correlated failure. -- Greg Smith, 2ndQuadrant US g...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On 10/7/10 6:41 AM, Aidan Van Dyk wrote: I'm really confused with all this k N scenarious I see bandied about, because, all it really amounts to is I only want *one* syncronous replication, and a bunch of synchrounous replications. And a bit of chance thrown in the mix to hope the syncronous one is pretty stable, the asynchronous ones aren't *too* far behind (define too and far at your leisure). Effectively, yes. The the difference between k of N synch rep and 1 synch standby + several async standbys is that in k of N, you have a pool and aren't dependent on having a specific standby be very reliable, just that any one of them is. So if you have k = 3 and N = 10, then you can have 10 standbys and only 3 of them need to ack any specific commit for the master to proceed. As long as (a) you retain at least one of the 3 which ack'd, and (b) you have some way of determining which standby is the most caught up, data loss is fairly unlikely; you'd need to lose 4 of the 10, and the wrong 4, to lose data. The advantage of this for availability over just having k = N = 3 comes when one of the standbys is responding slowly (due to traffic) or goes offline unexpectedly due to a hardware failure. In the k = N = 3 case, the system halts. In the k = 3, N = 10 case, you can lose up to 7 standbys without the system going down. It's notable that the massively scalable transactional databases (Dynamo, Cassandra, various telecom databases, etc.) all operate this way. However, I do consider this advanced functionality and not worth pursuing until we have the k = 1 case implemented and well-tested. For comparison, Cassandra, Hypertable and Riak have been working on their k N functionality for a couple years now and none of them has it stable *and* fast. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On Thu, Oct 7, 2010 at 1:22 PM, Josh Berkus j...@agliodbs.com wrote: So if you have k = 3 and N = 10, then you can have 10 standbys and only 3 of them need to ack any specific commit for the master to proceed. As long as (a) you retain at least one of the 3 which ack'd, and (b) you have some way of determining which standby is the most caught up, data loss is fairly unlikely; you'd need to lose 4 of the 10, and the wrong 4, to lose data. The advantage of this for availability over just having k = N = 3 comes when one of the standbys is responding slowly (due to traffic) or goes offline unexpectedly due to a hardware failure. In the k = N = 3 case, the system halts. In the k = 3, N = 10 case, you can lose up to 7 standbys without the system going down. Sure, but here is where I might not be following. If you want synchronous replication because you want query availabilty while making sure you're not getting stale queries from all your slaves, than using your k N (k = 3 and N - 10) situation is screwing your self. To get non-stale responses, you can only query those k=3 servers. But you've shot your self in the foot because you don't know which 3/10 those will be. The other 7 *are* stale (by definition). They talk about picking the caught up slave when the master fails, but you actually need to do that for *every query*. If you say they are pretty close so by the time you get the query to them they will be caught up, well then, all you really want is good async replication, you don't really *need* the synchronous part. The only case I see a race to quorum type of k N being useful is if you're just trying to duplicate data everywhere, but not actually querying any of the replicas. I can see that all queries go to the master, but the chances are pretty high the multiple machines are going to fail so I want multiple replicas being useful, but I *don't* think that's what most people are wanting in their I want 3 of 10 servers to ack the commit. The difference between good async and sync is only the *guarentee*. If you don't need the guarantee, you don't need the synchronous part. a. -- Aidan Van Dyk Create like a god, ai...@highrise.ca command like a king, http://www.highrise.ca/ work like a slave. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
If you want synchronous replication because you want query availabilty while making sure you're not getting stale queries from all your slaves, than using your k N (k = 3 and N - 10) situation is screwing your self. Correct. If that is your reason for synch standby, then you should be using k = N configuration. However, some people are willing to sacrifice consistency for durability and availability. We should give them that option (eventually), since among that triad you can never have more than two. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On 10/07/2010 06:41 PM, Greg Smith wrote: The cost of hardware capable of running a database server is a large multiple of what you can build an alerting machine for. You realize you don't need lots of disks nor RAM for a box that only ACKs? A box with two SAS disks and a BBU isn't that expensive anymore. I do not disagree with your theory or reasoning. But as a practical matter, I'm afraid the true cost of the better guarantee you're suggesting here is additional code complexity that will likely cause this feature to miss 9.1 altogether. As far as I'm concerned, this whole diversion into the topic of quorum commit is only consuming resources away from targeting something achievable in the time frame of a single release. So far I've been under the impression that Simon already has the code for quorum_commit k = 1. What I'm opposing to is the timeout feature, which I consider to be additional code, unneeded complexity and foot-gun. You cannot make a single server reliable enough to survive all of the things that Murphy's Law will inflict upon it, at any price. That's exactly what I'm saying applies to two servers as well. And why a timeout is a bad thing here, because the chance the second nodes fails as well is there (and is higher than you think, according to Murphy). For most of the businesses I work with who want sync rep, data is not considered safe until the second copy is on storage miles away from the original, because they know this too. Now, that are the people who really need sync rep, yes. What do you think how happy those businesses were to find out that Postgres is cheating on them in case of a network outage, for example? Do they really value (write!) availability more than data safety? Silly me, I'd only spread them across two adjacent states with different power providers! Not nearly good enough to avoid a correlated failure. Thanks for sharing this. I hope you didn't loose data. Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
But as a practical matter, I'm afraid the true cost of the better guarantee you're suggesting here is additional code complexity that will likely cause this feature to miss 9.1 altogether. As far as I'm concerned, this whole diversion into the topic of quorum commit is only consuming resources away from targeting something achievable in the time frame of a single release. Yes. My purpose in starting this thread was to show that k 1 quorum commit is considerably more complex than the people who have been bringing it up in other threads seem to think it is. It is not achievable for 9.1, and maybe not even for 9.2. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
Aidan Van Dyk ai...@highrise.ca wrote: To get non-stale responses, you can only query those k=3 servers. But you've shot your self in the foot because you don't know which 3/10 those will be. The other 7 *are* stale (by definition). They talk about picking the caught up slave when the master fails, but you actually need to do that for *every query*. With web applications, at least, you often don't care that the data read is absolutely up-to-date, as long as the point in time doesn't jump around from one request to the next. When we have used load balancing between multiple database servers (which has actually become unnecessary for us lately because PostgreSQL has gotten so darned fast!), we have established affinity between a session and one of the database servers, so that if they became slightly out of sync, data would not pop in and out of existence arbitrarily. I think a reasonable person could combine this technique with a 3 of 10 synchronous replication quorum to get both safe persistence of data and reasonable performance. I can also envision use cases where this would not be desirable. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On Thu, Oct 7, 2010 at 2:10 PM, Kevin Grittner kevin.gritt...@wicourts.gov wrote: Aidan Van Dyk ai...@highrise.ca wrote: To get non-stale responses, you can only query those k=3 servers. But you've shot your self in the foot because you don't know which 3/10 those will be. The other 7 *are* stale (by definition). They talk about picking the caught up slave when the master fails, but you actually need to do that for *every query*. With web applications, at least, you often don't care that the data read is absolutely up-to-date, as long as the point in time doesn't jump around from one request to the next. When we have used load balancing between multiple database servers (which has actually become unnecessary for us lately because PostgreSQL has gotten so darned fast!), we have established affinity between a session and one of the database servers, so that if they became slightly out of sync, data would not pop in and out of existence arbitrarily. I think a reasonable person could combine this technique with a 3 of 10 synchronous replication quorum to get both safe persistence of data and reasonable performance. I can also envision use cases where this would not be desirable. Well, keep in mind all updates have to be done on the single master. That works pretty well for fine-grained replication, but I don't think it's very good for full-cluster replication. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
Robert Haas robertmh...@gmail.com wrote: Kevin Grittner kevin.gritt...@wicourts.gov wrote: With web applications, at least, you often don't care that the data read is absolutely up-to-date, as long as the point in time doesn't jump around from one request to the next. When we have used load balancing between multiple database servers (which has actually become unnecessary for us lately because PostgreSQL has gotten so darned fast!), we have established affinity between a session and one of the database servers, so that if they became slightly out of sync, data would not pop in and out of existence arbitrarily. I think a reasonable person could combine this technique with a 3 of 10 synchronous replication quorum to get both safe persistence of data and reasonable performance. I can also envision use cases where this would not be desirable. Well, keep in mind all updates have to be done on the single master. That works pretty well for fine-grained replication, but I don't think it's very good for full-cluster replication. I'm completely failing to understand your point here. Could you restate another way? -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On 10/07/2010 03:19 PM, Dimitri Fontaine wrote: I think you're all into durability, and that's good. The extra cost is service downtime It's just *reduced* availability. That doesn't necessarily mean downtime, if you combine cleverly with async replication. if that's not what you're after: there's also availability and load balancing read queries on a system with no lag (no stale data servicing) when all is working right. All I'm saying is that those use cases are much better served with async replication. Maybe together with something that warns and takes action in case the standby's lag gets too big. Or what kind of customers do you think really need a no-lag solution for read-only queries? In the LAN case, the lag of async rep is negligible and in the WAN case the latencies of sync rep are prohibitive. My proposal is to make the risk window obvious and the behavior when you enter it configurable. I don't buy that. The risk calculation gets a lot simpler and obvious with strict guarantees. Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On 10/07/2010 07:44 PM, Aidan Van Dyk wrote: The only case I see a race to quorum type of k N being useful is if you're just trying to duplicate data everywhere, but not actually querying any of the replicas. I can see that all queries go to the master, but the chances are pretty high the multiple machines are going to fail so I want multiple replicas being useful, but I *don't* think that's what most people are wanting in their I want 3 of 10 servers to ack the commit. What else do you think they want it for, if not for protection against data loss? (Note that the queries don't need to go to the master exclusively if you can live with some lag - and I think the vast majority of people can. The zero data loss guarantee holds true in any case, though). The difference between good async and sync is only the *guarentee*. If you don't need the guarantee, you don't need the synchronous part. Here we are exactly on the same page again. Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On Thu, Oct 7, 2010 at 2:31 PM, Kevin Grittner kevin.gritt...@wicourts.gov wrote: Robert Haas robertmh...@gmail.com wrote: Kevin Grittner kevin.gritt...@wicourts.gov wrote: With web applications, at least, you often don't care that the data read is absolutely up-to-date, as long as the point in time doesn't jump around from one request to the next. When we have used load balancing between multiple database servers (which has actually become unnecessary for us lately because PostgreSQL has gotten so darned fast!), we have established affinity between a session and one of the database servers, so that if they became slightly out of sync, data would not pop in and out of existence arbitrarily. I think a reasonable person could combine this technique with a 3 of 10 synchronous replication quorum to get both safe persistence of data and reasonable performance. I can also envision use cases where this would not be desirable. Well, keep in mind all updates have to be done on the single master. That works pretty well for fine-grained replication, but I don't think it's very good for full-cluster replication. I'm completely failing to understand your point here. Could you restate another way? Establishing an affinity between a session and one of the database servers will only help if the traffic is strictly read-only. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
Markus Wanner mar...@bluegap.ch writes: I don't buy that. The risk calculation gets a lot simpler and obvious with strict guarantees. Ok, I'm lost in the use cases and analysis. I still don't understand why you want to consider the system already synchronous when it's not, whatever is the guarantee you're asking for. All I'm saying is that we should be able to know and show what the current system is up to, and we should be able to offer sane reactions in case of errors. You're calling a sane reaction blocking the master entirely when the standby ain't ready yet (it's still at the base backup state), and I can live with that. As an option. I say that either we go the lax quorum route, or we have to care for details and summary the failure cases with precision, and the possible responses with care. I don't see that possible without a clear state of each element in the system, their transitions, and a way to derive the global state of the distributed system out of that. It might be that the simpler way to go here is what Greg Smith has been proposing for a long time already, and again quite recently on this thread: have all the information you need in a system table and offer to run a user defined function to determine the state of the system. I think we managed to show what Josh Berkus wanted to know now. That's a quagmire here. Now, the problem I have is not Quorum Commit but the very definition of synchronous replication and the system we're trying to build. Not sure there's two of us wanting the same thing here. Regards, -- Dimitri Fontaine http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
Robert Haas robertmh...@gmail.com wrote: Establishing an affinity between a session and one of the database servers will only help if the traffic is strictly read-only. Thanks; I now see your point. In our environment, that's pretty common. Our most heavily used web app (the one for which we have, at times, needed load balancing) connects to the database with a read-only login. Many of our web apps do their writing by posting to queues which are handled at the appropriate source database later. (I had the opportunity to use one of these for real last night, to fill in a juror questionnaire after receiving a summons from the jury clerk in the county where I live.) Like I said, there are sane cases for this usage, but it won't fit everybody. I have no idea on percentages. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On Thu, 2010-10-07 at 13:44 -0400, Aidan Van Dyk wrote: To get non-stale responses, you can only query those k=3 servers. But you've shot your self in the foot because you don't know which 3/10 those will be. The other 7 *are* stale (by definition). They talk about picking the caught up slave when the master fails, but you actually need to do that for *every query*. There is a big confusion around that point and I need to point out that statement isn't accurate. It's taken me a long while to understand this. Asking for k 1 does *not* mean those servers are time synchronised. All it means is that the master will stop waiting after 3 acknowledgements. There is no connection between the master receiving acknowledgements and the standby applying changes received from master; the standbys are all independent of one another. In a bad case, those 3 acknowledgements might happen say 5 seconds apart on the worst and best of the 3 servers. So the first standby to receive the data could have applied the changes ~4.8 seconds prior to the 3rd standby. There is still a chance of reading stale data on one standby, but reading fresh data on another server. In most cases the time window is small, but still exists. The other 7 are stale with respect to the first 3. But then so are the last 9 compared with the first one. The value of k has nothing whatsoever to do with the time difference between the master and the last standby to receive/apply the changes. The gap between first and last standby (i.e. N, not k) is the time window during which a query might/might not see a particular committed result. So standbys are eventually consistent whether or not the master relies on them to provide an acknowledgement. The only place where you can guarantee non-stale data is on the master. High values of k reduce the possibility of data loss, whereas expected cluster availability is reduced as N - k gets smaller. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On Thu, 2010-10-07 at 19:50 +0200, Markus Wanner wrote: So far I've been under the impression that Simon already has the code for quorum_commit k = 1. I do, but its not a parameter. The k = 1 behaviour is hardcoded and considerably simplifies the design. Moving to k 1 is additional work, slows things down and seems likely to be fragile. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Training and Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
All, Establishing an affinity between a session and one of the database servers will only help if the traffic is strictly read-only. I think this thread has drifted very far away from anything we're going to do for 9.1. And seems to have little to do with synchronous replication. Synch rep ensures durability. It is not, by itself, a method of ensuring consistency, nor does it pretend to be one. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
Markus Wanner wrote: So far I've been under the impression that Simon already has the code for quorum_commit k = 1. What I'm opposing to is the timeout feature, which I consider to be additional code, unneeded complexity and foot-gun. Additional code? Yes. Foot-gun? Yes. Timeout should be disabled by default so that you get wait forever unless you ask for something different? Probably. Unneeded? This is where we don't agree anymore. The example that Josh Berkus just sent to the list is a typical example of what I expect people to do here. They'll use Sync Rep to maximize the odds a system failure doesn't cause any transaction loss. They'll use good quality hardware on the master so it's unlikely to fail. But when the database finds the standby unreachable, and it's left with the choice between either degrading into async rep or coming to a complete halt, you must give people the option of choosing to degrade instead after a timeout. Let them set off the red flashing lights, sound the alarms, and pray the master doesn't go down until you can fix the problem. But the choice to allow uptime concerns to win over the normal sync rep preferences, that's a completely valid business decision people will absolutely want to make in a way opposite of your personal preference here. I don't see this as needing any implementation any more complicated than the usual way such timeouts are handled. Note how long you've been trying to reach the standby. Default to -1 for forever. And if you hit the timeout, mark the standby as degraded and force them to do a proper resync when they disconnect. Once that's done, then they can re-enter sync rep mode again, via the same process a new node would have done so. -- Greg Smith, 2ndQuadrant US g...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us Author, PostgreSQL 9.0 High PerformancePre-ordering at: https://www.packtpub.com/postgresql-9-0-high-performance/book -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On Wed, Oct 6, 2010 at 6:11 PM, Markus Wanner mar...@bluegap.ch wrote: Yeah, sounds more likely. Then I'm surprised that I didn't find any warning that the Protocol C definitely reduces availability (with the ko-count=0 default, that is). Really? I don't think that ko-count=0 means wait-forever. IIRC, when I tried DRBD, I can write data in master's DRBD disk, without connected standby. So I think that by default the master waits for timeout and works alone when the standby goes down. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On Wed, Oct 6, 2010 at 6:00 PM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: In general, salvaging the WAL that was not sent to the standby yet is outright impossible. You can't achieve zero data loss with asynchronous replication at all. No. That depends on the type of failure. Unless the disk in the master has been corrupted, we might be able to salvage WAL. If we want only no data loss, we have only to implement the wait-forever option. But if we make consideration for the above-mentioned availability, the return-immediately option also would be required. In some (many, I think) cases, I think that we need to consider availability and no data loss together, and consider the balance of them. If you need both, you need three servers as Simon pointed out earlier. There is no way around that. No. That depends on how far you'd like to ensure no data loss. Poeple who use shared disk failover solution with one master and one standby don't such a high durability. They can avoid data loss by using something like RAID to a certain extent. So it's not problem for them to run the master alone after failover happens or standby goes down. But something like RAID cannot increase availability. Synchronous replication is solution for that purpose. Of course, if we are worried about running the master alone, we can increase the number of standbys. Furthermore, if we'd like to avoid data loss from the disaster which can destroy all the servers at the same time, we might need to increase the standbys further and locate some of them in the remote site. Please imagine that return-immediately (i.e., timeout = small) is useful for some use cases. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On Thu, Oct 7, 2010 at 10:24 PM, Fujii Masao masao.fu...@gmail.com wrote: On Wed, Oct 6, 2010 at 6:00 PM, Heikki Linnakangas heikki.linnakan...@enterprisedb.com wrote: In general, salvaging the WAL that was not sent to the standby yet is outright impossible. You can't achieve zero data loss with asynchronous replication at all. No. That depends on the type of failure. Unless the disk in the master has been corrupted, we might be able to salvage WAL. So I guess another way to say this is that zero data loss is unachievable, period. Greg Smith made a flip comment about having been so silly as to only put his redundant servers in adjacent states on different power grids, and yet still having an outage due to the Northeast blackouts. So what would he have had to do to completely rule out a correlated failure? Answer: It can't be done. If a massive asteroid comes zooming into the inner solar system tomorrow and hits the earth, obliterating all life, you're toast. Or likewise if nuclear war ensues. You could put your redundant server on the moon or, better yet, on a moon of one of the outer planets, but the hosting costs are pretty high and the ping times suck. So the point is that the question is not whether or not a correlated failure can happen, but whether you can imagine a scenario where a correlated failure has occurred yet you still wish you had your data. Different people will, obviously, draw that line in different places. Let's start by doing something simple that covers SOME of the cases people want, get it committed, and then move on from there. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On Wed, Oct 6, 2010 at 9:22 PM, Dimitri Fontaine dimi...@2ndquadrant.fr wrote: From my experience operating londiste, those states would be: 1. base-backup — self explaining 2. catch-up — getting the WAL to catch up after base backup 3. wanna-sync — don't yet have all the WAL to get in sync 4. do-sync — all WALs are there, coming soon 5. ok (async | recv | fsync | reply — feedback loop engaged) I agree to mange these standby states in a different standpoint. To avoid data loss, we must not promote the standby which is catching up with the master in half way to the new master at the failover. If clusterware can get the current standby state via SQL, it can check whether the failover causes data loss or not and give up failover before creating the trigger file. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On Thu, Oct 7, 2010 at 5:01 AM, Simon Riggs si...@2ndquadrant.com wrote: You seem willing to trade anything for that guarantee. I seek a more pragmatic approach that balances availability and risk. Those views are different, but not inconsistent. Oracle manages to offer multiple options and so can we. +1 Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On Thu, Oct 7, 2010 at 3:01 AM, Markus Wanner mar...@bluegap.ch wrote: Of course, it doesn't make sense to wait-forever on *every* standby that ever gets added. Quorum commit is required, yes (and that's what this thread is about, IIRC). But with quorum commit, adding a standby only improves availability, but certainly doesn't block the master in any way. But, even with quorum commit, if you choose wait-forever option, failover would decrease availability. Right after the failover, no standby has connected to new master, so if quorum = 1, all the transactions must wait for a while. Basically we need to take a base backup from new master to start the standbys and make them connect to new master. This might take a long time. Since transaction commits cannot advance for that time, availability would goes down. Or you think that wait-forever option is applied only when the standby goes down? Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On Thu, 2010-10-07 at 19:44 -0400, Greg Smith wrote: I don't see this as needing any implementation any more complicated than the usual way such timeouts are handled. Note how long you've been trying to reach the standby. Default to -1 for forever. And if you hit the timeout, mark the standby as degraded and force them to do a proper resync when they disconnect. Once that's done, then they can re-enter sync rep mode again, via the same process a new node would have done so. What I don't understand is why this isn't obvious to everyone. Greg this is very well put and the -hackers need to start thinking like people that actually use the database. JD -- PostgreSQL.org Major Contributor Command Prompt, Inc: http://www.commandprompt.com/ - 509.416.6579 Consulting, Training, Support, Custom Development, Engineering http://twitter.com/cmdpromptinc | http://identi.ca/commandprompt -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On Fri, Oct 8, 2010 at 8:44 AM, Greg Smith g...@2ndquadrant.com wrote: Additional code? Yes. Foot-gun? Yes. Timeout should be disabled by default so that you get wait forever unless you ask for something different? Probably. Unneeded? This is where we don't agree anymore. The example that Josh Berkus just sent to the list is a typical example of what I expect people to do here. They'll use Sync Rep to maximize the odds a system failure doesn't cause any transaction loss. They'll use good quality hardware on the master so it's unlikely to fail. But when the database finds the standby unreachable, and it's left with the choice between either degrading into async rep or coming to a complete halt, you must give people the option of choosing to degrade instead after a timeout. Let them set off the red flashing lights, sound the alarms, and pray the master doesn't go down until you can fix the problem. But the choice to allow uptime concerns to win over the normal sync rep preferences, that's a completely valid business decision people will absolutely want to make in a way opposite of your personal preference here. Definitely agreed. I don't see this as needing any implementation any more complicated than the usual way such timeouts are handled. Note how long you've been trying to reach the standby. Default to -1 for forever. And if you hit the timeout, mark the standby as degraded and force them to do a proper resync when they disconnect. Once that's done, then they can re-enter sync rep mode again, via the same process a new node would have done so. Fair enough. One question is when this timeout is applied. Obviously it should be applied when the standby goes down. But timeout should be applied when we initially start the master, and when no standby has not connected to new master yet after failover? I guess that people who want wait-forever would want to use timeout = -1 for all those cases. Otherwise they cannot ensure their no data loss. OTOH, people who don't want wait-forever would not want to wait for timeout in the latter two cases. So ISTM that something like enable_wait_forever or reaction_after_timeout parameter is required separately from the timeout. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On 06.10.2010 01:14, Josh Berkus wrote: Last I checked, our goal with synch standby was to increase availablity, not decrease it. No. Synchronous replication does not help with availability. It allows you to achieve zero data loss, ie. if the master dies, you are guaranteed that any transaction that was acknowledged as committed, is still committed. The other use case is keeping a hot standby server (or servers) up-to-date, so that you can run queries against it and you are guaranteed to get the same results you would if you ran the query in the master. Those are the two reasonable use cases I've seen. Anything else that has been discussed is some sort of a combination of those two, or something that doesn't make much sense when you scratch the surface and start looking at the failure modes. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On 06.10.2010 01:14, Josh Berkus wrote: You start a new one from the latest base backup and let it catch up? Possibly modifying the config file in the master to let it know about the new standby, if we go down that path. This part doesn't seem particularly hard to me. Agreed, not sure of the issue there. See previous post. The critical phrase is *without restarting the master*. AFAICT, no patch has addressed the need to change the master's synch configuration without restarting it. It's possible that I'm not following something, in which case I'd love to have it pointed out. Fair enough. I agree it's important that the configuration can be changed on the fly. It's orthogonal to the other things discussed, so let's just assume for now that we'll have that. If not in the first version, it can be added afterwards. pg_ctl reload is probably how it will be done. There is some interesting behavioral questions there on what happens when the configuration is changed. Like if you first define that 3 out of 5 servers must acknowledge, and you have an in-progress commit that has received 2 acks already. If you then change the config to 2 out of 4 servers must acknowledge, is the in-progress commit now satisfied? From the admin point of view, the server that was removed from the system might've been one that had acknowledged already, and logically in the new configuration the transaction has only received 1 acknowledgment from those servers that are still part of the system. Explicitly naming the standbys in the config file would solve that particular corner case, but it would no doubt introduce other similar ones. But it's an orthogonal issue, we'll figure it out when we get there. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On 10/06/2010 04:31 AM, Simon Riggs wrote: That situation would require two things * First, you have set up async replication and you're not monitoring it properly. Shame on you. The way I read it, Jeff is complaining about the timeout you propose that effectively turns sync into async replication in case of a failure. With a master that waits forever, the standby that's newly required for quorum certainly still needs its time to catch up. But it wouldn't live in danger of being optimized away for availability in case it cannot catch up within the given timeout. It's a tradeoff between availability and durability. So it can occur in both cases, though it now looks to me that its less important an issue in either case. So I think this doesn't rate the term dangerous to describe it any longer. The proposed timeout certainly still sounds dangerous to me. I'd rather recommend setting it to an incredibly huge value to minimize its dangers and get sync replication when that is what has been asked for. Use async replication for increased availability. Or do you envision any use case that requires a quorum of X standbies for normal operation but is just fine with only none to (X-1) standbies in case of failures? IMO that's when sync replication is most needed and when it absolutely should hold to its promises - even if it means to stop the system. There's no point in continuing operation if you cannot guarantee the minimum requirements for durability. If you happen to want such a thing, you should better rethink your minimum requirement (as performance for normal operations might benefit from a lower minimum as well). Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On Wed, Oct 6, 2010 at 10:52 AM, Jeff Davis pg...@j-davis.com wrote: I'm not sure I entirely understand. I was concerned about the case of a standby server being allowed to lag behind the rest by a large number of WAL records. That can't happen in the wait for all servers to apply case, because the system would become unavailable rather than allow a significant difference in the amount of WAL applied. I'm not saying that an unavailable system is good, but I don't see how my particular complaint applies to the wait for all servers to apply case. The case I was worried about is: * 1 master and 2 standby * The rule is wait for at least one standby to apply the WAL In your notation, I believe that's M - { S1, S2 } In that case, if one S1 is just a little faster than S2, then S2 might build up a significant queue of unapplied WAL. Then, when S1 goes down, there's no way for the slower one to acknowledge a new transaction without playing through all of the unapplied WAL. Intuitively, the administrator would think that he was getting both HA and redundancy, but in reality the availability is no better than if there were only two servers (M - S1), except that it might be faster to replay the WAL then to set up a new standby (but that's not guaranteed). Agreed. This is similar to my previous complaint. http://archives.postgresql.org/pgsql-hackers/2010-09/msg00946.php This problem would happen even if we fix the quorum to 1 as Josh propose. To avoid this, the master must wait for ACK from all the connected synchronous standbys. I think that this is likely to happen especially when we choose 'apply' replication level. Because that level can easily lag a synchronous standby because of the conflict between recovery and read-only query. Regards, -- Fujii Masao NIPPON TELEGRAPH AND TELEPHONE CORPORATION NTT Open Source Software Center -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Issues with Quorum Commit
On 10/06/2010 08:31 AM, Heikki Linnakangas wrote: On 06.10.2010 01:14, Josh Berkus wrote: Last I checked, our goal with synch standby was to increase availablity, not decrease it. No. Synchronous replication does not help with availability. It allows you to achieve zero data loss, ie. if the master dies, you are guaranteed that any transaction that was acknowledged as committed, is still committed. Strictly speaking, it even reduces availability. Which is why nobody actually wants *only* synchronous replication. Instead they use quorum commit or semi-synchronous (shudder) replication, which only requires *some* nodes to be in sync, but effectively replicates asynchronously to the others. From that point of view, the requirement of having one synch and two async standbies is pretty much the same as having three synch standbies with a quorum commit of 1. (Except for additional availability of the later variant, because in case of a failure of the one sync standby, any of the others can take over without admin intervention). Regards Markus Wanner -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers