Re: [HACKERS] Synchronous Standalone Master Redoux

Hampus Wessman Fri, 13 Jul 2012 00:13:52 -0700

Hi all,

Here are some (slightly too long) thoughts about this.


Shaun Thomas skrev 2012-07-12 22:40:

On 07/12/2012 12:02 PM, Bruce Momjian wrote:

Well, the problem also exists if add it as an internal database
feature --- how long do we wait to consider the standby dead, how do
we inform administrators, etc.


True. Though if there is no secondary connected, either because it's not
there yet, or because it disconnected, that's an easy check. It's the
network lag/stall detection that's tricky.

It is indeed tricky to detect this. If you don't get an (immediate)reply from the secondary (and you never do!), then all you can do iswait and *eventually* (after how long? 250ms? 10s?) assume that there isno connection between them. The conclusion may very well be wrongsometimes. A second problem is that we still don't know if this iscaused by some kind of network problems or if it's caused by thesecondary not running. It's perfectly possible that both servers areworking, but just can't communicate at the moment.

The thing is that what we do next (at least if our data is important andwhy otherwise use synchronous replication of any kind...) depends onwhat *did* happen. Assume that we have two database servers. At any timewe need at most one primary database to be running. Without thatrequirement our data can get messed up completely... If HA is importantto us, we may choose to do a failover to the secondary (and live withoutreplication for the moment) if the primary fails. With synchronousrepliction, we can do this without losing any data. If the secondaryalso dies, then we do lose data (and we'll know it!), but it might be anacceptable risk. If the secondary isn't permanently damaged, then wemight even be able to get the data back after some down time. Ok, sothat's one way to reconfigure the database servers on a failure. If thesecondary fails instead, then we can do similarly and remove it from the"cluster" (or in other words, disable synchronous replication to thesecondary). Again, we don't lose any data by doing this. We're taking acertain risk, however. We can't safely do a failover to the secondaryanymore... So if the primary fails now, then the only way not to losedata is to hope that we can get it back from the failed machine (thefailure may be temporary).

There's also the third possibility, of course, that the two servers areboth up and running, but they can't communicate over the network at themoment (this is, by the way, a difference from RAID, I guess). What dowe do then? Well, we still need at most one primary database server.We'll have to (somehow, which doesn't matter as much) decide whichdatabase to keep and consider the other one "down". Then we can just doas above (with all the same implications!). Is it always a good idea tokeep the primary? No! What if you (as a stupid example) pull the networkcable from the primary (or maybe turn off a switch so that it's isolatedfrom most of the network)? In that case you probably want the secondaryto take over instead. At least if you value service availability. Atthis point we can still do a safe failover too.

My point here is that if HA is important to you, then you may very wellwant to disable synchronous replication on a failure to avoid down time,but this has to be integrated with your overall failover / clustermanagement solution. Just having the primary automatically disablesynchronous replication doesn't seem overly useful to me... If you'reusing synchronous replication to begin with, you probably want to *know*if you may have lost data or not. Otherwise, you will have to assumethat you did and then you could frankly have been running asyncreplication all along. If you do integrate it with your failoversolution, then you can keep track of when it's safe to do a failover andwhen it's not, however, and decide how to handle each case.

How you decide what to do with the servers on failures isn't thatimportant here, really. You can probably run e.g. Pacemaker on 3+machines and have it check for quorums to accomplish this. That's a goodapproach at least. You can still have only 2 database servers (for costreasons), if you want. PostgreSQL could have all this built-in, but Idon't think it sounds overly useful to only be able to disablesynchronous replication on the primary after a timeout. Then you cannever safely do a failover to the secondary, because you can't be suresynchronous replication was active on the failed primary...


Regards,
Hampus

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Synchronous Standalone Master Redoux

Reply via email to